Data scientists often work with geographic data that needs to be visualized on a map, and sometimes the maps themselves are the data. The data is often located in two-dimensional space (latitude and longitude), but for some applications we have a third dimension as well: elevation. We could represent the elevations using contours, color, or 3-D perspective, but with the new rayshader package for R by Tyler Morgan-Wall, it's easy to visualize such maps as 3-D relief maps complete with shadows, perspective and depth of field:

đ Dead-simple 3D surface plotting in the next version of rayshader! Apply your hillshade (or any image) to a 3D surface map. Video preview with rayshader's built-in palettes. #rstats

â Tyler Morgan-Wall (@tylermorganwall) July 2, 2018

Code:

elmat %>%

sphere_shade() %>%

add_shadow(ray_shade(elmat)) %>%

plot_3d(elmat) pic.twitter.com/FCKQ9OSKpj

Tyler describes the rayshader package in a gorgeous blog post: his goal was to generate 3-D representations of landscape data that "looked like a paper weight". (Incidentally, you can use this package to produce *actual* paper weights with 3-D printing.) To this end, he went beyond simply visualizing a 3-D surface in `rgl`

and added a rectangular "base" to the surface as well as shadows cast by the geographic features. He also added support for detecting (or specifying) a water level: useful for representing lakes or oceans (like the map of the Monterey submarine canyon shown below) and for visualizing the effect of changing water levels like this animation of draining Lake Mead.

The rayshader package is implemented as an independent R package; it doesn't require any external 3-D graphics software to work. Not only does that make it easy to install and use, but it also means that the underlying computations are available for specialized data analysis tasks. For example, research analyst David Waldran used a LIDAR scan of downtown Indianapolis to create (with the lidR package) a 3-D map of the buildings, and then used the `ray_shade`

function to simulate the shadows cast by the buildings at various times during a winter's day. Averaging those shadows yields this map of the shadiest winter spots in Indianapolis:

The rayshader package is available for download now from your local CRAN mirror. You can also find an overview, and the latest version of the package at the Github repository linked below.

Github (tylermorganwall): rayshader, R Package for Producing and Visualizing Hillshaded Maps from Elevation Matrices, in 2D and 3D

Many illusions are based on the fact that our perceptions of color or brightness of an object are highly dependent on the background surrounding the object. For example, in this image (an example of the Cornsweet illusion) the upper and lower blocks are exactly the same color, according to the pixels displayed on your screen.

Mind = blown. These two blocks are exactly the same shade of grey. Hold your finger over the seam and check. pic.twitter.com/OqAnforGqs

â David Smith (@revodavid) December 5, 2013

Here's another simpler representation of the principle, created by Colin Fay (in response to this video made with colored paper). In the animation below, the rectangle moving to the left to right remains the same color throughout (a middling gray). But as the background around it changes, our perception of the color changes as well.

Colin created this animation in R using the gganimate package (available on GitHub from author Thomas Lin Pederson), and the process is delightfully simple. It begins with a chart of 10 "points", each being the same grey square equally spaced across the shaded background. Then, a simple command animates the transitions from one point to the next, and interpolates between them smoothly:

library(gganimate) gg_animated <- gg + transition_time(t) + ease_aes('linear')

You can find the complete R source code behind the animation at the blog post linked below, along with an interesting discussion of luminance and how you should consider it when choosing color scales for your data visualizations.

RTask: Remaking âLuminance-gradient-dependent lightness illusionâ with R

Every wanted to make R talk to you? Now you can, with the mscstts package by John Muschelli. It provides an interface to the Microsoft Cognitive Services Text-to-Speech API (hence the name) in Azure, and you can use it to convert any short piece of text to a playable audio file, rendering it as speech using a number of different voices.

Before you can generate speech yourself, you'll need a Bing Speech API key. If you don't already have an Azure account, you can generate a free 7-day API key in seconds by visiting the Try Cognitive Services page, selecting "Speech APIs", and clicking "Get API Key" for "Bing Speech":

No credit card is needed; all you need is a Microsoft, Facebook, LinkedIn or Github account. (If you need permanent keys, you can create an Azure account here which you can use for 5000 free Speech API calls per month, and *also *get $200 in free credits for use with any Azure service.)

Once you have your key (you'll actually get two, but you can use either one) you will call the `ms_synthesize`

function to convert text up to 1,000 characters or so into mp3 data, which you can then save to a file. (See lines 8-10 in the `speakR.R`

script, below.) Then, you can play the file with any MP3 player on your system. On Windows, the easiest way to do that is to call `start`

on the MP3 file itself, which will use your system default. (You'll need to modify line 11 of the script to work non-Windows systems.)

Saving the data to a file and then invoking a player got a bit tedious for me, so I created the `say`

function in the script below to automate the process. Let's see it in action:

Note that you can choose from a number of accents and spoken languages (including British English, Canadian French, Chinese and Japanese), and the gender of the voice (though both female and male voices aren't available for all languages). You can even modify volume, pitch, speaking rate, and even the pronunciation of individual words using the SSML standard. (This does mean you can't use characters recognized as SSML in your text, which is why the `say`

function below filters out `< >`

and `/`

first.)

The mscstts package is available for download now from your favorite CRAN mirror, and you can find the latest development version on Github. Many thanks to John Muschelli for putting this handy package together!

Many types of machine learning classifiers, not least commonly-used techniques like ensemble models and neural networks, are notoriously difficult to interpret. If the model produces a surprising label for any given case, it's difficult to answer the question, "why *that* label, and not one of the others?".

One approach to this dilemma is the technique known as LIME (Local Interpretable Model-Agnostic Explanations). The basic idea is that while for highly non-linear models it's impossible to give a simple explanation of the relationship between any one variable and the predicted classes at a *global* level, it might be possible to asses which variables are most influential on the classification at a *local* level, near the neighborhood of a particular data point. An procedure for doing so is described in this 2016 paper by Ribeiro et al, and implemented in the R package **lime** by Thomas Lin Pedersen and Michael Benesty (and a port of the Python package of the same name).

You can read about how the lime package works in the introductory vignette *Understanding Lime*, but this limerick by Mara Averick sums also things up nicely:

There once was a package called lime,Whose models were simply sublime,It gave explanations for their variations,One observation at a time.

"One observation at a time" is the key there: given a prediction (or a collection of predictions) it will determine the variables that most support (or contradict) the predicted classification.

The lime package also works with text data: for example, you may have a model that classifies a paragraph of text as a sentiment "negative", "neutral" or "positive". In that case, lime will determine the the words in that sentence which are most important to determining (or contradicting) the classification. The package helpfully also provides a shiny app making it easy to test out different sentences and see the local effect of the model.

To learn more about the lime algorithm and how to use the associated R package, a great place to get started is the tutorial Visualizing ML Models with LIME from the University of Cincinnati Business Analytics R Programming Guide. The lime package is available on CRAN now, and you can always find the latest version at the GitHub repository linked below.

GitHub (thomasp): lime (Local Interpretable Model-Agnostic Explanations)

I had a great time in Budapest last week for the eRum 2018 conference. The organizers have already made all of the videos available online. Here's my presentation: Speeding up R with Parallel Programming in the cloud.

You can find (and download) my presentation slides here. And if you just want the references from the last slide, here are the links:

Who says there's no art in mathematics? I've long admired the generative art that Thomas Lin Pedersen occasionally posts (and that you can see on Instagram), and though he's a prolific R user I'm not quite sure how he makes his art. Marcus Volz has another beautiful portfolio of generative art, and has also created an R package you can use to create your own designs: the mathart package.

Generative art uses mathematical equations and standard graphical rendering tools (point and lines, color and transparency) to create designs. The mathart package provides a number of R functions to create some interesting designs from just a few equations. Complex designs emerge from just a few trigonometric functions, like this shell:

Or this abstract harmonograph:

Amazingly, the image above, and an infinite collection of images similar to it, is generated by just two equations implemented in R:

x = A1*sin(t*f1+p1)*exp(-d1*t) + A2*sin(t*f2+p2)*exp(-d2*t), y = A3*sin(t*f3+p3)*exp(-d3*t) + A4*sin(t*f4+p4)*exp(-d4*t)

You can have a lot of fun playing around with the parameters to the harmonograph function to see what other interesting designs you can find. You can find that function, and functions for designs of birds, butterflies, hearts, and more in the mathart package available on Github and linked below.

Github (marcusvolz): mathart

Since its inception over 40 years ago, when S (R's predecessor) was just a sketch on John Chambers' wall at Bell Labs, R has always been a language for providing interfaces. I was reminded of this during Dirk Eddelbuettel's presentation at the Chicago R User Group meetup last night, where he enumerated Chambers' three principles behind its design (from his 2016 book, Extending R):

**Object:**Everything that exists in R is an object**Function**: Everything that happens in R is a function call**Interface**: Interfaces to other software are a part of R

The third principle "Interface" is demonstrated by R's broad connections to data sources, numerical and statistical computation libraries, graphical systems, external applications, and other languages. And it's further supported by the formal announcement this week of the reticulate package from RStudio, which provides a new interface between R and Python. With reticulate, you can:

- Import objects from Python, automatically converted into their equivalent R types. (For example, Pandas data frames become R data.frame objects, and NumPy arrays become R matrix objects.)
- Import Python modules, and call their functions from R
- Source Python scripts from R
- Interactively run Python commands from the R command line
- Combine R code and Python code (and output) in R Markdown documents, as shown in the snippet below

The reticulate package was first released on Github in January 2017, and has been available on CRAN since March 2017. It has already spawned several higher-level integrations between R and Python-based systems, including:

- H204GPU, a R package for H20's GPU-based scikit-learn-like suite of algorithms;
- greta, a packagefor Bayesian model estimation with Markov-chain Monte-carlo, based on Tensorflow
- spacyr, a wrapper for the spaCy natural language processing toolkit; and
- XRPython, John Chamber's interface to Python based on his XR package for language extensions to R, which now uses reticulate for its low-level interface to Python.

The reticulate package is available now on CRAN. You can find more details in the announcement at the link below.

RStudio blog: reticulate: R interface to Python

**Update March 31**: Corrected date of first availability of reticulate on CRAN

During a discussion with some other members of the R Consortium, the question came up: who maintains the most packages on CRAN? DataCamp maintains a list of most active maintainers by downloads, but in this case we were interested in the total number of packages by maintainer. Fortunately, this is pretty easy to figure thanks to the CRAN repository tools now included in R, and a little dplyr (see the code below) gives the answer quickly[*].

And the answer? The most prolific maintainer is Scott Chamberlain from ROpenSci, who is currently the maintainer of 77 packages. Here's a list of the top 20:

Maint n 1 Scott Chamberlain 77 2 Dirk Eddelbuettel 53 3 Gabor Csardi 50 4 Hadley Wickham 41 5 Jeroen Ooms 40 6 ORPHANED 37 7 Thomas J. Leeper 29 8 Bob Rudis 28 9 Henrik Bengtsson 28 10 Kurt Hornik 28 11 Oliver Keyes 28 12 Martin Maechler 27 13 Richard Cotton 27 14 Robin K. S. Hankin 25 15 Simon Urbanek 24 16 Kirill Muller 23 17 Torsten Hothorn 23 18 Achim Zeileis 22 19 Paul Gilbert 22 20 Yihui Xie 21

[**Update** Mar 23: updated the R code and the results to treat Gabor Csardi and GĂĄbor CsĂĄrdi as the same person, and corrected a trailing space issue that failed to count 2 of Hadley Wickham's packages.] (That list of orphaned packages with no current maintainer includes XML, d3heatmap, and flexclust, to name just 3 of the 37.) Here's the R code used to calculate the top 20:

[*]Well, it would have been quick, until I noticed that some maintainers had two forms of their name in the database, one with surrounding quotes and one without. It seemed like it was going to be trivial to fix with a regular expression, but it took me longer than I hoped to come up with the final regexp on line 6 above, which is now barely distinguishable from line noise. As usual, there an xkcd for this situation:

*by Antony Unwin, **University of Augsburg, Germany*

There are many different methods for identifying outliers and a lot of them are available in **R**. But are outliers a matter of opinion? Do all methods give the same results?

Articles on outlier methods use a mixture of theory and practice. Theory is all very well, but outliers are outliers because they donât follow theory. Practice involves testing methods on data, sometimes with data simulated based on theory, better with `realâ datasets. A method can be considered successful if it finds the outliers we all agree on, but do we all agree on which cases are outliers?

The Overview Of Outliers (O3) plot is designed to help compare and understand the results of outlier methods. It is implemented in the **OutliersO3** package and was presented at last yearâs useR! in Brussels. Six methods from other **R** packages are included (and, as usual, thanks are due to the authors for making their functions available in packages).

The starting point was a recent proposal of Wilkinsonâs, his HDoutliers algorithm. The plot above shows the default O3 plot for this method applied to the stackloss dataset. (Detailed explanations of O3 plots are in the **OutliersO3** vignettes.) The stackloss dataset is a small example (21 cases and 4 variables) and there is an illuminating and entertaining article (Dodge, 1996) that tells you a lot about it.

Wilkinsonâs algorithm finds 6 outliers for the whole dataset (the bottom row of the plot). Overall, for various combinations of variables, 14 of the cases are found to be potential outliers (out of 21!). There are no rows for 11 of the possible 15 combinations of variables because no outliers are found with them. If using a tolerance level of 0.05 seems a little bit lax, using 0.01 finds no outliers at all for any variable combination.

Trying another method with tolerance level=0.05 (*mvBACON* from **robustX**) identifies 5 outliers, all ones found for more than one variable combination by *HDoutliers*. However, no outliers are found for the whole dataset and only one of the three variable combinations where outliers are found is a combination where *HDoutliers* finds outliers. Of course, the two methods are quite different and it would be strange if they agreed completely. Is it strange that they do not agree more?

There are four other methods available in **OutliersO3** and using all six methods on stackloss a tolerance level of 0.05 identifies the following numbers of outliers:

```
## HDo PCS BAC adjOut DDC MCD
## 14 4 5 0 6 5
```

Each method uses what I have called the tolerance level in a rather different way. Sometimes it is called alpha and sometimes (1-alpha). As so often with **R**, you start wondering if more consistency would not be out of place, even at the expense of a little individuality. **OutliersO3** transforms where necessary to ensure that lower tolerance level values mean fewer outliers for all methods, but no attempt has been made to calibrate them equivalently. This is probably why `adjOutlyingness`

finds few or no outliers (results of this method are mildly random). The default value, according to `adjOutlyingness`

âs help page, is an alpha of 0.25.

The stackloss dataset is an odd dataset and small enough that each individual case can be studied in detail (cf. Dodgeâs paper for just how much detail). However, similar results have been found with other datasets (milk, Election2005, diamonds, âŚ). The main conclusion so far is that different outlier methods identify different numbers of different cases for different combinations of variables as different from the bulk of the data (i.e. as potential outliers)âor are these datasets just outlying examples?

There are other outlier methods available in **R** and they will doubtless give yet more different results. The recommendation has to be to proceed with care. Outliers may be interesting in their own right, they may be errors of some kindâand we may not agree whether they are outliers at all.

[Find the R code for generating the above plots here: OutliersUnwin.Rmd]