Many types of machine learning classifiers, not least commonly-used techniques like ensemble models and neural networks, are notoriously difficult to interpret. If the model produces a surprising label for any given case, it's difficult to answer the question, "why *that* label, and not one of the others?".

One approach to this dilemma is the technique known as LIME (Local Interpretable Model-Agnostic Explanations). The basic idea is that while for highly non-linear models it's impossible to give a simple explanation of the relationship between any one variable and the predicted classes at a *global* level, it might be possible to asses which variables are most influential on the classification at a *local* level, near the neighborhood of a particular data point. An procedure for doing so is described in this 2016 paper by Ribeiro et al, and implemented in the R package **lime** by Thomas Lin Pedersen and Michael Benesty (and a port of the Python package of the same name).

You can read about how the lime package works in the introductory vignette *Understanding Lime*, but this limerick by Mara Averick sums also things up nicely:

There once was a package called lime,Whose models were simply sublime,It gave explanations for their variations,One observation at a time.

"One observation at a time" is the key there: given a prediction (or a collection of predictions) it will determine the variables that most support (or contradict) the predicted classification.

The lime package also works with text data: for example, you may have a model that classifies a paragraph of text as a sentiment "negative", "neutral" or "positive". In that case, lime will determine the the words in that sentence which are most important to determining (or contradicting) the classification. The package helpfully also provides a shiny app making it easy to test out different sentences and see the local effect of the model.

To learn more about the lime algorithm and how to use the associated R package, a great place to get started is the tutorial Visualizing ML Models with LIME from the University of Cincinnati Business Analytics R Programming Guide. The lime package is available on CRAN now, and you can always find the latest version at the GitHub repository linked below.

GitHub (thomasp): lime (Local Interpretable Model-Agnostic Explanations)

I had a great time in Budapest last week for the eRum 2018 conference. The organizers have already made all of the videos available online. Here's my presentation: Speeding up R with Parallel Programming in the cloud.

You can find (and download) my presentation slides here. And if you just want the references from the last slide, here are the links:

Who says there's no art in mathematics? I've long admired the generative art that Thomas Lin Pedersen occasionally posts (and that you can see on Instagram), and though he's a prolific R user I'm not quite sure how he makes his art. Marcus Volz has another beautiful portfolio of generative art, and has also created an R package you can use to create your own designs: the mathart package.

Generative art uses mathematical equations and standard graphical rendering tools (point and lines, color and transparency) to create designs. The mathart package provides a number of R functions to create some interesting designs from just a few equations. Complex designs emerge from just a few trigonometric functions, like this shell:

Or this abstract harmonograph:

Amazingly, the image above, and an infinite collection of images similar to it, is generated by just two equations implemented in R:

x = A1*sin(t*f1+p1)*exp(-d1*t) + A2*sin(t*f2+p2)*exp(-d2*t), y = A3*sin(t*f3+p3)*exp(-d3*t) + A4*sin(t*f4+p4)*exp(-d4*t)

You can have a lot of fun playing around with the parameters to the harmonograph function to see what other interesting designs you can find. You can find that function, and functions for designs of birds, butterflies, hearts, and more in the mathart package available on Github and linked below.

Github (marcusvolz): mathart

Since its inception over 40 years ago, when S (R's predecessor) was just a sketch on John Chambers' wall at Bell Labs, R has always been a language for providing interfaces. I was reminded of this during Dirk Eddelbuettel's presentation at the Chicago R User Group meetup last night, where he enumerated Chambers' three principles behind its design (from his 2016 book, Extending R):

**Object:**Everything that exists in R is an object**Function**: Everything that happens in R is a function call**Interface**: Interfaces to other software are a part of R

The third principle "Interface" is demonstrated by R's broad connections to data sources, numerical and statistical computation libraries, graphical systems, external applications, and other languages. And it's further supported by the formal announcement this week of the reticulate package from RStudio, which provides a new interface between R and Python. With reticulate, you can:

- Import objects from Python, automatically converted into their equivalent R types. (For example, Pandas data frames become R data.frame objects, and NumPy arrays become R matrix objects.)
- Import Python modules, and call their functions from R
- Source Python scripts from R
- Interactively run Python commands from the R command line
- Combine R code and Python code (and output) in R Markdown documents, as shown in the snippet below

The reticulate package was first released on Github in January 2017, and has been available on CRAN since March 2017. It has already spawned several higher-level integrations between R and Python-based systems, including:

- H204GPU, a R package for H20's GPU-based scikit-learn-like suite of algorithms;
- greta, a packagefor Bayesian model estimation with Markov-chain Monte-carlo, based on Tensorflow
- spacyr, a wrapper for the spaCy natural language processing toolkit; and
- XRPython, John Chamber's interface to Python based on his XR package for language extensions to R, which now uses reticulate for its low-level interface to Python.

The reticulate package is available now on CRAN. You can find more details in the announcement at the link below.

RStudio blog: reticulate: R interface to Python

**Update March 31**: Corrected date of first availability of reticulate on CRAN

During a discussion with some other members of the R Consortium, the question came up: who maintains the most packages on CRAN? DataCamp maintains a list of most active maintainers by downloads, but in this case we were interested in the total number of packages by maintainer. Fortunately, this is pretty easy to figure thanks to the CRAN repository tools now included in R, and a little dplyr (see the code below) gives the answer quickly[*].

And the answer? The most prolific maintainer is Scott Chamberlain from ROpenSci, who is currently the maintainer of 77 packages. Here's a list of the top 20:

Maint n 1 Scott Chamberlain 77 2 Dirk Eddelbuettel 53 3 Gabor Csardi 50 4 Hadley Wickham 41 5 Jeroen Ooms 40 6 ORPHANED 37 7 Thomas J. Leeper 29 8 Bob Rudis 28 9 Henrik Bengtsson 28 10 Kurt Hornik 28 11 Oliver Keyes 28 12 Martin Maechler 27 13 Richard Cotton 27 14 Robin K. S. Hankin 25 15 Simon Urbanek 24 16 Kirill Muller 23 17 Torsten Hothorn 23 18 Achim Zeileis 22 19 Paul Gilbert 22 20 Yihui Xie 21

[**Update** Mar 23: updated the R code and the results to treat Gabor Csardi and Gábor Csárdi as the same person, and corrected a trailing space issue that failed to count 2 of Hadley Wickham's packages.] (That list of orphaned packages with no current maintainer includes XML, d3heatmap, and flexclust, to name just 3 of the 37.) Here's the R code used to calculate the top 20:

[*]Well, it would have been quick, until I noticed that some maintainers had two forms of their name in the database, one with surrounding quotes and one without. It seemed like it was going to be trivial to fix with a regular expression, but it took me longer than I hoped to come up with the final regexp on line 6 above, which is now barely distinguishable from line noise. As usual, there an xkcd for this situation:

*by Antony Unwin, **University of Augsburg, Germany*

There are many different methods for identifying outliers and a lot of them are available in **R**. But are outliers a matter of opinion? Do all methods give the same results?

Articles on outlier methods use a mixture of theory and practice. Theory is all very well, but outliers are outliers because they don’t follow theory. Practice involves testing methods on data, sometimes with data simulated based on theory, better with `real’ datasets. A method can be considered successful if it finds the outliers we all agree on, but do we all agree on which cases are outliers?

The Overview Of Outliers (O3) plot is designed to help compare and understand the results of outlier methods. It is implemented in the **OutliersO3** package and was presented at last year’s useR! in Brussels. Six methods from other **R** packages are included (and, as usual, thanks are due to the authors for making their functions available in packages).

The starting point was a recent proposal of Wilkinson’s, his HDoutliers algorithm. The plot above shows the default O3 plot for this method applied to the stackloss dataset. (Detailed explanations of O3 plots are in the **OutliersO3** vignettes.) The stackloss dataset is a small example (21 cases and 4 variables) and there is an illuminating and entertaining article (Dodge, 1996) that tells you a lot about it.

Wilkinson’s algorithm finds 6 outliers for the whole dataset (the bottom row of the plot). Overall, for various combinations of variables, 14 of the cases are found to be potential outliers (out of 21!). There are no rows for 11 of the possible 15 combinations of variables because no outliers are found with them. If using a tolerance level of 0.05 seems a little bit lax, using 0.01 finds no outliers at all for any variable combination.

Trying another method with tolerance level=0.05 (*mvBACON* from **robustX**) identifies 5 outliers, all ones found for more than one variable combination by *HDoutliers*. However, no outliers are found for the whole dataset and only one of the three variable combinations where outliers are found is a combination where *HDoutliers* finds outliers. Of course, the two methods are quite different and it would be strange if they agreed completely. Is it strange that they do not agree more?

There are four other methods available in **OutliersO3** and using all six methods on stackloss a tolerance level of 0.05 identifies the following numbers of outliers:

```
## HDo PCS BAC adjOut DDC MCD
## 14 4 5 0 6 5
```

Each method uses what I have called the tolerance level in a rather different way. Sometimes it is called alpha and sometimes (1-alpha). As so often with **R**, you start wondering if more consistency would not be out of place, even at the expense of a little individuality. **OutliersO3** transforms where necessary to ensure that lower tolerance level values mean fewer outliers for all methods, but no attempt has been made to calibrate them equivalently. This is probably why `adjOutlyingness`

finds few or no outliers (results of this method are mildly random). The default value, according to `adjOutlyingness`

’s help page, is an alpha of 0.25.

The stackloss dataset is an odd dataset and small enough that each individual case can be studied in detail (cf. Dodge’s paper for just how much detail). However, similar results have been found with other datasets (milk, Election2005, diamonds, …). The main conclusion so far is that different outlier methods identify different numbers of different cases for different combinations of variables as different from the bulk of the data (i.e. as potential outliers)—or are these datasets just outlying examples?

There are other outlier methods available in **R** and they will doubtless give yet more different results. The recommendation has to be to proceed with care. Outliers may be interesting in their own right, they may be errors of some kind—and we may not agree whether they are outliers at all.

[Find the R code for generating the above plots here: OutliersUnwin.Rmd]

Modern machine learning platforms like Tensorflow have to date been used mainly by the computer science crowd, for applications like computer vision and language understanding. But as JJ Allaire pointed out in his keynote at the RStudio conference earlier this month (embedded below), there's a wealth of applications in the data science domain that have yet to be widely explored using these techniques. This includes things like time series forecasting, logistic regression, latent variable models, and censored data analysis (including survival analysis and failure data analysis).

The keras package for R provides a flexible, high-level interface for specifying machine learning models.(RStudio also provides some nice features when using the package, including a dynamically-updated convergence chart to show progress.) Networks defined with keras are flexible enough to specify models for data science applications, that can then be optimized using frameworks like Tensorflow (as opposed to traditional maximum-likelihood techniques), without limitations on data set size and with the ability to apply modern computational hardware.

For learning materials, RStudio's Tensorflow Gallery provides a good place to get started with several worked examples using real-world data. The book Deep Learning with R (Chollet and Allaire) provides even more worked examples translated from the original Python. If you want to dive into the mathematical underpinnings, the book Deep Learning (Goodfellow et al) provides the details there.

RStudio blog: TensorFlow for R

*by Boxuan Cui, Data Scientist at Smarter Travel*

Once upon a time, there was a joke:

In Data Science, 80% of time spent prepare data, 20% of time spent complain about need for prepare data.

— Big Data Borat (@BigDataBorat) February 27, 2013

According to a Forbes article, cleaning and organizing data is the most time-consuming and least enjoyable data science task. Of all the resources out there, DataExplorer is one of them, with its sole mission to minimize the 80%, and make it enjoyable. As a result, one fundamental design principle is to be extremely user-friendly. Most of the time, one function call is all you need.

Data manipulation is powered by data.table, so tasks involving big datasets usually complete in a few seconds. In addition, the package is flexible enough with input data classes, so you should be able to throw in any `data.frame`

-like objects. However, certain functions require a `data.table`

class object as input due to the update-by-reference feature, which I will cover in later part of the post.

Now enough said and let's look at some code, shall we?

Take the `BostonHousing`

dataset from the `mlbench`

library:

```
library(mlbench)
data("BostonHousing", package = "mlbench")
```

Without knowing anything about the data, my first 3 tasks are almost always:

```
library(DataExplorer)
plot_missing(BostonHousing) ## Are there missing values, and what is the missing data profile?
plot_bar(BostonHousing) ## How does the categorical frequency for each discrete variable look like?
plot_histogram(BostonHousing) ## What is the distribution of each continuous variable?
```

While there are not many interesting insights from `plot_missing`

and `plot_bar`

, below is the output from `plot_histogram`

.

Upon scrutiny, the variable **rad** looks like discrete, and I want to group **crim**, **zn**, **indus** and **b** into bins as well. Let's do so:

```
## Set `rad` to factor
BostonHousing$rad <- as.factor(BostonHousing$rad)
## Create new discrete variables
for (col in c("crim", "zn", "indus", "b"))
```

BostonHousing[[paste0(col, "_d")]] <- as.factor(ggplot2::cut_interval(BostonHousing[[col]], 2))
## Plot bar chart for all discrete variables
plot_bar(BostonHousing)

At this point, we have much better understanding of the data distribution. Now assume we are interested in **medv** (median value of owner-occupied homes in USD 1000's), and would like to build a model to predict it. Let's plot it against all other variables:

`plot_boxplot(BostonHousing, by = "medv") `

`plot_scatterplot(`

subset(BostonHousing, select = -c(crim, zn, indus, b)),

by = "medv", size = 0.5)

`plot_correlation(BostonHousing)`

And this is how you slice & dice your data, and analyze correlation with merely 3 lines of code.

Feature engineering is a crucial step in building better models. DataExplorer provides a couple of functions to ease the process. All of them require a `data.table`

as the input object, because it is lightning fast. However, if you don't feel like coding in `data.table`

syntax, you may adopt the following process:

```
## Set your data to `data.table` first
your_data <- data.table(your_data)
## Apply DataExplorer functions
group_category(your_data, ...)
drop_columns(your_data, ...)
set_missing(your_data, ...)
## Set data back to the original object
class(your_data) <- "original_object_name"
```

Let's return to the `BostonHousing`

dataset. For the rest of this section, we'll assume the data has been converted to a `data.table`

already.

```
library(data.table)
BostonHousingDT <- data.table(BostonHousing)
```

Remember those transformed continuous variables? Let's drop them:

`drop_columns(BostonHousingDT, c("crim", "zn", "indus", "b"))`

Note: Because `data.table`

updates by reference, the original object is updated without the need to re-assign a returned object.

Let's take a look at the discrete variable **rad**:

`plot_bar(BostonHousingDT$rad)`

I think categories other than 4, 5 and 24 are too sparse, and might skew my model fit. How could I group all the sparse categories together?

```
group_category(BostonHousingDT, "rad", 0.25, update = FALSE)
# rad cnt pct cum_pct
# 1: 24 132 0.2608696 0.2608696
# 2: 5 115 0.2272727 0.4881423
# 3: 4 110 0.2173913 0.7055336
```

Looks like grouping by bottom 25% of **rad** would give me what I need. Let's do so:

```
group_category(BostonHousingDT, "rad", 0.25, update = TRUE)
plot_bar(BostonHousingDT$rad)
```

In addition to categorical frequency, you may also play with the `measure`

argument to group by the sum of a different variable. See `?group_category`

for more example use cases.

To generate a report of your data:

`create_report(BostonHousing)`

Currently, there is not much to do with this, but it is my plan to support customization of the generated report, so stay tuned for more features!

I hope you enjoyed exploring the Boston housing data with me, and finally here are some additional resources about the DataExplorer package: