Hadley Wickham, RStudio's Chief Scientist and prolific author of R books and packages, conducted an AMA (Ask Me Anything) session on Reddit this past Monday. The session was tremendously popular, generating more than 500 questions/comments and promoting the AMA to the front page of Reddit.

If you're not familiar with Hadley's work (which would be a surprise if you're an R user), his own introduction in the Reddit AMA post will fill you in:

*Broadly, I'm interested in the process of data analysis/science and how to make it easier, faster, and more fun. That's what has led to the development of my most popular packages like ggplot2, dplyr, tidyr, stringr. This year, I've been particularly interested in making it as easy as possible to get data into R. That's lead to my work on the DBI, haven, readr, readxl, and httr packages. Please feel free to ask me anything about the craft of data science.*

*I'm also broadly interested in the craft of programming, and the design of programming languages. I'm interested in helping people see the beauty at the heart of R and learn to master it as easily as possible. As well as a number of packages like devtools, testthat, and roxygen2, I've written two books along those lines: Advanced R, which teaches R as a programming language, mostly divorced from its usual application as a data analysis tool; and R packages, which teaches software development best practices for R: documentation, unit testing, etc.*

Check out the comments at the link below, where you'll find insights from Hadley on the best way to teach R, Big Data in R, the elegance (or otherwise) of the R language, being productive, the best BBQ, and much more.

by Joseph Rickert

I have been a big fan of R user groups since I attended my first meeting. There is just something about the vibe of being around people excited about what they are doing that feels good. From a speaker's perspective, presenting at an R user Group meeting must be the rough equivalent of doing "stand-up" at a club where you know mostly everyone and you are pretty sure people are going to like your material. So while user groups don't necessarily ignite R creativity (people don't do their best work just to present at an R User group meeting), they do help to shine the spotlight on some really good stuff.

I attend all of the Bay Area useR group meetings, and quite a few other R related events throughout the year, but I only get to experience a small fraction of what is going on in the R world. In the spirit of sharing the "wish I was there" feeling, here are a few recent user group presentations from around the globe that look like they were informative, entertaining and motivating.

Tommy O'Dell gave "Welcome to dply" talk to the Western Australia R Group (WARG) on September 10th. This is a very good presentation until near the very end when it becomes an absolutely great presentation!! Apparently, motivated by a desire to use dplyr with R 2.12, an older R version of R not supported by dplyr, Tommy deconstructed the dplyr "magic" to write his own package, rdplyr. This is a wonderful example of how curiosity and open source can open up many possibilities. The following slide comes from the section where Tommy explains some of the problems he encountered and how he worked through them.

On the 16th of September, Kevin Little gave a talk to MadR about how he recovered after "hitting the wall" in failed first attempt to interface to the SurveyMonkey API using the Rmonkey package. Kevin's description of how he worked through the process which included wading into some JSON scripting is a motivational case study. Kevin wrote a blog post that provides background for the project and has made his slides available here.

Also in September Jim Porzak, a long-time contributor to the San Francisco Bay Area R community, described a detailed customer segmentation analysis in a presentation to BARUG. The following slide examines the stability of the clusters.

Finally, there is a small treasure trove of relatively recent work at the BaselR presentations page. These include a presentation from Aimee Gott on the Mango Solutions development environment and one from Anne Kuemmel on using simulations to calculate confidence intervals in pharma applications. Also have a look at Daniel Sabanes Bove's presentation on using R to produce Microsoft PowerPoint presentations, and some thoughtful advice from Reinhold Koch on how to go about creating a lively R community within your company.

_______________________________________________________________________________________

_______________________________________________________________________________________

Let us all adopt this mindset!!

*by Bob HortonMicrosoft Senior Data Scientist*

Learning curves are an elaboration of the idea of validating a model on a test set, and have been widely popularized by Andrew Ng’s Machine Learning course on Coursera. Here I present a simple simulation that illustrates this idea.

Imagine you use a sample of your data to train a model, then use the model to predict the outcomes on data where you know what the real outcome is. Since you know the “real” answer, you can calculate the overall error in your predictions. The error on the same data set used to train the model is called the *training error*, and the error on an independent sample is called the *validation error*.

A model will commonly perform better (that is, have lower error) on the data it was trained on than on an independent sample. The difference between the training error and the validation error reflects *overfitting* of the model. Overfitting is like memorizing the answers for a test instead of learning the principles (to borrow a metaphor from the Wikipedia article). Memorizing works fine if the test is exactly like the study guide, but it doesn’t work very well if the test questions are different; that is, it doesn’t generalize. In fact, the more a model is overfitted, the higher its validation error is likely to be. This is because the spurious correlations the overfitted model memorized from the training set most likely don’t apply in the validation set.

Overfitting is usually more extreme with small training sets. In large training sets the random noise tends to average out, so that the underlying patterns are more clear. But in small training sets, there is less opportunity for averaging out the noise, and accidental correlations consequently have more influence on the model. Learning curves let us visualize this relationship between training set size and the degree of overfitting.

We start with a function to generate simulated data:

```
sim_data <- function(N, noise_level=1){
X1 <- sample(LETTERS[1:10], N, replace=TRUE)
X2 <- sample(LETTERS[1:10], N, replace=TRUE)
X3 <- sample(LETTERS[1:10], N, replace=TRUE)
y <- 100 + ifelse(X1 == X2, 10, 0) + rnorm(N, sd=noise_level)
data.frame(X1, X2, X3, y)
}
```

The input columns X1, X2, and X3 are categorical variables which each have 10 possible values, represented by capital letters `A`

through `J`

. The outcome is cleverly named `y`

; it has a base level of 100, but if the values in the first two `X`

variables are equal, this is increased by 10. On top of this we add some normally distributed noise. Any other pattern that might appear in the data is accidental.

Now we can use this function to generate a simulated data set for experiments.

```
set.seed(123)
data <- sim_data(25000, noise=10)
```

There are many possible error functions, but I prefer the root mean squared error:

`rmse <- function(actual, predicted) sqrt( mean( (actual - predicted)^2 ))`

To generate a learning curve, we fit models at a series of different training set sizes, and calculate the training error and validation error for each model. Then we will plot these errors against the training set size. Here the parameters are a model formula, the data frame of simulated data, the validation set size (vss), the number of different training set sizes we want to plot, and the smallest training set size to start with. The largest training set will be all the rows of the dataset that are not used for validation.

```
run_learning_curve <- function(model_formula, data, vss=5000, num_tss=30, min_tss=1000){
library(data.table)
max_tss <- nrow(data) - vss
tss_vector <- seq(min_tss, max_tss, length=num_tss)
data.table::rbindlist( lapply (tss_vector, function(tss){
vs_idx <- sample(1:nrow(data), vss)
vs <- data[vs_idx,]
ts_eligible <- setdiff(1:nrow(data), vs_idx)
ts <- data[sample(ts_eligible, tss),]
fit <- lm( model_formula, ts)
training_error <- rmse(ts$y, predict(fit, ts))
validation_error <- rmse(vs$y, predict(fit, vs))
data.frame(tss=tss,
error_type = factor(c("training", "validation"),
levels=c("validation", "training")),
error=c(training_error, validation_error))
}) )
}
```

We’ll use a formula that considers all combinations of the input columns. Since these are categorical inputs, they will be represented by dummy variables in the model, with each combination of variable values getting its own coefficient.

`learning_curve <- run_learning_curve(y ~ X1*X2*X3, data)`

With this example, you get a series of warnings:

```
## Warning in predict.lm(fit, vs): prediction from a rank-deficient fit may be
## misleading
```

This is R trying to tell you that you don’t have enough rows to reliably fit all those coefficients. In this simulation, training set sizes above about 7500 don’t trigger the warning, though as we’ll see the curve still shows some evidence of overfitting.

```
library(ggplot2)
ggplot(learning_curve, aes(x=tss, y=error, linetype=error_type)) +
geom_line(size=1, col="blue") + xlab("training set size") + geom_hline(y=10, linetype=3)
```

In this figure, the X-axis represents different training set sizes and the Y-axis represents error. Validation error is shown in the solid blue line on the top part of the figure, and training error is shown by the dashed blue line in the bottom part. As the training set sizes get larger, these curves converge toward a level representing the amount of irreducible error in the data. This plot was generated using a simulated dataset where we know exactly what the irreducible error is; in this case it is the standard deviation of the Gaussian noise we added to the output in the simulation (10; the root mean squared error is essentially the same as standard deviation for reasonably large sample sizes). We don’t expect any model to reliably fit this error since we know it was completely random.

One interesting thing about this simulation is that the underlying system is very simple, yet it can take many thousands of training examples before the validation error of this model gets very close to optimum. In real life, you can easily encounter systems with many more variables, much higher cardinality, far more complex patterns, and of course lots and lots of those unpredictable variations we call “noise”. You can easily encounter situations where truly enormous numbers of samples are needed to train your model without excessive overfitting. On the other hand, if your training and validation error curves have already converged, more data may be superfluous. Learning curves can help you see if you are in a situation where more data is likely to be of benefit for training your model better.

If you've developed a useful function in R (say, a function to make a forecast or prediction from a statistical model), you may want to call that function from an application other than R. For example, you might want to display the forecast (calculated in R) as part of a desktop, web-based or mobile application. One solution is to install R alongside the application and call it directly, but that can be difficult — or impossible, in the case of mobile apps. (You also need to be careful to comply with R's open-source GPL2 license.)

Oftentimes, an easier way is to install R on a cloud-based server, and call your R function via a remote API. If you manage such a server yourself, one solution is to install DeployR on the server, and publish your function that way. But now there's an even simpler alternative: use the AzureML package (now available on CRAN) to publish your function directly to the Microsoft Azure cloud service, and then call that function using a simple REST call.

To get started, you'll need your Azure Workspace ID wsID and Workspace Authorization Token wsAuth (this Technet blog post by Raymond Laghaeian provides the details, and if you don't yet have an Azure subscription a free trial is available). Then, use the publishWebService function to publish the function to the cloud. Here's the example from the blog post:

irisWebService <- publishWebService(

"predictSpecies", # R function to publish

"irisSpeciesWebService", # service name

list("sep_len"="float", "sep_wid"="float",

"pet_len"="float","pet_wid"="float"), # parameters and types

list("species"="int"), # result and type

wsID, wsAuth # authorization ID/token

)

All you need to do is specify the function to publish (here, a user-defined R function called predictSpecies), its input parameters (and their types), and the result type (along with a name for your service and authentication info). The AzureML package handles delivering the contents of your function (and any dependencies) to the Azure cloud, and setting up a web service for you according to your specifications. You can then manage this web service as a standard Azure service, test it out, and even monitor how often it's called:

Using the AzureML package, you can make any R function you create available to any application connected to the web, as long as the inputs and outputs are simple data types supported by the API. You can find more details in the AzureML package vignette and in the blog post linked below.

Technet Machine Learning Blog: Build & Deploy Predictive Web Apps Using RStudio and Azure ML

The RHadoop packages make it easy to connect R to Hadoop data (rhdfs), and write map-reduce operations in the R language (rmr2) to process that data using the power of the nodes in a Hadoop cluster. But getting the Hadoop cluster configured, with R and all the necessary packages installed on each node, hasn't always been so easy.

But now with HDInsight, Microsoft's Apache Hadoop-in-the-cloud service, it's much easier. As you configure your Hadoop cluster, you now have the option of installing R and RHadoop as part of the setup process. It's simply of matter of setting an option to run a pre-prepared script on the cluster nodes, and complete instructions are provided for Linux-based and Windows-based Hadoop clusters.

With the cluster thus configured, you can then use simple R commands to create data in HDFS, and use the mapreduce function from rmr2 to peform calculations on data using any R function, as shown in the toy example below:

The script also installs a collection of R packages that will be useful for your mapreduce calls: rJava, Rcpp, RJSONIO, bitops, digest, functional, reshape2, stringr, plyr, caTools, and stringdist. And of course you can modify the setup script to install any other packages or tools you need on the nodes.

HDInsights is available with your Microsoft Azure subscription, or you can try HDInsights for free with a free one-month trial of Azure. If you're new to HDInsight, you might also want to check out these tutorials on getting started with Linux and Windows Hadoop clusters.

Microsoft HDInsight: Install and use R on Linux and Windows HDInsight Hadoop clusters

by Joseph Rickert

This week, the Infrastructure Steering Committee (ISC) of the R Consortium unanimously elected Hadley Wickham as its chair thereby also giving Hadley a seat on the R Consortium board of directors. Congratulations Hadley!!

This is a major step forward towards putting the R Consortium in business. Not only is the ISC the group that will decide on what projects the R Consortium will undertake, but it will also be responsible for actually getting the work done. (Look here for the charter of the ISC. )

The whole process of funding, soliciting, selecting and executing projects will work something like this: The board of directors under the leadership of its chair, Richard Pugh of Mango Solutions, will establish a budget for projects. The ISC will solicit proposals for new projects both from R Consortium member companies and from the R Community at large. With approval from the board, the ISC will decide which projects to fund. From there on, the ISC will assemble resources and manage the work. That’s the plan. The devil, of course, is in the details. There is much work to be done to put all of the necessary infrastructure in place, but Hadley’s election makes it possible for the ISC to begin bootstrapping the process.

So, while there is currently no formal proposal process in place, and the ISC and the R Consortium are **not ready** to begin the process of soliciting proposals from the public, it is not too early for the R Community to begin thinking about what work needs to be done. Now, is the time to begin thinking on a grand scale; well, at least on a scale that might be a bit more ambitious than creating a single R package.

What type of project might make the cut? I don’t want to set up any constraints here, or limit possibilities. But, just to pick one application area, it seems to me that that there were more than a few ideas kicked around in the HP Workshop on Distributed Computing in R held earlier this year that could be formulated into exciting and important projects. How about a unified interface for distributed computing in R?

If you have an idea for a project that you think would benefit the general R community but is more complicated than writing a simple package please start thinking about how you would write up your ideas, elaborating on the benefits to the R Community, technical feasibility, required resources etc. And, stayed tuned to R Consortium Announcements for information on when the proposal process will begin.

I’ll finish here by congratulating Hadley one more time, and state that I am very pleased to have the opportunity to work with him and the other members of the committee. I expect that with Hadley’s technical leadership, the guidance of the board of directors, and the participation of committed R users that the R Consortium will become an effective advocate and source of support for the R Community.

You can write to the ISC at: isc@r-consortium.org

**Some facts about the R Consortium****Founded**: June 19, 2015**Status:** The R Consortium is organized as a Linux Foundation Collaborative Project**Member organizations:** Alteryx, Google, Hewlett Packard, Ketchum Trading, Mango Solutions, Microsoft, Oracle, The R foundation, RStudio and Tibco**Board of Directors:** David Smith (Microsoft), Hadley Wickham (RStudio), John Chambers (R Foundation), J.J. Allaire (RStudio), Louis Bajuk-Yorgan (Tibco) and Chair, Richard Pugh (Mango Solutions)**ISC Members:** Hadley Wickham (RStudio), Joseph Rickert (Microsoft), Luke Tierney (R Foundation) and Stephen Kaluzny (Tibco)

** **

by Andrie de Vries

Every once in a while I try to remember how to do interpolation using R. This is not something I do frequently in my workflow, so I do the usual sequence of finding the appropriate help page:

?interpolate

Help pages:

stats::approx Interpolation Functions

stats::NLSstClosestX Inverse Interpolation

stats::spline Interpolating Splines

So, the help tells me to use approx() to perform linear interpolation. This is an interesting function, because the help page also describes approxfun() that does the same thing as approx(), except that approxfun() returns a function that does the interpolation, whilst approx() returns the interpolated values directly.

(In other words, approxfun() acts a little bit like a predict() method for approx().)

The help page for approx() also points to stats::spline() to do spline interpolation and from there you can find smooth.spline() for smoothing splines.

Talking about smoothing, base R also contains the function smooth(), an implementation of running median smoothers (algorithm proposed by Tukey).

Finally I want to mention loess(), a function that estimates Local Polynomial Regression Fitting. (The function loess() underlies the stat_smooth() as one of the defaults in the package ggplot2.)

I set up a little experiment to see how the different functions behave. To do this, I simulate some random data in the shape of a sine wave. Then I use each of these functions to interpolate or smooth the data.

On my generated data, the interpolation functions approx() and spline() gives a quite ragged interpolation. The smoothed median function smooth() doesn't do much better - there simply is too much variance in the data.

The smooth.spline() function does a great job at finding a smoother using default values.

The last two plots illustrate loess(), the local regression estimator. Notice that loess() needs a tuning parameter (span). The lower the value of the smoothing parameter, the smaller the number of points that it functions on. Thus with a value of 0.1 you can see a much smoother interpolation than at a value of 0.5.

Here is the code:

by John Mount (more articles) and Nina Zumel (more articles).

In this article we conclude our four part series on basic model testing. When fitting and selecting models in a data science project, how do you know that your final model is good? And how sure are you that it's better than the models that you rejected? In this concluding Part 4 of our four part mini-series "How do you know if your model is going to work?" we demonstrate cross-validation techniques. Previously we worked on:

Cross validation techniques attempt to improve statistical efficiency by repeatedly splitting data into train and test and re-performing model fit and model evaluation. For example: the variation called k-fold cross-validation splits the original data into k roughly equal sized sets. To score each set we build a model on all data not in the set and then apply the model to our set. This means we build k different models (none which is our final model, which is traditionally trained on all of the data).

This is statistically efficient as each model is trained on a 1-1/k fraction of the data, so for k=20 we are using 95% of the data for training. Another variation called "leave one out" (which is essentially Jackknife resampling) is very statistically efficient as each datum is scored on a unique model built using all other data. Though this is very computationally inefficient as you construct a very large number of models (except in special cases such as the PRESS statistic for linear regression).

Statisticians tend to prefer cross-validation techniques to test/train split as cross-validation techniques are more statistically efficient and can give sampling distribution style distributional estimates (instead of mere point estimates). However, remember cross validation techniques are measuring facts *about the fitting procedure* and *not about the actual model in hand* (so they are answering a different question than test/train split). There is some attraction to actually scoring the model you are going to turn in (as is done with in-sample methods, and test/train split, but not with cross-validation). The way to remember this is: bosses are essentially frequentist (they want to know their team and procedure tends to produce good models) and employees are essentially Bayesian (they want to know the actual model they are turning in is likely good; see here for how it the nature of the question you are trying to answer controls if you are in a Bayesian or Frequentist situation).

To read more: Win-Vector - How do you know if your model is going to work? Part 4: Cross-validation techniques

The Effective Applications of R (EARL) Conference (held last week in London) is well-named. At the event I saw many examples of R being used to solve real-world industry problems with advanced statistics and data visualization. Here are just a few examples:

**AstraZeneca**, the pharmaceutical company, uses R to design clinical trials, and to predict the ending date of the trials based on planned interim analyses of the data.**Allianz**, the financial services company, has deployed a massively-parallel "R-as-a-Service" production environment to support real-time banking processes.**KPMG**, one of the "big four" consulting companies, used R to simulate the impact of rule-changes to Britain's National Lottery, to better distribute prizes amongst players.**Allstate**, the insurance company, uses R to build predictive models to set premiums and calculate risk profiles.**Douwe Egberts**, the coffee company, uses R to analyze consumer preferences for coffee, and to design coffee roasts with desired flavour profiles.**Atass Sports**, a company that specializes in forecasting sports results, used R to identify cases of match-fixing in professional tennis.

There were also several other examples of industry applications from parallel sessions I couldn't attend, from companies including Lloyds of London, Shell, UBS, Deloitte, UniCredit, BCA Marketplace, TIM Group, PartnerRe, and hosts Mango Solutions.

The next EARL conference will be held in Boston, November 2-4, where I'm honoured to be included as a keynote speaker. I'm looking forward to learning about many more applications of R there!

Microsoft is sponsoring another free MOOC starting on September 24: **Data Science and Machine Learning Essentials**. This course provides a five-week introduction to machine learning and data science concepts, including the open-source programming tools for data science: R and Python. (Read more about the course in this post on TechNet.) This course is organized into 5 weekly modules, each concluding with a quiz (and if you wish, can purchase a verified certificate from edX to show off your passing grade).

The course is presented by Cynthia Rudin (Professor of Statistics at MIT) and Steve Elston (author of Data Science in the Cloud with Azure ML and R), who will also participate in the course forum, and host office hours to answer questions that come up during the course.

If you're new to R, you might want to get prepared by reviewing the materials from the previous Microsoft-sponsored edX course, Introduction to R. The new course on Data Science Essentials begins online on September 24, and you can register for free at the link below.