by Joseph Rickert

Early October: somewhere the leaves are turning brilliant colors, temperatures are cooling down and that back to school feeling is in the air. And for more people than ever before, it is going to seem to be a good time to commit to really learning R. I have some suggestions for R courses below, but first: What does it mean to learn R anyway? My take is that the answer depends on a person's circumstances and motivation.

I find the following graphic to be helpful in sorting things out.

The X axis is time on Malcolm Gladwell's "Outliers" scale. His idea is that it takes 10,000 hours of real effort to master anything, R, Python or Rock and Roll Guitar. The Y axis lists increasingly difficult R tasks, and the arrows within the plot area are labels increasingly proficient types of R users.

The point I want to make here is that a significant amount of very productive R work happens in the area around the red ellipse. So, while their is no avoiding "10,000" hours of hard work to become an R Jedi knight, a curious and motivated person can master enough R to accomplish his/her programming goals with a more modest commitment. There are three main reasons for this:

- R's functional programming style is very well suited for statistical modeling, data visualization and data science tasks
- The 7,000
^{+}packages available in the R ecosystem provide tens of thousands of functions that make it possible to accomplish quite a bit without having to write much code - Numerous, high quality books and online material devoted to teaching statistical theory and data science with R

If you have some background in some area of statistics or data science a viable strategy for learning R is to identify a resource that works for you and just jump into the middle of things, picking up R as you go along.

The lists below link to courses that can either start you on a formal programming path, or help you become a productive R user in a particular application area. Some of the courses are "live events" that you take with a cohort of students, others are set up for self study.

The courses devoted to teaching R as a programming language are

- The Data Scientist’s toolbox
- R Programming
- Introduction to R Programming
- Introduction to R
- R Programing - Introduction 1
- Introduction a la programacion estadistica con R
- O’Reilly Code School

The first two courses above are from Coursera's Data Science Specialization sequence. Taught by Roger Peng, Jeff Leek and Brian Caffo they are probably the gold standard for MOOC R courses. I am a little late with this post. The Data Scientists's toolbox started this past Monday but there is still time to catch up. The third course, Introduction to R Programming, is a relatively new edX course from Microsoft's online offerings that is getting great reviews. The fourth course on the list a solid introduction to R from DataCamp. R Programming - Introduction 1 is a beginner's introduction to R taught by Paul Murrell or Tal Galili. Next listed, is a Spanish language introduction to R from Coursera and O'Reilly's interactive Code School course.

These next three lists contain courses from DataCamp and statistics.com and online resources from R Studio that introduce more advanced features of R by buildng on basic R programming skills. Note that the final course on the DataCamp list introduces Big Data features of Revolution R Enterprise which is available in the Azure Marketplace.

- Intermediate R
- Data Visualization in R with ggvis
- Data Manipulation with dplyr
- Data Analysis in R, the data.table Way
- Reporting with R Markdown
- Big data Analysis with Revolution R Enterprise

This next section lists courses from the major MOOCs, and non-MOOCs DataCamp and statistics.com that use R to teach various quantitative disciplines

**Coursera Courses**

- Data Analysis and Statistical Inference
- Developing Data Products
- Exploratory Data Analysis
- Getting and Cleaning Data
- Introduction to Computational Finance and Financial Econometrics
- Measuring Causal Effects in the Social Sciences
- Regression Models
- Reproducible Research
- Statistical Inference
- Statistics One

**edX Courses**

- Data Analysis for Life Sciences 1: Statistics and R
- Data Analysis for life Sciences 2: Introduction to Linear Models and Matrix Algebra
- Data Analysis for life Sciences 6: High-performance Computing for Reproducible Genomics
- Explore Statistics with R
- Sabermetrics 101: Introduction to Baseball Analytics

**Udacity Course**

DataCamp

statistics.com

Finally, here are a couple of google apps and Swirl, a new platform for teaching and learning R that may be useful for learning on the go.

It's time to "go back to school" and make some headway against those 10,000 hours.

*by Bob HortonMicrosoft Senior Data Scientist*

Learning curves are an elaboration of the idea of validating a model on a test set, and have been widely popularized by Andrew Ng’s Machine Learning course on Coursera. Here I present a simple simulation that illustrates this idea.

Imagine you use a sample of your data to train a model, then use the model to predict the outcomes on data where you know what the real outcome is. Since you know the “real” answer, you can calculate the overall error in your predictions. The error on the same data set used to train the model is called the *training error*, and the error on an independent sample is called the *validation error*.

A model will commonly perform better (that is, have lower error) on the data it was trained on than on an independent sample. The difference between the training error and the validation error reflects *overfitting* of the model. Overfitting is like memorizing the answers for a test instead of learning the principles (to borrow a metaphor from the Wikipedia article). Memorizing works fine if the test is exactly like the study guide, but it doesn’t work very well if the test questions are different; that is, it doesn’t generalize. In fact, the more a model is overfitted, the higher its validation error is likely to be. This is because the spurious correlations the overfitted model memorized from the training set most likely don’t apply in the validation set.

Overfitting is usually more extreme with small training sets. In large training sets the random noise tends to average out, so that the underlying patterns are more clear. But in small training sets, there is less opportunity for averaging out the noise, and accidental correlations consequently have more influence on the model. Learning curves let us visualize this relationship between training set size and the degree of overfitting.

We start with a function to generate simulated data:

```
sim_data <- function(N, noise_level=1){
X1 <- sample(LETTERS[1:10], N, replace=TRUE)
X2 <- sample(LETTERS[1:10], N, replace=TRUE)
X3 <- sample(LETTERS[1:10], N, replace=TRUE)
y <- 100 + ifelse(X1 == X2, 10, 0) + rnorm(N, sd=noise_level)
data.frame(X1, X2, X3, y)
}
```

The input columns X1, X2, and X3 are categorical variables which each have 10 possible values, represented by capital letters `A`

through `J`

. The outcome is cleverly named `y`

; it has a base level of 100, but if the values in the first two `X`

variables are equal, this is increased by 10. On top of this we add some normally distributed noise. Any other pattern that might appear in the data is accidental.

Now we can use this function to generate a simulated data set for experiments.

```
set.seed(123)
data <- sim_data(25000, noise=10)
```

There are many possible error functions, but I prefer the root mean squared error:

`rmse <- function(actual, predicted) sqrt( mean( (actual - predicted)^2 ))`

To generate a learning curve, we fit models at a series of different training set sizes, and calculate the training error and validation error for each model. Then we will plot these errors against the training set size. Here the parameters are a model formula, the data frame of simulated data, the validation set size (vss), the number of different training set sizes we want to plot, and the smallest training set size to start with. The largest training set will be all the rows of the dataset that are not used for validation.

```
run_learning_curve <- function(model_formula, data, vss=5000, num_tss=30, min_tss=1000){
library(data.table)
max_tss <- nrow(data) - vss
tss_vector <- seq(min_tss, max_tss, length=num_tss)
data.table::rbindlist( lapply (tss_vector, function(tss){
vs_idx <- sample(1:nrow(data), vss)
vs <- data[vs_idx,]
ts_eligible <- setdiff(1:nrow(data), vs_idx)
ts <- data[sample(ts_eligible, tss),]
fit <- lm( model_formula, ts)
training_error <- rmse(ts$y, predict(fit, ts))
validation_error <- rmse(vs$y, predict(fit, vs))
data.frame(tss=tss,
error_type = factor(c("training", "validation"),
levels=c("validation", "training")),
error=c(training_error, validation_error))
}) )
}
```

We’ll use a formula that considers all combinations of the input columns. Since these are categorical inputs, they will be represented by dummy variables in the model, with each combination of variable values getting its own coefficient.

`learning_curve <- run_learning_curve(y ~ X1*X2*X3, data)`

With this example, you get a series of warnings:

```
## Warning in predict.lm(fit, vs): prediction from a rank-deficient fit may be
## misleading
```

This is R trying to tell you that you don’t have enough rows to reliably fit all those coefficients. In this simulation, training set sizes above about 7500 don’t trigger the warning, though as we’ll see the curve still shows some evidence of overfitting.

```
library(ggplot2)
ggplot(learning_curve, aes(x=tss, y=error, linetype=error_type)) +
geom_line(size=1, col="blue") + xlab("training set size") + geom_hline(y=10, linetype=3)
```

In this figure, the X-axis represents different training set sizes and the Y-axis represents error. Validation error is shown in the solid blue line on the top part of the figure, and training error is shown by the dashed blue line in the bottom part. As the training set sizes get larger, these curves converge toward a level representing the amount of irreducible error in the data. This plot was generated using a simulated dataset where we know exactly what the irreducible error is; in this case it is the standard deviation of the Gaussian noise we added to the output in the simulation (10; the root mean squared error is essentially the same as standard deviation for reasonably large sample sizes). We don’t expect any model to reliably fit this error since we know it was completely random.

One interesting thing about this simulation is that the underlying system is very simple, yet it can take many thousands of training examples before the validation error of this model gets very close to optimum. In real life, you can easily encounter systems with many more variables, much higher cardinality, far more complex patterns, and of course lots and lots of those unpredictable variations we call “noise”. You can easily encounter situations where truly enormous numbers of samples are needed to train your model without excessive overfitting. On the other hand, if your training and validation error curves have already converged, more data may be superfluous. Learning curves can help you see if you are in a situation where more data is likely to be of benefit for training your model better.

The RHadoop packages make it easy to connect R to Hadoop data (rhdfs), and write map-reduce operations in the R language (rmr2) to process that data using the power of the nodes in a Hadoop cluster. But getting the Hadoop cluster configured, with R and all the necessary packages installed on each node, hasn't always been so easy.

But now with HDInsight, Microsoft's Apache Hadoop-in-the-cloud service, it's much easier. As you configure your Hadoop cluster, you now have the option of installing R and RHadoop as part of the setup process. It's simply of matter of setting an option to run a pre-prepared script on the cluster nodes, and complete instructions are provided for Linux-based and Windows-based Hadoop clusters.

With the cluster thus configured, you can then use simple R commands to create data in HDFS, and use the mapreduce function from rmr2 to peform calculations on data using any R function, as shown in the toy example below:

The script also installs a collection of R packages that will be useful for your mapreduce calls: rJava, Rcpp, RJSONIO, bitops, digest, functional, reshape2, stringr, plyr, caTools, and stringdist. And of course you can modify the setup script to install any other packages or tools you need on the nodes.

HDInsights is available with your Microsoft Azure subscription, or you can try HDInsights for free with a free one-month trial of Azure. If you're new to HDInsight, you might also want to check out these tutorials on getting started with Linux and Windows Hadoop clusters.

Microsoft HDInsight: Install and use R on Linux and Windows HDInsight Hadoop clusters

by Andrie de Vries

Every once in a while I try to remember how to do interpolation using R. This is not something I do frequently in my workflow, so I do the usual sequence of finding the appropriate help page:

?interpolate

Help pages:

stats::approx Interpolation Functions

stats::NLSstClosestX Inverse Interpolation

stats::spline Interpolating Splines

So, the help tells me to use approx() to perform linear interpolation. This is an interesting function, because the help page also describes approxfun() that does the same thing as approx(), except that approxfun() returns a function that does the interpolation, whilst approx() returns the interpolated values directly.

(In other words, approxfun() acts a little bit like a predict() method for approx().)

The help page for approx() also points to stats::spline() to do spline interpolation and from there you can find smooth.spline() for smoothing splines.

Talking about smoothing, base R also contains the function smooth(), an implementation of running median smoothers (algorithm proposed by Tukey).

Finally I want to mention loess(), a function that estimates Local Polynomial Regression Fitting. (The function loess() underlies the stat_smooth() as one of the defaults in the package ggplot2.)

I set up a little experiment to see how the different functions behave. To do this, I simulate some random data in the shape of a sine wave. Then I use each of these functions to interpolate or smooth the data.

On my generated data, the interpolation functions approx() and spline() gives a quite ragged interpolation. The smoothed median function smooth() doesn't do much better - there simply is too much variance in the data.

The smooth.spline() function does a great job at finding a smoother using default values.

The last two plots illustrate loess(), the local regression estimator. Notice that loess() needs a tuning parameter (span). The lower the value of the smoothing parameter, the smaller the number of points that it functions on. Thus with a value of 0.1 you can see a much smoother interpolation than at a value of 0.5.

Here is the code:

by John Mount (more articles) and Nina Zumel (more articles).

In this article we conclude our four part series on basic model testing. When fitting and selecting models in a data science project, how do you know that your final model is good? And how sure are you that it's better than the models that you rejected? In this concluding Part 4 of our four part mini-series "How do you know if your model is going to work?" we demonstrate cross-validation techniques. Previously we worked on:

Cross validation techniques attempt to improve statistical efficiency by repeatedly splitting data into train and test and re-performing model fit and model evaluation. For example: the variation called k-fold cross-validation splits the original data into k roughly equal sized sets. To score each set we build a model on all data not in the set and then apply the model to our set. This means we build k different models (none which is our final model, which is traditionally trained on all of the data).

This is statistically efficient as each model is trained on a 1-1/k fraction of the data, so for k=20 we are using 95% of the data for training. Another variation called "leave one out" (which is essentially Jackknife resampling) is very statistically efficient as each datum is scored on a unique model built using all other data. Though this is very computationally inefficient as you construct a very large number of models (except in special cases such as the PRESS statistic for linear regression).

Statisticians tend to prefer cross-validation techniques to test/train split as cross-validation techniques are more statistically efficient and can give sampling distribution style distributional estimates (instead of mere point estimates). However, remember cross validation techniques are measuring facts *about the fitting procedure* and *not about the actual model in hand* (so they are answering a different question than test/train split). There is some attraction to actually scoring the model you are going to turn in (as is done with in-sample methods, and test/train split, but not with cross-validation). The way to remember this is: bosses are essentially frequentist (they want to know their team and procedure tends to produce good models) and employees are essentially Bayesian (they want to know the actual model they are turning in is likely good; see here for how it the nature of the question you are trying to answer controls if you are in a Bayesian or Frequentist situation).

To read more: Win-Vector - How do you know if your model is going to work? Part 4: Cross-validation techniques

by Joseph Rickert

In a recent post focused on plotting time series with the new dygraphs package, I did not show how easy it is to read financial data into R. However, in a thoughtful comment to the post, Achim Zeileis pointed out a number of features built into the basic R time series packages that everyone ought to know. In this post, I will just elaborate a little on what Achim sketched out. First off, I began the previous post with url strings that point to stock data for IBM and LinkedIn. Yahoo Finance make this sort of thing easy. A quick search for a stock will bring you to a page with historical stock prices. Then, its is only a matter of copying the url to the associated csv file, and as Achim points out:

If you already have the download URLs ibm_url and lnkd_url, then you can also simply use zoo::read.zoo() and merge the resulting closing prices:

And the ggplot2 figure can just be drawn with the autoplot() method for zoo series:

library("ggplot2") autoplot(z, facets = NULL)

The resulting plot takes you most of the way to the ggplot produced in the post.

The final formatting can be accomplished with the additional ggplot commands used in my post. This is just delightful: 3 lines of code that fetch the data and prepare it for plotting, and half-line to get a sophisticated default plot.

Of course, the manual step of hunting for urls is completely unnecessary. The get.historic.quote function in the tseries package will fetch a time series object for you that can also be plotted with the zoo autoplot function.

If you are really interested in working like a quant then the defaults in the quantmod package will give you the look and feel of a traders screen. The function:

getSymbols("IBM",src="google")

will bring the "xts" / "zoo" time series object, IBM, containing historic IBM stock data directly into your work space with no need even to make an assignment!

head(IBM) IBM.Open IBM.High IBM.Low IBM.Close IBM.Volume 2007-01-03 97.18 98.40 96.26 97.27 9199500 2007-01-04 97.25 98.79 96.88 98.31 10557200 2007-01-05 97.60 97.95 96.91 97.42 7222900 2007-01-08 98.50 99.50 98.35 98.90 10340100 2007-01-09 99.08 100.33 99.07 100.07 11108900 2007-01-10 98.50 99.05 97.93 98.89 8744900

Moreover, IBM is an "OHLC" object that, with the right plotting function like chartSeries from the quantmod package, will produce the kind of open-high-low-close charts favored by stock analysts for charting financial instruments. (There is even an R function to determine if you have an OHLC object.)

is.OHLC(IBM) #[1] TRUE chartSeries(IBM,type="candle",subset='2010-08-24::2015-09-02')

The getSymbols function will fetch data from the Yahoo, Google, FRED and oanda financial services sites and also read as well as reading from MySQL data bases and .csv and RData files.

Quandl, however, is probably the best place to go for free (and premium) financial data. Once you signup for an account with Quandl the following code will get data frame with several columns of IBM stock information.

token <- "your_token_string" Quandl.auth(token) # Authenticate your token ibmQ = Quandl("WIKI/IBM", start_date="2010-08-24", end_date="2015-09-03") head(ibmQ) Date Open High Low Close Volume Ex-Dividend Split Ratio Adj. Open Adj. High Adj. Low Adj. Close Adj. Volume 1 2015-09-02 144.92 145.08 143.18 145.05 4243473 0 1 144.92 145.08 143.18 145.05 4243473 2 2015-09-01 144.84 144.98 141.85 142.68 5258877 0 1 144.84 144.98 141.85 142.68 5258877 3 2015-08-31 147.26 148.40 146.26 147.89 4093078 0 1 147.26 148.40 146.26 147.89 4093078 4 2015-08-28 147.75 148.20 147.18 147.98 4058832 0 1 147.75 148.20 147.18 147.98 4058832 5 2015-08-27 148.63 148.97 145.66 148.54 4762003 0 1 148.63 148.97 145.66 148.54 4762003 6 2015-08-26 144.09 146.98 142.14 146.70 6186742 0 1 144.09 146.98 142.14 146.70 6186742

Here, I have presented just some of the very basics, still not coming close to describing all that R offers for acquiring and manipulating financial time series information.

As a final note: a couple of years ago I posted a short tutorial for getting started with the Quandl R API. Unfortunately, since that time Quandl has changed it's coding scheme so my R code from that tutorial will not run without changes. The code in the file Quandl_code, however, produces the following plot of Asian Currency exchange rates and may serve as an updated example.

Look here to decipher the Quandl currency codes.

by John Mount (more articles) and Nina Zumel (more articles)

When fitting and selecting models in a data science project, how do you know that your final model is good? And how sure are you that it's better than the models that you rejected? In this Part 2 of our four part mini-series "How do you know if your model is going to work?" we develop in-training set measures. Previously we worked on:

- Part 1: Defining the scoring problem

The most tempting procedure is to score your model on the data used to train it. The attraction is this avoids the statistical inefficiency of denying some of your data to the training procedure.

A common way to asses score quality is to run your scoring function on the data used to build your model. We might try comparing several models scored by AUC or deviance (normalized to factor out sample size) on their own training data as shown below. What we have done is take five popular machine learning techniques (random forest, logistic regression, gbm, GAM logistic regression, and elastic net logistic regression) and plotted their performance in terms of AUC and normalized deviance on their own training data. For AUC larger numbers are better, and for deviance smaller numbers are better. Because we have evaluated multiple models we are starting to get a sense of scale. We should suspect an AUC of 0.7 on training data is good (though random forest achieved an AUC on training of almost 1.0), and we should be acutely aware that evaluating models on their own training data has an upward bias (the model has seen the training data, so it has a good chance of doing well on it; or training data is not exchangeable with future data for the purpose of estimating model performance). There are two more Gedankenexperiment models that any machine data scientist should always have in mind:

- The null model (on the graph as "null model"). This is the performance of the best constant model (model that returns the same answer for all datums). In this case it is a model scores each and every row as having an identical 7% chance of churning. This is an important model that you want to better than. It is also a model you are often competing against as a data science as it is the "what if we treat everything in this group the same" option (often the business process you are trying to replace).The data scientist should always compare their work to the null model on deviance (null model AUC is trivially 0.5) and packages like logistic regression routinely report this statistic.
- The best single variable model (on the graph as "best single variable model"). This is the best model built using only one variable or column (in this case using a GAM logistic regression as the modeling method). This is another model the data scientist wants to out perform as it represents the "maybe one of the columns is already the answer case" (if so that would be very
*good*for the business as they could get good predictions without modeling infrastructure).The data scientist should definitely compare their model to the best single variable model. Until you significantly outperform the best single variable model you have not outperformed what an analyst can find with a single pivot table.

At this point it would be tempting to pick the random forest model as the winner as it performed best on the training data. There are at least two things wrong with this idea:

To read more: win-vector blog: How do you know if your model is going to work? Part 2: In-training set measures

*by John Mount (more articles) and Nina Zumel (more articles) of Win-Vector LLC*

"Essentially, all models are wrong, but some are useful." George Box

Here's a caricature of a data science project: your company or client needs information (usually to make a decision). Your job is to build a model to predict that information. You fit a model, perhaps several, to available data and evaluate them to find the best. Then you cross your fingers that your chosen model doesn't crash and burn in the real world. We've discussed detecting if your data has a signal. Now: how do you know that your model is good? And how sure are you that it's better than the models that you rejected?

Notice the Sun in the 4th revolution about the earth. A very pretty, but not entirely reliable model.

In this latest "Statistics as it should be" series, we will systematically look at what to worry about and what to check. This is standard material, but presented in a "data science" oriented manner. Meaning we are going to consider scoring system utility in terms of service to a *negotiable* business goal (one of the many ways data science differs from pure machine learning). To organize the ideas into digestible chunks, we are presenting this article as a four part series. This part (part 1) sets up the specific problem.

Win-Vector blog: HOW DO YOU KNOW IF YOUR MODEL IS GOING TO WORK? PART1: THE PROBLEM

by Andrie de Vries

Just more than a year ago I cobbled together some code to work with the (then) new version of Google Sheets. You can still find my musings and code at the blog post Reading data from the new version of Google Spreadsheets.

Since then, Jennifer Bryan (@JennyBryan) published a wonderful package that does far more than I ever tried, in a much more elegant way.

This package just **works**!

I read the first few lines of the excellent vignette, pasted the code into my R session, and had a listing of my Google sheets within seconds:

library(googlesheets)

library(dplyr)

(my_sheets <- gs_ls())

# (expect a prompt to authenticate with Google interactively HERE)

my_sheets %>% glimpse()

The package neatly handles handshaking with Google to get access to your authorisation token. It automatically opens a web browser where you log in to your Google account. Google then presents you with a token that you simply paste into your R session command line, and from there everything just works.

(my_sheets <- gs_ls())

Source: local data frame [6 x 10]

sheet_title author perm version updated

1 RRO with MKL benchmark andrie r new 2015-06-03 15:52:21

2 FMH: UAT Test plan (v1.… des.holmes rw new 2015-02-10 00:07:26

3 Hospital Dat 2013_08_22 markratnaraj… rw new 2013-10-13 20:57:55

4 Strategy Time TBG clementdata rw new 2013-04-23 19:43:21

5 hspot-rfp-alpha-scoresh… benoit.chovet rw new 2013-02-22 16:20:37

6 2012-Jan Mobile interne… simbamangu r new 2013-01-07 06:03:56

Variables not shown: ws_feed (chr), alternate (chr), self (chr), alt_key (chr)

Since the googlesheets package is on CRAN, you can install by using:

install.packages("googlesheets")

I highly recommend reading the following excellent resources:

- The package vignette at https://cran.r-project.org/web/packages/googlesheets/vignettes/basic-usage.html
- The README at the project's github page https://github.com/jennybc/googlesheets
- Jenny's presentation at UseR!2015, available at https://speakerdeck.com/jennybc/googlesheets-talk-at-user2015