R has some good tools for importing data from spreadsheets, among them the readxl package for Excel and the googlesheets package for Google Sheets. But these only work well when the data in the spreadsheet are arranged as a rectangular table, and not overly encumbered with formatting or generated with formulas. As Jenny Bryan pointed out in her recent talk at the useR!2016 conference (and embedded below, or download PDF slides here), in practice few spreadsheets have "a clean little rectangle of data in the upper-left corner", because most people use spreadsheets not just a file format for data retrieval, but also as a reporting/visualization/analysis tool.

Nonetheless, for a practicing data scientist, there's a lot of useful data locked up in these messy spreadsheets that needs to be imported into R before we can begin analysis. As just one example given by Jenny in her talk, this spreadsheet was included as one of 15,000 spreadsheet attachments (one with 175 tabs!) in the Enron Corpus.

To make it easier to import data into R from messy spreadsheets like this, Jenny and co-author Richard G. FitzJohn created the jailbreakr package. The package is in its early stages, but it can already import Excel (xlsx format) and Google Sheets intro R as a new "linen" objects from which small sub-tables can easily be extracted as data frames. It can also print spreadsheets in a condensed text-based format with one character per cell — useful if you're trying to figure out why an apparently simple spreadsheet isn't importing as you expect. (Check out the "weekend getaway winner" story near the end of Jenny's talk for a great example.)

The jailbreakr package isn't yet on CRAN, but if you want to try it out you can download it from the Github repository (or even contribute!) at the link below.

Github (rsheets): jailbreakr

by Joseph Rickert

Data Science is all about getting access to interesting data, and it is really nice when some kind soul not only points out an interesting data set but also makes it easy for you to access it. Below is a list of 17 R packages that appeared on CRAN between May 1st and August 8th that, in one way or another, provide access to publicly available data.

bigQueryR: Provides an interface to Google's BigQuery. The vignette shows how to use it.

blscrapeR: Provides an API wrapper for Bureau of Labor Statistics data sets. There is a vignette showing how to access inflation and price data, one for accessing Wages and Benefits data, and one for mapping BLS data.

cdlTools: Provides functions to download USDA National Agricultural Statistics Service (NASS) cropscape data for a specified state.

dataone: The dataone R package enables R scripts to search, download and upload science data and metadata from/to the DataONE Federation. The website describes DataOne as "a community driven project providing access to data across multiple member repositories, supporting enhanced search and discovery of Earth and environmental data". The package comes with several vignettes including this overview.

dataRetrieval: Package to retrieve USGS and EPA hydrologic and water quality data, officially supported by USGS. The vignette gives several examples of downloading interesting data sets.

eechidna: Provides the data from the 2013 Australian Federal Election and tools to analyze it. There are several nicely done vignettes. The following plot which shows election results by polling place comes from the vignette on plotting polling stations.

There are also vignettes on census and election data, shapefiles and mapping Australia's Electorates.

getHFdata: Provides functions to downloads and aggregate high frequency trading data for Brazilian instruments directly from the Bovespa ftp site. There is a vignette to get you started. The following plot showing unemployment data by state comes from the vignette on Census data.

googleAnalyticsR: Provides an interface to the Google Analytics Reporting API. There is a vignette.

googleway: Provides functions to retrieve data from 6 Google Maps APIs. The vignette shows how.

gutenberg: Search and download public domain works in the Project Gutenberg collection. The vignette shows you how to search and download public domain texts.

ie2miscdata: Contains a collection of USGS environmental and water resources data sets. There is a vignette showing how to create plots from the data. (See also: dataRetrieval.)

macleish: Provides functions to data from the Ada & Archibald MacLeish field station in Whately, MA. Thev ignette shows how to obtain weather data.

muckrock: Contains public domain information on requests made by muckrock through the US Freedom of Information Act.

nasadata: Provides an interface to NASA's Earth Imagery and Assets API and Earth Observatory and Natural Event Tracker.

oec: Provides an interface to the Observatory for Economic Complexity.

osi: Provides a connector to the Open Source Initiative API that provides machine --readable data about open source software licenses.

pewdata: Provides for reproducible, programmatic retrieval of survey data sets from the Pew Research Center. The vignette shows how to setup and use the package. Look here for an interesting poll about what Americans know about science.

TCGAretriever: Provides an interface to data sets from the The Cancer Genome Atlas (TCGA) via the Cancer Genomic Data Server web service.

For more packages that provide APIs to data sets have a look at the CRAN Task View on Web Technologies and Services. For a list of interesting data sets out there in the wild see the MRAN Data Sources page.

[**Update**: added the dataRetrieval package, at the suggestion of Laura DeCicco.**]**

*Editor's note: This is Joe's last post to Revolutions as a member of the Microsoft team: he is heading on for further adventures in the world of R. We want to thank Joe for his many contributions to the blog over the past 6 years, and please join us in wishing him well!*

Hadley Wickham's dplyr package is an amazing tool for restructuring, filtering, and aggregating data sets using its elegant grammar of data manipulation. By default, it works on in-memory data frames, which means you're limited to the amount of data you can fit into R's memory. Hadley also provided an extension mechanism to make dplyr work with external data sources, and so Hong Ooi created the dplyrXdf package to work with Xdf data files. With dplyrXdf you can manipulate data files of virtually unlimited size using R, and even use the pipe operator %>% from the magrittr package.

To use the dplyrXdf package, you will need to use Microsoft R Client (free download for Windows) or Microsoft R Server (on Windows, Linux, Hadoop or HDInsight with Spark). The Xdf files you create can then be used with the big-data functions of the included ScaleR package, enabling you to use R to perform statistical analysis of files hundreds of gigabytes in size.

To help you get started with the dplyrXdf packaghe, Hong has created a new dplyrXdf cheat sheet (pdf). This handy and printable 2-page document explains how dplyrXdf:

- Extends dplyr framework to large, on-disk data sets
- Simplifies current interface to xdf functionality
- Handles the task of file management for the user
- Is transparent to other xdf-aware functions

It also includes some extended examples of working with big data with dplyrXdf and analyzing them with the ScaleR package. To download the cheat-sheet, click on the link below.

Microsoft Advanced Analytics: dplyrXdf cheat sheet

by Joseph Rickert

My guess is that a good many statistics students first encounter the bivariate Normal distribution as one or two hastily covered pages in an introductory text book, and then don't think much about it again until someone asks them to generate two random variables with a given correlation structure. Fortunately for R users, a little searching on the internet will turn up several nice tutorials with R code explaining various aspects of the bivariate Normal. For this post, I have gathered together a few examples and tweaked the code a little to make comparisons easier.

Here are five different ways to simulate random samples bivariate Normal distribution with a given mean and covariance matrix.

To set up for the simulations this first block of code defines N, the number of random samples to simulate, the means of the random variables, and and the covariance matrix. It also provides a small function for drawing confidence ellipses on the simulated data.

library(mixtools) #for ellipse

N <- 200 # Number of random samples

set.seed(123)

# Target parameters for univariate normal distributions

rho <- -0.6

mu1 <- 1; s1 <- 2

mu2 <- 1; s2 <- 8

# Parameters for bivariate normal distribution

mu <- c(mu1,mu2) # Mean

sigma <- matrix(c(s1^2, s1*s2*rho, s1*s2*rho, s2^2),

2) # Covariance matrix

# Function to draw ellipse for bivariate normal data

ellipse_bvn <- function(bvn, alpha){

Xbar <- apply(bvn,2,mean)

S <- cov(bvn)

ellipse(Xbar, S, alpha = alpha, col="red")

}

The first method, the way to go if you just want to get on with it, is to use the mvrnorm() function from the MASS package.

library(MASS)

bvn1 <- mvrnorm(N, mu = mu, Sigma = sigma ) # from MASS package

colnames(bvn1) <- c("bvn1_X1","bvn1_X2")

It takes so little code to do the simulation it might be possible to tweet in a homework assignment.

A look at the source code for mvrnorm() shows that it uses eignevectors to generate the random samples. The documentation for the function states that this method was selected because it is stabler than the alternative of using a Cholesky decomposition which might be faster.

For the second method, let's go ahead and directly generate generate bivariate Normal random variates with the Cholesky decomposition. Remember that the Cholesky decomposition of sigma (a positive definite matrix) yields a matrix M such that M times its transpose gives sigma back again. Multiplying M by a matrix of standard random Normal variates and adding the desired mean gives a matrix of the desired random samples. A lecture from Colin Rundel covers some of the theory.

M <- t(chol(sigma))

# M %*% t(M)

Z <- matrix(rnorm(2*N),2,N) # 2 rows, N/2 columns

bvn2 <- t(M %*% Z) + matrix(rep(mu,N), byrow=TRUE,ncol=2)

colnames(bvn2) <- c("bvn2_X1","bvn2_X2")

For the third method we make use of a special property of the bivariate normal that is discussed in almost all of those elementary textbooks. If X_{1} and X_{2} are two jointly distributed random variables, then the conditional distribution of X_{2} given X_{1} is itself normal with: mean = m_{2} + r(s_{2}/s_{1})(X_{1} - m_{1}) and variance = (1 - r^{2})s^{2}X_{2}.

Hence, a sample from a bivariate Normal distribution can be simulated by first simulating a point from the marginal distribution of one of the random variables and then simulating from the second random variable conditioned on the first. A brief proof of the underlying theorem is available here.

rbvn<-function (n, m1, s1, m2, s2, rho)

{

X1 <- rnorm(n, mu1, s1)

X2 <- rnorm(n, mu2 + (s2/s1) * rho *

(X1 - mu1), sqrt((1 - rho^2)*s2^2))

cbind(X1, X2)

}

bvn3 <- rbvn(N,mu1,s1,mu2,s2,rho)

colnames(bvn3) <- c("bvn3_X1","bvn3_X2")

The fourth method, my favorite, comes from Professor Darren Wiliinson's Gibbs Sampler tutorial. This is a very nice idea; using the familiar bivariate Normal distribution to illustrate the basics of the Gibbs Sampling Algorithm. Note that this looks very much like the previous method, except that now we are alternately sampling from the full conditional distributions.

gibbs<-function (n, mu1, s1, mu2, s2, rho)

{

mat <- matrix(ncol = 2, nrow = n)

x <- 0

y <- 0

mat[1, ] <- c(x, y)

for (i in 2:n) {

x <- rnorm(1, mu1 +

(s1/s2) * rho * (y - mu2), sqrt((1 - rho^2)*s1^2))

y <- rnorm(1, mu2 +

(s2/s1) * rho * (x - mu1), sqrt((1 - rho^2)*s2^2))

mat[i, ] <- c(x, y)

}

mat

}

bvn4 <- gibbs(N,mu1,s1,mu2,s2,rho)

colnames(bvn4) <- c("bvn4_X1","bvn4_X2")

The fifth and final way uses the rmvnorm() function form the mvtnorm package with the singular value decomposition method selected. The functions in this package are overkill for what we are doing here, but mvtnorm is probably the package you would want to use if you are calculating probabilities from high dimensional multivariate distributions. It implements numerical methods for carefully calculating the high dimensional integrals involved that are based on some papers by Professor Alan Genz dating from the early '90s. These methods are briefly explained in the package vignette.

library (mvtnorm)

bvn5 <- mvtnorm::rmvnorm(N,mu,sigma, method="svd")

colnames(bvn5) <- c("bvn5_X1","bvn5_X2")

Note that I have used the :: operator here to make sure that R uses the rmvnorm() function from the mvtnorm package. There is also a rmvnorm() function in the mixtools package that I used to get the ellipse function. Loading the packages in the wrong order could lead to the rookie mistake of having the function you want inadvertently overwritten.

Next, we plot the results of drawing just 100 random samples for each method. This allows us to see how the algorithms spread data over the sample space as they are just getting started.

bvn <- list(bvn1,bvn2,bvn3,bvn4,bvn5)

par(mfrow=c(3,2))

plot(bvn1, xlab="X1",ylab="X2",main= "All Samples")

for(i in 2:5){

points(bvn[[i]],col=i)

}

for(i in 1:5){

item <- paste("bvn",i,sep="")

plot(bvn[[i]],xlab="X1",ylab="X2",main=item, col=i)

ellipse_bvn(bvn[[i]],.5)

ellipse_bvn(bvn[[i]],.05)

}

par(mfrow=c(1,1))

The first plot shows all 500 random samples color coded by the method with which they were generated. The remaining plots show the samples generated by each method. In each of these plots the ellipses mark the 0.5 and 0.95 probability regions, i.e. the area within the ellipses should contain 50% and 95% of the points respectively. Note that bvn4 which uses the Gibbs sampling algorithm looks like all of the rest. In most use cases for the Gibbs it takes the algorithm some time to converge to the target distribution. In our case, we start out with a pretty good guess.

Finally, a word about accuracy: nice coverage of the sample space is not sufficient to produce accurate results. A little experimentation will show that, for all of the methods outlined above, regularly achieving a sample covariance matrix that is close to the target, sigma, requires something on the order of 10,000 samples as is Illustrated below.

> sigma

[,1] [,2]

[1,] 4.0 -9.6

[2,] -9.6 64.0

for(i in 1:5){

print(round(cov(bvn[[i]]),1))

}

bvn1_X1 bvn1_X2

bvn1_X1 4.0 -9.5

bvn1_X2 -9.5 63.8

bvn2_X1 bvn2_X2

bvn2_X1 3.9 -9.5

bvn2_X2 -9.5 64.5

bvn3_X1 bvn3_X2

bvn3_X1 4.1 -9.8

bvn3_X2 -9.8 63.7

bvn4_X1 bvn4_X2

bvn4_X1 4.0 -9.7

bvn4_X2 -9.7 64.6

bvn5_X1 bvn5_X2

bvn5_X1 4.0 -9.6

bvn5_X2 -9.6 65.3

Many people coming to R for the first time find it disconcerting to realize that there are several ways to do some fundamental calculation in R. My take is that rather than being a point of frustration, having multiple options indicates that richness of the R language. A close look at the package documentation will often show that yet another method to do something is a response to some subtle need that was not previously addressed. Enjoy the diversity!

by Joseph Rickert

My impression is that the JSM has become ever more R friendly over recent years, but with two sessions organized around R tools and several talks featuring R packages, this year may turn out to be the beginning of a new era where conference organizers see value in putting R on the agenda and prospective speakers perceive it to be advantageous to mention R, an R package or a Shiny App in their abstract.

As should be expected, the vast majority of the presentations will focus on statistics or the application of statistical methods, and not on the underlying computational platform. Nevertheless, based on past experience I would be very surprised if there is not quite a bit more R talk buzzing around the conference.

If you are going to Chicago please stop by the Microsoft booth 232. We would be happy to tell you how we are using R at Microsoft and even more interested in hearing your opinion about what Microsoft should be doing with R. Also look for us at the opening night mixer (Sunday 6 - 8PM in the Expo Hall) and the Student Mixer (Monday 6 - 7:30PM in the Chicago Hilton Hotel)

Here follows my **R Users Guide to JSM 2016**. I have organized the talks by session number and included information on times and room numbers.

Session 21 Statistical Computing and Graphics Student Awards – Contributed papers

Sun, 7/31/2016, 2:00 PM - 3:50 PM – Room: CCW175b

2:25 PM |
The PICASSO Package for High Dimensions Nonconvex Sparse Learning in R — Xingguo Li ; Tuo Zhao, The Johns Hopkins University ; Tong Zhang, Rutgers University ; Han Liu, Princeton |

3:05 PM |
Using the Geomnet Package: Visualizing African Slave Trade, 1514--1866 — Samantha Tyner, Iowa State University |

3:25 PM |
Xgboost: An R Package for Fast and Accurate Gradient Boosting — Tong He, Simon Fraser University |

Session: 47 Making the Most of R Tools - Invited papers

Sun, 7/31/2016, 4:00 PM - 5:50 PM – Room: CC-W183b

4:05 PM |
Thinking with Data Using R and RStudio: Powerful Idioms for Analysts — Nicholas Jon Horton, Amherst College; Randall Pruim, Calvin College ; Daniel Kaplan, Macalester College |

4:35 PM |
Transform Your Workflow and Deliverables with Shiny and R Markdown — Garrett Grolemund, RStudio |

Session 127: R Tools for Statistical Computing – Contributed papers

Mon, 8/1/2016, 8:30 AM - 10:20 AM – Room: CC-W196c

8:35 AM |
The Biglasso Package: Extending Lasso Model Fitting to Big Data in R — Yaohui Zeng, University of Iowa ; Patrick Breheny, University of Iowa |

8:50 AM |
Independent Sampling for a Spatial Model with Incomplete Data — Harsimran Somal, University of Iowa ; Mary Kathryn Cowles, University of Iowa |

9:05 AM |
Introduction to the TextmineR Package for R — Thomas Jones, Impact Research |

9:20 AM |
Vector-Generalized Time Series Models — Victor Miranda Soberanis, University of Auckland ; Thomas Yee, University of Auckland |

9:35 AM |
New Computational Approaches to Large/Complex Mixed Effects Models — Norman Matloff, University of California at Davis |

9:50 AM |
Broom: An R Package for Converting Statistical Modeling Objects Into Tidy Data Frames — David G. Robinson, Stack Overflow |

10:05 AM |
Exact Parametric and Nonparametric Likelihood-Ratio Tests for Two-Sample Comparisons — Yang Zhao, SUNY Buffalo ; Albert Vexler, SUNY Buffalo ; Alan Hutson, SUNY Buffalo ; Xiwei Chen, SUNY Buffalo |

Session 247: Better Communication with Statistical Graphics – Contributed papers

Mon, 8/1/2016, 2:00 PM - 3:50 PM – Room CC W184bc

2:35 PM |
The Linked Microposter Plot as a New Means for the Visualization of Eye-Tracking Data — Chunyang Li, Utah State University ; Juergen Symanzik, Utah State University |

2:50 PM |
Optimizing Diffusion Cartograms for Areal Data Using a New Evaluation Method — Xiaoyue Cheng, University of Nebraska - Omaha |

3:35 PM |
Interactive Graphics for Functional Data Analyses — Julia Wrobel, Columbia University ; Jeff Goldsmith, Columbia Mailman School of Public Health |

Session 349: Applications of Regression Trees on Sample Data – Contributed papers

Tue, 8/2/2016, 10:30 AM - 12:20 PM – Room W184a

10:55 AM |
Modeling Survey Data with Regression Trees — Daniell Toth, Bureau of Labor Statistics |

Session 530: Applications in Drug Development – Contributed papers

Wed, 8/3/2016, 10:30 AM - 12:20 PM – Room: CC W187a

11:15 AM |
Facilitating Clinical Trial Simulation in Alzheimer's Disease Using the CAMD IPD, Literature Summary Level Data, and the 'adsim' R Package — Daniel Polhamus, Metrum Research Group |

Session 353: Statistical Learning and Data Science – Contributed speed presentations

Tue, 8/2/2016, 10:30 AM - 12:20 PM – Room: CC-W181a

10:55 AM |
An R Package Enabling Likelihood-Based Inference for Generalized Linear Mixed Models — Christina Knudson |

Session 354: Business, Finance and Economic Statistics – Contributed speed presentations

Tue, 8/2/2016, 10:30 AM - 12:20 PM – Room: CCW181b

12:00 PM |
Optimal Stratification of Univariate Populations via StratifyR Package — Karuna Garan Reddy, University of the South Pacific ; Mohammed G. M. Khan, University of the South Pacific |

**Posters**

Session 88: The Extraordinary Power of Data – Invited Poster Presentations

Sun, 7/31/2016, 6:00 PM - 8:00 PM – Room CC-Hall F1 West

1: Communicate Better with R, R Markdown, and Shiny — Garrett Grolemund, RStudio

Session 203: Environmental Statistics – Contrbuted Poster Presentations

Mon, 8/1/2016, 11:35 AM - 12:20 PM – Room: CC- Hall F1 West

6: Using the R Caret Package as a Teaching Tool for Topics in Classification and Prediction Methods: A Case Study —Keith Williams, University of Arkansas for Medical Sciences

Session 376: Posters on Statistics in Genomics and Genetics

Tue, 8/2/2016, 10:30 AM - 12:20 PM – Room: CC Hall F1 West

51: BANFF: An R Package for BAyesian Network Feature Finder — Zhou Lan, North Carolina State University ; Yize Zhao, Statistical and Applied Mathematical Sciences Institute ; Jian Kang, University of Michigan ; Tianwei Yu, Emory University

Session 449: Posters Statistical Learning and Data Science

Tue, 8/2/2016, 2:00 PM - 2:45 PM– R CC_Hall F1 West

15: An R Package Enabling Likelihood-Based Inference for Generalized Linear Mixed Models — Christina Knudson |

Session 453: Posters on Business and Economic Statistics

Tue, 8/2/2016, 3:05 PM - 3:50 PM – Room: CC Hall F1 West

27: Optimal Stratification of Univariate Populations via StratifyR Package — Karuna Garan Reddy, University of the South Pacific ; Mohammed G. M. Khan, University of the South Pacific

Session 556: Posters on Statistical Computing

Wed, 8/3/2016, 10:30 AM - 12:20 PM – Room CC-Hall F1 West

31: Lucid: An R Package for Pretty Printing Floating Point Numbers — Kevin Wright, DuPont Pioneer

by Joseph Rickert

New R packages keep rolling into CRAN at a prodigious rate: 184 in May, 195 in June and July looks like it will continue the trend. I spent some time sorting through them and have picked out a few that that are interesting from a data science point of view.

ANLP provides functions for building text prediction models. It contains functions for cleaning text data, building N-grams and more. The vignette works through an example.

fakeR generates fake data based on a given data set. Factors are sampled from contingency tables and numerical data is sampled from a multivariate normal distribution. The method works reasonably well for small data sets producing fake data with the same correlation structure as the original data sets. The following example, uses the simulate_dataset() function on a subset of the mtcars data set. The correlation plots look pretty much identical.

library(fakeR)

library(corrplot)

data(mtcars)

df1 <- mtcars[,c(1:4,6)]

set.seed(77)

df2 <- as.data.frame(simulate_dataset(df1))

par(mfrow=c(1,2))

corrplot(cor(df1),method="ellipse")

corrplot(cor(df2),method="ellipse")

Note however, that the simulated data might need some processing before being used in a model. The last row in the df2 simulated data set contains a negative value for displacement. fakeR contains functions for both time dependent and time independent variables. There is a vignette.

heatmaply produces interactive heatmaps. More than just being attractive graphics these increase the already considerable value of the heatmap as an exploratory analysis tool. The following code from the package vignette shows the correlation structure of variables in the mtcars data set.

library(heatmaply)

heatmaply(cor(mtcars),

k_col = 2, k_row = 2,

limits = c(-1,1)) %>%

layout(margin = list(l = 40, b = 40))

mscstexta4r provides an R Client for the Microsoft Cognitive Services Text Analytics REST API, suite of text analytics web services built with Azure Machine Learning that can be used to analyze unstructured text. The vignette explains the explains the text processing capabilities as well as how to get going with it.

mvtboost extends the GBM model to fit boosted decision trees to multivariate continuous variables. The algorithm jointly fits models with multiple outcomes over common set of predictors. According to the authors this makes it possible to: (1) chose the number of trees and shrinkage to jointly minimize prediction error in a test set over all outcomes, (2) compare tree models across outcomes, and (3) estimate the "covariance explained" by predictors in pairs of outcomes. For this last point, the authors explain:

The covariance explained matrix can then be organized in a p×q(q+1)/2p×q(q+1)/2 table where qq is the number of outcomes, and pp is the number of predictors. Each element is the covariance explained by a predictor for any pair of outcomes. When the outcomes are standardized to unit variance, each element can be interpreted as the correlation explained in any pair of outcomes by a predictor. Like the R2R2 of the linear model, this decomposition is unambiguous only if the predictors are independent. Below we show the original covariance explained matrix.

The theory is explained in the paper "Finding Structure in Data". There are two vignettes: the mpg example and the Well-being example.

In a classification problem, label noise refers to the incorrect labeling of the training instances. The NoiseFilterR package presents an extensive collection label noise filtering algorithms along with impressive documentation and references for each algorithm. For example, the documentation for the Generalized Edition (GE), a similarity-based filter is organized like this:

The package and its vignette comprise a valuable resource for machine learning.

preprosim describes itself as a "lightweight data quality simulation for classification". It contains functions to add noise, missing values, outliers, irrelevant features and other transformations that are useful in evaluating the classification accuracy. The vignette provides a small example how missing values and noise can affect accuracy.

polmineR provides text mining tools for the analysis of large corpora using the IMS Open Corpus Workbench (CWB) as the backend. There is a vignette to get you started.

RFinfer provides functions that use the infinitesimal jackknife to generate predictions and prediction variances from random forest models. There are Introduction and Jackknife vignettes.

Finally, there are two packages I recommend installing just for the sake of sanity.

gaussfacts provides random "facts" about Carl Friedrich Gauss. One or two of these and you should feel fine again.

rmsfact invokes random quotes from Richard M. Stallman. To be used when gaussfacts doesn't quite do it for you.

by Konstantin Golyaev, Data Scientist at Microsoft

I recently participated in an internal one-day Microsoft R Server (MRS) hackathon. For an experienced base R user but a complete MRS novice, this turned out to be an interesting challenge. R has fantastic and unparalleled set of tools for exploratory data analysis, as long as your data set is small enough to fit comfortably in memory. In particular, the dplyr package offers a concise and expressive data manipulation semantics, and is wildly popular among R users.

Unlike base R, MRS is able to handle large datasets that do not fit in memory. This is achieved by splitting the data into chunks that are small enough to fit in RAM, applying operations sequentially to every chunk, and storing the results, as well as data, on disk using the binary xdf (eXtended Data Frame) format. The downside to the flexibility is the need to use MRS-specific function from the RevoScaleR package, something that would have definitely slowed me down. Thankfully, the dplyrXdf package bridges the syntax gap between base R and MRS, and enables out-of-memory operations on large datasets with the same dplyr syntax. In what follows, I will describe the problem I chose to address and share the code used to obtain the solution.

Our data science team at Microsoft is split across locations in Redmond, the Bay Area, and Boston. For this reason, I decided to investigate whether planes departing from the corresponding major airports tend to arrive on time. To do this, I used the On-Time Performance Dataset from the Research and Innovative Technology Administration of the Bureau of Transportation Statistics. This dataset spans 26 years, 1988 through 2012, and is fairly large: over 148 million records, or 14 GB of raw information. The combined dataset can be obtained from the Revolution Analytics website. While it is definitely possible to work with such a dataset on a single machine using base R, unless one has access to a really powerful machine (e.g. Azure VM Standard_DS14), it would have taken a while, and we only had several hours for the hackathon.

Instead, I used 2012 data for prototyping the code, and then leveraged the out-of-memory capabilities of MRS to execute the same code on the entire dataset. I tend to be a heavy dplyr user, which makes prototyping with small data a breeze. Unfortunately, dplyr did not work with MRS xdf datasets back then (and it still does not today), which would mean that I would have had to rewrite most of my code in a way that MRS would understand. Thankfully, the dplyrXdf package made it entirely unnecessary; I relied on it as a translation mechanism that relayed my dplyr-style logic to MRS.

The data analysis itself was quite straightforward, given the time constraints of the hackathon. I provisioned a Windows Data Science Virtual Machine on Azure, which comes with a trial version of MRS. I focused on the four major airports that our team uses most frequently: Seattle (SEA), Boston (BOS), San Francisco (SFO), and San Jose (SJC). I grouped the last two into a single entity called SVC (for ‘Silicon Valley Campus’). I eliminated a number of records with missing data, as well as the few records where arrival time was not within the two-hour window of scheduled time. I grouped the flights by departure day of week, as well as departure time, which I bucketed into four six-hour long groups: from midnight to 6:00 AM, from 6:01 AM to noon, and so on. The code for the analysis can be obtained from my GitHub repository.

Once the data was properly partitioned, I computed frequencies of untimely arrivals within each cell, and plotted the results on a grid. In total, I obtained 84 plots – 3 locations, 7 days a week, 4 departure time groups. The vertical red line on each plot indicates the time when the flights were supposed to arrive. This means any probability mass to the right of the red line is bad news, while mass to the left of the red line is good news.

The first thing to notice from the SEA and SVC plots, is that, unless you travel on a plane that departs between midnight and 6:00 AM, you are not very likely to arrive late.

**Frequency of late arrivals for Seattle (SEA)**

It appears that “red-eye” flights from the West Coast are more likely to arrive later than scheduled, and being late by as much as 90 minutes is not out of the question for such flights.

**Frequency of late arrivals for San Francisco (SFO) and San Jose (SJC)**

For flights originating from BOS, however, the story is rather different. Namely, the flights leaving after 6:00 PM are also reasonably likely to be late, in addition to flights leaving between midnight and 6:00 AM. In fact, the “red-eyes” from BOS have a weird bi-modal distribution of relative arrival time: they are either 30 minutes too early or 90 minutes late. This is something to keep in mind if you travel from BOS and have a connecting flight on the way to your final destination.

**Frequency of late arrivals for Boston (BOS)**

In conclusion, a combination of Windows Data Science VM with MRS together with dplyrXdf turned out to be a powerhouse, for three reasons. First, it was trivial to get all the tools up and running, by provisioning the VM and installing dplyrXdf from GitHub. Second, the MRS component allowed me to scale operations to a pretty large dataset without having to worry about the implementation details. Third, the dplyrXdf component made me very productive because it eliminated the need to learn MRS-specific commands and let me use my dplyr skills.

by Joseph Rickert

Just about two and a half years ago I wrote about some resources for doing Bayesian statistics in R. Motivated by the tutorial Modern Bayesian Tools for Time Series Analysis by Harte and Weylandt that I attended at R/Finance last month, and the upcoming tutorial An Introduction to Bayesian Inference using R Interfaces to Stan that Ben Goodrich is going to give at useR! I thought I'd look into what's new. Well, Stan is what's new! Yes, Stan has been under development and available for some time. But somehow, while I wasn't paying close attention, two things happened: (1) the rstan package evolved to make the mechanics of doing Bayesian in R analysis really easy and (2) the Stan team produced and/or organized an amazing amount of documentation.

My impressions of doing Bayesian analysis in R were set in the WinBUGS era. The separate WinBUGs installation was always tricky, and then moving between the BRugs and R2WinBUGS packages presented some additional challenges. My recent Stan experience was nothing like this. I had everything up and running in just a few minutes. The directions for getting started with rstan are clear and explicit about making sure that you have the right tool chain in place for your platform. Since I am running R 3.3.0 on Windows 10 I installed Rtools34. This went quickly and as expected except that C:\Rtools\gcc-4.x-y\bin did not show up in my path variable. Not a big deal: I used the menus in the Windows System Properties box to edit the Path statement by hand. After this, rstan installed like any other R package and I was able to run the 8schools example from the package vignette. The following 10 minute video by Ehsan Karim takes you through the install process and the vignette example.

The Stan documentation includes four major components: (1) The Stan Language Manual, (2) Examples of fully worked out problems, (3) Contributed Case Studies and (4) both slides and video tutorials. This is an incredibly rich cache of resources that makes a very credible case for the ambitious project of teaching people with some R experience both Bayesian Statistics and Stan at the same time. The "trick" here is that the documentation operates at multiple levels of sophistication with entry points for students with different backgrounds. For example, a person with some R and the modest statistics background required for approaching Gelman and Hill's extraordinary text: Data Analysis Using Regression and Multilevel/Hierarchical Models can immediately beginning running rstan code for the book's examples. To run the rstan version of the example in section 5.1, Logistic Regression with One Predictor, with no changes a student only needs only to copy the R scripts and data into her local environment. In this case, she would need the R script: 5._LogisticRegressionWithOnePredictor. R, the data: nes1992_vote.data.R and the Stan code: nes_logit.stan**.** The Stan code for this simple model is about as straightforward as it gets: variable declarations, parameter identification and the model itself.

data { | |

int<lower=0> N; | |

vector[N] income; | |

int<lower=0,upper=1> vote[N]; | |

} | |

parameters { | |

vector[2] beta; | |

} | |

model { | |

vote ~ bernoulli_logit(beta[1] + beta[2] * income); | |

} |

Running the script will produce the iconic logistic regression plot:

I'll wind down by curbing my enthusiasm just a little by pointing out that Stan is not the only game in town. JAGS is a popular alternative, and there is plenty that can be done with unaugmented R code alone as the Bayesian Inference Task View makes abundantly clear.

If you are a book person and new to Bayesian statistics, I highly recommend Bayesian Essentials with R by Jean-Michel Marin and Christian Robert. The authors provide a compact introduction to Bayesian statistics that is backed up with numerous R examples. Also, the new book by Richard McElreath, Statistical Rethinking: A Bayesian Course with Examples in R and Stan looks like it is going to be an outstanding read. The online supplements to the book are certainly worth a look.

Finally, if you are a Bayesian or a thinking about becoming one and you are going to useR!, be sure to catch the following talks:

- Bayesian analysis of generalized linear mixed models with JAGS, by Martyn Plummer
- bamdit: An R Package for Bayesian meta-Analysis of diagnostic test data by Pablo Emilio Verde
- Fitting complex Bayesian models with R-INLA and MCMC by Virgilio Gómez-Rubio
- bayesboot: An R package for easy Bayesian bootstrapping by Rasmus Arnling Bååth
- An Introduction to Bayesian Inference using R Interfaces to Stan by Ben Goodrich
- DiLeMMa - Distributed Learning with Markov Chain Monte Carlo Algorithms Using the ROAR Package by Ali Zaidi

by Lourdes O. Montenegro

*Lourdes O. Montenegro is a PhD candidate at the Lee Kuan Yew School of Public Policy, National University of Singapore. Her research interests cover the intersection of applied data science, technology, economics and public policy.*

Many of us now find it hard to live without a good quality internet connection. As a result, there is growing interest in characterizing and comparing internet performance metrics. For example, when planning to switch internet service providers or considering a move to a new city or country, internet users may want to research in advance what to expect in terms of download speed or latency. Cloud companies may want to provision adequately for different markets with varying levels of internet quality. And governments may want to benchmark their communications infrastructure and invest accordingly. Whatever the purpose, a consortium of research, industry and public interest organizations called Measurement Lab has made available the largest open and verifiable internet performance dataset in the planet. With the help of a combination of packages, R users can easily query, explore and visualize this large dataset at no cost.

In the example that follows, we use the *bigrquery* package to query and download results from the Network Diagnostic Tool (NDT) used by the U.S. FCC. The *bigrquery* package provides an interface to Google BigQuery which hosts NDT results along with several other Measurement Lab (M-Lab) datasets. However, R users will need to first set-up a BigQuery account and join the M-Lab mailing list to authenticate. Detailed instructions are provided on the M-Lab website. Once done, SQL-like queries can be run from within R. The results are saved as a dataframe on which further analysis can be performed. Aside from the convenience of working within the R environment, the *bigrquery* package has another advantage: the only limitation to the size of the query results that can be saved for further exploration is the amount of available RAM. In contrast, the BigQuery web interface only allows export to .csv format of query results which are at or below 16,000 rows.

The following R script gives us the average download speed (in Mbps) per country in 2015.[1] The SQL-like query can be modified to return other internet performance metrics that may be of interest to the R user such as upload speed, round-trip time (latency) and packet re-transmission rates.

# Querying average download speed per country in 2015 require(bigrquery) downquery_template <- "SELECT connection_spec.client_geolocation.country_code AS country, AVG(8 * web100_log_entry.snap.HCThruOctetsAcked/ (web100_log_entry.snap.SndLimTimeRwin + web100_log_entry.snap.SndLimTimeCwnd + web100_log_entry.snap.SndLimTimeSnd)) AS downloadThroughput, COUNT(DISTINCT test_id) AS tests, FROM plx.google:m_lab.ndt.all WHERE IS_EXPLICITLY_DEFINED(web100_log_entry.connection_spec.remote_ip) AND IS_EXPLICITLY_DEFINED(web100_log_entry.connection_spec.local_ip) AND IS_EXPLICITLY_DEFINED(web100_log_entry.snap.HCThruOctetsAcked) AND IS_EXPLICITLY_DEFINED(web100_log_entry.snap.SndLimTimeRwin) AND IS_EXPLICITLY_DEFINED(web100_log_entry.snap.SndLimTimeCwnd) AND IS_EXPLICITLY_DEFINED(web100_log_entry.snap.SndLimTimeSnd) AND project = 0 AND IS_EXPLICITLY_DEFINED(connection_spec.data_direction) AND connection_spec.data_direction = 1 AND web100_log_entry.snap.HCThruOctetsAcked >= 8192 AND (web100_log_entry.snap.SndLimTimeRwin + web100_log_entry.snap.SndLimTimeCwnd + web100_log_entry.snap.SndLimTimeSnd) >= 9000000 AND (web100_log_entry.snap.SndLimTimeRwin + web100_log_entry.snap.SndLimTimeCwnd + web100_log_entry.snap.SndLimTimeSnd) < 3600000000 AND IS_EXPLICITLY_DEFINED(web100_log_entry.snap.CongSignals) AND web100_log_entry.snap.CongSignals > 0 AND (web100_log_entry.snap.State == 1 OR (web100_log_entry.snap.State >= 5 AND web100_log_entry.snap.State <= 11)) AND web100_log_entry.log_time >= PARSE_UTC_USEC('2015-01-01 00:00:00') / POW(10, 6) AND web100_log_entry.log_time < PARSE_UTC_USEC('2016-01-01 00:00:00') / POW(10, 6) GROUP BY country ORDER BY country ASC;" downresult <- query_exec(downquery_template, project="measurement-lab", max_pages=Inf)

Once we have the query results in a dataframe, we can proceed to visualize and map average download speeds for each country. To do this, we can use the *rworldmap* package which offers a relatively simple way to map country level and gridded user datasets. Mapping is done mainly through two functions: (1) *joinCountryData2Map* joins the query results with shapefiles of country boundaries; (2) *mapCountryData* plots the chloropleth map. Note that the join is best effected using either two or three-letter ISO country codes, although the *rworldmap* package also allows join columns filled with country names.

In order to make the chloropleth map prettier and more comprehensible, we can augment with a combination of the *classInt* package to calculate natural breaks in the range of download speed results and the *RColorBrewer* package for a wider selection of color schemes. In the succeeding R script, we specify the Jenks method to cluster download speed results in such a way that minimizes deviation from the class mean within each class but maximizes deviation across class means. Compared to other methods for clustering download speed results, the Jenks method draws a sharper picture of countries clocking greater than 25 Mbps on average.

require(rworldmap) require(classInt) require(RColorBrewer) downloadmap <- joinCountryData2Map(downresult , joinCode='ISO2' , nameJoinColumn='country' , verbose='TRUE') par(mai=c(0,0,0.2,0),xaxs="i",yaxs="i") #getting class intervals using a 'jenks' classification in classInt package classInt <- classInt::classIntervals( downloadmap[["downloadThroughput"]], n=5, style="jenks") catMethod = classInt[["brks"]] #getting a colour scheme from the RColorBrewer package colourPalette <- RColorBrewer::brewer.pal(5,'RdPu') mapParams <- mapCountryData(downloadmap, nameColumnToPlot="downloadThroughput", mapTitle="Download Speed (mbps)", catMethod=catMethod, colourPalette=colourPalette, addLegend=FALSE) do.call( addMapLegend, c(mapParams, legendWidth=0.5, legendLabels="all", legendIntervals="data", legendMar = 2))

Looking at the map, we see that the UK, Japan, Romania, Sweden, Taiwan, The Netherlands, Denmark and Singapore (if we squint!) are the best places to be for internet speed addicts. Until further investigation, we can safely discount the suspiciously high results for North Korea since the number of observations are too low. In contrast, average download speeds in South Korea might be grossly underestimated when measured from foreign servers, as may be the case with NDT results, since most Koreans access locally hosted content. There are, of course, a number of caveats worth mentioning before drawing any conclusions regarding the causes of varying internet performance between countries. Confounding factors such as distance from the client to the test server, the client's operating system, and the proportion of fixed broadband to wireless connections will need to be controlled for. Despite these caveats, this tentative exploration already reveals interesting patterns in global internet performance that is worth a closer look.

[1] Thanks to Stephen McInerney and Chris Ritzo for code advice.