Jilia Silge and David Robinson are both dab hands at using R to analyze text, from tracking the happiness (or otherwise) of Jane Austen characters, to identifying whether Trump's tweets came from him or a staffer. If you too would like to be able to make statistical sense of masses of (possibly messy) text data, check out their book Tidy Tidy Text Mining with R, available free online and soon to be published by O'Reilly.
The book builds on the tidytext package (to which Gabriela De Queiroz also contributed) and describes how to handle and analyze text data. The "tidy text" of the title refers to a standardized way of handling text data, as a simple table with one term per row (where a "term" may be a word, collection of words, or sentence, depending on the application). Julia gave several examples of tidy text in her recent talk at the RStudio conference:
Once you have text data in this "tidy" format, you can apply a vast range of statistical tools to it, by assigning data values to the terms. For example, can use sentiment analysis tools to quantify terms by their emotional content, and analyze that. You can compare rates of term usage, such as between chapters or to compare authors, or simply create a word cloud of terms used. You coyld use topic modeling techniques, to classify a collection of documents into like kinds.
There are a wealth of data sources you can use to apply these techniques: documents, emails, text messages ... anything with human-readable text. The book includes examples of analyzing works of literature (check out the janesustenr and guternbergr packages), downloading Tweets and Usenet posts, and even shows how to use metadata (in this case, from NASA) as the subject of a text analysis. But it's just as likely you have data of your own to try tidy text mining with, so check out Tidy Text Mining with R and to get started.
The Microsoft R Server Tiger Team assists customers around the world to implement large-scale analyytic solutions. Along the way, they discover useful tips and best practices, and share them on the Tiger Team blog. Here are a few recent tips from the Tiger Team on using Microsoft R Server:
For more tips, including tips on operationalizing R scripts and using Microsoft R Server with data platforms including Teradata and Cloudera, check out thre Tiger Team blog at the link below.
Switzerland is a country with lots of mountains, and several large lakes. While the political subdivisions (called municipalities) cover the high mountains and lakes, nothing much of economic interest happens in these places. (Raclette and sailing are wonderful, but don't count for our purposes.) For this reason, the Swiss Federal Statistical Office publishes the boundaries of the "productive" parts of the municipalities, and as this choropleth of average age in Swiss municpalities created by Timo Grossenbacher shows, leaving out the non-productive parts leaves us with a very different-looking Switzerland.
The choropleth would be more recognizable by filling in the non-productive areas with a traditional relief map, which is exactly what Timo does (along with breaking the age scale into discrete categories, for improved interpretability) in the publication-quality map below.
Timo's blog post, Beautiful thematic maps with ggplot2 (only), details the process of building maps like this using the ggplot2 package (and just a few others) for R. There are lots of useful nuggets of advice within the tutorial, including:
R --vanilla
". The --vanilla
option prevents R from running any initialization scripts (that might load packages) or loading any objects from a saved workspace.coord_equal
to display them as a map without distortion.For the complete tutorial, including links to the code and data, check out Timo's blog post linked below.
Timo Grossenbacher: Beautiful thematic maps with ggplot2 (only)
Update Dec 29: A couple of minor corrections based on feeback from Timo Grossenbacher
Bayesian Inference is a way of combining information from data with things we think we already know. For example, if we wanted to get an estimate of the mean height of people, we could use our prior knowledge that people are generally between 5 and 6 feet tall to inform the results from the data we collect. If our prior is informative and we don't have much data, this will help us to get a better estimate. If we have a lot of data, even if the prior is wrong (say, our population is NBA players), the prior won't change the estimate much. You might say that including such "subjective" information in a statistical model isn't right, but there's subjectivity in the selection of any statistical model. Bayesian Inference makes that subjectivity explicit.
Bayesian Inference can seem complicated, but as Brandon Rohrer explains, it's based on straighforward principles of conditional probability. Watch his video below for an elegant explanation of the basics.
If you'd like to try out some Bayesian statistics yourself, R has many packages for Bayesian Inference.
Data Science and Robots Blog: How Bayesian inference works
If you're already familiar with R, but struggling with out-of-memory or performance problems when attempting to analyze large data sets, you might want to check out this new EdX course, Analyzing Big Data with Microsoft R Server, presented by my colleague Seth Mottaghinejad. In the course, you'll learn how to build models using the RevoScaleR
package, and deploy those models to production environments like Spark and SQL Server. The course is self-paced with videos, tutorials and tests, and is free to audit.
(By the way, if you don't already know R, you might want to check out the courses Introduction to R for Data Science and Programming in R for Data Science first.)
The RevoScaleR package isn't available on CRAN: it's included with Microsoft R Server and Microsoft R Client. You can download and use Microsoft R Client for free, which provides an installation of R with the RevoScaleR
library built in and loaded when you start the session. An R IDE is also recommended: you can use R Tools for Visual Studio or RStudio.
The course is open now, and you can get started at EdX at the link below.
EdX: Analyzing Big Data with Microsoft R Server
The glmnetUtils package provides a collection of tools to streamline the process of fitting elastic net models with glmnet. I wrote the package after a couple of projects where I found myself writing the same boilerplate code to convert a data frame into a predictor matrix and a response vector. In addition to providing a formula interface, it also has a function (cvAlpha.glmnet
) to do crossvalidation for both elastic net parameters α and λ, as well as some utility functions.
The interface that glmnetUtils provides is very much the same as for most modelling functions in R. To fit a model, you provide a formula and data frame. You can also provide any arguments that glmnet will accept. Here is a simple example:
mtcarsMod <- glmnet(mpg ~ cyl + disp + hp, data=mtcars) ## Call: ## glmnet.formula(formula = mpg ~ cyl + disp + hp, data = mtcars) ## ## Model fitting options: ## Sparse model matrix: FALSE ## Use model.frame: FALSE ## Alpha: 1 ## Lambda summary: ## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 0.03326 0.11690 0.41000 1.02800 1.44100 5.05500
Under the hood, glmnetUtils creates a model matrix and response vector, and passes them to the glmnet package to do the actual model fitting. Prediction also works as you'd expect: just pass a data frame containing the new observations, along with any arguments that predict.glmnet
needs.
# least squares regression: get predictions for lambda=1
predict(mtcarsMod, newdata=mtcars, s=1)
You may have noticed the options "use model.frame" and "sparse model matrix" in the printed output above. glmnetUtils includes a couple of options to improve performance, especially on wide datasets and/or have many categorical (factor) variables.
The standard R method for creating a model matrix out of a data frame uses the model.frame
function, which has a major disadvantage when it comes to wide data. It generates a terms object, which specifies how the original columns of data relate to the columns in the model matrix. This involves creating and storing a (roughly) square matrix of size p × p, where p is the number of variables in the model. When p > 10000, which isn't uncommon these days, the terms object can exceed a gigabyte in size. Even if there is enough memory to store the object, processing it can be very slow.
Another issue with the standard approach is the treatment of factors. Normally, model.matrix
will turn an N-level factor into an indicator matrix with N−1 columns, with one column being dropped. This is necessary for unregularised models as fit with lm
and glm
, since the full set of N columns is linearly dependent. However, this may not be appropriate for a regularised model as fit with glmnet. The regularisation procedure shrinks the coefficients towards zero, which forces the estimated differences from the baseline to be smaller. But this only makes sense if the baseline level was chosen beforehand, or is otherwise meaningful as a default; otherwise it is effectively making the levels more similar to an arbitrarily chosen level.
To deal with these problems, glmnetUtils by default will avoid using model.frame
, instead building up the model matrix term-by-term. This avoids the memory cost of creating a terms object, and can be much faster than the standard approach. It will also include one column in the model matrix for all levels in a factor; that is, no baseline level is assumed. In this situation, the coefficients represent differences from the overall mean response, and shrinking them to zero is meaningful (usually). Machine learners may also recognise this as one-hot encoding.
glmnetUtils can also generate a sparse model matrix, using the sparse.model.matrix
function provided in the Matrix package. This works exactly the same as a regular model matrix, but takes up significantly less memory if many of its entries are zero. A scenario where this is the case would be where many of the predictors are factors, each with a large number of levels.
One piece missing from the standard glmnet package is a way of choosing α, the elastic net mixing parameter, similar to how cv.glmnet
chooses λ, the shrinkage parameter. To fix this, glmnetUtils provides the cvAlpha.glmnet
function, which uses crossvalidation to examine the impact on the model of changing α and λ. The interface is the same as for the other functions:
# Leukemia dataset from Trevor Hastie's website: # http://web.stanford.edu/~hastie/glmnet/glmnetData/Leukemia.RData load("~/Leukemia.rdata") leuk <- do.call(data.frame, Leukemia) cvAlpha.glmnet(y ~ ., data=leuk, family="binomial") ## Call: ## cvAlpha.glmnet.formula(formula = y ~ ., data = leuk, family = "binomial") ## ## Model fitting options: ## Sparse model matrix: FALSE ## Use model.frame: FALSE ## Alpha values: 0 0.001 0.008 0.027 0.064 0.125 0.216 0.343 0.512 0.729 1 ## Number of crossvalidation folds for lambda: 10
cvAlpha.glmnet
uses the algorithm described in the help for cv.glmnet
, which is to fix the distribution of observations across folds and then call cv.glmnet
in a loop with different values of α. Optionally, you can parallelise this outer loop, by setting the outerParallel
argument to a non-NULL value. Currently, glmnetUtils supports the following methods of parallelisation:
parLapply
in the parallel package. To use this, set outerParallel
to a valid cluster object created bymakeCluster
.rxExec
as supplied by Microsoft R Server’s RevoScaleR package. To use this, set outerParallel
to a valid compute context created by RxComputeContext
, or a character string specifying such a context.The glmnetUtils package is a way to improve quality of life for users of glmnet. As with many R packages, it’s always under development; you can get the latest version from my GitHub repo. The easiest way to install it is via devtools:
library(devtools)
install_github("hong-revo/glmnetUtils")
A more detailed version of this post can also be found at the package vignette. If you find a bug, or if you want to suggest improvements to the package, please feel free to contact me at hongooi@microsoft.com.
Hadley Wickham, co-author (with Garrett Grolemund) of R for Data Science and RStudio's Chief Scientist, has focused much of his R package development on the un-sexy but critically important part of the data science process: data management. In the Tidy Tools Manifesto, he proposes four basic principles for any computer interface for handling data:
Reuse existing data structures.
Compose simple functions with the pipe.
Embrace functional programming.
Design for humans.
Those principles are realized in a new collection of his R packages: the tidyverse. Now, with a simple call to library(tidyverse) (after installing the package from CRAN), you can load a suite of tools to make managing data easier into your R session:
The tidyverse also loads purrr, for functional programming with data, and ggplot2, for data visualization using the grammar of graphics.
Installing the tidyverse package also installs for you (but doesn't automatically load) a raft of other packages to help you work with dates/time, strings, factors (with the new forcats package), and statistical models. It also provides various packages for connecting to remote data sources and data file formats.
Simply put, tidyverse puts a complete suite of modern data-handling tools into your R session, and provides an essential toolbox for any data scientist using R. (Also, it's a lot easier to simply add library(tidyverse) to the top of your script rather than the dozen or so library(...) calls previously required!) Hadley regularly updates these packages, and you can easily update them in your R installation using the provided tidyverse_update() function.
For more on tidyverse, check out Hadley's post on the RStudio blog, linked below.
RStudio Blog: tidyverse 1.0.0
If you have dense data on a continuous scale, an effective way of representing the data visually is to use a heatmap, where the values are represented by a color on a continuous scale. For example, this chart from a Wall Street Journal interactive feature (and mentioned in Tal Galili's useR!2016 talk) represents the number of measles cases in each US state and year by a colored square:
(Here's how to create that chart in R.) But, note that scale at the bottom of the chart, mapping measles cases to a color on the rainbow. Here, we'll zoom in on it:
The scale you choose for a heat map is very important, and has a major impact on how the viewer will interpret the data presented. This scale has been chosen with care: while most of the scale is red, very few of the data cells are red (because the distribution of measles cases is skewed, thanks in particular to the introduction of a vaccine in 1964). A naively chosen scale would wash out the data.
The actual colors you choose are important too. The physics, technology, and neuroscience behind the interpretation of colors is surprisingly complex, but this talk on the default color schemes used in Python's matplotlib does a great job of explaining:
You can easily use the viridis color scales in R as well, thanks to the viridis package by Simon Garnier, which is available on CRAN. The package provides for heatmap color schemes, all carefully chosen for optimized perception and usefulness for color-impaired viewers.
You can find several examples of using the viridis color pallettes in the package vignette, both for base R graphics (including raster) and ggplot2. To get started, just install.packages("viridis") to install the package from CRAN.
Github (Simon Garnier): viridis
You download the data and complete your analysis with ample time to spare. Then, just before deadline, your collaborator lets you know that they've "fixed a data error". Now, you have to do your analysis all over again. This is the reproducibility horror story:
R provides many tools to make reproducibility easy, and the creators of the above video, Ecoinformática - AEET, provide a useful list of tutorials and guides. Chief amongst these is using the knitr package for R: the R language automates the process of importing, preparing and analyzing the data, while knitr automates the process of assembling text, code, tables and charts into a Word, PDF, HTML and many other document formats.
But while knitr solves a good chunk[*] of the reproducibility problem, there's one complicating factor it doesn't deal with: updated R pacakges. In the same way that a collaborator updating the data triggers a restart, someone updating an R package your script uses can also affect your results. (That someone was likely you, working on a different R project.) The checkpoint package for R solves that problem by letting you "lock in" the package versions you use with a project. It's easy to use: all you need to do is add a line like checkpoint("2016-08-31") to the beginning of your script, which:
It does some clever things to avoid re-downloading packages if it doesn't need to, and avoiding duplicates of multiple copies of the same package version, but that's the basic gist. Checkpoint also makes it really easy to share code with others, because you can be confident they'll also get the packages they need to make your script work. You can learn more about the checkpoint package here and in this vignette, and just install it from CRAN to get started. (If you use Microsoft R Open you don't even need to download it, it's already included.)
[*] pun intended
R has some good tools for importing data from spreadsheets, among them the readxl package for Excel and the googlesheets package for Google Sheets. But these only work well when the data in the spreadsheet are arranged as a rectangular table, and not overly encumbered with formatting or generated with formulas. As Jenny Bryan pointed out in her recent talk at the useR!2016 conference (and embedded below, or download PDF slides here), in practice few spreadsheets have "a clean little rectangle of data in the upper-left corner", because most people use spreadsheets not just a file format for data retrieval, but also as a reporting/visualization/analysis tool.
Nonetheless, for a practicing data scientist, there's a lot of useful data locked up in these messy spreadsheets that needs to be imported into R before we can begin analysis. As just one example given by Jenny in her talk, this spreadsheet was included as one of 15,000 spreadsheet attachments (one with 175 tabs!) in the Enron Corpus.
To make it easier to import data into R from messy spreadsheets like this, Jenny and co-author Richard G. FitzJohn created the jailbreakr package. The package is in its early stages, but it can already import Excel (xlsx format) and Google Sheets intro R as a new "linen" objects from which small sub-tables can easily be extracted as data frames. It can also print spreadsheets in a condensed text-based format with one character per cell — useful if you're trying to figure out why an apparently simple spreadsheet isn't importing as you expect. (Check out the "weekend getaway winner" story near the end of Jenny's talk for a great example.)
The jailbreakr package isn't yet on CRAN, but if you want to try it out you can download it from the Github repository (or even contribute!) at the link below.
Github (rsheets): jailbreakr