The latest update to the world's most popular statistical data analysis software is now available. R 3.1.2 (codename: "Pumpkin Helmet") makes a number of minor improvements and bug fixes to the R language engine. You can see the complete list of changes here, which include improvements for the log-Normal distribution function, improved axis controls for histograms, a fix to the nlminb optimizer which was causing rare crashes on Windows (and traced to a bug in the gcc compiler), and some compatibility updates for the Yosemite release of OS X on Macs.
This latest update comes on the heels of another major R milestone: the CRAN package repository now features more than 6,000 user-contributed packages! (CRAN actually hit that milestone two days ago; as of this writing there are actually 6,004.) These packages are all ready to use with R 3.1.2 — the CRAN system automatically checks to make sure the packages pass all of their test with the latest R version. You can search and explore R packages on MRAN, or simply browse R packages by topic area.
Revolution R Open will be updated to include R 3.1.2 very soon: the next update is in testing now and should be ready in a couple of weeks.
r-devel mailing list: R 3.1.2 is released
by Joseph Rickert
The San Francisco Bay Area Chapter of the Association of Computing Machinery (ACM) has been holding an annual Data Mining Camp and "unconference" since 2009. This year, to reflect the times, the group held a Data Science Camp and unconference, and we at Revolution Analytics were, once again, very happy to be a sponsor for the event and pleased to be able to participate.
In an ACM unconference, except for prearranged tutorials and the keynote address, there are no scheduled talks. Instead, anyone with the passion to speak gets two minutes to pitch a session. A show of hands determines what flys, the organizers allocate rooms and group talks by theme on-the-fly, and then off you go. The photo below shows how all of this sorted out on Saturday.
As you might expect, there was a lot of interest in Big Data, NoSQL, NLP etc., but there was also quite a bit of interest in R, enough to run fill a large room for two back-to-back sessions. I was very happy to reprise some of the material from a recent webinar I presented on an introduction to Machine Learning and Data Science with R, and Ram Narasimhan (a longtime member of the Bay Area useR Group) gave a high energy and very informative tutorial on the dplyr package that, judging from the audience reaction, inspired quite a few new R programmers.
But the real R highlight came early in the day. Irina Kukuyeva presented a tutorial on Principal Component Analysis with Applications in R and Python that was well worth getting up for early Saturday morning. Not only did irina put together a very nice introduction to PCA starting with the the basic math and illustrating how PCA is used through case studies, but in a laudable effort to be as inclusive as possible, she also took the trouble to write both Python and R code for all of her examples! The following slide shows what PCA looks like in both languages.
This next slide shows what a good bit of statistics looks like in both languages.
For more presentations and tutorials by Irina that feature R, have a look at her Tutorial page.
If you ever find you need to embed the results of R functions — data, charts, or even a single calculation — into other applications, then you might want to take a look at DeployR Open. DeployR Open is an open-source server-based framework for R, that makes it easy to call out to the server to run R code in real time.
The workflow is simple: An R programmer develops an R script (using their standard R tools) and publishes that script to the DeployR server. Once published, R scripts can be executed by any authorized application using the DeployR API. We provide native client libraries in Java, JavaScript and .NET to simplify making calls to the server. The R results returned on these calls can be embedded, displayed or processed in any way your application needs.
There are a few really nice features of this architecture:
DeployR Open offers a number of ways for client applications to integrate with the server:
DeployR Open is an 100% open-source project, and includes many features previously only available as part of Revolution R Enteprise DeployR. (That's how despite being new as an open source project, it's already at version 7.3 — it's been in use for more than four years!) If you'd like technical support for DeployR Open, phone and email support is included with a Revolution R Plus subscription.
The DeployR Open server is deployed as a single node, so mainly designed for prototyping, building and deploying applications where expected load on the server is low or moderate. If you anticipate a need to scale to multiple server resources to handle increased workload and improved throughput or want to enjoy seamless integration with popular enterprise security solutions such as SSO, LDAP, Active Directory or PAM, consider upgrading to Revolution R Enterprise DeployR.
If you'd like to learn more, check out the DeployR Open website. There, you can find developer documentation and administrator documentation, and of course download and install DeployR Open for free. Let us know what you think in the comments!
by Terry M. Therneau Ph.D.
Faculty, Mayo Clinic
About a year ago there was a query about how to do "type 3" tests for a Cox model on the R help list, which someone wanted because SAS does it. The SAS addition looked suspicious to me, but as the author of the survival package I thought I should understand the issue more deeply. It took far longer than I expected but has been illuminating.
First off, what exactly is this 'type 3' computation of which SAS so deeply enamored? Imagine that we are dealing with a data set that has interactions. In my field of biomedical statistics all data relationships have interactions: an effect is never precisely the same for young vs old, fragile vs robust, long vs short duration of disease, etc. We may not have the sample size or energy to model them, but they exist nonetheless. Assume as an example that we had a treatment effect that increases with age; how then would one describe a main effect for treatment? One approach is to select an age distribution of interest and use the mean treatment effect, averaged over that age distribution.
To compute this, one can start by fitting a sufficiently rich model, get predicted values for our age distribution, and then average them. This requires almost by definition a model that includes an age by treatment interaction: we need reasonably unbiased estimates of the treatment effects at individual ages a,b,c,... before averaging, or we are just fooling ourselves with respect to this overall approach. The SAS type 3 method for linear models is exactly this. It assumes as the "reference population of interest" a uniform distribution over any categorical variables and the observed distribution of the data set for any continuous ones, followed by a computation of the average predicted value. Least squares means are also an average prediction taken over the reference population.
A primary statistical issue with type 3 is the choice of reference. Assume for instance that age had been coded as a categorical with levels of 50-59, 60-69, 70-79 and 80+. A type 3 test answers the question of what the treatment effect would be in a population of subjects in which 1/4 were aged 50-59, another 1/4 were 60-69, etc. Since I will never encounter a set of subjects with said pattern in real life, such an average is irrelevant . A nice satire of the situation can be found under the nom de plume of Guernsey McPearson (Also have a look at Multi-Centre Trials and the Finally Decisive Argument). To be fair there are other cases where the uniform distribution is precisely the right population, e.g., a designed experiment that lost perfect balance due to a handful of missing response values. But these are rare to non-existent in my world, and type 3 remains an answer to the question that nobody asked.
Average population prediction also highlights a serious deficiency in R. Working out the algebra, type 3 tests for a linear model turn out to be a contrast, C %*% coef(fit), for a particular contrast vector or matrix C. This fits neatly into the SAS package, which has a simple interface for user specified contrasts. (The SAS type 3 algorithm is at its heart simply an elegant way to derive C for their default reference population.) The original S package took a different view, which R has inherited, of pre instead of post-processing. Several of the common contrasts one might want to test can be obtained by clever coding of the design matrix X, before the fit, causing the contrast of interest to appear as one of the coefficients of the fitted model. This is a nice idea when it works, but there are many cases where it is insufficient, a linear trend test or all possible pair-wise comparisons for example.
R needs a general and well thought out post-fit contrasts function. Population averaged estimates could be one option of said routine, with the SAS population one possible choice.
Also, I need to mention a couple more things:
How do you summarize fashion? For New York Fashion Week, the New York Times used the idea of "Fashion Fingerprints", distilling a designer's collections into small fragments highlighting the palette. Here's what Marc Jacobs' current collection looks like:
Click through for an interactive version where you can explore each design, and scroll down to the bottom where you can see even greater distillation: each designer represented as abstract color blocks, with colors represented from head to toe.
R user Giuseppe Paleologo noted that R is ideally suited to a task like this. In fact, it took him less than 10 minutes to create a prototype function in R, and less than an hour to create a clean working function. The actual code is less than 20 lines, and you can use it to process 2000 images in less than a minute. Here's the result:
You can find the R code that created this image, and more information about how it works, at the link below.
gappy3000: R is in fashion (via the author)
If you haven't heard the buzz about Docker but you often need to spin up Linux-based VM's for testing, simulations, etc. then you should check it out. In short, Docker rocks: we use it for testing our Linux-based distros of Revolution R Open. If you want to use R and Docker together, Dirk Eddelbuettel and Carl Boettiger have made it easy with Rocker, and have also provided a nice explanation of Docker itself:
While its use (superficially) resembles that of virtual machines, it is much more lightweight as it operates at the level of a single process (rather than an emulation of an entire OS layer). This also allows it to start almost instantly, require very little resources and hence permits an order of magnitude more deployments per host than a virtual machine.
Rocker provides pre-built Docker images for the currently-released base R distribution, as well as the in-progress R-devel build. It also provides a container for an RStudio Server instance.
You can find more info about Rocker at the blog post below, or at the Rocker Github page.
Dirk Eddelbuettel: Introducing Rocker: Docker for R
by Joseph Rickert
One of the most interesting R related presentations at last week’s Strata Hadoop World Conference in New York City was the session on Distributed R by Sunil Venkayala and Indrajit Roy, both of HP Labs. In short, Distributed R is an open source project with the end goal of running R code in parallel on data that is distributed across multiple machines. The following figure conveys the general idea.
A master node controls multiple worker nodes each of which runs multiple R processes in parallel.
As I understand it, the primary use case for the Distributed R software is to move data quickly from a database into distributed data structures that can be accessed by multiple, independent R instances for coordinated, parallel computation. The Distributed R infrastructure automatically takes care of the extraction of the data and the coordination of the calculations, including the occasional movement of data from a worker node to the master node when required by the calculation. The user interface to the Distributed R mechanism is through R functions that have been designed and optimized to work with the distributed data structures, and through a special “Distributed R aware” foreach() function that allow users to write their own distributed functions using ordinary R functions.
To make all of this happen, Distributed R platform contains several components that may be briefly described as follows:
The distributed R package contains:
A really nice feature of the distributed data structures is that they can be populated and accessed by rows, columns and blocks making it possible to write efficient algorithms tuned to the structure of particular data sets. For example, data cleaning for wide data sets (many more columns than rows) can be facilitated by preprocessing individual features.
vRODBC is an ODBC client that provides R with database connectivity. This is the connection mechanism that permits the parallel loading of data from various sources data including HP’s Vertica database.
The HPdata package contains the functions that allow you to actually load distributed data structures from various data sources
The HPDGLM package implements a parallel, distributed GLM models (Presently only linear regression, logistic regression and Poisson regression models are available), The package also contains functions for cross validation and split-sample validation.
The HPdclassifier package is intended to contain several distributed classification algorithms. It currently contains a parallel distributed implementation of the random forests algorithm.
The HPdcluster package contains a parallel, distributed kmeans algorithm.
The HPdgraph package is intended to contain distributed algorithms for graph analytics. It currently contains a parallel, distributed implementation of the pagerank algorithm for directed graphs.
The following sample code, taken directly from the HPdclassifier User Guide, but modified slightly for presentation here, is similar to the examples that Venkayala and Roy showed in their presentation. Note, that after the distributed arrays are set up they are loaded in parallel with data using the the foreach function from the distributedR package.
library(HPdclassifier) # loading the library
Loading required package: distributedR
Loading required package: Rcpp
Loading required package: RInside
Loading required package: randomForest
distributedR_start() # starting the distributed environment
Workers registered - 1/1.
All 1 workers are registered.
[1] TRUE
nparts <- sum(ds$Inst) # number of available distributed instances
# Describe the data
nSamples <- 100 # number of samples
nAttributes <- 5 # number of attributes of each sample
nSpllits <- 1 # number of splits in each darray
# Create the distributed arrays
dax <- darray(c(nSamples,nAttributes),c(round(nSamples/nSpllits),nAttributes))
day <- darray(c(nSamples,1), c(round(nSamples/nSpllits),1))
# Load the distributed arrays
foreach(i, 1:npartitions(dax),
function(x=splits(dax,i),y=splits(day,i),id=i){
x <- matrix(runif(nrow(x)*ncol(x)), nrow(x),ncol(x))
y <- matrix(runif(nrow(y)), nrow(y), 1)
update(x)
update(y)
})
# Fit the Random Forest Model
myrf <- hpdrandomForest(dax, day, nExecutor=nparts)
# prediction
dp <- predictHPdRF(myrf, dax)
Notwithstanding all of its capabilities, Distributed R is still clearly work in progress. It is only available on Linux platforms. Algorithms and data must be resident in memory. Distributed R is not available on CRAN, and even with an excellent Installation Guide, installing the platform is a bit of an involved process.
Nevertheless, Distributed R is impressive, and I think a valuable contribution to open source R. I expect that users with distributed data will find the platform to be a viable way to begin high performance computing with R.
Note that the Distributed R project discussed in this post is an HP initiative and is not in anyway related to http://www.revolutionanalytics.com/revolution-r-enterprise-distributedr.
by Andrie de Vries
Last week we announced the availability of Revolution R Open, an enhanced distribution of R. One of the enhancements is the inclusion of high performance linear algebra libraries, specifically the Intel MKL. This library significantly speeds up many statistical calculations, e.g. the matrix algebra that forms the basis of many statistical algorithms.
Several years ago, David Smith wrote a blog post about multithreaded R, where he explored the benefits of the MKL, in particular on Windows machines.
In this post I explore whether anything has changed.
To best use the power available in the machines of today, Revolution R Open is installed by default with the Intel Math Kernel Library (MKL), which provides BLAS and LAPACK library functions used by R. Intel MKL makes it possible for so many common R operations to use all of the processing power available.
The MKL's default behavior is to use as many parallel threads as there are available cores. There’s nothing you need to do to benefit from this performance improvement — not a single change to your R script is required.
However, you can still control or restrict the number of threads using the setMKLthreads()
function from the Revobase
package delivered with Revolution R Open. For example, you might want to limit the number of threads to reserve some of the processing capacity for other activities, or if you’re doing explicit parallel programming with the ParallelR suite or other parallel programming tools.
You can set the maximum number of threads as follows:
setMKLthreads(<value>)
Where the <value>
is the maximum number of parallel threads, not to exceed the number of available cores.
Compared to open source R, the MKL offers significant performance gains, particularly on Windows.
Here are the results of 5 tests on matrix operations, run on a Samsung laptop with an Intel i7 4-core CPU. From the graphic you can see that a matrix multiplication runs 27 times faster with the MKL than without, and linear discriminant analysis is 3.6 times faster.
You can replicate the same tests by using this code:
---
---
Another famous benchmark was published by Simon Urbanek, one of the members of R-core. You can find his code at Simon's benchmark page. His benchmark consists of three different classes of test:
I compared the total execution time of the benchmark script in RRO (with MKL) and R. Using Revolution R Open, the benchmark tests completed in 47.7 seconds. This compared to ~176 seconds using R-3.1.1 on the same machine.
To replicate these results you can use the following script runs (sources) his code directly from the URL and captures the total execution time:
---
---
Here is a summary of each of the individual tests:
R-3.1.1 | RRO | Performance gain | |
I. Matrix calculation | |||
Create, transpose and deform matrix | 1.01 | 1.01 | 0.0 |
Matrix computation | 0.40 | 0.40 | 0.0 |
Sort random values | 0.72 | 0.74 | 0.0 |
Cross product | 11.50 | 0.42 | 26.4 |
Linear regression | 5.56 | 0.25 | 20.9 |
II. Matrix functions | |||
Fast Fourier Transform | 0.45 | 0.47 | 0.0 |
Compute eigenvalues | 0.74 | 0.39 | 0.9 |
Calculate determinant | 2.87 | 0.24 | 10.8 |
Cholesky decomposition | 4.50 | 0.25 | 16.8 |
Matrix inverse | 2.71 | 0.25 | 9.9 |
III. Programmation | |||
Vector calculation | 0.67 | 0.67 | 0.0 |
Matrix calculation | 0.26 | 0.26 | 0.0 |
Recursion | 0.95 | 1.06 | -0.1 |
Loops | 0.43 | 0.43 | 0.0 |
Mixed control flow | 0.41 | 0.37 | 0.1 |
Total test time | 165.60 | 47.72 | 2.5 |
The Intel MKL makes a notable difference for many matrix computations. When running the Urbanek benchmark using the MKL on Windows, you can expect a performance gain of ~2.5x.
The caveat is that different the standard R distribution on different operating systems use different math libraries. For example, R on Mac OSx uses the ATLAS blas, which gives you comparable performance to the MKL.
To find out more about Revolution R Open, go to http://mran.revolutionanalytics.com/open/
You can download RRO at http://mran.revolutionanalytics.com/download/
Many R scripts depend on CRAN packages, and most CRAN packages in turn depend on other CRAN packages. If you install an R package, you'll also be installing its dependencies to make it work, and possibly other packages as well to enable its full functionality.
My colleague Andrie posted some R code to map package dependencies a couple of months ago, but now you can easily explore the dependencies of any CRAN package at MRAN. Simply search for a package and click the Dependencies Graph tab. Here's a very simple one: the foreach package.
The foreach package depends on two others: iterators and codetools, which will be automatically installed for you by install.packages when you install foreach. (We'll duscuss the use of "Suggests" — as here with randomForest — later.) Now let's look at a more complex example: the caret package.
The caret package provides an interface to many of the predictive modeling packages on CRAN, and so it has several dependencies (nine, in fact — you can see the list by clicking on the Dependencies Table tab). But it also Suggests many more packages — these are packages that are not required to run caret, but if you do have them, there are more model types you can use within the caret framework.
Here's a quick overview of the types of dependencies you'll find in the charts and tables on MRAN:
MRAN is updated daily, and so the Dependencies Graph is always up-to-date with the latest CRAN packages and their connections. Start exploring at the link below.
MRAN: Explore Packages
My second-favourite keynote from yesterday's Strata Hadoop World conference was this one, from Pinterest's John Rauser. To many people (especially in the Big Data world), Statistics is a series of complex equations, but a just a little intuition goes a long way to really understanding data. John illustrates this wonderfully using an example of data collected to determine whether consuming beer causes mosquitoes to bite you more:
The big lesson here, IMO, is that so many statistical problems can seem complex, but you can actually get a lot of insight by recognizing that your data is just one possible instance of a random process. If you have a hypothesis for what that process is, you can simulate it, and get an intuitive sense of how surprising your data is. R has excellent tools for simulating data, and a couple of hours spent writing code to simulate data can often give insights that will be valuable for the formal data analysis to come.
(By the way, my favourite keynote from the connference was Amanda Cox's keynote on data visualization at the New York Times, which featured several examples developed in R. Sadly, though, it wasn't recorded.)
O'Reilly Strata: Statistics Without the Agonizing Pain