by Joseph Rickert

The San Francisco Bay Area Chapter of the Association of Computing Machinery (ACM) has been holding an annual Data Mining Camp and "unconference" since 2009. This year, to reflect the times, the group held a *Data Science Camp* and unconference, and we at Revolution Analytics were, once again, very happy to be a sponsor for the event and pleased to be able to participate.

In an ACM unconference, except for prearranged tutorials and the keynote address, there are no scheduled talks. Instead, anyone with the passion to speak gets two minutes to pitch a session. A show of hands determines what flys, the organizers allocate rooms and group talks by theme on-the-fly, and then off you go. The photo below shows how all of this sorted out on Saturday.

As you might expect, there was a lot of interest in Big Data, NoSQL, NLP etc., but there was also quite a bit of interest in R, enough to run fill a large room for two back-to-back sessions. I was very happy to reprise some of the material from a recent webinar I presented on an introduction to Machine Learning and Data Science with R, and Ram Narasimhan (a longtime member of the Bay Area useR Group) gave a high energy and very informative tutorial on the dplyr package that, judging from the audience reaction, inspired quite a few new R programmers.

But the real R highlight came early in the day. Irina Kukuyeva presented a tutorial on* Principal Component Analysis with Applications in R and Python* that was well worth getting up for early Saturday morning. Not only did irina put together a very nice introduction to PCA starting with the the basic math and illustrating how PCA is used through case studies, but in a laudable effort to be as inclusive as possible, she also took the trouble to write both Python and R code for all of her examples! The following slide shows what PCA looks like in both languages.

This next slide shows what a good bit of statistics looks like in both languages.

For more presentations and tutorials by Irina that feature R, have a look at her Tutorial page.

by Joseph Rickert

One of the most interesting R related presentations at last week’s Strata Hadoop World Conference in New York City was the session on Distributed R by Sunil Venkayala and Indrajit Roy, both of HP Labs. In short, Distributed R is an open source project with the end goal of running R code in parallel on data that is distributed across multiple machines. The following figure conveys the general idea.

A master node controls multiple worker nodes each of which runs multiple R processes in parallel.

As I understand it, the primary use case for the Distributed R software is to move data quickly from a database into distributed data structures that can be accessed by multiple, independent R instances for coordinated, parallel computation. The Distributed R infrastructure automatically takes care of the extraction of the data and the coordination of the calculations, including the occasional movement of data from a worker node to the master node when required by the calculation. The user interface to the Distributed R mechanism is through R functions that have been designed and optimized to work with the distributed data structures, and through a special “Distributed R aware” foreach() function that allow users to write their own distributed functions using ordinary R functions.

To make all of this happen, Distributed R platform contains several components that may be briefly described as follows:

The distributed R package contains:

- functions to set up the infrastructure for the distributed platform
- distributed data structures that are the analogues of R’s data frames, arrays and lists, and
- the functions foreach() and splits() to let users write their own parallel algorithms.

A really nice feature of the distributed data structures is that they can be populated and accessed by rows, columns and blocks making it possible to write efficient algorithms tuned to the structure of particular data sets. For example, data cleaning for wide data sets (many more columns than rows) can be facilitated by preprocessing individual features.

vRODBC is an ODBC client that provides R with database connectivity. This is the connection mechanism that permits the parallel loading of data from various sources data including HP’s Vertica database.

The HPdata package contains the functions that allow you to actually load distributed data structures from various data sources

The HPDGLM package implements a parallel, distributed GLM models (Presently only linear regression, logistic regression and Poisson regression models are available), The package also contains functions for cross validation and split-sample validation.

The HPdclassifier package is intended to contain several distributed classification algorithms. It currently contains a parallel distributed implementation of the random forests algorithm.

The HPdcluster package contains a parallel, distributed kmeans algorithm.

The HPdgraph package is intended to contain distributed algorithms for graph analytics. It currently contains a parallel, distributed implementation of the pagerank algorithm for directed graphs.

The following sample code, taken directly from the HPdclassifier User Guide, but modified slightly for presentation here, is similar to the examples that Venkayala and Roy showed in their presentation. Note, that after the distributed arrays are set up they are loaded in parallel with data using the the foreach function from the distributedR package.

library(HPdclassifier) # loading the library

Loading required package: distributedR

Loading required package: Rcpp

Loading required package: RInside

Loading required package: randomForest

distributedR_start() # starting the distributed environment

Workers registered - 1/1.

All 1 workers are registered.

[1] TRUE

nparts <- sum(ds$Inst) # number of available distributed instances

# Describe the data

nSamples <- 100 # number of samples

nAttributes <- 5 # number of attributes of each sample

nSpllits <- 1 # number of splits in each darray

# Create the distributed arrays

dax <- darray(c(nSamples,nAttributes),c(round(nSamples/nSpllits),nAttributes))

day <- darray(c(nSamples,1), c(round(nSamples/nSpllits),1))

# Load the distributed arrays

foreach(i, 1:npartitions(dax),

function(x=splits(dax,i),y=splits(day,i),id=i){

x <- matrix(runif(nrow(x)*ncol(x)), nrow(x),ncol(x))

y <- matrix(runif(nrow(y)), nrow(y), 1)

update(x)

update(y)

})

# Fit the Random Forest Model

myrf <- hpdrandomForest(dax, day, nExecutor=nparts)

# prediction

dp <- predictHPdRF(myrf, dax)

Notwithstanding all of its capabilities, Distributed R is still clearly work in progress. It is only available on Linux platforms. Algorithms and data must be resident in memory. Distributed R is not available on CRAN, and even with an excellent Installation Guide, installing the platform is a bit of an involved process.

Nevertheless, Distributed R is impressive, and I think a valuable contribution to open source R. I expect that users with distributed data will find the platform to be a viable way to begin high performance computing with R.

Note that the Distributed R project discussed in this post is an HP initiative and is not in anyway related to http://www.revolutionanalytics.com/revolution-r-enterprise-distributedr.

A quick heads up that if you'd like to get a great introduction to doing data science with the R language, Joe Rickert will be giving a free webinar next Thursday, September 25: Data Science with R. Regular readers of the blog will be familiar with Joe's posts on this topic. A few recent examples include posts on comparing machine learning models, predictive models for airline delays, agent-based models, and many more. Register for the live webinar and Q&A with Joe, plus access to the slides and replay after the live session. Here's the overview:

Whenever data scientists are asked about what software they use R always comes up at the top of the list. In one recent survey, only SQL was rated higher than R. In this webinar we will explore what makes R so popular and useful. Starting with the big picture, we describe how R is organized and how to find your way around the R world. Then we will work through some examples highlighting features of R that make it attractive for data science work including:

- Acquiring data
- Data manipulation
- Exploratory data analysis
- Model building
- Machine learning

Revolution Analytics webinars: Data Science with R

by Joseph Rickert

The days are getting shorter here in California and the summer R conferences UseR!2014 and JSM are behind us, but there are still some very fine conferences for R users to look forward to before the year ends.

DataWeek starts in San Francisco on September 15th. I will be conducting a bootcamp for new R users, and on Wedneday the 17th Skylar Lyon and David Smith will talk about R in production during an R Case Studies Session.

The same week, halfway around the world, the EARL (Effective Applications of the R Language) Conference starts in London. Ben Goldacre, author of two best sellers Bad Science and Bad Pharma will be the keynote speaker. The technical program will include sessions from such R luminaries as Hadley Wickham, Patrick Burns, Matt Dowle, Andrie de Vries, Romain Francois, Tal Galili and more.

On October 5th through 9th, Predictive Analytics World - Healthcare will be held in Boston. Max Kuhn will be conducting a hands-on workshop on R for predictive modeling in Business and Healthcare.

One week later, October 15th, Strata and Hadoop World will kick off in New York City. The R Studio team will be start the conference with an R Day. Hadley Wickham, WInston Chang, Garrett Grolemund, JJ Allaire and Yihui Xie will all be giving presentations. On the 16th, Amanda Cox, the force behind the Time's superb R based graphics, will be givining a keynote address. Other R releated sessions include a presentation by Sunil Venkayala and Indrajit Roy on HP's Distributed R platform. Revolution Analytics will, once again, be sponsoring this conference. If you go, please drop by the Revolution Analytics' booth.

That same week, PAZUR'14 the Polish Academic R User's Meeting will be taking place. Workshops will be held on data visualization, the analysis of surveys, the exploration of geospatial data and much more. Revolution Analytics is pleased to be a sponsor here too.

On October 25th, the ACM will once again hold its very popular Data Science Bootcamp at eBay in San Jose. And, once again Revolution Analytics is proud to be a sponsor. I will be attending this event, so please look me up if you want to chat about R.

There is sure to be some R content at the ICSM, International Conference on Statistics and Mathematics and the Workshop on Bayesian Modeling to be held on November 24th through 26th in Surabaya, Indonesia.

Also, sometime in November, rOpenSci will be "bringing together a mix of developers and academics to hack on/create tools in the open data space using and for R". A date hasen't been set yet but we will let you know when things are finalized. Here is the link to the Spring hackathon that took place in San Francsco which was pretty impressive.

If I have missed anything, please let us know!

DataScience.LA has posted a great recap of the latest LA R meetup, which in turn was a recap of presentations from the useR! 2014 conference. Follow that link to review slides from the event, whith summaries of useR! 2014 related to R and Python; Finance; dplyr; R books; SalesForce and R AnalyticFlow.

DataScience.LA has also posted more videos from useR! 2014, including Joe Cheng's demonstration of Shiny, Matthew Dowle's presentation on data.table and interviews with Heather Turner and Yihui Xie. Go check 'em out!

by Joseph Rickert

The Joint Statistical Meetings (JSM) get underway this weekend in Boston and Revolution Analytics is again proud to be a sponsor. More than 6,000 statisticians and data scientists from around the world are expected to attend and listen to thousands of presentations. It is true that many talks will be on specialized topics that only statisticians working in particular a field will have the interest and patience to sit through. However, there is evidence that the conference will have something exciting to offer data scientists and statisticians working in industry. Keyword searches yield 79 presentations for Big Data, 29 on Machine Learning, 17 on Data Science, 17 on Data Mining and 19 related to R. There is more than enough here to fill a data scientist’s dance card.

Three must-see presentations under the Big Data keyword are: Michael Franklin's presentation on Analyzing Data at Scale with the Berkeley Data Analytics Stack; Hui Jiang et al. on Implementation of Statistical Algorithms in Big Data Platforms and Tim Hesterberg's talk on Simulation-Based Methods in Statistics Education, and Google Tools. Under the Data Science label, Bill Ruh’s invited talk Industrial Internet, an Opportunity for Statisticians to Become Data Scientists looks most inviting. There are also quite a few Data Science talks that indicated some soul searching within the academic community as to how the statistics curriculum ought to be changed. See, for example, Michael Rappa’s talk on Data Scientists: How Do We Prepare for the Future? and Johanna Hardin’s talk: Data Science and Statistics: How Should They Fit into Our Curriculum?

Here is the list of R related presentations:

**Saturday, August 2**

- 8:00 AM - 12:00 PM: Adaptive Tests of Significance Using R and SAS — Professional Development Continuing Education Course ASA Instructor: Tom O'Gorman

**Sunday, August 3**

- 8:30 AM - 5:00 PM: Adaptive Methods in Modern Clinical Trials — Professional Development Continuing Education Course ASA , Biometrics Section Instructors: Frank Bretz, Byron Jones, and Guosheng Yin
- 4:20 PM: Glassbox: An R Package for Visualizing Algorithmic Models: Max Ghenis and Ben Ogorek and Estevan Flores
- 4:45 PM: Bayesian Enrollment and Event Predictions in Clinical Trials Leveraging Literature Data: Aijun Gao and Fanni Natanegara and Govinda Weerakkody

**Monday, August 4**

- 8:55 AM: Thinking with Data in the Second Course: Nicholas J. Horton and Ben S. Baumer and Hadley Wickham
- 8:30 AM to 10:20 AM: Do You See What I See? Formal Usability Testing and Statistical Graphics: Marie C. Vendettuoli and Matthew Williams and Susan Ruth VanderPlas
- 8:35 AM: Preparing Students for Big Data Using R and Rstudio: Randall Pruim
- 8:35 AM: Does R Provide What Customer Need?: Vipin Arora
- 8:55 AM: Doing Reporducible Research Unconscously: Higher Standard, but Less Work: Yihui Xie
- 12:30 PM: to 1:50 PM: Analyzing Umpire Performance Using PITCHf/x: Andrew Swift
- 3:30 PM: The Perfect Bracket: Machine Learning in NCAA Basketball: Sara Stoudt and Loren Santana and Ben S. Baumer

**Tuesday, August 5**

- 10:35 AM: Tools for Teaching R and Statistics Using Games Brad Luen and Michael Higgins
- 2:00 PM: Multiple Treatment Groups: A Case Study with Health Care Practice and Policy Implications Alexandra Hanlon and Karen Hirschman and Beth Ann Griffin and Mary Naylor
- 2:05 PM: glmmplus: An R Package for Messy Longitudinal Data Ben Ogorek and Caitlin Hogan
- 3:30 PM: Give Me an Old Computer, a Blank DVD, and an Internet Connection and I'll Give You World-Class Analytics Ty Henkaline

**Wednesday, August 6**

- 9:35 AM: Testing Packages for the R Language: Stephen Kaluzny and Lou Bajuk-Yorgan
- 9:50 AM: Using R Analytics on Streaming Data: Lou Bajuk-Yorgan and Stephen Kaluzny
- 10:35 Shiny: Easy Web Applications in R:Joseph Cheng
- 10:30 AM to 12:20 PM: Classroom Demonstrations of Big Data: Eric A. Suess
- 11:00 AM: ggvis: Moving Toward a Grammar of Interactive Graphics: Hadley Wickham
- 3:05 PM: Accessing Data from the Census Bureau API: Alex Shum and Heike Hofmann

**Thursday, August 7**

- 9:20 AM: Predicting Dangerous E. Coli Levels at Erie, Pennsylvania, Beaches with Random Forests in R: Michael Rutter
- 9:25 AM: Beyond the Black Box: Flexible Programming of Hierarchical Modeling Algorithms for BUGS-Compatible Models Using NIMBLE: Perry de de Valpine and Daniel Turek and Christopher J. Paciorek and Rastislav Bodik and Duncan Temple Lang

If you are going to JSM please come by booth #303 to say hello. You may also find the mobile apps (Apple or Android) that Revolution Analytics is sponsoring useful, and don't forget to fill out the survey for a chance to win an Apple TV.

Finally, I will be the program chair for Session 401, Monte Carlo Methods to be held Tuesday, 8/5/2014, from 2:00 PM to 3:50 PM in room CC-101. If you are interested in simulation be sure to drop in. I have seen the presentations and think they are well worth attending.

I was honoured to be invited earlier this month to the Directions of Statistical Computing meeting in Brixen, Italy. DSC is one of two meetings run by the R Project and unlike the useR! conference, DSC is a much smaller and intimate meeting (DSC 2014 had about 30 participants). If you haven't come across DSC meeting before (quite possible, given that it had last been held in 2009), R Core Group member Martyn Plummer has a nice overview of DSC.

A focus of the first day of the conference was on the performance of R computation engine. The organizers invited representatives from all of the "alternative" R engine implementations, and I believe it marked the first time that developers involved with pqR, Renjin, FastR, and Riposte and TERR were gathered in the same place. (The CXXR project was unfortunately not represented.) Jan Vitek [slides] presented a fascinating comparison of the various projects, based on his interviews with the developers.

It was interesting to see the commonalities in many of the approaches. Three projects, Renjin [slides], FastR [slides] and Riposte [slides] use just-in-time compilation and an optimized bytecode engine. All have achieved impressive performance gains, but have struggled with compatibility (and especially being able to run the 6000+ CRAN packages). But it's clear that their work is having an influence on R itself: Thomas Kalibera [slides] (who previously worked on the FastR project) is working with Luke Tierney and Jan Vitek to improve the performance of R's bytecode interpreter.

Other approaches are also being pursued to improve the performance of the R engine. Luke Tierney [slides] described new improvements in R 3.1 to streamline the reference counting system, and noted that several of the performance improvements implemented by Radford Neal [slides] in pqR have already been incorporated into the R engine. And Helena Kotthaus [slides] has done some very exciting work to profile the performance of the R engine which has already led to performance improvements when virtual memory is being used.

Overall, it was exciting to see collaboration and research into R as a language, and especially the attention from the computer science community to the implementation of R. As Robert Gentleman (co-creator of R and conference lead) noted, R now has a new community beyond statisticians and data scientists: computer scientists. It's exciting to see how R is incorporating learning and innovation from this new community.

For more on DSC 2014, see the reports from Martyn Plummer on Day 1 and Day 2 of the conference. The full program, with links to download the slide presentation, is at the link below.

DSC 2014: Schedule (and slide downloads)

[Reposting to update with the new date for the webinar: Tuesday July 29.]

Just a quick heads-up that I'll be presenting with Neera Talbert (VP Professional Services, Revolution Analytics) in a free webinar on **Tuesday, July 29** on Applications in R: Success and Lessons Learned from the Marketplace. I'll describe several R applications from well-known companies (some of which can be seen in the presentation I gave at the China R User Conference), and Neera will present a few case studies of how the Revolution Analytics consulting group has helped companies using R in areas such supply chain analytics, sensor data analysis, and R package validation and certification. Here's the abstract for the webinar:

Applications in R - Success and Lessons Learned from the MarketplaceAdoption of the R language has grown rapidly in the last few years, and is ranked as the number-one data science language in several surveys. This accelerating R adoption curve has been driven by the Big Data revolution, and the fact that so many data scientists — having learned R at university — are actively unlocking the secrets hidden in these new, vast data troves.

In this webinar David Smith, Chief Community Officer, will take a look at the growth of R and the innovative uses of R in business, government and non-profit sectors. Then Neera Talbert, Vice President, Professional Services will take you into the trenches of recent customer deployments and share best practices and pitfalls to avoid in deploying or expanding your own R applications.

You can sign up for the webinar (with live Q&A with me and Neera) at the link below, which will also automatically send a link to the slides and replay when they're available after the live presentation.

Revolution Analytics webinars: Applications in R: Success and Lessons Learned from the Marketplace (10AM PDT, July 29)

by Joseph Rickert

John Chambers opened UseR! 2014 by describing how the R language grew out of early efforts to give statisticians easier access to high quality statistical software. In 1976 computational statistics was a very active field, but most algorithms were compiled as Fortran subroutines. Building models with this software was not a trivial process. First you had to write a main Fortran program to implement the model and call the right subroutines, and then you had to write the job control language code to submit your job and get it executed. When John and his Bell Labs colleagues sat down on that May afternoon to work on what would become the first implementation of the S language they were thinking about how they could make this process easier. The top half John’s famous diagram from that afternoon schematically indicates their intention to design a software interface so that one could call an arbitrary Fortran subroutine, ABC, by wrapping it in some simplified calling syntax: XABC( ).

The main idea was to bring the best computational facilities to the people doing the analysis. As John phrased it: “combine serious computational challenges with convenience”. In the end, the designers of both S, and its second incarnation, R, did much better than convenience. They built a tool to facilitate “flow”. When you are engaged in any mentally challenging work in (including statistical analysis) at a high level of play, you want to be able to stay in the zone and not get knocked out by peripheral tasks that interrupt your thought processes. As engaging and meaningful as it is in its own right, writing code is not doing statistics. One of the big advantages of working with R is that you can do quite a bit of statistics with just a handful of functions and the simplest syntax. R is a tool that helps you keep moving forward. If you want to see something then plot it. If the data in the wrong format, then mutate it.

A second idea that flows from the idea of S as an interface is that S was not intended to be self sufficient. John was explicit that S was designed as an interface to the “best algorithms”, not as a “from the ground up programming language”. The idea of being able to make use of external computational resources is still compelling. There will always be high-quality stuff that we will want to get at. Moreover, as John elaborated: “unlike 38 years ago there are many possible interfaces to languages, to other computing models and to (specialized) hardware”. The challenge is to interface to applications that are “too diverse for one solution to fit them all”, and to do this “without loosing the R that works in ‘ordinary’ circumstances.

John offered three examples of R projects that extend the reach of R to leverage other computing environments.

- Rcpp - turns C++ in to an R function by generating an interface to C++ with much less programming effort than .Call
- RLLVM - enables compiling R language code into specialized forms for efficiency and other purposes
- H2O - provides a compressed, efficient external version of a data frame for running statistical models on large data sets.

These examples, chosen to represent each of the three different kinds of interface targets that John called out, also represent projects of different scope and levels of integration. With a total of 226 reverse depends and reverse imports, Rcpp is already a great success. It is likely that ready access to C++ will form a permanent part of the R programmers mindset.

RLLVM is a much more radical and ambitious project that would allow R to be the window to entirely different computing models. As best I understand it, the central idea is to use the R environment as the system interface to “any number of new languages” perhaps languages that have not yet been invented. RLLVM would “Use R syntax for commands to be interpreted in a different interpreter”. RLLVM seems to be a powerful idea and a direct generalization of the original XABC() idea.

The RH2O package is an example of providing R users with transparent access to data sets that are too large to fit into memory. It is one of many efforts underway (including those from Revolution Analytics) to integrate Hadoop, Teradata, Spark and other specialized computing platforms within the R environment. Some of these specialized platforms may indeed be longed lived, but it is not likely that all of them will. From the point of view of doing statistics, it is the R interface that is likely to survive and persist, platforms will come and go.

An implication of the willingness of R developers to embrace diversity is that R is likely to always be a work in progress. There will be loose ends, annoying inconsistencies and unimplemented possibilities. I suppose that there are people who will never be comfortable with this state of affairs. It is not unreasonable to prefer a system where there is one best way to do something and where, within the bounds of some pre-established design, there is near perfect consistency. However, the pursuit of uniformity and consistency seems to me to doom designers to be at least one step behind, because it means continually starting over to get things right.

So what does this say about the future of R? John closed his talk by stating that “the best future would be one of variety, not uniformity”. I take this to mean that, for the near future anyway, whatever the next big thing is, it is likely that someone will write an R package to talk to it.

Some links regarding S and R History:

- John Chambers useR! 2006 slides
- Trevor Hastie's Interview with John Chambers
- Ross Ihaka: R: Past and Future History
- New York Times Article

UseR! 2014, the R user conference held last week in LA, was the most successful yet. Around 700 R users from around the world converged on the UCLA campus to share their experiences with the R language and to socialize with other data scientists, statisticians and others using R.

The week began with a series of 3-hour tutorials on topics as diverse as data management, visualization, statistics and biostatistics, programming, and interactive applications. (Joe Rickert reported on the tutorials from the field last week.) Unlike in previous years, the tutorial day was included in the registration, which meant that all of the sessions were jam-packed with interested R users.

The remaining 2-and-a-half days were packed with keynotes, contributed talks, and social sessions galore. Given the parallel nature of the tracks I couldn't make it to more than a fraction of the talks, but here are a few of the highlights from my notes:

- In the opening keynote, John Chambers shared the story of the genesis of the S language (which later begat R — see more in this 2013 interview). The three key principles behind the S language were
**objects**(everything is an object),**functions**(everything that happens is a function call), and**interfaces**(the language is an interface to other algorithms). In fact, the very first sketch of the language (from a slide made in 1976, shown below), called described it as an "Algorithm Interface".

- Jeroen Ooms demonstrated OpenCPU, a Web API for scientific computing. OpenCPU makes it possible to integrate R into web-based apps for non R users. You can see some examples of OpenCPU in action in the App Gallery, including the Stockplot app that Jeroen demonstrated.

- Karthik Ram talked about the ROpenSci project and fostering open science with R. The ROpenSci team has created dozens of R packages, including interfaces to public data sources, data visualization tools, and support for reproducible research.

- Google has more than 1000 R users on its internal R mailing list, according to Tim Hesterberg in Google's sponsor talk. (My sponsor talk for Revolution Analytics can be found here.)

- The heR panel discussion and mixer, which facilitated an excellent conversation about women in data science and the R community. (The useR! conference itself was around 25% women — certainly room for improvement, but better than many math or computer science conferences.)

- Thomas Fuchs from NASA/JPL, who revealed in two talks that R is used for vision analysis in space exploration (including the Mars Hirise mission and deep-space astronomy). As a NASA buff this was a thrill for me to learn, and I hope to write more about it sometime.

- The conference banquet under the summer twilight on the UCLA lawns and featuring entertaining acecdotes from David McArthur.

For more on the conference, the slides from many of the talks and tutorials are available at the useR! website. (If you presented there, please submit your slides via a pull request.) Also check out these reviews of the conference from Daniel Gutierrez at InsideBigData and Phyllis Zimbler Miller.

The next conference, useR! 2015 will be held in Aarlborg, Denmark. I'm looking forward to it already!