by Joseph Rickert

We usually have a pretty good time at the monthly Bay Area useR Group (BARUG) meetings, but this month's meeting was a bit more of a party than usual. The very well connected PR team at Sqor Sports, our host company for the evening, secured San Francisco's tres trendy 111 Minna Gallery for the venue. There was a full bar, house music for the networking portion of the meeting, gourmet grilled cheese sandwiches complements of Revolution Analytics and drama — Matt Dowle, one of our speakers, was on a flight that was late getting in from London.

Oh! and yes, there were three very engaging presentations — well worth standing around in the dark.

First up was Noah Gift, CTO or Sqor, a company with a mission to take sports marketing to a whole new level. They are creating a marketplace for athletes to build and promote their digital brands. Noah described how devilishly difficult it is to gather, clean and prepare the data. Correctly labeling social media data from several sources generated by different athletes with the same name poses a number of vexing challenges.

One surprising aspect of the technology that Sqor is developing is what they call an Erlang to R bridge the replaces many tasks they formerly accomplished with Python. Noah indicated that they planning on placing this code in open source.

Below is a plot from Noah's presentation showing predictions from their R based machine learning algorithms.

Our second speaker was Stephen Elston who gave a virtuoso, live demo on using R on the Microsoft Azure Machine Learning cloud platform. Steve glided between the Azure workflow interface and running R scripts. He showed how to manipulate and transform data in both environments, go back and forth to run models in both Azure and R and visualize results in R. Slides for Steve’s talk are available as is some R code on Steve's github site. Studying the scripts will give you an idea of the features he presented.

Finally, just in from London, and still lucid at what would have been 4AM his time, Matt Dowle walked through a summary of new features of data.table v1.9.4 and v1.9.5. There were several data.table users present, and Matt made a few new converts with a series of impressively fast benchmarks against base R. In one demo, Matt showed data.table's forder() taking only 17 seconds to sort 40 million random numerics, a task that took R 7 minutes. According to Matt, the trick for getting this kind of performance is data.table's C-based implementation of radix sorting which works on numeric, character and integer types, with no range restrictions (recall that base::sort.list(...,method="radix") is limited to integers with range < 100,000).

data.table's radix sorting, which scales linearly i.e. below the O(n log n) bound for comparison sorts, is based on two papers: one by Terdiman and the other by Herf. However, where both of these papers use the least significant digit, data.table uses the most significant digit to improve cache efficiency.

Matt also demonstrated data.table's new automatic indexes (You can now use == in i and data.table will automatically build a secondary key) as well as using dplyr syntax with data.table. Matt emphasized that this flexibility shows the power of R's object oriented design. Matt also claimed that both Python's pandas and the dplyr for R made the wrong choices in using hashing. Instead of hashing, data.table uses fast sorting based on the sort order vector which is an index in data.table

For more benchmark information be sure to visit Matt's github site. If you are new to data.table, I recommend starting with Matt's 2014 useR presentation which explains some the ideas underlying data.table as well as providing an introduction.

by Jeremy Reynolds

Senior R Trainer, Revolution Analytics

Last week, Revolution Analytics released its first massive open, online course through a partnership with datacamp.com: Introduction to Revolution R Enterprise for Big Data Analytics. You can sign up for the free course here.

This course provides a look at some of the tools provided by the RevoScaleR package that ships with Revolution R Enterprise. The course and the interactive training framework provided by the platform allow you to get a feel for how you can manipulate, visualize, and analyze large datasets with RevoScaleR.

There are four “chapters” within the course:

- Chapter 1 introduces the RevoScaleR package. Within the chapter, we discuss the challenges associated with big data and how the functions and algorithms in RevoScaleR address them. We walk through an example in which we demonstrate the use of several core RevoScaleR functions, and we provide exercises in which you can use the RevoScaleR to create your first linear model on big data.
- Chapter 2 provides details for some of the RevoScaleR functions used to explore large datasets. We demonstrate how you can use these functions to summarize, cross-tabulate, and visualize variables in large datasets, and we provide hands-on exercises where you can practice with them.
- Chapter 3 covers the RevoScaleR functions used to manipulate and transform large datasets. We demonstrate how you can use these functions to perform simple and complex transformations. We provide a set of interactive exercises that allow you to practice creating transformed variables and to explore how chunking algorithms impact the ways in which you need to process data.
- Chapter 4 concentrates on the analysis of larte data sets. We demonstrate how to use RevoScaleR functions to build statistical and machine learning models on large data sets. Here, we cover linear and logistic regression, k-means clustering, and decision tree estimation. For each kind of analysis, there are hands-on exercises in which you can explore some of the flexibility and power associated with the functions.

We are very excited about our partnership with datacamp.com — the platform provides a unique, hands-on training environment in which you can practice in a live R environment, and you can “learn by doing.” We are diligently working to create more extensive content, including courses on the Fundamentals of the R Programming Language, Introductory Statistics with Revolution R Enterprise, Predictive Modeling, and Advanced R Programming. Stay tuned for more!

DataCamp: Introduction to Revolution R Enterprise for Big Data Analytics

On Wednesday next week, I'll be presenting a live webinar to introduce Revolution R Open and several other open source projects from Revolution Analytics. In the webinar I'll describe:

- The enhancements included in Revolution R Open
- The Reproducible R Toolkit and the checkpoint package
- How to call R from other applications with DeployR Open
- How to run R in Hadoop with RHadoop packages
- Parallel programming for R with the ParallelR packages

(Revolution Analytics offers support and open source assurance for all of the above with Revolution R Plus.) I'll also be sharing some of the latest statistics on R's popularity. And of course, there will be a live Q&A session at the end of the webinar where you can ask me questions. If you can't make the live session, a recording and slides will be sent to registered participants. You can register for the webinar, which takes place at 10AM Seattle time on Wednesday November, at the link below. I hope to see you there!

Revolution Analytics Webinars: Introducing Revolution R Open: Enhanced, Open Source R distribution from Revolution Analytics

by Joseph Rickert

The San Francisco Bay Area Chapter of the Association of Computing Machinery (ACM) has been holding an annual Data Mining Camp and "unconference" since 2009. This year, to reflect the times, the group held a *Data Science Camp* and unconference, and we at Revolution Analytics were, once again, very happy to be a sponsor for the event and pleased to be able to participate.

In an ACM unconference, except for prearranged tutorials and the keynote address, there are no scheduled talks. Instead, anyone with the passion to speak gets two minutes to pitch a session. A show of hands determines what flys, the organizers allocate rooms and group talks by theme on-the-fly, and then off you go. The photo below shows how all of this sorted out on Saturday.

As you might expect, there was a lot of interest in Big Data, NoSQL, NLP etc., but there was also quite a bit of interest in R, enough to run fill a large room for two back-to-back sessions. I was very happy to reprise some of the material from a recent webinar I presented on an introduction to Machine Learning and Data Science with R, and Ram Narasimhan (a longtime member of the Bay Area useR Group) gave a high energy and very informative tutorial on the dplyr package that, judging from the audience reaction, inspired quite a few new R programmers.

But the real R highlight came early in the day. Irina Kukuyeva presented a tutorial on* Principal Component Analysis with Applications in R and Python* that was well worth getting up for early Saturday morning. Not only did irina put together a very nice introduction to PCA starting with the the basic math and illustrating how PCA is used through case studies, but in a laudable effort to be as inclusive as possible, she also took the trouble to write both Python and R code for all of her examples! The following slide shows what PCA looks like in both languages.

This next slide shows what a good bit of statistics looks like in both languages.

For more presentations and tutorials by Irina that feature R, have a look at her Tutorial page.

by Joseph Rickert

One of the most interesting R related presentations at last week’s Strata Hadoop World Conference in New York City was the session on Distributed R by Sunil Venkayala and Indrajit Roy, both of HP Labs. In short, Distributed R is an open source project with the end goal of running R code in parallel on data that is distributed across multiple machines. The following figure conveys the general idea.

A master node controls multiple worker nodes each of which runs multiple R processes in parallel.

As I understand it, the primary use case for the Distributed R software is to move data quickly from a database into distributed data structures that can be accessed by multiple, independent R instances for coordinated, parallel computation. The Distributed R infrastructure automatically takes care of the extraction of the data and the coordination of the calculations, including the occasional movement of data from a worker node to the master node when required by the calculation. The user interface to the Distributed R mechanism is through R functions that have been designed and optimized to work with the distributed data structures, and through a special “Distributed R aware” foreach() function that allow users to write their own distributed functions using ordinary R functions.

To make all of this happen, Distributed R platform contains several components that may be briefly described as follows:

The distributed R package contains:

- functions to set up the infrastructure for the distributed platform
- distributed data structures that are the analogues of R’s data frames, arrays and lists, and
- the functions foreach() and splits() to let users write their own parallel algorithms.

A really nice feature of the distributed data structures is that they can be populated and accessed by rows, columns and blocks making it possible to write efficient algorithms tuned to the structure of particular data sets. For example, data cleaning for wide data sets (many more columns than rows) can be facilitated by preprocessing individual features.

vRODBC is an ODBC client that provides R with database connectivity. This is the connection mechanism that permits the parallel loading of data from various sources data including HP’s Vertica database.

The HPdata package contains the functions that allow you to actually load distributed data structures from various data sources

The HPDGLM package implements a parallel, distributed GLM models (Presently only linear regression, logistic regression and Poisson regression models are available), The package also contains functions for cross validation and split-sample validation.

The HPdclassifier package is intended to contain several distributed classification algorithms. It currently contains a parallel distributed implementation of the random forests algorithm.

The HPdcluster package contains a parallel, distributed kmeans algorithm.

The HPdgraph package is intended to contain distributed algorithms for graph analytics. It currently contains a parallel, distributed implementation of the pagerank algorithm for directed graphs.

The following sample code, taken directly from the HPdclassifier User Guide, but modified slightly for presentation here, is similar to the examples that Venkayala and Roy showed in their presentation. Note, that after the distributed arrays are set up they are loaded in parallel with data using the the foreach function from the distributedR package.

library(HPdclassifier) # loading the library

Loading required package: distributedR

Loading required package: Rcpp

Loading required package: RInside

Loading required package: randomForest

distributedR_start() # starting the distributed environment

Workers registered - 1/1.

All 1 workers are registered.

[1] TRUE

nparts <- sum(ds$Inst) # number of available distributed instances

# Describe the data

nSamples <- 100 # number of samples

nAttributes <- 5 # number of attributes of each sample

nSpllits <- 1 # number of splits in each darray

# Create the distributed arrays

dax <- darray(c(nSamples,nAttributes),c(round(nSamples/nSpllits),nAttributes))

day <- darray(c(nSamples,1), c(round(nSamples/nSpllits),1))

# Load the distributed arrays

foreach(i, 1:npartitions(dax),

function(x=splits(dax,i),y=splits(day,i),id=i){

x <- matrix(runif(nrow(x)*ncol(x)), nrow(x),ncol(x))

y <- matrix(runif(nrow(y)), nrow(y), 1)

update(x)

update(y)

})

# Fit the Random Forest Model

myrf <- hpdrandomForest(dax, day, nExecutor=nparts)

# prediction

dp <- predictHPdRF(myrf, dax)

Notwithstanding all of its capabilities, Distributed R is still clearly work in progress. It is only available on Linux platforms. Algorithms and data must be resident in memory. Distributed R is not available on CRAN, and even with an excellent Installation Guide, installing the platform is a bit of an involved process.

Nevertheless, Distributed R is impressive, and I think a valuable contribution to open source R. I expect that users with distributed data will find the platform to be a viable way to begin high performance computing with R.

Note that the Distributed R project discussed in this post is an HP initiative and is not in anyway related to http://www.revolutionanalytics.com/revolution-r-enterprise-distributedr.

A quick heads up that if you'd like to get a great introduction to doing data science with the R language, Joe Rickert will be giving a free webinar next Thursday, September 25: Data Science with R. Regular readers of the blog will be familiar with Joe's posts on this topic. A few recent examples include posts on comparing machine learning models, predictive models for airline delays, agent-based models, and many more. Register for the live webinar and Q&A with Joe, plus access to the slides and replay after the live session. Here's the overview:

Whenever data scientists are asked about what software they use R always comes up at the top of the list. In one recent survey, only SQL was rated higher than R. In this webinar we will explore what makes R so popular and useful. Starting with the big picture, we describe how R is organized and how to find your way around the R world. Then we will work through some examples highlighting features of R that make it attractive for data science work including:

- Acquiring data
- Data manipulation
- Exploratory data analysis
- Model building
- Machine learning

Revolution Analytics webinars: Data Science with R

by Joseph Rickert

The days are getting shorter here in California and the summer R conferences UseR!2014 and JSM are behind us, but there are still some very fine conferences for R users to look forward to before the year ends.

DataWeek starts in San Francisco on September 15th. I will be conducting a bootcamp for new R users, and on Wedneday the 17th Skylar Lyon and David Smith will talk about R in production during an R Case Studies Session.

The same week, halfway around the world, the EARL (Effective Applications of the R Language) Conference starts in London. Ben Goldacre, author of two best sellers Bad Science and Bad Pharma will be the keynote speaker. The technical program will include sessions from such R luminaries as Hadley Wickham, Patrick Burns, Matt Dowle, Andrie de Vries, Romain Francois, Tal Galili and more.

On October 5th through 9th, Predictive Analytics World - Healthcare will be held in Boston. Max Kuhn will be conducting a hands-on workshop on R for predictive modeling in Business and Healthcare.

One week later, October 15th, Strata and Hadoop World will kick off in New York City. The R Studio team will be start the conference with an R Day. Hadley Wickham, WInston Chang, Garrett Grolemund, JJ Allaire and Yihui Xie will all be giving presentations. On the 16th, Amanda Cox, the force behind the Time's superb R based graphics, will be givining a keynote address. Other R releated sessions include a presentation by Sunil Venkayala and Indrajit Roy on HP's Distributed R platform. Revolution Analytics will, once again, be sponsoring this conference. If you go, please drop by the Revolution Analytics' booth.

That same week, PAZUR'14 the Polish Academic R User's Meeting will be taking place. Workshops will be held on data visualization, the analysis of surveys, the exploration of geospatial data and much more. Revolution Analytics is pleased to be a sponsor here too.

On October 25th, the ACM will once again hold its very popular Data Science Bootcamp at eBay in San Jose. And, once again Revolution Analytics is proud to be a sponsor. I will be attending this event, so please look me up if you want to chat about R.

There is sure to be some R content at the ICSM, International Conference on Statistics and Mathematics and the Workshop on Bayesian Modeling to be held on November 24th through 26th in Surabaya, Indonesia.

Also, sometime in November, rOpenSci will be "bringing together a mix of developers and academics to hack on/create tools in the open data space using and for R". A date hasen't been set yet but we will let you know when things are finalized. Here is the link to the Spring hackathon that took place in San Francsco which was pretty impressive.

If I have missed anything, please let us know!

DataScience.LA has posted a great recap of the latest LA R meetup, which in turn was a recap of presentations from the useR! 2014 conference. Follow that link to review slides from the event, whith summaries of useR! 2014 related to R and Python; Finance; dplyr; R books; SalesForce and R AnalyticFlow.

DataScience.LA has also posted more videos from useR! 2014, including Joe Cheng's demonstration of Shiny, Matthew Dowle's presentation on data.table and interviews with Heather Turner and Yihui Xie. Go check 'em out!

by Joseph Rickert

The Joint Statistical Meetings (JSM) get underway this weekend in Boston and Revolution Analytics is again proud to be a sponsor. More than 6,000 statisticians and data scientists from around the world are expected to attend and listen to thousands of presentations. It is true that many talks will be on specialized topics that only statisticians working in particular a field will have the interest and patience to sit through. However, there is evidence that the conference will have something exciting to offer data scientists and statisticians working in industry. Keyword searches yield 79 presentations for Big Data, 29 on Machine Learning, 17 on Data Science, 17 on Data Mining and 19 related to R. There is more than enough here to fill a data scientist’s dance card.

Three must-see presentations under the Big Data keyword are: Michael Franklin's presentation on Analyzing Data at Scale with the Berkeley Data Analytics Stack; Hui Jiang et al. on Implementation of Statistical Algorithms in Big Data Platforms and Tim Hesterberg's talk on Simulation-Based Methods in Statistics Education, and Google Tools. Under the Data Science label, Bill Ruh’s invited talk Industrial Internet, an Opportunity for Statisticians to Become Data Scientists looks most inviting. There are also quite a few Data Science talks that indicated some soul searching within the academic community as to how the statistics curriculum ought to be changed. See, for example, Michael Rappa’s talk on Data Scientists: How Do We Prepare for the Future? and Johanna Hardin’s talk: Data Science and Statistics: How Should They Fit into Our Curriculum?

Here is the list of R related presentations:

**Saturday, August 2**

- 8:00 AM - 12:00 PM: Adaptive Tests of Significance Using R and SAS — Professional Development Continuing Education Course ASA Instructor: Tom O'Gorman

**Sunday, August 3**

- 8:30 AM - 5:00 PM: Adaptive Methods in Modern Clinical Trials — Professional Development Continuing Education Course ASA , Biometrics Section Instructors: Frank Bretz, Byron Jones, and Guosheng Yin
- 4:20 PM: Glassbox: An R Package for Visualizing Algorithmic Models: Max Ghenis and Ben Ogorek and Estevan Flores
- 4:45 PM: Bayesian Enrollment and Event Predictions in Clinical Trials Leveraging Literature Data: Aijun Gao and Fanni Natanegara and Govinda Weerakkody

**Monday, August 4**

- 8:55 AM: Thinking with Data in the Second Course: Nicholas J. Horton and Ben S. Baumer and Hadley Wickham
- 8:30 AM to 10:20 AM: Do You See What I See? Formal Usability Testing and Statistical Graphics: Marie C. Vendettuoli and Matthew Williams and Susan Ruth VanderPlas
- 8:35 AM: Preparing Students for Big Data Using R and Rstudio: Randall Pruim
- 8:35 AM: Does R Provide What Customer Need?: Vipin Arora
- 8:55 AM: Doing Reporducible Research Unconscously: Higher Standard, but Less Work: Yihui Xie
- 12:30 PM: to 1:50 PM: Analyzing Umpire Performance Using PITCHf/x: Andrew Swift
- 3:30 PM: The Perfect Bracket: Machine Learning in NCAA Basketball: Sara Stoudt and Loren Santana and Ben S. Baumer

**Tuesday, August 5**

- 10:35 AM: Tools for Teaching R and Statistics Using Games Brad Luen and Michael Higgins
- 2:00 PM: Multiple Treatment Groups: A Case Study with Health Care Practice and Policy Implications Alexandra Hanlon and Karen Hirschman and Beth Ann Griffin and Mary Naylor
- 2:05 PM: glmmplus: An R Package for Messy Longitudinal Data Ben Ogorek and Caitlin Hogan
- 3:30 PM: Give Me an Old Computer, a Blank DVD, and an Internet Connection and I'll Give You World-Class Analytics Ty Henkaline

**Wednesday, August 6**

- 9:35 AM: Testing Packages for the R Language: Stephen Kaluzny and Lou Bajuk-Yorgan
- 9:50 AM: Using R Analytics on Streaming Data: Lou Bajuk-Yorgan and Stephen Kaluzny
- 10:35 Shiny: Easy Web Applications in R:Joseph Cheng
- 10:30 AM to 12:20 PM: Classroom Demonstrations of Big Data: Eric A. Suess
- 11:00 AM: ggvis: Moving Toward a Grammar of Interactive Graphics: Hadley Wickham
- 3:05 PM: Accessing Data from the Census Bureau API: Alex Shum and Heike Hofmann

**Thursday, August 7**

- 9:20 AM: Predicting Dangerous E. Coli Levels at Erie, Pennsylvania, Beaches with Random Forests in R: Michael Rutter
- 9:25 AM: Beyond the Black Box: Flexible Programming of Hierarchical Modeling Algorithms for BUGS-Compatible Models Using NIMBLE: Perry de de Valpine and Daniel Turek and Christopher J. Paciorek and Rastislav Bodik and Duncan Temple Lang

If you are going to JSM please come by booth #303 to say hello. You may also find the mobile apps (Apple or Android) that Revolution Analytics is sponsoring useful, and don't forget to fill out the survey for a chance to win an Apple TV.

Finally, I will be the program chair for Session 401, Monte Carlo Methods to be held Tuesday, 8/5/2014, from 2:00 PM to 3:50 PM in room CC-101. If you are interested in simulation be sure to drop in. I have seen the presentations and think they are well worth attending.

I was honoured to be invited earlier this month to the Directions of Statistical Computing meeting in Brixen, Italy. DSC is one of two meetings run by the R Project and unlike the useR! conference, DSC is a much smaller and intimate meeting (DSC 2014 had about 30 participants). If you haven't come across DSC meeting before (quite possible, given that it had last been held in 2009), R Core Group member Martyn Plummer has a nice overview of DSC.

A focus of the first day of the conference was on the performance of R computation engine. The organizers invited representatives from all of the "alternative" R engine implementations, and I believe it marked the first time that developers involved with pqR, Renjin, FastR, and Riposte and TERR were gathered in the same place. (The CXXR project was unfortunately not represented.) Jan Vitek [slides] presented a fascinating comparison of the various projects, based on his interviews with the developers.

It was interesting to see the commonalities in many of the approaches. Three projects, Renjin [slides], FastR [slides] and Riposte [slides] use just-in-time compilation and an optimized bytecode engine. All have achieved impressive performance gains, but have struggled with compatibility (and especially being able to run the 6000+ CRAN packages). But it's clear that their work is having an influence on R itself: Thomas Kalibera [slides] (who previously worked on the FastR project) is working with Luke Tierney and Jan Vitek to improve the performance of R's bytecode interpreter.

Other approaches are also being pursued to improve the performance of the R engine. Luke Tierney [slides] described new improvements in R 3.1 to streamline the reference counting system, and noted that several of the performance improvements implemented by Radford Neal [slides] in pqR have already been incorporated into the R engine. And Helena Kotthaus [slides] has done some very exciting work to profile the performance of the R engine which has already led to performance improvements when virtual memory is being used.

Overall, it was exciting to see collaboration and research into R as a language, and especially the attention from the computer science community to the implementation of R. As Robert Gentleman (co-creator of R and conference lead) noted, R now has a new community beyond statisticians and data scientists: computer scientists. It's exciting to see how R is incorporating learning and innovation from this new community.

For more on DSC 2014, see the reports from Martyn Plummer on Day 1 and Day 2 of the conference. The full program, with links to download the slide presentation, is at the link below.

DSC 2014: Schedule (and slide downloads)