Many thanks to everyone who attended Tuesday's webinar, Applications in R - Success and Lessons Learned from the Marketplace. We had a great turnout and a very lively Q&A session. I've already shared many of the slides describing how companies like Google, Facebook and the New York Times use R, but this is the first time the presentation has been recorded — you can watch in the embedded video below. Revolution Analytics' VP of Professional Services Neera Talbert also joined this presentation, to describe some of the lessons we've learned helping companies implement R in production environments.

You can download the slides (including links to all of the references applications) from the link below.

Revolution Analytics webinars: Applications in R - Success and Lessons Learned from the Marketplace

[Reposting to update with the new date for the webinar: Tuesday July 29.]

Just a quick heads-up that I'll be presenting with Neera Talbert (VP Professional Services, Revolution Analytics) in a free webinar on **Tuesday, July 29** on Applications in R: Success and Lessons Learned from the Marketplace. I'll describe several R applications from well-known companies (some of which can be seen in the presentation I gave at the China R User Conference), and Neera will present a few case studies of how the Revolution Analytics consulting group has helped companies using R in areas such supply chain analytics, sensor data analysis, and R package validation and certification. Here's the abstract for the webinar:

Applications in R - Success and Lessons Learned from the MarketplaceAdoption of the R language has grown rapidly in the last few years, and is ranked as the number-one data science language in several surveys. This accelerating R adoption curve has been driven by the Big Data revolution, and the fact that so many data scientists — having learned R at university — are actively unlocking the secrets hidden in these new, vast data troves.

In this webinar David Smith, Chief Community Officer, will take a look at the growth of R and the innovative uses of R in business, government and non-profit sectors. Then Neera Talbert, Vice President, Professional Services will take you into the trenches of recent customer deployments and share best practices and pitfalls to avoid in deploying or expanding your own R applications.

You can sign up for the webinar (with live Q&A with me and Neera) at the link below, which will also automatically send a link to the slides and replay when they're available after the live presentation.

Revolution Analytics webinars: Applications in R: Success and Lessons Learned from the Marketplace (10AM PDT, July 29)

Revolution Analytics, founded in 2007, was the first company devoted to the R project. Since then, we've been behind several R initiatives, including the RHadoop project and the network of R user groups around the world. I gave this short presentation today at the useR! 2014 conference in Los Angeles with some of the highlights from Revolution Analytics from 2007-2014.

Slideshare: Revolution Analytics: a 5-minute history

by Joseph Rickert

useR! 2014 is just about two weeks away, and I am very much looking forward to meeting R users from around the world. This is just a great time to catch up with old friends, hopefully make some new friends, and talk about R and R user groups. The number of R user groups continues to increase. Over the past six months, new groups have formed in Chennai (India), Exeter (UK), Miami (FL), Durham (NH), Albany (NY) and Charlotte (NC). There are now 141 user groups listed in Revolutions User Group Directory.

Fourty-nine groups are custered in Europe.

At Revolution Analytics we are always looking for new ways to support R user groups and to do what we can spot trends and highlight ways that R users help each other and share information. For example, many R user groups post the slides from talks and presentations on their websites. Taken together these make a considerable source of reference material with respect to what's hot in R. Some groups are beginning to user their sites to present training material and share code. For example, have a lool at Resources page on the NH UseRs site, and the GitHub code on "*EXTRACTION DE DONNÉES SUR LE WEB*" posted by R Addicts Paris.

If you are going to UCLA for UserR! please stop by Revolution Analytics table to chat. (I'll be the guy in the hat). We would very much like to hear your ideas about R user groups and what more we may be able to do to help.

"A growing body of evidence that indicates that the most meaningful way to access predictive analytics and enhance the reputation of Data Science is through open source analytics, which greatly hinges upon the free open source programming language R", according to Dataversity in the recent article "The Relevance of Open Source (Advanced) Analytics". The article also includes several business use cases for R. I was also interviewed for the article, and when asked why companies should invest in R as a data science platform, this was my reply:

“Investing in R, whether from the point of view of an individual Data Scientist or a company as a whole is always going to pay off because R is always available. If you’ve got a Data Scientist new to an organization, you can always use R. If you’re a company and you’re putting your practice on R, R is always going to be available. And, there’s also an ecosystem of companies built up around R including Revolution Enterprise to help organizations implement R into their machine critical production processes.”

You can read the entire article at the link below.

Dataversity: The Relevance of Open Source (Advanced) Analytics

Many companies are considering switching from SAS to R for statistical data analysis, and may be wondering how R compares in performance and data size scalability to the legacy SAS systems (base SAS and SAS/Stat) they are currently using. Performance and scalability for R is exactly what Revolution R Enterprise (RRE) was designed for. In a recent webinar, Thomas Dinsmore described a benchmarking process to compare performance of legacy SAS and RRE. (The benchmarking process is described in the white paper *Revolution R Enterprise: Faster Than SAS*, and you can see the code behind the benchmarking process here.) In the webinar, Thomas revealed the following results:

- RRE ran the tasks forty-two times faster than legacy SAS on the larger data set
- RRE outperformed legacy SAS on every task
- The RRE performance advantage ranged from 10X to 300X
- The RRE advantage increased when we tested on larger data sets
- SAS’ new HP PROC, where available, only marginally improved SAS performance

Also in the webinar, John Wallace, founder and CEO of DataSong, described how performance and scalability requirements led to the selection in 2011 of Revolution R Enterprise as the analytics engine in their software-as-a-service platform. DataSong's industry-leading marketing analytics system currently analyzes more than $3billion in marketing spend by major retailers.

The slides from the webinar are embedded above, and you can watch and download the full webinar at the link below.

Revolution Analytics webinars: Is Revolution R Enterprise Faster than SAS? Benchmarking Results Revealed

by Joseph Rickert

In last week’s post, I sketched out the history of Generalized Linear Models and their implementations. In this post I’ll attempt to outline how GLM functions evolved in R to handle large data sets.

The first function to make it possible to build GLM models with datasets that are too big to fit into memory was the bigglm()from Thomas Lumley’s biglm package which was released to CRAN in May 2006. bigglm()is an example of a external memory or “chunking” algorithm. This means that data is read from some source on disk and processed one chunk at a time. Conceptually, chunking algorithms work as follows: a program reads a chunk of data into memory, performs intermediate calculations to compute the required sufficient statistics, saves the results and reads the next chunk. The process continues until the entire dataset is processed. Then, if necessary, the intermediate results are assembled into a final result.

According to the documentation trail, bigglm()is based on Alan Miller’s 1991 refinement (algorithm AS 274 implemented in Fortran 77) to W. Morevin Genetlemen’s 1975 Algol algorithm ( AS 75). Both of these algorithms work by updating the Cholesky decomposition of the design matrix with new observations. For a model with p variables, only the p x p triangular Cholesky factor and a new row of data need to be in memory at any given time.

bigglm()does not do the chunking for you. Working with the algorithm requires figuring out how to feed it chunks of data from a file or a database that are small enough to fit into memory with enough room left for processing. ( Have a look at the make.data() function defined on page 4 of the biglm pdf for the prototype example of chunking by passing a function to bigglm()’s data argument.) bigglm() and the biglm package offer few features for working with data. For example, bigglm() can handle factors but it assumes that the factor levels are consistent across all chunks. This is very reasonable under the assumption that the appropriate place to clean and prepare the data for analysis is the underlying database.

The next steps in the evolution of building GLM models with R was the development of memory-mapped data structures along with the appropriate machinery to feed bigglm() data stored on disk. In late 2007, Dan Alder et al. released the ff package which provides data structures that, from R's point of view, make data residing on disk appear is if it were in RAM. The basic idea is that only a chunk (pagesize) of the underlying data file is mapped into memory and this data can be fed to bigglm(). This strategy really became useful in 2011 when Edwin de Jonge, Jan Wijffels and Jan van der Laan released ffbase, a package of statistical functions designed to exploit ff’s data structures. ffbase contains quite a few functions including some for basic data manipulation such as ffappend() and ffmatch(). For an excellent example of building a bigglm() model with a fairly large data set have a look at the post from the folks at BNOSAC. This is one of the most useful, hands-on posts with working code for building models with R and large data sets to be found. (It may be a testimony to the power of provocation.)

Not longer after ff debuted (June 2008), Michael Kane, John Emerson and Peter Haverty released bigmemory, a package for working with large matrices backed by memory-mapped files. Thereafter followed a sequence of packages in the Big Memory Project, including biganalytics, for exploiting the computational possibilities opened by by bigmemory. bigmemory packages are built on the Boost Interprocess C++ library and were designed to facilitate parallel programming with foreach, snow, Rmpi and multicore and enable distributed computing from within R. The biganalytics package contains a wrapper function for bigglm() that enables building GLM models from very large files mapped to big.matrix objects with just a few lines of code.

The initial release in early August 2010 of the RevoScaleR package for Revolution R Enterprise included rxLogit(), a function for building logistic regression models on very masive data sets. rxLogit() was one of the first of RevoScaleR’s Parallel External Memory Algorithms (PEMA). These algorithms are designed specifically for high performance computing with large data sets on a variety of distributed platforms. In June 2012, Revolution Analytics followed up with rxGlm(), a PEMA that implements all of the all of the standard GLM link/family pairs as well as Tweedie models and user-defined link functions. As with all of the PEMAS, scripts including rxGlm() may be run on different platforms just by changing a few lines of code that specifies the user’s compute context. For example, a statistician could test out a model on a local PC or cluster and then change the compute context to run it directly on a Hadoop cluster.

The only other Big Data GLM implementation accessible through an R package of which I am aware is h20.glm() function that is part of the 0xdata’s JVM implementation of machine learning algorithms which was announced in October 2013. As opposed the the external memory R implementations described above, H20 functions run in the distributed memory created by the H20 process. Look here for h20.glm() demo code.

And that's it, I think this brings us up to date with R based (or accessible) functions for running GLMs on large data sets.

by Mike Bowles

*In two previous posts, *A Thumbnail History of Ensemble Methods and Ensemble Packages in R, *Mike Bowles — a **machine learning expert and serial entrepreneur — laid out a brief history of ensemble methods and described a few of the many implementations in R. In this post Mike takes a detailed look at the Random Forests implementation in the RevoScaleR package that ships with Revolution R Enterprise.*

Revolution Analytics' rxDForest() function provides an ideal tool for developing ensemble models on very large data sets. It allows the data scientist to do prototyping on a single CPU version of the random forest algorithm and to then shift with relative ease to a multi-core version for generating a higher-performance model on an extremely large data set. It’s convenient that the single-CPU and multiple-CPU versions operate on the same data, have many of the same input parameters and deliver the same types of performance summaries and analyses. Revolution Anaytics is one of a very small number of companies offering a true multi-core version of Random Forests. (I only know of one other)*.

The computationally intensive part of ensemble methods is training binary decision trees, and the computationally intense part of training a binary tree is split-point determination. Binary trees comprise a number of binary decisions that are of the form *(attributeX < some number)*. The nodes in the tree pose this binary question and its answer determines whether an example goes to the left or right out of the node. To train the binary tree every possible split point for every attribute has to be tried in order to pick the best one. It’s easy to see why this split-point selection process consumes so much time — particularly on very large data sets. In a standard tree formulation (CART for example) the number of tests is equal to the number of points in the data set (not the number of examples (rows), the number of rows times the number of attributes (columns). This issue has been the subject of research for the last ten or so years.

The Google PLANET paper discusses the sensible idea of approximating the split-point selection process by aggregating points into bins instead of checking every possible value. More recent researchers have developed methods for generating approximate data histograms on streaming data. These methods are well-suited to the map-reduce environment and implemented in the Revolution Analytics version of binary decision trees and Random Forests. Their incorporation makes the computation faster and introduces a “binning” parameter that may be unfamiliar to long-time users of single-CPU versions of random forests.

The screen shots below the input and output for running the Revolution Analytics multi-cpu random forests on AWS. The software is being run through a server version of RStudio. Two CPUs are included in the cluster for building trees. The first screen shot shows the code input for building a predictive model on the UC Irvine data set for red wine taste scores. The code is the figure shows how little change is required for running Revolution Analytics’ multi-core version versus using one of the random forest packages available through CRAN.

The next two screen shots show some of the familiar output that’s available. The first plot shows the oob-prediction error as a function of the number of trees in the ensemble. The second plot gives variable importance. Those familiar with the wine taste data set will recognize that alcohol is correctly identified as the most significant feature for predicting wine taste.

** Editor's Note: H20 from 0xdata contains a multi-core Random Forest implementation*.

by Joseph Rickert

R/Finance 2014 is just about a week away. Over the past four or five years this has become my favorite conference. It is small (300 people this year), exceptionally well-run, and always offers an eclectic mix of theoretical mathematics, efficient, practical computing, industry best practices and trading “street smarts”. This clip of Blair Hull delivering a keynote speech at R/Finance 2012 is an example of the latter. It ought to resonate with anyone who has followed some of the hype surrounding Michael Lewis recent book Flash Boys.

In any event, I thought it would be a good time to look at the relationship between R and Finance and to highlight some resources that are available to students, quants and data scientists looking to do computational finance with R.

First off, consider what computational finance has done for R. From the point of view of the development and growth of the R language, I think it is pretty clear that computational finance has played the role of the ultimate “Killer App” for R. This high stakes, competitive environment where a theoretical edge or a marginal computational advantage can mean big rewards has led to R package development in several areas including time series, optimization, portfolio analysis, risk management, high performance computing and big data. Additionally, challenges and crisis in the financial markets have helped accelerate R’s growth into big data. In this podcast, Michael Kane talks about the analysis of the 2010 Flash Crash he did with Casey King and Richard Holowczak and describes using R with large financial datasets.

Conversely, I think that it is also clear that R has done quite a bit to further computational finance. R’s ability to facilitate rapid data analysis and visualization, its great number of available functions and algorithms and the ease with which it can interface to new data sources and other computing environments has made it a flexible tool that evolves and adapts at a pace that matches developments in the financial industry. The list of packages in the Finance Task View on CRAN indicates the symbiotic relationship between the development of R and the needs of those working in computational finance. On the one hand, there are over 70 packages under the headings Finance and Risk Management that were presumably developed to directly respond to a problem in computational finance. But, the task view also mentions that packages in the Econometrics, Multivariate, Optimization, Robust, SocialSciences and TimeSeries task views may also be useful to anyone working in computational finance. (The High Performance Computing and Machine Learning task views should probably also be mentioned.) The point is that while a good bit of R is useful to problems in computational finance, R has greatly benefited from the contributions of the computational finance community.

If you are just getting started with R and computational finance have a look at John Nolan’s R as a Tool in Computational Finance. Other resources for R and computational finance that you may find helpful are::

**Package Vignettes**Several of the Finance related packages have very informative vignettes or associated websites. For example have a look at those for the packages portfolio, rugarch, rquantlib (check out the cool rotating distributions), PerformanceAnalytics, and MarkowitzR.

**Data**Quandl has become a major source for financial data, which can be easily accessed from R.

**Websites**Relevant websites include the RMetrics site, The R Trader, Burns Statistics and Guy Yollin’s repository of presentations

**YouTube**Three videos that.I found to be particularly interesting are recordings of the presentations “Finance with R” by Ronald Hochreiter, “Using R in Academic Finance” by Sanjiv Das and Portfolio Construction in R by Elliot Norma.

**Blogs**Over the past couple of years, RBloggers has posted quite a few finance related applications. Prominent among these is the series on Quantitative Finance Applications in R by Daniel Harrison on the Revolutions Blog.

**Books**Books on R and Finance include the excellent RMetrics series of ebooks, Statistics and Data Analysis for Financial Engineering by David Ruppert, Financial Risk Modeling and Portfolio Optimization with R by Bernard Pfaff, Introduction to R for Quantitative Finance by Daróczi et al. and a brand new title Computational Finance: An Introductory Course with R by Agrimiro Arratia.

**Coursera**This August, Eric Zivot will teach the course Introduction to Computational Finance and Financial Econometrics which will emphasize R.

**The R Journal**The R Journal frequently publishes finance related papers. The present issue: Volume 5/2, December 2013 contains three relevant papers. Performance Attribution for Equity Portfolios by Yang Lu, David Kane, Temporal Disaggregation of Time Series by Christoph Sax, Peter Steiner, and betategarch: Simulation, Estimation and Forecasting of Beta-Skew-t-EGARCH Models by Genaro Sucarrat.

**Conferencesin addition to R/Finance (Chicago) and useR!2014 (Los Angeles) look for R based, computational finance expertise at the 8th R/RMetrics Workshop (Paris).**

**Community**R-Sig-Finance is one of R’s most active special interest groups.

In a new article for FastCoLabs, journalist Tina Amirtha has published a follow-up article to last month's piece on R's impact on open science. In her latest article, the focus is on how R is used at companies and displacing legacy statistics software like SAS:

SAS is no match for the open-source language that pioneering data scientists use in academia, which is simply known as R.

Tina interviews R users from companies Facebook and DataSong (as well as me and data scientist Casey Herron from Revolution Analytics) and discovers some cool applications of how R is used:

Facebook, for example, uses a technique called power analysis in order to figure out whether it has collected enough relevant data when it studies how users interact with new features on the site. It is all thanks to research data scientists who have developed the appropriate statistical tools in R and made them available to everyone.

The article points out the obvious reasons why R is good for business: it's open-source, it's great for data visualization, and because new research in statistics is done in R, it has far more capabilities (especially new and powerful algorithms) compared to proprietary tools. I wanted to point out one often-overlooked but, I believe, extremely important reason why R is good for businesses: **people.**

“I think the number one value to businesses [in using R] is access to talent,” says Smith. “So many businesses now are doing much more with data, especially with the big data revolution and doing much more with analytics. And because they’re hiring people coming out of school. They know R already.”

The data science talent shortage is a real problem for data driven businesses, but those companies that have adopted R as their platform have a supply of ready-trained R users graduating from academia (and who likely already know other cutting-edge open-source technologies to boot).

In other R-in-the-media news, Revolution Analytics' new program providing technical support for R was covered in Datanami, Computerworld, insideBigData and ZDnet.

FastCoLabs: Why The R Programming Language Is Good For Business