The 14th annual KDnuggets poll measuring use of analytics software is open for voting. The poll asks, "What Predictive Analytics, Big Data, Data mining, Data Science software you used in the past 12 months for a real project?" and allows up to 20 choices from commercial software, open source software, and "big data" software. R was the leading choice in 2012, and currently leads the pack in the voting for 2013.
Even a casual glance at the R Community Calendar shows an impressive amount of R user group activity throughout the world: 45 events in April and 31 scheduled so far for May. New groups formed last month in Knoxville, Tennessee (The Knoxville R User Group: KRUG) and Sheffield in the UK (The Sheffield R Users). An this activity seems to be cumulative. This month, the Bay Area R User’s Group (BARUG) expects to hold its 52nd and 53rd meet ups while the Sydney Users of R Forum (SURF) will hold its 50th. Everywhere R user groups are sponsoring high quality presentations and making them available online, but the Orange County R User Group is pushing the envelope with respect to sophistication and reach. Last Friday, I attended a webinar organized by this group where Professor Trevor Hastie of Stanford University presented Sparse Linear Models with demonstrations using GLMNET. This was a world-class presentation and quite a coup for Orange County to have Professor Hastie present.
The glmnet package written Jerome Friedman, Trevor Hastie and Rob Tibshirani contains very efficient procedures for fitting lasso or elastic-net regularization paths for generalized linear models. So far the glmnet function can fit gaussian and multiresponse gaussian models, logistic regression, poisson regression, multinomial and grouped multinomial models and the Cox model. The efficiency of the glmnet algorithm comes from using cyclical coordinate descent in the optimization process and from Jerome Friedman's underlying Fortran code.
Although Professor Hastie’s presentation was primarily concerned with fitting models for the wide problem (the number of explanatory variables is much larger than the number of observations) the lasso and elastic-net algorithms are just as applicable to data sets with large numbers of observations. It is likely that in the future we will see glmnet implementations for variable selection on datasets with thousands of variables and hundreds of millions of observations. The following graph shows the regularization paths for the coefficients of a model fit the HIV data from one Professor Hastie’s examples.
Each curve represents a coefficient in the model. The x axis is a function of lambda, the regularization penalty parameter. The y axis gives the value of the coefficient. The graph shows how the coefficients “enter the model” (become non-zero) as lambda changes. The following code, based on an example from the webinar, produces the plot and also shows how easy it is to perform cross-validation.
library(glmnet)# load the package load("hiv.rda")# HIV dataclass(hiv.train) # The data are stored as a list names(hiv.train) # The names of the list elements are x and y dim(hiv.train$x) # The explanatory data consists of 704 observations of# 208 binary mutation variableshead(hiv.train[[1]])# Look at the explanatory datahead(hiv.train[[2]])# Look at the response data: changes in susceptibility to antiviral drugs
fit=glmnet(hiv.train$x,hiv.train$y)# fit the modelplot(fit,xvar="lambda", main="HIV model coefficient paths")# Plot the paths for the fit
fit # look at the fit for each coefficient#
cv.fit=cv.glmnet(hiv.train$x,hiv.train$y)# Perform cross validation on the fited modelplot(cv.fit)# Plot the mean sq error for the cross validated fit as a function# of lambda the shrinkage parameter# First vertical line indicates minimal mse# Second vertical line is one sd from mse: indicates a smaller model# is "almost as good" as the minimal mse model
tpred=predict(fit,hiv.test$x)# Predictions on the test data
mte=apply((tpred-hiv.test$y)^2,2,mean)# Compute mse for the predictionspoints(log(fit$lambda),mte,col="blue",pch="*")# overlay the mse predictions on the plotlegend("topleft",legend=c("10 fold CV","Test"),pch="*",col=c("red","blue"))
Don’t be content with this partial example. Professor Hastie and The Orange County R User Group have graciously made the slides, code and data available at this link. The webinar is well worth watching in its entirety.
As you might expect, Professor Hastie gives a masterful presentation: lucid, clear and succinct. This is inspite of the fact that Professor Hastie begins the presentation by commenting that it was his first webinar ever and that he was a little uncomfortable talking to his screen. (I think anyone who has ever given a webinar can relate to this: you talk to the screen and no energy from the audience comes back. Nothing is more disruptive to efforts to be enthusiastic than silence.) Nevertheless, Professor Hastie presents a difficult topic with a clarity that carries his audience along, and he is completely unphased by the inevitable glitch. Watch how he handles the upside down slide. You can download his slides, R scripts and data from the link below.
In my presentation to the Strata Santa Clara 2013 conference earlier this year, my goal was to give a succinct (under 20 minutes!) explanation of three terms that are two often used as mere buzzwords: predictive analytics, real time, and big data.
In the talk I referenced the example UpStream Software's marketing attribution model. Last week I posted some details of how they used R to create and deploy the model for clients like Williams Sonoma, so follow that link if you're interested in the details.
I didn't have much time to get into why I believe that the R language is the ideal environment for creating such models, but I go into more depth in this longer version of the presentation.
In this third installment (following part 1 and part 2) of Extending RevoScaleR for Mining Big Data we look at how to use the building blocks provided by RevoScaleR to create a Naive Bayes model.
Motivation: Fit a Naive Bayes model to big data.
Naive Bayes is a simple probabilistic classifier based on applying Bayes' theorem with strong (naive) independence assumptions. This is often a good benchmark for other more complicated data mining models.
Really you are just calculating proportions for categorical variables (with possible Laplace correction), and probabilities based on a normal distribution for numeric variables. The proportions are easily calculated using rxCrossTabs, and the normal probabilities are easily calculated given a mean and standard deviation which we can get from rxSummary.
We can use existing e1071 code and replace the calculation of proportions and probabilities with big data versions. The results are not only not big data, but existing methods work on object!
You can test this out yourself with the function rxNaiveBayes at github.
Conclusions
It is pretty easy to extend RevoScaleR to do many tasks. These are only three example, but there are more on the github page. I also have a few more complicated examples that should be up eventually.
If you have an interest in helping to extend the functionality of RevoScaleR or just want to test some of the things I have created, please have a look at RevoEnhancements on github.
Major retailers like Williams Sonoma use UpStream Software for marketing analytics, including revenue attribution, targeting, and optimization. In the video below Tess Nesbitt (senior statistician at UpStream) describes how she uses Revolution R Enterprise and Hadoop to figure out the impact on various marketing channels (for example direct mail, email offers, and catalogs) on consumer retail sales.
(The slides for Tess's presentation, The Impact of Big Data on Marketing Analytics, are available for download.) Because Tess needs to massive amounts of consumer behaviour data to tease out the effects of the different marketing channels, she switched from SAS/WPS and now uses the big-data capabilities of Revolution R Enterprise to fit a survival model to more than 36 million records in less than 4 minutes. (You can find some of the details of the model and the R code used to fit it in Tess's recent presentation to the Bay Area R User Group, Statistical Marketing Analytics with Big Data (PPT).) By reducing the time it takes to fit the model from an overnight wait to the time it takes to make a cup of coffee, Tess has more opportunity to refine and improve the model, as she describes in a recent Channel Reseller News article:
"Revolution R Enterprise 6.2 enables UpStream to build highly efficient statistical models on extremely large data. Because their parallelized algorithms are so efficient, it enables us to take multiple passes at the data, build iterative models, and it provides everything we need to glean as much information and build the best models we can for our customers."
These more powerful models in turn mean that retailers can finally understand what marketing activities are most likely to lead consumers to make purchases, as Mohan Namboodiri (VP Customer Analytics, Williams Sonoma) describes in this GigaOM Structure 2013 panel discussion.
On Friday I traveled to Boulder, CO to update the Boulder BI Brain Trust on the latest news and updates from Revolution R Enterprise. While I was there, I was interviewed by BBBT president Claudia Imhoff. In a wide-ranging chat, we discussed:
What's behind the Revolution Analytics momentum over the past year?
How Business Intelligence relates to Data Science
Case studies of Revolution R in production
The value of "black box" machine learning algorithms versus data science and the statistical modeling process
The components of Revolution R Enterprise, and the Revolution Analytics partner ecosystem
I've embedded the 15-minute interview below, and you can also listen to the BBBT audio podcast at the link at the end of the post.
Between the Strata conference and various announcements, last week was certainly a busy one for the crew here at Revolution Analytics. So I thought I'd take the opportunity to catch you up on some of the recent media articles you might have missed:
As a gamer, I was especially interested to see what Electronic Art's Rajat Taneja had to say about big data challenges in video games. Here are some of the key stats from his talk at Strata Santa Clatra 2013:
There are more than 2 billion gamers worldwide, generating 50 Tb of data per day.
AAA multiplayer titles like Battlefield generates about 1Tb of data per day from in-game telemetry.
Social games like Simpsons Tapped Out generate about 150Gb of data per day.
In a typical month, EA hosts about 2.5 billion game sessions, representing about 50 billion minutes of gameplay.
Taneja said that an important initiative at EA is to be able to move from descriptions of the collected data ("what happened") to predictions about the future ("what happens next"). They've designed an ongoing process of data distillation to extract relevant data from the 50Tb stream, and use that for predictive analytics. I've embedded the video of Taneja's presentation at Strata below:
At Tuesday's Data Driven Business Day at the Strata conference I gave my talk, Real-time Big Data Predictive Analytics: From Deployment to Production. My goal in the talk was to explain the buzz-phrases "real time", "big data" and "predictive examples" in the context of a specific example: why are some web ads today uncannily targeted at our personal interests or needs?
I've embedded the slides below (and you can also find a PPT version here):
I'm not sure the messages comes through without the narrative, but I'll post the video version when it's available. You can also watch an expanded version of this talk I gave as a Revolution Analytics webinar.
By the way, if you're new to RHadoop, here's RHadoop creator and project leader Antonio Piccolboni introducting RHadoop at last year's Strata CA conference.