For the past 7 years, Revolution Analytics has been the leading provider of R-based software and services to companies around the globe. Today, we're excited to announce a new, enhanced R distribution for everyone: Revolution R Open.

Revolution R Open is a downstream distribution of R from the R Foundation for Statistical Computing. It's built on the R 3.1.1 language engine, so it's 100% compatible with any scripts, packages or applications that work with R 3.1.1. It also comes with enhancements to improve your R experience, focused on performance and reproducibility:

- Revolution R Open is linked with the Intel Math Kernel Libraries (MKL). These replace the standard R BLAS/LAPACK libraries to improve the performance of R, especially on multi-core hardware. You don't need to modify your R code to take advantage of the performance improvements.

- Revolution R Open comes with the Reproducible R Toolkit. The default CRAN repository is a static snapshot of CRAN (taken on October 1). You can always access newer R packages with the checkpoint package, which comes pre-installed. These changes make it easier to share R code with other R users, confident that they will get the same results as you did when you wrote the code.

Today we are also introducing MRAN, a new website where you can find information about R, Revolution R Open, and R Packages. MRAN includes tools to explore R Packages and R Task Views, making it easy to find packages to extend R's capabilities. MRAN is updated daily.

Revolution R Open is available for download now. Visit mran.revolutionanalytics.com/download for binaries for Windows, Mac, Ubuntu, CentOS/Red Hat Linux and (of course) the GPLv2 source distribution.

With the new Revolution R Plus program, Revolution Analytics is offering technical support and open-source assurance for Revolution R Open and several other open source projects from Revolution Analytics (including DeployR Open, ParallelR and RHadoop). If you are interested in subscribing, you can find more information at www.revolutionanalytics.com/plus . And don't forget that big-data R capabilities are still available in Revolution R Enterprise.

We hope you enjoy using Revolution R Open, and that your workplace will be confident adopting R with the backing of technical support and open source assurance of Revolution R Plus. Let us know what you think in the comments!

The ability to create reproducible research is an important topic for many users of R. So important, that several groups in the R community have tackled this problem. Notably, packrat from RStudio, and gRAN from Genentech (see our previous blog post).

The **Reproducible R Toolkit** is a new open-source initiative from Revolution Analytics. It takes a simple approach to dealing with R package versions, consisting of an R package checkpoint, and an associated daily CRAN snapshot archive, checkpoint-server. Here's one illustration of the problem it solves (with apologies to xkcd):

To achieve reproducibility, we store daily snapshots of all CRAN packages. At midnight UTC each day we refresh the CRAN mirror and then store a snapshot of CRAN as it exists at that very moment. You can access these daily snapshots using the checkpoint package, which installs and consistently use these packages just as they existed at the snapshot date. Daily snapshots exist starting from 2014-09-17.

checkpoint package

The goal of the checkpoint package is to solve the problem of package reproducibility in R. Since packages get updated on CRAN all the time, it can be difficult to recreate an environment where all your packages are consistent with some earlier state. To solve this issue, checkpoint allows you to install packages locally as they existed on a specific date from the corresponding snapshot (stored on the checkpoint server) and it configures your R session to use only these packages. Together, the checkpoint package and the checkpoint server act as a "CRAN time machine", so that anyone using checkpoint can ensure the reproducibility of scripts or projects at any time.

How to use checkpoint

One you have the checkpoint package installed, using the checkpoint() function is as simple as adding the following lines to the top of your script:

Typically, you will use the date you created the script as the argument to checkpoint. The first time you run the script, checkpoint will inspect your script (and other R files in the same project folder) for the packages used, and install the required packages with versions as of the specified date. (The first time you run the script, it will take some time to download and install the packages, but subsequent runs will use the previously-installed package versions.)

The checkpoint package installs the packages in a folder specific to the current project (in a subfolder of

If you want to update the packages you use at a later date, just update the date in the checkpoint() call and checkpoint() will automatically update the locally-installed packages.

The checkpoint package is available on CRAN:

Worked example

The Reproducible R Toolkit was created by the Open Source Solutions group at Revolution Analytics. Special thanks go to Scott Chamberlain who helped with early development.

We'd love to know what you think about checkpoint. Leave comments here on the blog, or via the checkpoint GitHub page.

Rrrr! It's International Talk Like a Pirate day again, mateys, the day all landlubbers should talk in pirate lingo. (If you're unsure how, R can help.) It's also the day where you can pick up some great O'Reilly R books for half price.

The Revolution Analytics team has been celebrating with some great pirate costumes -- thanks Colleen, Dave and Amanda!

That's all for this week. See you on Monday, mateys!

As more companies explore the benefits that Hadoop may provide, the opportunities to better understand the technology are myriad and unequal. As a provider of in-Hadoop analytics, Revolution Analytics is participating in the coming Hortonworks seminar series. We will be on site to discuss how to deploy R-based analytics within Hadoop clusters using Revolution R Enterprise. The seminar series itself will host a number of different speakers to provide the attendees with the chance to:

- Gain insight into the business value derived from Hadoop
- Understand the technical role of Hadoop within your data center
- Look at the future of Hadoop as presented by the builders and architects of the platform

The cost to attend the seminar is $99.00 and includes breakfast, lunch and a cocktail hour; starting at 8:00 AM and ending at 4:00 PM. We hope to see you at one of the cities that we are sponsoring:

*By Neera Talbert, VP Services and Ben Wiley, R Programmer at Revolution Analytics*

By now, everyone should be familiar with the data scientist boom. Simply logging onto LinkedIn reveals a seemingly infinite number of people with words and phrases like “Data Scientist”, “Big Data Specialist”, and “Analytics” in their title. A few weeks ago, an article floated around the internet about how R programmers are the highest paid software engineers in industry. But the career of a data scientist is hot not only because it’s highly lucrative; drawing conclusions from data is itself a rewarding process, since these conclusions often shape our future.

As anyone would expect in such an attractive new and emerging field, a lot of people are noticing. So how do you distinguish yourself in a job application to an analytics position? Or, from a company’s perspective, how can you sift through the numerous applications of individuals with analytics backgrounds and choose the one that is best suited to finish the project? One of the tough aspects of the data scientist is that the definition is extremely broad. Upon a closer inspection of LinkedIn profiles with analytics positions, backgrounds include a variety of fields like applied and computational math, statistics, computer science, and so forth. With analytical aspects, even fields like biology or political science appear in these searches.

In other words, having one particular background cannot bar you from being a data scientist. At the same time, however, this in no way implies that the path to a data scientist is easy. In fact, the loosely defined background requirements do more to attract top talent from many fields, rather than attracting only the talent from one. Having expertise in statistics and computer science no doubt helps quite a bit, but sometimes this is not enough to distinguish yourself on an application. There are many popular programming languages used for data analysis, and because these are often new and emerging, it can be difficult to assess one’s understanding of a particular language.

Certifications can be one effective way to convey to an employer that you truly know and understand a program or concept. Revolution Analytics now offers a professional certification that tests the most sought after R analytics skills for the enterprise, and can be an effective way to assess which applicants possess the necessarily backgrounds in R and ScaleR programming. With R being the most widely used statistical language today, the Revolution R Enterprise Professional Certification can be a sure way to attract attention in the job market.

Another option for standing out as a data scientist is to attend graduate school. Of course, this path is much longer than obtaining a certification, but the effects can be lucrative as well. While selecting a good data science grad program is a blog post in and of itself, the obviously attractive fields for a data scientist are statistics and computer science. (There are now also a number of graduate programs devoted specifically to data science.) That’s not to say that other fields aren’t good options either – fields like genomics, biology, physics, economics, and others that heavily rely on data can be attractive paths for the prospective data scientist as well. The only concern to consider is again verifying that the skills gained in a grad program reflect industry’s expectations.

Finally, experience also helps. Having multiple years in an analytics position is a great way to convey one’s understanding of data science to employers, and is often a substantial consideration in a company’s evaluation of a candidate. Having a background as a programmer or analyst can be good ways to step into the data science position. Oftentimes a lack of experience is the greatest hurdle to entering the analytics profession, and so not everyone has the background described above.

Despite the difficulty in attracting an employer’s attention, entering the field of data science is totally worth it. With articles about “Big Data” and “Cloud Computing” emerging everyday on the internet, being a data scientist no doubt puts you at the edge of modern-day technological development, and gives you the ability to make a substantial contribution to society. Plus there’s the pay…

Revolution Analytics AcademyR: Revolution R Enterprise Professional Certification

by Joseph Rickert

I think we can be sure that when American botanist Edgar Anderson meticulously collected data on three species of iris in the early 1930s he had no idea that these data would produce a computational storm that would persist well into the 21st century. The calculations started, presumably by hand, when R. A. Fisher selected this data set to illustrate the techniques described in his 1936 paper on discriminant analysis. However, they really got going in the early 1970s when the pattern recognition and machine learn community began using it to test new algorithms or illustrate fundamental principles. (The earliest reference I could find was: Gates, G.W. "The Reduced Nearest Neighbor Rule". IEEE Transactions on Information Theory, May 1972, 431-433.) Since then, the data set (or one of its variations) has been used to test hundreds, if not thousands, of machine learning algorithms. The UCI Machine Learning Repository, which contains what is probably the “official” iris data set, lists over 200 papers referencing the iris data.

So why has the iris data set become so popular? Like most success stories, randomness undoubtedly plays a huge part. However, Fisher’s selecting it to illustrate a discrimination algorithm brought it to peoples attention, and the fact that the data set contains three classes, only one of which is linearly separable from the other two, makes it interesting.

For similar reasons, the airlines data set used in the 2009 ASA Sections on Statistical Computing and Statistical Graphics Data expo has gained a prominent place in the machine learning world and is well on its way to becoming the “iris data set for big data”. It shows up in all kinds of places. (In addition to this blog, it made its way into the RHIPE documentation and figures in several college course modeling efforts.)

Some key features of the airlines data set are:

- It is big enough to exceed the memory of most desktop machines. (The version of the airlines data set used for the competition contained just over 123 million records with twenty-nine variables.
- The data set contains several different types of variables. (Some of the categorical variables have hundreds of levels.)
- There are interesting things to learn from the data set. (This exercise from Kane and Emerson for example)
- The data set is
*tidy*, but not clean, making it an attractive tool to practice big data wrangling. (The AirTime variable ranges from -3,818 minutes to 3,508 minutes)

An additional, really nice feature of the airlines data set is that it keeps getting bigger! RITA, The Research and Innovative Technology Administration Bureau of Transportation Statistics continues to collect data which can be downloaded in .csv files. For your convenience, we have a 143M+ record version of the data set Revolution Analytics test data site which contains all of the RITA records from 1987 through the end of 2012 available for download.

The following analysis from Revolution Analytics’ Sue Ranney uses this large version of the airlines data set and illustrates how a good model, driven with enough data, can reveal surprising features of a data set.

# Fit a Tweedie GLM tm <- system.time( glmOut <- rxGlm(ArrDelayMinutes~Origin:Dest + UniqueCarrier + F(Year) + DayOfWeek:F(CRSDepTime) , data = airData, family = rxTweedie(var.power =1.15), cube = TRUE, blocksPerRead = 30) ) tm # Build a dataframe for three airlines: Delta (DL), Alaska (AK), HA airVarInfo <- rxGetVarInfo(airData) predData <- data.frame( UniqueCarrier = factor(rep(c( "DL", "AS", "HA"), times = 168), levels = airVarInfo$UniqueCarrier$levels), Year = as.integer(rep(2012, times = 504)), DayOfWeek = factor(rep(c("Mon", "Tues", "Wed", "Thur", "Fri", "Sat", "Sun"), times = 72), levels = airVarInfo$DayOfWeek$levels), CRSDepTime = rep(0:23, each = 21), Origin = factor(rep("SEA", times = 504), levels = airVarInfo$Origin$levels), Dest = factor(rep("HNL", times = 504), levels = airVarInfo$Dest$levels) ) # Use the model to predict the arrival day for the three airlines and plot predDataOut <- rxPredict(glmOut, data = predData, outData = predData, type = "response") rxLinePlot(ArrDelayMinutes_Pred~CRSDepTime|UniqueCarrier, groups = DayOfWeek, data = predDataOut, layout = c(3,1), title = "Expected Delay: Seattle to Honolulu by Departure Time, Day of Week, and Airline", xTitle = "Scheduled Departure Time", yTitle = "Expected Delay")

Here, rxGLM()fits a Tweedie Generalized Linear model that looks at arrival delay as a function of the interaction between origin and destination airports, carriers, year, and the interaction between days of the week and scheduled departure time. This function kicks off a considerable amount of number crunching as Origin is a factor variable with 373 levels and Dest, also a factor, has 377 levels. The F() function makes Year and CrsDepTIme factors "on the fly" as the model is being fit. The resulting model ends up with 140,852 coefficients 8,626 of which are not NA. The calculation takes 12.6 minutes to run on a 5 node (4 cores and 16GB of RAM per node) IBM Platform LFS cluster.

The rest of the code uses the model to predict arrival delay for three airlines and plots the fitted values by day of the week and departure time.

It looks like Saturday is the best time to fly for these airlines. Note that none of the sturcture revealed in these curves was put into the model in the sense that there are no polynomial terms in the model.

It will take a few minutes to download the zip file with the 143M airlines records, but please do, and let us know how your modeling efforts go.

So says CIO.com, in a recent article *11 Market Trends in Advanced Analytics*.

R, an open source programming language for computational statistics, visualization and data is becoming a ubiquitous tool in advanced analytics offerings.

Kirsch says nearly every top vendor of advanced analytics has integrated R into their offering and so that they can now import R models. This allows data scientists, statisticians and other sophisticated enterprise users to leverage R within their analytics package.

One of the big beneficiaries of this trend, Kirsch says, is Revolution Analytics, the leading provider of enterprise support for R. Kaufman and Kirsch also point to advanced analytics firm Predixion, which is focused on extending R beyond data scientists and statisticians to business users through a wizard interface.

The article is based on a recent analyst report, Advanced Analytics: The Hurwitz Victory Index Report 2014. Revolution Analytics was named a Leader in Go To Market Strength in the report.

Many thanks to everyone who attended Tuesday's webinar, Applications in R - Success and Lessons Learned from the Marketplace. We had a great turnout and a very lively Q&A session. I've already shared many of the slides describing how companies like Google, Facebook and the New York Times use R, but this is the first time the presentation has been recorded — you can watch in the embedded video below. Revolution Analytics' VP of Professional Services Neera Talbert also joined this presentation, to describe some of the lessons we've learned helping companies implement R in production environments.

You can download the slides (including links to all of the references applications) from the link below.

Revolution Analytics webinars: Applications in R - Success and Lessons Learned from the Marketplace

[Reposting to update with the new date for the webinar: Tuesday July 29.]

Just a quick heads-up that I'll be presenting with Neera Talbert (VP Professional Services, Revolution Analytics) in a free webinar on **Tuesday, July 29** on Applications in R: Success and Lessons Learned from the Marketplace. I'll describe several R applications from well-known companies (some of which can be seen in the presentation I gave at the China R User Conference), and Neera will present a few case studies of how the Revolution Analytics consulting group has helped companies using R in areas such supply chain analytics, sensor data analysis, and R package validation and certification. Here's the abstract for the webinar:

Applications in R - Success and Lessons Learned from the MarketplaceAdoption of the R language has grown rapidly in the last few years, and is ranked as the number-one data science language in several surveys. This accelerating R adoption curve has been driven by the Big Data revolution, and the fact that so many data scientists — having learned R at university — are actively unlocking the secrets hidden in these new, vast data troves.

In this webinar David Smith, Chief Community Officer, will take a look at the growth of R and the innovative uses of R in business, government and non-profit sectors. Then Neera Talbert, Vice President, Professional Services will take you into the trenches of recent customer deployments and share best practices and pitfalls to avoid in deploying or expanding your own R applications.

You can sign up for the webinar (with live Q&A with me and Neera) at the link below, which will also automatically send a link to the slides and replay when they're available after the live presentation.

Revolution Analytics webinars: Applications in R: Success and Lessons Learned from the Marketplace (10AM PDT, July 29)

Revolution Analytics, founded in 2007, was the first company devoted to the R project. Since then, we've been behind several R initiatives, including the RHadoop project and the network of R user groups around the world. I gave this short presentation today at the useR! 2014 conference in Los Angeles with some of the highlights from Revolution Analytics from 2007-2014.

Slideshare: Revolution Analytics: a 5-minute history