*By Neera Talbert, VP Services and Ben Wiley, R Programmer at Revolution Analytics*

By now, everyone should be familiar with the data scientist boom. Simply logging onto LinkedIn reveals a seemingly infinite number of people with words and phrases like “Data Scientist”, “Big Data Specialist”, and “Analytics” in their title. A few weeks ago, an article floated around the internet about how R programmers are the highest paid software engineers in industry. But the career of a data scientist is hot not only because it’s highly lucrative; drawing conclusions from data is itself a rewarding process, since these conclusions often shape our future.

As anyone would expect in such an attractive new and emerging field, a lot of people are noticing. So how do you distinguish yourself in a job application to an analytics position? Or, from a company’s perspective, how can you sift through the numerous applications of individuals with analytics backgrounds and choose the one that is best suited to finish the project? One of the tough aspects of the data scientist is that the definition is extremely broad. Upon a closer inspection of LinkedIn profiles with analytics positions, backgrounds include a variety of fields like applied and computational math, statistics, computer science, and so forth. With analytical aspects, even fields like biology or political science appear in these searches.

In other words, having one particular background cannot bar you from being a data scientist. At the same time, however, this in no way implies that the path to a data scientist is easy. In fact, the loosely defined background requirements do more to attract top talent from many fields, rather than attracting only the talent from one. Having expertise in statistics and computer science no doubt helps quite a bit, but sometimes this is not enough to distinguish yourself on an application. There are many popular programming languages used for data analysis, and because these are often new and emerging, it can be difficult to assess one’s understanding of a particular language.

Certifications can be one effective way to convey to an employer that you truly know and understand a program or concept. Revolution Analytics now offers a professional certification that tests the most sought after R analytics skills for the enterprise, and can be an effective way to assess which applicants possess the necessarily backgrounds in R and ScaleR programming. With R being the most widely used statistical language today, the Revolution R Enterprise Professional Certification can be a sure way to attract attention in the job market.

Another option for standing out as a data scientist is to attend graduate school. Of course, this path is much longer than obtaining a certification, but the effects can be lucrative as well. While selecting a good data science grad program is a blog post in and of itself, the obviously attractive fields for a data scientist are statistics and computer science. (There are now also a number of graduate programs devoted specifically to data science.) That’s not to say that other fields aren’t good options either – fields like genomics, biology, physics, economics, and others that heavily rely on data can be attractive paths for the prospective data scientist as well. The only concern to consider is again verifying that the skills gained in a grad program reflect industry’s expectations.

Finally, experience also helps. Having multiple years in an analytics position is a great way to convey one’s understanding of data science to employers, and is often a substantial consideration in a company’s evaluation of a candidate. Having a background as a programmer or analyst can be good ways to step into the data science position. Oftentimes a lack of experience is the greatest hurdle to entering the analytics profession, and so not everyone has the background described above.

Despite the difficulty in attracting an employer’s attention, entering the field of data science is totally worth it. With articles about “Big Data” and “Cloud Computing” emerging everyday on the internet, being a data scientist no doubt puts you at the edge of modern-day technological development, and gives you the ability to make a substantial contribution to society. Plus there’s the pay…

Revolution Analytics AcademyR: Revolution R Enterprise Professional Certification

by Joseph Rickert

I think we can be sure that when American botanist Edgar Anderson meticulously collected data on three species of iris in the early 1930s he had no idea that these data would produce a computational storm that would persist well into the 21st century. The calculations started, presumably by hand, when R. A. Fisher selected this data set to illustrate the techniques described in his 1936 paper on discriminant analysis. However, they really got going in the early 1970s when the pattern recognition and machine learn community began using it to test new algorithms or illustrate fundamental principles. (The earliest reference I could find was: Gates, G.W. "The Reduced Nearest Neighbor Rule". IEEE Transactions on Information Theory, May 1972, 431-433.) Since then, the data set (or one of its variations) has been used to test hundreds, if not thousands, of machine learning algorithms. The UCI Machine Learning Repository, which contains what is probably the “official” iris data set, lists over 200 papers referencing the iris data.

So why has the iris data set become so popular? Like most success stories, randomness undoubtedly plays a huge part. However, Fisher’s selecting it to illustrate a discrimination algorithm brought it to peoples attention, and the fact that the data set contains three classes, only one of which is linearly separable from the other two, makes it interesting.

For similar reasons, the airlines data set used in the 2009 ASA Sections on Statistical Computing and Statistical Graphics Data expo has gained a prominent place in the machine learning world and is well on its way to becoming the “iris data set for big data”. It shows up in all kinds of places. (In addition to this blog, it made its way into the RHIPE documentation and figures in several college course modeling efforts.)

Some key features of the airlines data set are:

- It is big enough to exceed the memory of most desktop machines. (The version of the airlines data set used for the competition contained just over 123 million records with twenty-nine variables.
- The data set contains several different types of variables. (Some of the categorical variables have hundreds of levels.)
- There are interesting things to learn from the data set. (This exercise from Kane and Emerson for example)
- The data set is
*tidy*, but not clean, making it an attractive tool to practice big data wrangling. (The AirTime variable ranges from -3,818 minutes to 3,508 minutes)

An additional, really nice feature of the airlines data set is that it keeps getting bigger! RITA, The Research and Innovative Technology Administration Bureau of Transportation Statistics continues to collect data which can be downloaded in .csv files. For your convenience, we have a 143M+ record version of the data set Revolution Analytics test data site which contains all of the RITA records from 1987 through the end of 2012 available for download.

The following analysis from Revolution Analytics’ Sue Ranney uses this large version of the airlines data set and illustrates how a good model, driven with enough data, can reveal surprising features of a data set.

# Fit a Tweedie GLM tm <- system.time( glmOut <- rxGlm(ArrDelayMinutes~Origin:Dest + UniqueCarrier + F(Year) + DayOfWeek:F(CRSDepTime) , data = airData, family = rxTweedie(var.power =1.15), cube = TRUE, blocksPerRead = 30) ) tm # Build a dataframe for three airlines: Delta (DL), Alaska (AK), HA airVarInfo <- rxGetVarInfo(airData) predData <- data.frame( UniqueCarrier = factor(rep(c( "DL", "AS", "HA"), times = 168), levels = airVarInfo$UniqueCarrier$levels), Year = as.integer(rep(2012, times = 504)), DayOfWeek = factor(rep(c("Mon", "Tues", "Wed", "Thur", "Fri", "Sat", "Sun"), times = 72), levels = airVarInfo$DayOfWeek$levels), CRSDepTime = rep(0:23, each = 21), Origin = factor(rep("SEA", times = 504), levels = airVarInfo$Origin$levels), Dest = factor(rep("HNL", times = 504), levels = airVarInfo$Dest$levels) ) # Use the model to predict the arrival day for the three airlines and plot predDataOut <- rxPredict(glmOut, data = predData, outData = predData, type = "response") rxLinePlot(ArrDelayMinutes_Pred~CRSDepTime|UniqueCarrier, groups = DayOfWeek, data = predDataOut, layout = c(3,1), title = "Expected Delay: Seattle to Honolulu by Departure Time, Day of Week, and Airline", xTitle = "Scheduled Departure Time", yTitle = "Expected Delay")

Here, rxGLM()fits a Tweedie Generalized Linear model that looks at arrival delay as a function of the interaction between origin and destination airports, carriers, year, and the interaction between days of the week and scheduled departure time. This function kicks off a considerable amount of number crunching as Origin is a factor variable with 373 levels and Dest, also a factor, has 377 levels. The F() function makes Year and CrsDepTIme factors "on the fly" as the model is being fit. The resulting model ends up with 140,852 coefficients 8,626 of which are not NA. The calculation takes 12.6 minutes to run on a 5 node (4 cores and 16GB of RAM per node) IBM Platform LFS cluster.

The rest of the code uses the model to predict arrival delay for three airlines and plots the fitted values by day of the week and departure time.

It looks like Saturday is the best time to fly for these airlines. Note that none of the sturcture revealed in these curves was put into the model in the sense that there are no polynomial terms in the model.

It will take a few minutes to download the zip file with the 143M airlines records, but please do, and let us know how your modeling efforts go.

So says CIO.com, in a recent article *11 Market Trends in Advanced Analytics*.

R, an open source programming language for computational statistics, visualization and data is becoming a ubiquitous tool in advanced analytics offerings.

Kirsch says nearly every top vendor of advanced analytics has integrated R into their offering and so that they can now import R models. This allows data scientists, statisticians and other sophisticated enterprise users to leverage R within their analytics package.

One of the big beneficiaries of this trend, Kirsch says, is Revolution Analytics, the leading provider of enterprise support for R. Kaufman and Kirsch also point to advanced analytics firm Predixion, which is focused on extending R beyond data scientists and statisticians to business users through a wizard interface.

The article is based on a recent analyst report, Advanced Analytics: The Hurwitz Victory Index Report 2014. Revolution Analytics was named a Leader in Go To Market Strength in the report.

Many thanks to everyone who attended Tuesday's webinar, Applications in R - Success and Lessons Learned from the Marketplace. We had a great turnout and a very lively Q&A session. I've already shared many of the slides describing how companies like Google, Facebook and the New York Times use R, but this is the first time the presentation has been recorded — you can watch in the embedded video below. Revolution Analytics' VP of Professional Services Neera Talbert also joined this presentation, to describe some of the lessons we've learned helping companies implement R in production environments.

You can download the slides (including links to all of the references applications) from the link below.

Revolution Analytics webinars: Applications in R - Success and Lessons Learned from the Marketplace

[Reposting to update with the new date for the webinar: Tuesday July 29.]

Just a quick heads-up that I'll be presenting with Neera Talbert (VP Professional Services, Revolution Analytics) in a free webinar on **Tuesday, July 29** on Applications in R: Success and Lessons Learned from the Marketplace. I'll describe several R applications from well-known companies (some of which can be seen in the presentation I gave at the China R User Conference), and Neera will present a few case studies of how the Revolution Analytics consulting group has helped companies using R in areas such supply chain analytics, sensor data analysis, and R package validation and certification. Here's the abstract for the webinar:

Applications in R - Success and Lessons Learned from the MarketplaceAdoption of the R language has grown rapidly in the last few years, and is ranked as the number-one data science language in several surveys. This accelerating R adoption curve has been driven by the Big Data revolution, and the fact that so many data scientists — having learned R at university — are actively unlocking the secrets hidden in these new, vast data troves.

In this webinar David Smith, Chief Community Officer, will take a look at the growth of R and the innovative uses of R in business, government and non-profit sectors. Then Neera Talbert, Vice President, Professional Services will take you into the trenches of recent customer deployments and share best practices and pitfalls to avoid in deploying or expanding your own R applications.

You can sign up for the webinar (with live Q&A with me and Neera) at the link below, which will also automatically send a link to the slides and replay when they're available after the live presentation.

Revolution Analytics webinars: Applications in R: Success and Lessons Learned from the Marketplace (10AM PDT, July 29)

Revolution Analytics, founded in 2007, was the first company devoted to the R project. Since then, we've been behind several R initiatives, including the RHadoop project and the network of R user groups around the world. I gave this short presentation today at the useR! 2014 conference in Los Angeles with some of the highlights from Revolution Analytics from 2007-2014.

Slideshare: Revolution Analytics: a 5-minute history

by Joseph Rickert

useR! 2014 is just about two weeks away, and I am very much looking forward to meeting R users from around the world. This is just a great time to catch up with old friends, hopefully make some new friends, and talk about R and R user groups. The number of R user groups continues to increase. Over the past six months, new groups have formed in Chennai (India), Exeter (UK), Miami (FL), Durham (NH), Albany (NY) and Charlotte (NC). There are now 141 user groups listed in Revolutions User Group Directory.

Fourty-nine groups are custered in Europe.

At Revolution Analytics we are always looking for new ways to support R user groups and to do what we can spot trends and highlight ways that R users help each other and share information. For example, many R user groups post the slides from talks and presentations on their websites. Taken together these make a considerable source of reference material with respect to what's hot in R. Some groups are beginning to user their sites to present training material and share code. For example, have a lool at Resources page on the NH UseRs site, and the GitHub code on "*EXTRACTION DE DONNÉES SUR LE WEB*" posted by R Addicts Paris.

If you are going to UCLA for UserR! please stop by Revolution Analytics table to chat. (I'll be the guy in the hat). We would very much like to hear your ideas about R user groups and what more we may be able to do to help.

"A growing body of evidence that indicates that the most meaningful way to access predictive analytics and enhance the reputation of Data Science is through open source analytics, which greatly hinges upon the free open source programming language R", according to Dataversity in the recent article "The Relevance of Open Source (Advanced) Analytics". The article also includes several business use cases for R. I was also interviewed for the article, and when asked why companies should invest in R as a data science platform, this was my reply:

“Investing in R, whether from the point of view of an individual Data Scientist or a company as a whole is always going to pay off because R is always available. If you’ve got a Data Scientist new to an organization, you can always use R. If you’re a company and you’re putting your practice on R, R is always going to be available. And, there’s also an ecosystem of companies built up around R including Revolution Enterprise to help organizations implement R into their machine critical production processes.”

You can read the entire article at the link below.

Dataversity: The Relevance of Open Source (Advanced) Analytics

Many companies are considering switching from SAS to R for statistical data analysis, and may be wondering how R compares in performance and data size scalability to the legacy SAS systems (base SAS and SAS/Stat) they are currently using. Performance and scalability for R is exactly what Revolution R Enterprise (RRE) was designed for. In a recent webinar, Thomas Dinsmore described a benchmarking process to compare performance of legacy SAS and RRE. (The benchmarking process is described in the white paper *Revolution R Enterprise: Faster Than SAS*, and you can see the code behind the benchmarking process here.) In the webinar, Thomas revealed the following results:

- RRE ran the tasks forty-two times faster than legacy SAS on the larger data set
- RRE outperformed legacy SAS on every task
- The RRE performance advantage ranged from 10X to 300X
- The RRE advantage increased when we tested on larger data sets
- SAS’ new HP PROC, where available, only marginally improved SAS performance

Also in the webinar, John Wallace, founder and CEO of DataSong, described how performance and scalability requirements led to the selection in 2011 of Revolution R Enterprise as the analytics engine in their software-as-a-service platform. DataSong's industry-leading marketing analytics system currently analyzes more than $3billion in marketing spend by major retailers.

The slides from the webinar are embedded above, and you can watch and download the full webinar at the link below.

Revolution Analytics webinars: Is Revolution R Enterprise Faster than SAS? Benchmarking Results Revealed

by Joseph Rickert

In last week’s post, I sketched out the history of Generalized Linear Models and their implementations. In this post I’ll attempt to outline how GLM functions evolved in R to handle large data sets.

The first function to make it possible to build GLM models with datasets that are too big to fit into memory was the bigglm()from Thomas Lumley’s biglm package which was released to CRAN in May 2006. bigglm()is an example of a external memory or “chunking” algorithm. This means that data is read from some source on disk and processed one chunk at a time. Conceptually, chunking algorithms work as follows: a program reads a chunk of data into memory, performs intermediate calculations to compute the required sufficient statistics, saves the results and reads the next chunk. The process continues until the entire dataset is processed. Then, if necessary, the intermediate results are assembled into a final result.

According to the documentation trail, bigglm()is based on Alan Miller’s 1991 refinement (algorithm AS 274 implemented in Fortran 77) to W. Morevin Genetlemen’s 1975 Algol algorithm ( AS 75). Both of these algorithms work by updating the Cholesky decomposition of the design matrix with new observations. For a model with p variables, only the p x p triangular Cholesky factor and a new row of data need to be in memory at any given time.

bigglm()does not do the chunking for you. Working with the algorithm requires figuring out how to feed it chunks of data from a file or a database that are small enough to fit into memory with enough room left for processing. ( Have a look at the make.data() function defined on page 4 of the biglm pdf for the prototype example of chunking by passing a function to bigglm()’s data argument.) bigglm() and the biglm package offer few features for working with data. For example, bigglm() can handle factors but it assumes that the factor levels are consistent across all chunks. This is very reasonable under the assumption that the appropriate place to clean and prepare the data for analysis is the underlying database.

The next steps in the evolution of building GLM models with R was the development of memory-mapped data structures along with the appropriate machinery to feed bigglm() data stored on disk. In late 2007, Dan Alder et al. released the ff package which provides data structures that, from R's point of view, make data residing on disk appear is if it were in RAM. The basic idea is that only a chunk (pagesize) of the underlying data file is mapped into memory and this data can be fed to bigglm(). This strategy really became useful in 2011 when Edwin de Jonge, Jan Wijffels and Jan van der Laan released ffbase, a package of statistical functions designed to exploit ff’s data structures. ffbase contains quite a few functions including some for basic data manipulation such as ffappend() and ffmatch(). For an excellent example of building a bigglm() model with a fairly large data set have a look at the post from the folks at BNOSAC. This is one of the most useful, hands-on posts with working code for building models with R and large data sets to be found. (It may be a testimony to the power of provocation.)

Not longer after ff debuted (June 2008), Michael Kane, John Emerson and Peter Haverty released bigmemory, a package for working with large matrices backed by memory-mapped files. Thereafter followed a sequence of packages in the Big Memory Project, including biganalytics, for exploiting the computational possibilities opened by by bigmemory. bigmemory packages are built on the Boost Interprocess C++ library and were designed to facilitate parallel programming with foreach, snow, Rmpi and multicore and enable distributed computing from within R. The biganalytics package contains a wrapper function for bigglm() that enables building GLM models from very large files mapped to big.matrix objects with just a few lines of code.

The initial release in early August 2010 of the RevoScaleR package for Revolution R Enterprise included rxLogit(), a function for building logistic regression models on very masive data sets. rxLogit() was one of the first of RevoScaleR’s Parallel External Memory Algorithms (PEMA). These algorithms are designed specifically for high performance computing with large data sets on a variety of distributed platforms. In June 2012, Revolution Analytics followed up with rxGlm(), a PEMA that implements all of the all of the standard GLM link/family pairs as well as Tweedie models and user-defined link functions. As with all of the PEMAS, scripts including rxGlm() may be run on different platforms just by changing a few lines of code that specifies the user’s compute context. For example, a statistician could test out a model on a local PC or cluster and then change the compute context to run it directly on a Hadoop cluster.

The only other Big Data GLM implementation accessible through an R package of which I am aware is h20.glm() function that is part of the 0xdata’s JVM implementation of machine learning algorithms which was announced in October 2013. As opposed the the external memory R implementations described above, H20 functions run in the distributed memory created by the H20 process. Look here for h20.glm() demo code.

And that's it, I think this brings us up to date with R based (or accessible) functions for running GLMs on large data sets.