*By Neera Talbert, VP Services and Ben Wiley, R Programmer at Revolution Analytics*

By now, everyone should be familiar with the data scientist boom. Simply logging onto LinkedIn reveals a seemingly infinite number of people with words and phrases like “Data Scientist”, “Big Data Specialist”, and “Analytics” in their title. A few weeks ago, an article floated around the internet about how R programmers are the highest paid software engineers in industry. But the career of a data scientist is hot not only because it’s highly lucrative; drawing conclusions from data is itself a rewarding process, since these conclusions often shape our future.

As anyone would expect in such an attractive new and emerging field, a lot of people are noticing. So how do you distinguish yourself in a job application to an analytics position? Or, from a company’s perspective, how can you sift through the numerous applications of individuals with analytics backgrounds and choose the one that is best suited to finish the project? One of the tough aspects of the data scientist is that the definition is extremely broad. Upon a closer inspection of LinkedIn profiles with analytics positions, backgrounds include a variety of fields like applied and computational math, statistics, computer science, and so forth. With analytical aspects, even fields like biology or political science appear in these searches.

In other words, having one particular background cannot bar you from being a data scientist. At the same time, however, this in no way implies that the path to a data scientist is easy. In fact, the loosely defined background requirements do more to attract top talent from many fields, rather than attracting only the talent from one. Having expertise in statistics and computer science no doubt helps quite a bit, but sometimes this is not enough to distinguish yourself on an application. There are many popular programming languages used for data analysis, and because these are often new and emerging, it can be difficult to assess one’s understanding of a particular language.

Certifications can be one effective way to convey to an employer that you truly know and understand a program or concept. Revolution Analytics now offers a professional certification that tests the most sought after R analytics skills for the enterprise, and can be an effective way to assess which applicants possess the necessarily backgrounds in R and ScaleR programming. With R being the most widely used statistical language today, the Revolution R Enterprise Professional Certification can be a sure way to attract attention in the job market.

Another option for standing out as a data scientist is to attend graduate school. Of course, this path is much longer than obtaining a certification, but the effects can be lucrative as well. While selecting a good data science grad program is a blog post in and of itself, the obviously attractive fields for a data scientist are statistics and computer science. (There are now also a number of graduate programs devoted specifically to data science.) That’s not to say that other fields aren’t good options either – fields like genomics, biology, physics, economics, and others that heavily rely on data can be attractive paths for the prospective data scientist as well. The only concern to consider is again verifying that the skills gained in a grad program reflect industry’s expectations.

Finally, experience also helps. Having multiple years in an analytics position is a great way to convey one’s understanding of data science to employers, and is often a substantial consideration in a company’s evaluation of a candidate. Having a background as a programmer or analyst can be good ways to step into the data science position. Oftentimes a lack of experience is the greatest hurdle to entering the analytics profession, and so not everyone has the background described above.

Despite the difficulty in attracting an employer’s attention, entering the field of data science is totally worth it. With articles about “Big Data” and “Cloud Computing” emerging everyday on the internet, being a data scientist no doubt puts you at the edge of modern-day technological development, and gives you the ability to make a substantial contribution to society. Plus there’s the pay…

Revolution Analytics AcademyR: Revolution R Enterprise Professional Certification

by James Peruvankal

There are plenty of options if you want to learn R and are looking for training: your college’s statistics department, massive open online courses like Coursera, Udacity, edX, Datacamp etc. SiliconANGLE recently published an article about top R-training companies.

Let’s talk about how to choose a good R-trainer.

- First and foremost is technical competency in R - In addition to having done a significant amount of R programming, the instructor should have an education in a quantitative field. The idea behind this is that the instructor will have had experience expressing non-trivial ideas in R. However, it is not necessarily the case that the most technically competent person is the best instructor.
- Experience in teaching statistics - Learning R invariably involves working with statistics. So knowing where students can go wrong in understanding statistical concepts is a skill that greatly increases the effectiveness of an R Instructor. This skill only comes with experience. Joan Garfield's 1995 article in the International Statistical Review: How students learn Statistics is an excellent reference on what could go wrong in learning statistics and how to correct them.
- Communication skills - The instructor should have the ability to clearly communicate complex topics in simple examples that students can relate to. We recommend Gelman and Nolan's book: Teaching Statistics: A Bag of Tricks which promotes an activity based approach to teaching.
- Evangelism - Passion generates passion. The enthusiasm of the instructor spreads to the students.
- Teaching style and philosophy - From our experience in teaching and based on decades of research on how people learn, we have come up with our teaching philosophy. The most important factor is that people ‘learn by doing’. Ensure that hands-on learning is where most of the time is spent on.

At Revolution Analytics we are guided by the teaching philosophy presented in the following chart:

So, if you are serious about learning R, brush up on your statistics, be prepared to jump right in and start doing things on your own, surround yourself with people who are passionate about statistics and R, and figure out how to make the whole process fun for you. If you are teaching R and want to join us in our mission to ‘take R to the Enterprise’, see if you can fit in with our team.

by Joseph Rickert

UserR! 2014 got under way this past Monday with a very impressive array of tutorials delivered on the day that the conference organizers were struggling to cope with a record breaking crowd. My guess is that conference attendance is somewhere in the 700 range. Moreover, this the first year that I can remember that tutorials were free. The combination made for jam-packed tutorials.

The first thing that jumps out just by looking at the tutorial schedule is the effort that RStudio made to teach at the conference. Of the sixteen tutorials given, four were presented by the RStudio team. Winston Chang conducted an introduction to interactive graphics with ggvis; Yihui Xie presented Dynamic Documents with R and knitr; Garrett Grolemund Interactive data display with Shiny and R and Hadley Wickham taught data manipulation with dplyr. Shiny is still captivating R users. I was particularly struck by a conversation I had with a undergraduate Stats major who seemed to be genuinely pleased and excited about being able to build her own Shiny apps. Kudos to the RStudio team.

Bob Muenchen brought this same kind of energy to his introduction to managing data with R. Bob has extensive experience with SAS and Stata and he seems to have a gift anticipating areas where someone proficient in one of these packages, but new to R, might have difficulty.

Matt Dowle presented his tutorial on Data Table and, from the chatter I heard, increased his growing list of data table converts. Dirk Eddelbuettel presented An Example-Driven, Hands-on Introduction to Rcpp and Romain Francois taught C++ and Rcpp11 for beginners. Both Dirk and Romain work hard to make interfacing to C++ a real option for people who themselves are willing to make an effort.

I came late to Martin Morgan’s tutorial on Bioconductor but couldn’t get in the room. Fortunately, Martin prepared an extensive set of materials for his course which I hope to be able to work through.

Max Kuhn taught an introduction to applied predictive based on the recent book he wrote with Kjell Johnson. Both the slides and code are available. Virgilio Gomez Rubio presented a tutorial on applied spatial data analysis with R. His materials are available here. Ramnath Vaidyanathan Interactive Documents with R Have a look at both Ramnath’s slides, the map embedded in his abstract and his screencast.

Drew Schmidt taught a workshop on Programming with Big Data in R based on the pdbR package. The course page points to a rich set of introductory resources.

I very much wanted to attend Søren Højsgaard’s tutorial on Graphical Models and Bayesian Networks with R, but couldn’t make it. I did attend John Nash's tutorial on Nonlinear parameter optimization and modeling in R and I am glad that I did. This is a new field for me and I was fortunate to see something of John’s meticulous approach to the subject.

I was disappointed that I didn’t get to attend Thomas Petzoldt’s tutorial on Simulating differential equation models in R. This is an area that is not usually associated with R. The webpage for the tutorial is really worth a look.

I don't know if the conference organizers planned it that way, but as it turned out, the tutorial subjects chosen are an excellent showcase for the depth and diversity of applications that can be approached through R. Many thanks to the tutorial teachers and congratulations to the UseR! 2014 conference organizers for a great start to the week. I hope to have more to say about the conference in future posts.

Revolution Analytics has just introduced a 10-module series of R courses in Singapore. If you'd like to learn how to do data analysis in R, already know data analysis in another language like SAS and want to transition to R, or just want to enhance your R skills in a specific area, one of these hands-on courses may be of interest. The available modules (which you mix-and-match) are:

- Module 1: Introduction to R
- Module 2: Taking on Big Data using RevoScaleR
- Module 3: Using R with Hadoop
- Module 4: Implementing Web Services using RevoDeployR
- Module 5: Introduction to Rattle and Alteryx for Data Mining
- Module 6: Visualization in R with ggplot2
- Module 7: R for SAS Users
- Module 8: Introduction to Statistics and Analytics
- Module 9: Advanced Statistical Topics, Data Mining, and Machine Learning
- Module 10: System Administration

The next scheduled course on April 15 is **R for SAS users**, a 2-day R course in Singapore designed for existing SAS users. This quick-start for the R language builds on your existing knowlege of SAS to teach data handling, data visualization, statistical analysis and reporting with R.

If you're not in the SIngapore area, you can find other R courses in North America, India, and Europe. And from anywhere in the world you can take Bob Muenchen's course **R for SAS, SPSS and Stata Users online**, April 21-23.

by James Paul Peruvankal, Senior Program Manager at Revolution Analytics

At Revolution Analytics, we are always interested in how people teach and learn R, and what makes R so popular, yet ‘quirky’ to learn. To get some insight from a real pro we interviewed Bob Muenchen. Bob is the author of *R for SAS and SPSS Users* and, with Joseph M. Hilbe, *R for Stata Users*. He is also the creator of r4stats.com, a popular web site devoted to analyzing trends in analytics software and helping people learn the R language. Bob is an Accredited Professional Statistician™ with 30 years of experience and is currently the manager of OIT Research Support (formerly the Statistical Consulting Center) at the University of Tennessee. He has conducted research for a variety of public and private organizations and has assisted on more than 1,000 graduate theses and dissertations. He has written or coauthored over 60 articles published in scientific journals and conference proceedings.

Bob has served on the advisory boards of SAS Institute, SPSS Inc., StatAce OOD, the Statistical Graphics Corporation and PC Week Magazine. His suggested improvements have been incorporated into SAS, SPSS, JMP, STATGRAPHICS and several R packages. His research interests include statistical computing, data graphics and visualization, text analysis, data mining, psychometrics and resampling.

**James: How did you get started teaching people how to use statistical software? **

Bob: When I came to UT in 1979, many people were switching from either FORTRAN or SPSS to SAS. There was quite a lot of demand for SAS training, and I enjoyed teaching the workshops. Back then SAS could save results, like residuals or predicted values, much more easily than SPSS, which drove the switch.

When the Windows version of SPSS came out, people started switching back. The SPSS user interface designer, Sheri Gilley, really understood what ease of use was all about, and the SAS folks didn’t get that until quite recently. I was just as happy teaching the SPSS workshops. However, many SPSS users at UT avoid programming, which I think is a big mistake. Pointing-and-clicking your way through an analysis can be a time-saving way to work, but I always keep the program so I have a record of what I did.

I started teaching R workshops in 2005 and attendance was quite sparse. Now it’s one of our Research Computing Support team’s most popular topics.

**James: Is there anything special about teaching people how to use R, any particular difficulties?**

Bob: In other analytics software, the focus is on variables. It sounds too simple to even bother saying: "Every procedure accepts variables." There are very few ways to specify them, such as by simple name, A, B, C, or lists like A TO Z or A—Z.

Rather than just variables, R has a variety of objects such as vectors, factors and matrices. Some procedures (called functions in R) require particular kinds of objects and there are many more ways to specify which objects to use. From a new user's perspective that may seem like needless complexity. However it provides significant benefits. Once an R user has defined a categorical variable as a factor, analyses will then try to “do the right thing” with that variable. For instance, you could include it in a regression equation and R would create the indicator variables needed to handle a categorical variable automatically.

Another important benefit to R’s object orientation is that it allows a total merger of what would normally be a separate matrix language into the main language of R. This attracts developers, who are helping grow R’s capabilities very rapidly.

**James: How do you handle such a broad range of backgrounds in your classes?**

Bob: The workshop participants do come from a very wide range of fields, but they share a common set of knowledge: what a variable is, how to analyze data, and so on. So I save a great deal of time by not having to explain all that. Instead, I redirect it into pointing out where R is likely to surprise them. You can have variables that are *not* in a data set? That’s a bizarre concept to a SAS, SPSS or Stata user. You can have X in one data set and Y in another, but include both in the same regression model? That sounds very strange at first and, of course, it’s quite risky if you’re not careful. I introduce most topics with, “You’re expecting this, but here comes something very different…”. Different doesn’t necessarily mean better, of course. SAS, SPSS and Stata are all top-quality packages and they do some things with less effort. I love R, but I like to point out where I think the others do a better job.

**James: How do you find teaching online compared to classroom courses?**

Bob: I teach my workshops in-person at The University of Tennessee and I’ve taught at the American Statistical Association’s Joint Statistical Meeting as well as the UseR! Conference. Teaching “live” is great fun, and being able to see the participants’ expressions is helpful in adjusting the presentation pace and knowing when to stop and ask for questions.

However, live workshops have major drawbacks. Travel costs can easily exceed the fee for a workshop, but worse, minimizing those expenses means cramming too much material into a short timeframe. That’s why I teach my webinars in half-day stretches skipping a day in between. We break every hour and fifteen minutes so people can relax. On their days off they can catch up on their regular work, review the workshop material, work on the exercises and email me with questions. At the end of a live workshop people are happy but exhausted and they leave quickly. At the end of a webinar-based workshop, they often stay for a long time afterwards asking questions. I stay online as long as it takes to answer them all.

**James: Some people like learning actively, with their hands on the keyboard. Others prefer to focus more on what’s being said and taking notes. How do you handle these styles?**

Bob: This is an excellent question! When I take a workshop myself, I usually prefer hands-on but sometimes I don’t. Each of my workshop attendees receives setup instructions a week early so their computer has the software installed and the files in the right place by the time we start. They’re ready for whichever learning style they prefer.

For hands-on learners, I use a single R program that contains the course notes as programming comments interspersed with executable code. Since the “slides” are right in front of them, they never need to take their eyes off their screens. The examples are designed to be easy to covert to their own projects. They build in a step-by-step fashion, going from simple to more complex to make sure no one gets lost. Participants can run each example as I cover it, and see the results on their own computers.

For people focused more on listening and taking notes, everyone also has a complete set of slides. The slides have the notes that describe each concept, then the code for it, followed by the output. The notes follow a numbering scheme that is used in both the program and the handout. This way, both types of learners stay in sync.

This dual approach has another benefit. It’s very easy to switch from one style to the other at any time. If someone gets tired of typing, or his or her computer malfunctions, switching to the notes is seamless. Conversely, if someone is following the printed notes and want switch to run an example, it’s very easy to find.

**James: What motivated you to start writing books?**

Bob: I’ve always enjoyed writing newsletter and journal articles. My books on R started out just as a set of notes that I kept for myself. When I put them online, they started getting thousands of hits and Springer called to ask if I could make it a book. I really didn’t think I had enough information, but it kept growing. The second edition of R for SAS and SPSS Users is 686 pages and I have notes on a few topics that I wish I had added. If I ever find time for a third edition, they’ll be in there.

**James: Thank you Bob for your time!**

If you are looking to learn R and are already familiar with software like SAS, SPSS or Stata, do check out Bob’s upcoming workshops here and here.

Statistical forecasting is a critical component of every modern business, and Rob J Hyndman, Professor of Statistics at Monash University, is an expert in the field. He's the co-author of several books on forecasting, including Forecasting: Principles and Practice, a free on-line book that provides a comprehensive introduction to forecasting methods. You can implement all of the methods in this book using open source R software, and Rob's forecast package. Using these techniques and software, he has solved many important forecasting problems in industry, including:

- Helping the Australian government improve forecasts of pharmaceutical sales, converting a 20% budgeting deficit into a forecast within 1% of actual expenditures,
- Helping fertilizer manufacturers reduce warehousing costs by accurately forecasting sales,
- Helping call centers better schedule call center staff by forecasting the rate of inbound calls, and
- Helping electricity companies forecast long-term electricity demand

If you'd like to learn about statistical forecasting, Rob Hyndman will be giving an on-line course on Statistical Forecasting in R in partnership with Revolution Analytics. Here's a short video describing the course:

The course assumes some knowledge of R and statistics, but assumes no backround in statistical forecasting. The course consists of 2 online 1-hour sessions per week over a six week period beginning on October 21. For more details about the course and to register, follow the link below.

Rob J Hyndman: Online course on forecasting using R

*The most recent edition of the Revolution Newsletter is now available. In case you missed it, the news section is below, and you can read the full September edition (with highlights from this blog and community events) online. You can subscribe to the Revolution Newsletter to get it monthly via email.*

September 19th is International Talk Like a Pirate Day! In honor of a day where we get to wander around saying, "Rrrrr!" we're hosting a photo contest for you to share with us your best Jolly Roger/Regina pose. You can tweet your photo with the hashtag #RPirate2013 and tag us @RevolutionR. We will choose 10 winners who will receive a pirate bandana, eyepatch, and an I <3 (Love) R t-shirt! Show us the cut of your jib!

**The Modern Data Architecture for Predictive Analytics with Revolution Analytics and Hortonworks Data Platform***September 24th, 09:00 AM - 10:00 AM PDT*

Hortonworks and Revolution Analytics have teamed up to bring the predictive analytics power of R to Hortonworks Data Platform. Hadoop, being a disruptive data processing framework, has made a large impact in the data ecosystems of today. Enabling business users to translate existing skills to Hadoop is necessary to encourage the adoption and allow businesses to get value out of their Hadoop investment quickly. R, being a prolific and rapidly growing data analysis language, now has a place in the Hadoop ecosystem. We will cover:

- Trends and business drivers for Hadoop
- How Hortonworks and Revolution Analytics play a role in the modern data architecture
- How you can run R natively in Hortonworks Data Platform to simply move your R-powered analytics to Hadoop

Did you miss Mario Inchiosa's webinar: High Performance Predictive Analytics in R and Hadoop last month? The recording is now available!

**03Oct13**- R+Hadoop - Ask Bigger (and New) Questions and Get Better, Faster Answers**10Oct13**- Using Time to Event Models for Prediction and Inference

with DataSong CEO, John Wallace

- Revolution R Enterprise was recognized with DataWeek's Top Innovator award for Data Visualization.
- Revolution Analytics has teamed with Cloudera to bring statistical models into Hadoop clusters from R
- Symphony Analytics partners with Revolution Analytics to develop solutions in healthcare, retails and telecommunications.
- PULSATE Announces Big Data Collaboration with Dell, Intel and Revolution Analytics
- Revolution Analytics CTO Greg Todd on The Value of Data: All You Really Need to Know You Learned in Kindergarten
- Should you be using R? 5 Key Considerations When Choosing Open Source Statistics Software
- Are you ready for Big Data Analytics? This Slidecast from Revolution Analytics’ Bill Jacobs gives a preview of the forthcoming in-Hadoop R capabilities Revolution R Enterprise 7

Are your R programming skills a little flabby? Sign up for our new Big Data Analytics Boot Camp!

Don’t miss Allan Englehardt’s Using Advanced Analytics in the Insurance Industry this month, September 23rd - 27th!

**New Courses have been added to our ONLINE COURSE CALENDAR!**

Advanced analytics for marketing applications, forecasting and Hadoop! Check out our dates and times and register soon to take advantage of our early bird discount.

**Featured Archive Video: **Actuarial Analytics in R and the 5-minute version, too!

*Revolution Analytics is proud to sponsor and attend the following events.*

DataWeek 2013 - Sept 28 - Oct 3rd, 2013 - San Francisco, CA

DataWeek 2013 Conference + Festival is the second-annual 6-day, data-centric event in downtown San Francisco. Each day of DataWeek has a different business and technology focus ranging from the Saturday / Sunday hackathon activities, the Monday / Tuesday technical workshops, and the Wednesday / Thursday conference & expo. Revolution Analytics will be hosting 2 "R Workshops," on Oct 1st and 3rd as well as an R Meet-up at 6:00 PM Oct 2nd.

Customer Insights and Analytics in Banking Summit - Oct 1-2, 2013 - Atlanta, GA

The Customer Insights and Analytics Summit will focus in on winning strategies banks can employ to optimize the value of every customer. Revolution Analytics Chief Strategy Officer and VP of Product Marketing & Management, Michele Chambers, will be presenting “Apply Predictive Analytics to Web Data to Generate Business Critical Insights” in conjunction with Capital One.

Teradata Partner Conference & Expo - Oct 20-24, 2013 - Dallas, TX

Empower yourself with the knowledge to do more with data than you ever thought possible. Learn from the brightest executives, business analysts, data scientists and technologists as they showcase innovation and world-class methodologies for data integration, big data analytics, operations, integrated marketing management and more! Revolution Analytics will have 2 speakers at this year’s event: Bill Jacobs, Director of Product Marketing and Lee Edelfsen, Chief Scientist.

Strata Conference + Hadoop World - Oct 28-30, 2013 - New York, NY

Strata + Hadoop World is where big data's most influential decision makers, architects, developers, and analysts gather to shape the future of their businesses and technologies. Since joining forces last year, Strata + Hadoop World is also one of the largest gatherings of the Apache Hadoop community in the world, with emphasis on hands-on and business sessions on the Hadoop ecosystem. Revolution Analytics is pleased to offer you a 20% discount on registration fees to attend this year’s event. Please use the following code: REVOLUTION20.

The massively-online open course (MOOC) platform Coursera has already delivered two essential free courses for anyone who wants to learn the R language. *Computing for Data Analysis*, presented by Roger Peng, covers the basics of R programming. The follow-up course *Data Analysis*, presented by Jeff Leek, covers statistical modeling and data visualization with R.

If you missed these courses, one way is to catch up to watch the course videos; we've posted YouTube playlists for Computing for Data Analysis videos and Data Analysis videos here on the blog. But if you want to participate in the full Coursera experience with paced online lectures, quizzes, data analysis assignments, and interaction with more than 100,000 fellow students, the courses will be running again soon. Computing for Data Analysis starts again on September 23 and runs for 4 weeks; Data Analysis starts on October 28 and runs for 8 weeks. Follow the link below for introductory videos for each course, and links to register at Coursera.

Simply Statistics: The return of the stat - Computing for Data Analysis & Data Analysis back on Coursera!

*by Joseph Rickert*

I was recently looking through upcoming Coursera offerings and came across the course *Coding the Matrix: Linear Algebra through Computer Science Applications* taught by Philip Klein from Brown University. This looks like a fine course; but why use Python to teach linear algebra? I suppose this is a blind spot of mine: MATLAB I can see. That software has a long tradition of being used in applied mathematics and engineering applications. The Linear Algebra course from MIT open courseware is based on MATLAB and half the linear algebra books published by SIAM use MATLAB.

I expected MATLAB and Python seems like a stretch, but where do we stand in the R world vis-a-vis linear algebra? Well, from a pedagogical point of view there does not appear to be much material “out there” that specifically relates to teaching linear algebra with R. It seems that not much has changed since this 2009 post. A google search yields some nice, short documents (Hyde, Højsgaard, Petris, and Carstensen) that look like the online residue from efforts to teach linear algebra using R. And, although most introductory R books have some material devoted to linear algebra (e.g. the extended markov chain in *The Art of R Programming*), one would be hard pressed to find a book entirely devoted to teaching linear algebra with R.* Hands-On Matrix Algebra Using R: Active and Motivated Learning with Applications* by Hrishikesh D. Vinod is the exception. (This looks like a gem waiting to be discovered.)

My guess, however, is that we will see much more introductory R material focused on linear algebra as scientists and engineers with computational needs outside of statistics proper discover R. R is, after all, well suited for doing the matrix computations associated with linear algebra, and here are some reasons:

(1) As a language designed for doing computational statistics, R is built on an efficient foundation of well-tested and trusted linear algebra code. From the very beginning, R was good at linear algebra. The vignette *2nd Introduction to the Matrix package* lays out a little of the history of R’s linear algebra underpinnings:

Initially the numerical linear algebra functions in R called underlying Fortran routines from the Linpack (Dongarra et al., 1979) and Eispack (Smith et al., 1976) libraries but over the years most of these functions have been switched to use routines from the Lapack (Anderson et al., 1999) library which is the state-of-the-art implementation of numerical dense linear algebra.

(For example, the base R functions for computing eigenvalues, eigen(), Cholesky decompositions, chol(), and singular value decompositions svd() all use LAPACK or LINPACK code.)

(2) R’s notation, indexing and operators are very close to the matrix notation which mathematicians normally use to express linear algebra, and there are basic R functions for several matrix operations (See Quick R for a summary)

(3) The way R functions operate on whole objects: vectors, matrices arrays etc model very closely the conceptual processes of manipulating matrices as single entities.

(4) The seamless interplay between the data frame and matrix data structures in R make it easy to populate matrices from the appropriate columns in heterogeneous data sets.

(5) R is an extensible language and there is a considerable amount of work being done in the R community to "go deep", to "go sparse" and to "go big".

**Going Deep**

“Going Deep” means making it easy to access the computational resources that may be necessary for building production level applications. To this end, the Rcpp package makes it easy to write R functions that call C++ code to do the heavy lifting. Moreover, the RcppArmadillo and RcppEigen packages provide direct and efficient access to the C++ Armadillo and Eigen libraries for doing linear algebra.

**Going Sparse**

Modern statistical applications on even moderately large data sets often produce sparse matrices. One way to work with them in R is via Matrix, a “recommended” package that provides S4 classes for both dense and sparse matrices that extend R’s basic matrix data type. Methods for R functions that work on Matrix objects provide access to efficient linear algebra libraries including BLAS, Lapack CHOLMOD including AMD and COLAMD and Csparse. (Getting oriented in the world of linear algebra software is not an easy task, I found this chart from an authoritative source helpful.)

The following code which provides a very first look at the Matrix package shows a couple of notable features: (1) the Matrix() function evaluates a matrix to determine its class and (2) once the Cholesky factorization is computed it automatically becomes part of the matrix object.# A very first look at library(Matrix) set.seed (1) m <- 10; n <- 5 # Set dim of matrix A <- matrix (runif (m*n),m,n) # Construct a random matrix B <- crossprod(A) # Make a positive-definite symmetric matrix A;B # Have a look # A.M <- Matrix(A) # Matrix recognizes as a dense matrix and class(A.M) # automatically assigns A.M to class dgeMatrix [1] "dgeMatrix" # B.M <- Matrix(B) # Matrix recognizes B as a symmetric, dense matrix class(B.M) # and automatically assign B.M to class dysMatrix [1] "dsyMatrix" # chol(B.M) # Compute the Cholesky decomposition of B.M str(B.M) # The Cholesky decomposition is now part of the B.M object Formal class 'dsyMatrix' [package "Matrix"] with 5 slots ..@ x : num [1:25] 3.94 3.09 2.05 2.87 3.07 ... ..@ Dim : int [1:2] 5 5 ..@ Dimnames:List of 2 .. ..$ : NULL .. ..$ : NULL ..@ uplo : chr "U" ..@ factors :List of 1 .. ..$ Cholesky:Formal class 'Cholesky' [package "Matrix"] with 5 slots .. .. .. ..@ x : num [1:25] 1.98 0 0 0 0 ... .. .. .. ..@ Dim : int [1:2] 5 5 .. .. .. ..@ Dimnames:List of 2 .. .. .. .. ..$ : NULL .. .. .. .. ..$ : NULL .. .. .. ..@ uplo : chr "U" .. .. .. ..@ diag : chr "N" B.M@factors # Access the B.M object $Cholesky 5 x 5 Matrix of class "Cholesky" [,1] [,2] [,3] [,4] [,5] [1,] 1.9845453 1.5548529 1.0323413 1.4475384 1.5451499 [2,] . 1.1677451 0.4304767 0.5193746 0.6326812 [3,] . . 1.1606778 0.4354018 0.9637288 [4,] . . . 0.8844295 0.1572823 [5,] . . . . 0.646782

Created by Pretty R at inside-R.org

Other contributed sparse matrix packages are SparseM, slam and spam. For more practical information see the short tutorial from John Myles White and the *Sparse Models Vignette* by Martin Maechler

**Going Big**

R has had the capability to work with big matrices for some time and this continues to be an area of active development. The R packages bigmemory, and biganalytics provide structures for working with matrices that are too large to fit into memory. bigalgebra contains functions for doing linear algebra with bigmemory structures. The scidb package enables R users to do matrix computations on huge arrays stored in a SciDB data base, and Revolution Analytics’ commercial distribution of R, Revolution R Enterprise (RRE), makes high performance matrix calculations readily available from R. Because RRE is compiled with the Intel Math Kernel Library most common R functions based on linear algebra calculations automatically get a significant performance boost. However, the real linear algebra benefit RRE provides comes from the ability to compute very large matrices in seconds and seamlessly integrate them into an R workflow. A post from 2011 shows the code for doing a principal components analysis on 50 years of stock data with over 9 million observations and 2,800 stocks. The function rxCor() took only 0.57 seconds on my 8GB, 2-core laptop to compute the correlation matrix .

Of course these there categories are not mutually exclusive. In this short video Bryan Lewis shows how the winners of the NetFlix prize used R to go deep, to go big and to go sparse.

*By Revolution Analytics senior program manager James Peruvankal*

Big Data is driving immediate changes in the insurance industry that will have long term effects well beyond its industry impact. Reacting to pressure from competitors and shareholders, insurance companies around the world are working to improve their analytical capabilities to deliver innovative and personalized products to both acquire new customers while ensuring the profitability of all their customers remains high.

Big Data allows companies like Tesco to supply retail loyalty card data from their stores to their insurance business. This data helps to make modeling for pricing data more accurate across demographics and geographies. Large scale data mining and machine learning developed for managing their loyalty card program, deliver insights that their insurance arm can use to improve their price and risk modeling to improve profitability while maintaining appropriate loss reserves. Big Data operations like this help organizations fight off the industry newcomers and other competition who fail to innovate with their data sources. Traditional insurers need to respond to the challenges posed by many innovative newcomers or they risk going the way of the small local supermarket or independent book store (remember them?).

In the US, home insurers, especially in the wake of the 2008 financial crisis, face significant challenges to maintain profitability. Pricing models are many times inadequate and there are significant regulatory constraints. For one insurer a combination of data visualization and multiple machine learning models were used to develop a new pricing model to identify seriously underpriced policies. They found that 5% of the poorly priced policies contributed 14% of the loss ratio. The new pricing model helped to restore their profitability.

To learn more about
how Big Data is impacting the use of advanced analytics in the insurance
industry, you may want to subscribe to our online course, delivered by Allan
Englehardt, this September 23rd – 27^{th}. You can enroll through the link below.

Revolution Analytics Training: Using Advanced Analytics in the Insurance Industry