*by Bob Horton, **Data Scientist, Revolution Analytics*

From electronic medical records to genomic sequences, the data deluge is affecting all aspects of health care. The Masters of Science in Health Informatics (MSHI) program at the University of San Francisco, now in its second year, is designed to help students develop the practical computing skills and quantitative perspicacity they need to manage and exploit this wealth of data in health care applications.

This spring, I am privileged to participate in this effort by developing and teaching a new course, “Statistical Computing for Biomedical Data Analytics”, intended to motivate and prepare students for further studies in data science, such as the intensive summer courses of the MSAN bootcamp. The syllabus is on github.

As you’ve probably guessed, we will be using R. Other courses in the curriculum use Python, which seems to be favored by engineers; in contrast, R was developed by and for statisticians. We want the students to be exposed to both perspectives, and to have the technical background needed to make use of the extensive repositories of code available from CRAN and Bioconductor.

Data science is an interdisciplinary endeavor born of the synergy between computing, statistics, data management, and visualization. This can make it challenging to get started, because you have to know so many things before you get to the good stuff. We’re going to try to ease into it by starting with computational explorations of mathematical and statistical concepts. R is a fantastic environment for this; you can see a bell-shaped curve emerge from an example as simple as

`plot(0:20, choose(n=20, k=0:20))`

Note the expressive power of the vector of k values, and the easy convenience of having a world of statistical functions at your fingertips. Imagine how this little plot would have delighted Sir Francis Galton.

Data science is a journey. The enormous breadth of material and the rapid pace of development mean that the most important thing to learn is how to learn more. We’ll explore many fantastic resources for learning data science and R. For example, Coursera has excellent offerings, exemplified by the series of mini-courses from Johns Hopkins; our students will take at least one of these as a course project.

Of course, the R community itself is the biggest and most important resource. One class will be a field trip to a Bay Area useR Group (BARUG) meeting, and the comments in response to this post will be required reading. Ideas or suggestions regarding the syllabus or course materials from the github repository are welcome, as are observations or ruminations on the process of learning data science and R.

Finally, we are very interested in helping our students find outstanding internship opportunities in health-related organizations. Please don’t hesitate to contact me through community@revolutionanalytics.com if you are interested in working with us. Stay tuned for progress reports.

Harvard University is offering a free 5-week on-line course on Statistics and R for the Life Sciences on the edX platform. The course promises you will learn the basics of statistical inference and the basics of using R scripts to conduct reproducible research. You'll just need a backround in basic math and programming to follow along and complete homework in the R language.

As a new course, I haven't seen any of the content, but the presenters Rafael Irizarry and Michael Love are active contributors to the Bioconductor project, so it should be good. The course begins January 19 and registration is open through 27 April at the link below.

by Joseph Rickert

The American Statistical Association (ASA) Undergraduate Guidelines Workgroup recently published the report Curriculum Guidelines for Undergraduate Programs in Statistical Science. Although intended for educators setting up or revamping Stats programs at colleges and universities, this concise, 17 page document should be good reading for anyone who wants to take charge of their own education in learning to "think with data". Whether you are just getting started with your education or you are a working professional contemplating what to learn next to expand your knowledge and update your skills you should find the ASA report helpful.

The report places good statistical practice firmly on the foundation of the scientific method and locates statistical knowledge and skills squarely in the center modern data analysis.

However, it is far from being a complacent panegyric to statistics. The ASA report challenges educators to help students see that the "discipline of statistics is more than a collection of unrelated tools" and explicitly calls for an increased emphasis on data science and big league computational skills. Graduates of statistical programs:

should be facile with professional statistical software and other appropriate tools for data exploration, cleaning, validation, analysis, and communication. They should be able to program in a higher-level language, to think algorithmically, to use simulation-based statistical techniques . . . Graduates should be able to manage and manipulate data, including joining data from different sources and formats and restructuring data into a form suitable for analysis.

The expectations for communication skills are particularly noteworthy. The report says:

Graduates should be expected to write clearly, speak fluently, and construct effective visual displays and compelling written summaries. They should demonstrate ability to collaborate in teams and to organize and manage projects.

One could argue about the details of the topics that should be included in an undergraduate program. But, clearly the committe is aiming for far more than producing minimly competent, employable graduates. They are outlining a way of life, a competent way of being in a data driven world.

Hidden among the white papers listed on the ASA curriculum guidelines page is a treasure: Tim Hesterberg's paper on What Teachers Should Know about the Bootstrap: Resampling in the Undergraduate Curriculum. This is a lucid and fairly deep explication of bootstrapping and resampling techniques that deserves wide circulation. Tim writes that he had three goals in producing the paper: (1) To show the enormous potential of bootstrapping and permutation tests to help students understand statistical concepts . . .(2) To dig deeper . . .(3) To change statistical practice . . ."

Point (3) may sound astoundingly ambitious. However, it is grounded in a revolution that has been quietly gaining strength and whose time has come. Textbooks that rely on R based simulations to teach probability (e.g. Baclawski) and statistics (e.g. Matloff) have been available for some time, and Tim points out that undergraduate textbooks such as Chihara and Hesterberg which use resampling as the fundamental unifying idea are beginning to appear. Moreover, data scientists outside of the community of academic statisticians are well aware that programming skills more than compensate for a traditional statistics education that presents the subject as a collection of unrelated tests and techniques as this Strata Hadoop world presentation from John Rauser makes clear.

It is very good, indeed, to see the ASA leading the charge for change.

by Jeremy Reynolds

Senior R Trainer, Revolution Analytics

Last week, Revolution Analytics released its first massive open, online course through a partnership with datacamp.com: Introduction to Revolution R Enterprise for Big Data Analytics. You can sign up for the free course here.

This course provides a look at some of the tools provided by the RevoScaleR package that ships with Revolution R Enterprise. The course and the interactive training framework provided by the platform allow you to get a feel for how you can manipulate, visualize, and analyze large datasets with RevoScaleR.

There are four “chapters” within the course:

- Chapter 1 introduces the RevoScaleR package. Within the chapter, we discuss the challenges associated with big data and how the functions and algorithms in RevoScaleR address them. We walk through an example in which we demonstrate the use of several core RevoScaleR functions, and we provide exercises in which you can use the RevoScaleR to create your first linear model on big data.
- Chapter 2 provides details for some of the RevoScaleR functions used to explore large datasets. We demonstrate how you can use these functions to summarize, cross-tabulate, and visualize variables in large datasets, and we provide hands-on exercises where you can practice with them.
- Chapter 3 covers the RevoScaleR functions used to manipulate and transform large datasets. We demonstrate how you can use these functions to perform simple and complex transformations. We provide a set of interactive exercises that allow you to practice creating transformed variables and to explore how chunking algorithms impact the ways in which you need to process data.
- Chapter 4 concentrates on the analysis of larte data sets. We demonstrate how to use RevoScaleR functions to build statistical and machine learning models on large data sets. Here, we cover linear and logistic regression, k-means clustering, and decision tree estimation. For each kind of analysis, there are hands-on exercises in which you can explore some of the flexibility and power associated with the functions.

We are very excited about our partnership with datacamp.com — the platform provides a unique, hands-on training environment in which you can practice in a live R environment, and you can “learn by doing.” We are diligently working to create more extensive content, including courses on the Fundamentals of the R Programming Language, Introductory Statistics with Revolution R Enterprise, Predictive Modeling, and Advanced R Programming. Stay tuned for more!

DataCamp: Introduction to Revolution R Enterprise for Big Data Analytics

*By Neera Talbert, VP Services and Ben Wiley, R Programmer at Revolution Analytics*

By now, everyone should be familiar with the data scientist boom. Simply logging onto LinkedIn reveals a seemingly infinite number of people with words and phrases like “Data Scientist”, “Big Data Specialist”, and “Analytics” in their title. A few weeks ago, an article floated around the internet about how R programmers are the highest paid software engineers in industry. But the career of a data scientist is hot not only because it’s highly lucrative; drawing conclusions from data is itself a rewarding process, since these conclusions often shape our future.

As anyone would expect in such an attractive new and emerging field, a lot of people are noticing. So how do you distinguish yourself in a job application to an analytics position? Or, from a company’s perspective, how can you sift through the numerous applications of individuals with analytics backgrounds and choose the one that is best suited to finish the project? One of the tough aspects of the data scientist is that the definition is extremely broad. Upon a closer inspection of LinkedIn profiles with analytics positions, backgrounds include a variety of fields like applied and computational math, statistics, computer science, and so forth. With analytical aspects, even fields like biology or political science appear in these searches.

In other words, having one particular background cannot bar you from being a data scientist. At the same time, however, this in no way implies that the path to a data scientist is easy. In fact, the loosely defined background requirements do more to attract top talent from many fields, rather than attracting only the talent from one. Having expertise in statistics and computer science no doubt helps quite a bit, but sometimes this is not enough to distinguish yourself on an application. There are many popular programming languages used for data analysis, and because these are often new and emerging, it can be difficult to assess one’s understanding of a particular language.

Certifications can be one effective way to convey to an employer that you truly know and understand a program or concept. Revolution Analytics now offers a professional certification that tests the most sought after R analytics skills for the enterprise, and can be an effective way to assess which applicants possess the necessarily backgrounds in R and ScaleR programming. With R being the most widely used statistical language today, the Revolution R Enterprise Professional Certification can be a sure way to attract attention in the job market.

Another option for standing out as a data scientist is to attend graduate school. Of course, this path is much longer than obtaining a certification, but the effects can be lucrative as well. While selecting a good data science grad program is a blog post in and of itself, the obviously attractive fields for a data scientist are statistics and computer science. (There are now also a number of graduate programs devoted specifically to data science.) That’s not to say that other fields aren’t good options either – fields like genomics, biology, physics, economics, and others that heavily rely on data can be attractive paths for the prospective data scientist as well. The only concern to consider is again verifying that the skills gained in a grad program reflect industry’s expectations.

Finally, experience also helps. Having multiple years in an analytics position is a great way to convey one’s understanding of data science to employers, and is often a substantial consideration in a company’s evaluation of a candidate. Having a background as a programmer or analyst can be good ways to step into the data science position. Oftentimes a lack of experience is the greatest hurdle to entering the analytics profession, and so not everyone has the background described above.

Despite the difficulty in attracting an employer’s attention, entering the field of data science is totally worth it. With articles about “Big Data” and “Cloud Computing” emerging everyday on the internet, being a data scientist no doubt puts you at the edge of modern-day technological development, and gives you the ability to make a substantial contribution to society. Plus there’s the pay…

Revolution Analytics AcademyR: Revolution R Enterprise Professional Certification

by James Peruvankal

There are plenty of options if you want to learn R and are looking for training: your college’s statistics department, massive open online courses like Coursera, Udacity, edX, Datacamp etc. SiliconANGLE recently published an article about top R-training companies.

Let’s talk about how to choose a good R-trainer.

- First and foremost is technical competency in R - In addition to having done a significant amount of R programming, the instructor should have an education in a quantitative field. The idea behind this is that the instructor will have had experience expressing non-trivial ideas in R. However, it is not necessarily the case that the most technically competent person is the best instructor.
- Experience in teaching statistics - Learning R invariably involves working with statistics. So knowing where students can go wrong in understanding statistical concepts is a skill that greatly increases the effectiveness of an R Instructor. This skill only comes with experience. Joan Garfield's 1995 article in the International Statistical Review: How students learn Statistics is an excellent reference on what could go wrong in learning statistics and how to correct them.
- Communication skills - The instructor should have the ability to clearly communicate complex topics in simple examples that students can relate to. We recommend Gelman and Nolan's book: Teaching Statistics: A Bag of Tricks which promotes an activity based approach to teaching.
- Evangelism - Passion generates passion. The enthusiasm of the instructor spreads to the students.
- Teaching style and philosophy - From our experience in teaching and based on decades of research on how people learn, we have come up with our teaching philosophy. The most important factor is that people ‘learn by doing’. Ensure that hands-on learning is where most of the time is spent on.

At Revolution Analytics we are guided by the teaching philosophy presented in the following chart:

So, if you are serious about learning R, brush up on your statistics, be prepared to jump right in and start doing things on your own, surround yourself with people who are passionate about statistics and R, and figure out how to make the whole process fun for you. If you are teaching R and want to join us in our mission to ‘take R to the Enterprise’, see if you can fit in with our team.

by Joseph Rickert

UserR! 2014 got under way this past Monday with a very impressive array of tutorials delivered on the day that the conference organizers were struggling to cope with a record breaking crowd. My guess is that conference attendance is somewhere in the 700 range. Moreover, this the first year that I can remember that tutorials were free. The combination made for jam-packed tutorials.

The first thing that jumps out just by looking at the tutorial schedule is the effort that RStudio made to teach at the conference. Of the sixteen tutorials given, four were presented by the RStudio team. Winston Chang conducted an introduction to interactive graphics with ggvis; Yihui Xie presented Dynamic Documents with R and knitr; Garrett Grolemund Interactive data display with Shiny and R and Hadley Wickham taught data manipulation with dplyr. Shiny is still captivating R users. I was particularly struck by a conversation I had with a undergraduate Stats major who seemed to be genuinely pleased and excited about being able to build her own Shiny apps. Kudos to the RStudio team.

Bob Muenchen brought this same kind of energy to his introduction to managing data with R. Bob has extensive experience with SAS and Stata and he seems to have a gift anticipating areas where someone proficient in one of these packages, but new to R, might have difficulty.

Matt Dowle presented his tutorial on Data Table and, from the chatter I heard, increased his growing list of data table converts. Dirk Eddelbuettel presented An Example-Driven, Hands-on Introduction to Rcpp and Romain Francois taught C++ and Rcpp11 for beginners. Both Dirk and Romain work hard to make interfacing to C++ a real option for people who themselves are willing to make an effort.

I came late to Martin Morgan’s tutorial on Bioconductor but couldn’t get in the room. Fortunately, Martin prepared an extensive set of materials for his course which I hope to be able to work through.

Max Kuhn taught an introduction to applied predictive based on the recent book he wrote with Kjell Johnson. Both the slides and code are available. Virgilio Gomez Rubio presented a tutorial on applied spatial data analysis with R. His materials are available here. Ramnath Vaidyanathan Interactive Documents with R Have a look at both Ramnath’s slides, the map embedded in his abstract and his screencast.

Drew Schmidt taught a workshop on Programming with Big Data in R based on the pdbR package. The course page points to a rich set of introductory resources.

I very much wanted to attend Søren Højsgaard’s tutorial on Graphical Models and Bayesian Networks with R, but couldn’t make it. I did attend John Nash's tutorial on Nonlinear parameter optimization and modeling in R and I am glad that I did. This is a new field for me and I was fortunate to see something of John’s meticulous approach to the subject.

I was disappointed that I didn’t get to attend Thomas Petzoldt’s tutorial on Simulating differential equation models in R. This is an area that is not usually associated with R. The webpage for the tutorial is really worth a look.

I don't know if the conference organizers planned it that way, but as it turned out, the tutorial subjects chosen are an excellent showcase for the depth and diversity of applications that can be approached through R. Many thanks to the tutorial teachers and congratulations to the UseR! 2014 conference organizers for a great start to the week. I hope to have more to say about the conference in future posts.

Revolution Analytics has just introduced a 10-module series of R courses in Singapore. If you'd like to learn how to do data analysis in R, already know data analysis in another language like SAS and want to transition to R, or just want to enhance your R skills in a specific area, one of these hands-on courses may be of interest. The available modules (which you mix-and-match) are:

- Module 1: Introduction to R
- Module 2: Taking on Big Data using RevoScaleR
- Module 3: Using R with Hadoop
- Module 4: Implementing Web Services using RevoDeployR
- Module 5: Introduction to Rattle and Alteryx for Data Mining
- Module 6: Visualization in R with ggplot2
- Module 7: R for SAS Users
- Module 8: Introduction to Statistics and Analytics
- Module 9: Advanced Statistical Topics, Data Mining, and Machine Learning
- Module 10: System Administration

The next scheduled course on April 15 is **R for SAS users**, a 2-day R course in Singapore designed for existing SAS users. This quick-start for the R language builds on your existing knowlege of SAS to teach data handling, data visualization, statistical analysis and reporting with R.

If you're not in the SIngapore area, you can find other R courses in North America, India, and Europe. And from anywhere in the world you can take Bob Muenchen's course **R for SAS, SPSS and Stata Users online**, April 21-23.

by James Paul Peruvankal, Senior Program Manager at Revolution Analytics

At Revolution Analytics, we are always interested in how people teach and learn R, and what makes R so popular, yet ‘quirky’ to learn. To get some insight from a real pro we interviewed Bob Muenchen. Bob is the author of *R for SAS and SPSS Users* and, with Joseph M. Hilbe, *R for Stata Users*. He is also the creator of r4stats.com, a popular web site devoted to analyzing trends in analytics software and helping people learn the R language. Bob is an Accredited Professional Statistician™ with 30 years of experience and is currently the manager of OIT Research Support (formerly the Statistical Consulting Center) at the University of Tennessee. He has conducted research for a variety of public and private organizations and has assisted on more than 1,000 graduate theses and dissertations. He has written or coauthored over 60 articles published in scientific journals and conference proceedings.

Bob has served on the advisory boards of SAS Institute, SPSS Inc., StatAce OOD, the Statistical Graphics Corporation and PC Week Magazine. His suggested improvements have been incorporated into SAS, SPSS, JMP, STATGRAPHICS and several R packages. His research interests include statistical computing, data graphics and visualization, text analysis, data mining, psychometrics and resampling.

**James: How did you get started teaching people how to use statistical software? **

Bob: When I came to UT in 1979, many people were switching from either FORTRAN or SPSS to SAS. There was quite a lot of demand for SAS training, and I enjoyed teaching the workshops. Back then SAS could save results, like residuals or predicted values, much more easily than SPSS, which drove the switch.

When the Windows version of SPSS came out, people started switching back. The SPSS user interface designer, Sheri Gilley, really understood what ease of use was all about, and the SAS folks didn’t get that until quite recently. I was just as happy teaching the SPSS workshops. However, many SPSS users at UT avoid programming, which I think is a big mistake. Pointing-and-clicking your way through an analysis can be a time-saving way to work, but I always keep the program so I have a record of what I did.

I started teaching R workshops in 2005 and attendance was quite sparse. Now it’s one of our Research Computing Support team’s most popular topics.

**James: Is there anything special about teaching people how to use R, any particular difficulties?**

Bob: In other analytics software, the focus is on variables. It sounds too simple to even bother saying: "Every procedure accepts variables." There are very few ways to specify them, such as by simple name, A, B, C, or lists like A TO Z or A—Z.

Rather than just variables, R has a variety of objects such as vectors, factors and matrices. Some procedures (called functions in R) require particular kinds of objects and there are many more ways to specify which objects to use. From a new user's perspective that may seem like needless complexity. However it provides significant benefits. Once an R user has defined a categorical variable as a factor, analyses will then try to “do the right thing” with that variable. For instance, you could include it in a regression equation and R would create the indicator variables needed to handle a categorical variable automatically.

Another important benefit to R’s object orientation is that it allows a total merger of what would normally be a separate matrix language into the main language of R. This attracts developers, who are helping grow R’s capabilities very rapidly.

**James: How do you handle such a broad range of backgrounds in your classes?**

Bob: The workshop participants do come from a very wide range of fields, but they share a common set of knowledge: what a variable is, how to analyze data, and so on. So I save a great deal of time by not having to explain all that. Instead, I redirect it into pointing out where R is likely to surprise them. You can have variables that are *not* in a data set? That’s a bizarre concept to a SAS, SPSS or Stata user. You can have X in one data set and Y in another, but include both in the same regression model? That sounds very strange at first and, of course, it’s quite risky if you’re not careful. I introduce most topics with, “You’re expecting this, but here comes something very different…”. Different doesn’t necessarily mean better, of course. SAS, SPSS and Stata are all top-quality packages and they do some things with less effort. I love R, but I like to point out where I think the others do a better job.

**James: How do you find teaching online compared to classroom courses?**

Bob: I teach my workshops in-person at The University of Tennessee and I’ve taught at the American Statistical Association’s Joint Statistical Meeting as well as the UseR! Conference. Teaching “live” is great fun, and being able to see the participants’ expressions is helpful in adjusting the presentation pace and knowing when to stop and ask for questions.

However, live workshops have major drawbacks. Travel costs can easily exceed the fee for a workshop, but worse, minimizing those expenses means cramming too much material into a short timeframe. That’s why I teach my webinars in half-day stretches skipping a day in between. We break every hour and fifteen minutes so people can relax. On their days off they can catch up on their regular work, review the workshop material, work on the exercises and email me with questions. At the end of a live workshop people are happy but exhausted and they leave quickly. At the end of a webinar-based workshop, they often stay for a long time afterwards asking questions. I stay online as long as it takes to answer them all.

**James: Some people like learning actively, with their hands on the keyboard. Others prefer to focus more on what’s being said and taking notes. How do you handle these styles?**

Bob: This is an excellent question! When I take a workshop myself, I usually prefer hands-on but sometimes I don’t. Each of my workshop attendees receives setup instructions a week early so their computer has the software installed and the files in the right place by the time we start. They’re ready for whichever learning style they prefer.

For hands-on learners, I use a single R program that contains the course notes as programming comments interspersed with executable code. Since the “slides” are right in front of them, they never need to take their eyes off their screens. The examples are designed to be easy to covert to their own projects. They build in a step-by-step fashion, going from simple to more complex to make sure no one gets lost. Participants can run each example as I cover it, and see the results on their own computers.

For people focused more on listening and taking notes, everyone also has a complete set of slides. The slides have the notes that describe each concept, then the code for it, followed by the output. The notes follow a numbering scheme that is used in both the program and the handout. This way, both types of learners stay in sync.

This dual approach has another benefit. It’s very easy to switch from one style to the other at any time. If someone gets tired of typing, or his or her computer malfunctions, switching to the notes is seamless. Conversely, if someone is following the printed notes and want switch to run an example, it’s very easy to find.

**James: What motivated you to start writing books?**

Bob: I’ve always enjoyed writing newsletter and journal articles. My books on R started out just as a set of notes that I kept for myself. When I put them online, they started getting thousands of hits and Springer called to ask if I could make it a book. I really didn’t think I had enough information, but it kept growing. The second edition of R for SAS and SPSS Users is 686 pages and I have notes on a few topics that I wish I had added. If I ever find time for a third edition, they’ll be in there.

**James: Thank you Bob for your time!**

If you are looking to learn R and are already familiar with software like SAS, SPSS or Stata, do check out Bob’s upcoming workshops here and here.

Statistical forecasting is a critical component of every modern business, and Rob J Hyndman, Professor of Statistics at Monash University, is an expert in the field. He's the co-author of several books on forecasting, including Forecasting: Principles and Practice, a free on-line book that provides a comprehensive introduction to forecasting methods. You can implement all of the methods in this book using open source R software, and Rob's forecast package. Using these techniques and software, he has solved many important forecasting problems in industry, including:

- Helping the Australian government improve forecasts of pharmaceutical sales, converting a 20% budgeting deficit into a forecast within 1% of actual expenditures,
- Helping fertilizer manufacturers reduce warehousing costs by accurately forecasting sales,
- Helping call centers better schedule call center staff by forecasting the rate of inbound calls, and
- Helping electricity companies forecast long-term electricity demand

If you'd like to learn about statistical forecasting, Rob Hyndman will be giving an on-line course on Statistical Forecasting in R in partnership with Revolution Analytics. Here's a short video describing the course:

The course assumes some knowledge of R and statistics, but assumes no backround in statistical forecasting. The course consists of 2 online 1-hour sessions per week over a six week period beginning on October 21. For more details about the course and to register, follow the link below.

Rob J Hyndman: Online course on forecasting using R