by Joseph Rickert

The Joint Statistical Meetings (JSM) get underway this weekend in Boston and Revolution Analytics is again proud to be a sponsor. More than 6,000 statisticians and data scientists from around the world are expected to attend and listen to thousands of presentations. It is true that many talks will be on specialized topics that only statisticians working in particular a field will have the interest and patience to sit through. However, there is evidence that the conference will have something exciting to offer data scientists and statisticians working in industry. Keyword searches yield 79 presentations for Big Data, 29 on Machine Learning, 17 on Data Science, 17 on Data Mining and 19 related to R. There is more than enough here to fill a data scientist’s dance card.

Three must-see presentations under the Big Data keyword are: Michael Franklin's presentation on Analyzing Data at Scale with the Berkeley Data Analytics Stack; Hui Jiang et al. on Implementation of Statistical Algorithms in Big Data Platforms and Tim Hesterberg's talk on Simulation-Based Methods in Statistics Education, and Google Tools. Under the Data Science label, Bill Ruh’s invited talk Industrial Internet, an Opportunity for Statisticians to Become Data Scientists looks most inviting. There are also quite a few Data Science talks that indicated some soul searching within the academic community as to how the statistics curriculum ought to be changed. See, for example, Michael Rappa’s talk on Data Scientists: How Do We Prepare for the Future? and Johanna Hardin’s talk: Data Science and Statistics: How Should They Fit into Our Curriculum?

Here is the list of R related presentations:

**Saturday, August 2**

- 8:00 AM - 12:00 PM: Adaptive Tests of Significance Using R and SAS — Professional Development Continuing Education Course ASA Instructor: Tom O'Gorman

**Sunday, August 3**

- 8:30 AM - 5:00 PM: Adaptive Methods in Modern Clinical Trials — Professional Development Continuing Education Course ASA , Biometrics Section Instructors: Frank Bretz, Byron Jones, and Guosheng Yin
- 4:20 PM: Glassbox: An R Package for Visualizing Algorithmic Models: Max Ghenis and Ben Ogorek and Estevan Flores
- 4:45 PM: Bayesian Enrollment and Event Predictions in Clinical Trials Leveraging Literature Data: Aijun Gao and Fanni Natanegara and Govinda Weerakkody

**Monday, August 4**

- 8:55 AM: Thinking with Data in the Second Course: Nicholas J. Horton and Ben S. Baumer and Hadley Wickham
- 8:30 AM to 10:20 AM: Do You See What I See? Formal Usability Testing and Statistical Graphics: Marie C. Vendettuoli and Matthew Williams and Susan Ruth VanderPlas
- 8:35 AM: Preparing Students for Big Data Using R and Rstudio: Randall Pruim
- 8:35 AM: Does R Provide What Customer Need?: Vipin Arora
- 8:55 AM: Doing Reporducible Research Unconscously: Higher Standard, but Less Work: Yihui Xie
- 12:30 PM: to 1:50 PM: Analyzing Umpire Performance Using PITCHf/x: Andrew Swift
- 3:30 PM: The Perfect Bracket: Machine Learning in NCAA Basketball: Sara Stoudt and Loren Santana and Ben S. Baumer

**Tuesday, August 5**

- 10:35 AM: Tools for Teaching R and Statistics Using Games Brad Luen and Michael Higgins
- 2:00 PM: Multiple Treatment Groups: A Case Study with Health Care Practice and Policy Implications Alexandra Hanlon and Karen Hirschman and Beth Ann Griffin and Mary Naylor
- 2:05 PM: glmmplus: An R Package for Messy Longitudinal Data Ben Ogorek and Caitlin Hogan
- 3:30 PM: Give Me an Old Computer, a Blank DVD, and an Internet Connection and I'll Give You World-Class Analytics Ty Henkaline

**Wednesday, August 6**

- 9:35 AM: Testing Packages for the R Language: Stephen Kaluzny and Lou Bajuk-Yorgan
- 9:50 AM: Using R Analytics on Streaming Data: Lou Bajuk-Yorgan and Stephen Kaluzny
- 10:35 Shiny: Easy Web Applications in R:Joseph Cheng
- 10:30 AM to 12:20 PM: Classroom Demonstrations of Big Data: Eric A. Suess
- 11:00 AM: ggvis: Moving Toward a Grammar of Interactive Graphics: Hadley Wickham
- 3:05 PM: Accessing Data from the Census Bureau API: Alex Shum and Heike Hofmann

**Thursday, August 7**

- 9:20 AM: Predicting Dangerous E. Coli Levels at Erie, Pennsylvania, Beaches with Random Forests in R: Michael Rutter
- 9:25 AM: Beyond the Black Box: Flexible Programming of Hierarchical Modeling Algorithms for BUGS-Compatible Models Using NIMBLE: Perry de de Valpine and Daniel Turek and Christopher J. Paciorek and Rastislav Bodik and Duncan Temple Lang

If you are going to JSM please come by booth #303 to say hello. You may also find the mobile apps (Apple or Android) that Revolution Analytics is sponsoring useful, and don't forget to fill out the survey for a chance to win an Apple TV.

Finally, I will be the program chair for Session 401, Monte Carlo Methods to be held Tuesday, 8/5/2014, from 2:00 PM to 3:50 PM in room CC-101. If you are interested in simulation be sure to drop in. I have seen the presentations and think they are well worth attending.

by Joseph Rickert

If I had to pick just one application to be the “killer app” for the digital computer I would probably choose Agent Based Modeling (ABM). Imagine creating a world populated with hundreds, or even thousands of agents, interacting with each other and with the environment according to their own simple rules. What kinds of patterns and behaviors would emerge if you just let the simulation run? Could you guess a set of rules that would mimic some part of the real world? This dream is probably much older than the digital computer, but according to Jan Thiele’s brief account of the history of ABMs that begins his recent paper, *R Marries NetLogo: Introduction to the RNetLogo Package* in the *Journal of Statistical Software,* academic work with ABMs didn’t really take off until the late 1990s.

Now, people are using ABMs for serious studies in economics, sociology, ecology, socio-psychology, anthropology, marketing and many other fields. No less of a complexity scientist than Doyne Farmer (of Dynamic Systems and Prediction Company fame) has argued in *Nature* for using ABMs to model the complexity of the US economy, and has published on using ABMs to drive investment models. in the following clip of a 2006 interview, Doyne talks about building ABMs to explain the role of subprime mortgages on the Housing Crisis. (Note that when asked about how one would calibrate such a model Doyne explains the need to collect massive amounts of data on individuals.)

Fortunately, the tools for building ABMs seem to be keeping pace with the ambition of the modelers. There are now dozens of platforms for building ABMs, and it is somewhat surprising that NetLogo, a tool with some whimsical terminology (e.g. agents are called turtles) that was designed for teaching children, has apparently become a defacto standard. NetLogo is Java based, has an intuitive GUI, ships with dozens of useful sample models, is easy to program, and is available under the GPL 2 license.

As you might expect, R is a perfect complement for NetLogo. Doing serious simulation work requires a considerable amount of statistics for calibrating models, designing experiments, performing sensitivity analyses, reducing data, exploring the results of simulation runs and much more. The recent *JASS* paper* Facilitating Parameter Estimation and Sensitivity Analysis of Agent-Based Models: a Cookbook Using NetLogo and R *by Thiele and his collaborators describe the R / NetLogo relationship in great detail and points to a decade’s worth of reading. But the real fun is that Thiele’s RNetLogo package lets you jump in and start analyzing NetLogo models in a matter of minutes.

Here is part of an extended example from Thiele's *JSS* paper that shows R interacting with the Fire model that ships with NetLogo. Using some very simple logic, Fire models the progress of a forest fire.

Snippet of NetLogo Code that drives the Fire model

to go if not any? turtles ;; either fires or embers [ stop ] ask fires [ ask neighbors4 with [pcolor = green] [ ignite ] set breed embers ] fade-embers tick end ;; creates the fire turtles to ignite ;; patch procedure sprout-fires 1 [ set color red ] set pcolor black set burned-trees burned-trees + 1 end

The general idea is that turtles represent the frontier of the fire run through a grid of randomly placed trees. Not shown in the above snippet is the logic that shows that the entire model is controlled by a single parameter representing the density of the trees.

This next bit of R code shows how to launch the Fire model from R, set the density parameter, and run the model.

# Launch RNetLogo and control an initial run of the # NetLogo Fire Model library(RNetLogo) nlDir <- "C:/Program Files (x86)/NetLogo 5.0.5" setwd(nlDir) nl.path <- getwd() NLStart(nl.path) model.path <- file.path("models", "Sample Models", "Earth Science","Fire.nlogo") NLLoadModel(file.path(nl.path, model.path)) NLCommand("set density 70") # set density value NLCommand("setup") # call the setup routine NLCommand("go") # launch the model from R

Here we see the Fire model running in the NetLogo GUI after it was launched from RStudio.

This next bit of code tracks the progression of the fire as a function of time (model "ticks"), returns results to R and plots them. The plot shows the non-linear behavior of the system.

# Investigate percentage of forest burned as simulation proceeds and plot library(ggplot2) NLCommand("set density 60") NLCommand("setup") burned <- NLDoReportWhile("any? turtles", "go", c("ticks", "(burned-trees / initial-trees) * 100"), as.data.frame = TRUE, df.col.names = c("tick", "percent.burned")) # Plot with ggplot2 p <- ggplot(burned,aes(x=tick,y=percent.burned)) p + geom_line() + ggtitle("Non-linear forest fire progression with density = 60")

As with many dynamical systems, the Fire model displays a phase transition. Setting the density lower than 55 will not result in the complete destruction of the forest, while setting density above 75 will very likely result in complete destruction. The following plot shows this behavior.

RNetLogo makes it very easy to programatically run multiple simulations and capture the results for analysis in R. The following two lines of code runs the Fire model twenty times for each value of density between 55 and 65, the region surrounding the pahse transition.

d <- seq(55, 65, 1) # vector of densities to examine res <- rep.sim(d, 20) # Run the simulation

The plot below shows the variability of the percent of trees burned as a function of density in the transition region.

My code to generate plots is available in the file: Download NelLogo_blog while all of the code from Thiele's JSS paper is available from the journal website.

Finally, here are a few more interesting links related to ABMs.

- On validating ABMs
- ABMs and

by James Peruvankal

There are plenty of options if you want to learn R and are looking for training: your college’s statistics department, massive open online courses like Coursera, Udacity, edX, Datacamp etc. SiliconANGLE recently published an article about top R-training companies.

Let’s talk about how to choose a good R-trainer.

- First and foremost is technical competency in R - In addition to having done a significant amount of R programming, the instructor should have an education in a quantitative field. The idea behind this is that the instructor will have had experience expressing non-trivial ideas in R. However, it is not necessarily the case that the most technically competent person is the best instructor.
- Experience in teaching statistics - Learning R invariably involves working with statistics. So knowing where students can go wrong in understanding statistical concepts is a skill that greatly increases the effectiveness of an R Instructor. This skill only comes with experience. Joan Garfield's 1995 article in the International Statistical Review: How students learn Statistics is an excellent reference on what could go wrong in learning statistics and how to correct them.
- Communication skills - The instructor should have the ability to clearly communicate complex topics in simple examples that students can relate to. We recommend Gelman and Nolan's book: Teaching Statistics: A Bag of Tricks which promotes an activity based approach to teaching.
- Evangelism - Passion generates passion. The enthusiasm of the instructor spreads to the students.
- Teaching style and philosophy - From our experience in teaching and based on decades of research on how people learn, we have come up with our teaching philosophy. The most important factor is that people ‘learn by doing’. Ensure that hands-on learning is where most of the time is spent on.

At Revolution Analytics we are guided by the teaching philosophy presented in the following chart:

So, if you are serious about learning R, brush up on your statistics, be prepared to jump right in and start doing things on your own, surround yourself with people who are passionate about statistics and R, and figure out how to make the whole process fun for you. If you are teaching R and want to join us in our mission to ‘take R to the Enterprise’, see if you can fit in with our team.

Statistics has many canonical data sets. For classification statistics, we have the Fisher's iris data. For Big Data statistics, the canonical data set used in many examples is the Airlines data. And for dotplots, we have the barley data, first popularized by Bill Cleveland in the landmark 1993 text *Visualizing Data*. Cleveland's innovations in data visualiation were hugely influential in the S language and (later) R's lattice and ggplot2 packages, and the panel chart of the barley data shown below is one of the best known.

The chart above shows the yields for several different varieties of barley (Trebi, Glabron and so on) planted at each of six different sites in Minnesota (Duluth, Grand Rapids, etc.) in the years 1931 (pink) and 1932 (blue). The reason this data set has become legendary appears in the "Morris" panel, where unlike all other sites the yields in 1931 exceeded those in 1932 for all barley varieties. This is a great demonstration of the power of dotplots and panel graphics. In his book, Cleveland said that "either an extraordinary natural event, such as disease or a local weather anomaly, produced a strange coincidence, or the years for Morris were inadvertently reversed", and "on the basis of the evidence, the mistake hypothesis would appear to be the more likely."

But it now looks that despite Cleveland's suggestion, the data are correct after all. In a paper in the American Statistician published last year, Kevin Wright notes that in that time period local effects of weather (especially drought), insects and disease had greater impact on barley yields than any overall year-to-year effects on yield, and that the results at Morris were not surprising. Kevin offers as evidence extended barley yield data (available in his R package agridat) covering 10 years and 18 varieties. As you can see in the chart below, there is significant variation across years and within sites. Take a look at 1934 for example: a bounty of barley in Duluth, but a meagre crop in St Paul:

So it goes to show that in Cleveland's original example, it wasn't a data error that led to the "unusual" results at the Morris site. Rather, it's an expected consequence of the year-to-year variation of yields in each of the growing sites. But it's no less of an interesting data set to show off the power of dot plots and panel charts — as you can see from several other examples included in Kevin Wright's paper linked below. (With thanks to Kevin for describing this example to me at the useR! 2014 poster session. You can see a version of his poster here.)

American Statistician: Revisiting Immer's Barley Data. The American Statistician, 67(3), 129–133.

by Joseph Rickert

Broadly speaking, a meta-analysis is any statistical analysis that attempts to combine the results of several individual studies. The term was apparently coined by statistician Gene V Glass in a 1976 speech he made to the American Education Research Association. Since that time, not only has meta-analysis become a fundamental tool in medicine, but it is also becoming popular in economics, finance, the social sciences and engineering. Organizations responsible for setting standards for evidence-based medicine such as the United Kingdom’s National Institute for Health and Care Excellence (NICE) make extensive use of meta-analysis.

The application of meta-analysis to medicine is intuitive and, on the surface, compelling. Clinical trials designed to test efficacy for some new treatment for a disease against the standard treatment tend to be based on relatively small samples. (For example, the largest four trials for Respiratory Tract Diseases currently listed on ClinicalTrials.gov has an estimated enrollment of 533 patients.) It would seem to be a “no brainer” to use “all of the information” to get more accurate results. However, as for so many things, the devil is in the details. The preliminary tasks of establishing a rigorous protocol for guiding the meta-analysis and the systematic review to search for relevant studies are themselves far from trivial. One has to work hard to avoid “selection bias”, “publication bias” and other even more subtle difficulties.

In my limited experience with meta-analysis, I found it extraordinariy difficult to determine whether patient populations from different clinical trials were sufficiently homogenous to be included in the same meta-analysis. Even when working with well-written papers, published in quality journals, a considerable amount of medical expertise was required to interpret the data. I came away with the strong impression that a good meta-analysis requires collaboration from a team of experts.

Historically, it has probably been the case that most meta-analyses were conducted either with general tools such as Excel or specialized software like RevMan from the Cochrane Collaboration. However, R is the natural platform for meta-analysis both because of the myriad possibilities for statistical analyses that are not generally available through the specialized software, and because of the many packages devoted to various aspects of meta-analysis. The CRAN Meta Analysis Task View is exceptionally well-organized listing R packages according to the different stages of conducting a meta-analysis and also calling out some specialized techniques such as meta-regression and network-meta analysis.

ln a future post, I hope to be able to explore some of these packages more closely. For now, let’s look at a very simple analysis based on Thomas Lumley’s rmeta package which has been a part of R since 1999. The following simple meta-analysis is written up very nicely in the book by Chen and Peace titled Applied Meta-Analysis with R.

The cochrane data set in the rmeta package contains the results from seven randomized clinical trials designed to test the effectiveness of corticosteriod therapy in preventing neonatal deaths in premature labor. The columns of the data set are: the name of the trial center, the number of deaths in the treatment group, the total number of patients in the treatment group, the number of deaths in the control group and the total number of patients in the control group.

The null hypothesis is that there is no difference between treatment and control. Following Chen and Peace, we fit both fixed effects and random effects models to look at the odds ratios.

The summary for the fixed effects models shows that while only two studies, Auckland and Doran, individually show a significant effect, the overall confidence interval from the Mantel Haenszel test does indicate a benefit from the treatment.

Fixed effects ( Mantel-Haenszel ) meta-analysis Call: meta.MH(ntrt = n.trt, nctrl = n.ctrl, ptrt = ev.trt, pctrl = ev.ctrl, names = name, data = cochrane) ------------------------------------ OR (lower 95% upper) Auckland 0.58 0.38 0.89 Block 0.16 0.02 1.45 Doran 0.25 0.07 0.81 Gamsu 0.70 0.34 1.45 Morrison 0.35 0.09 1.41 Papageorgiou 0.14 0.02 1.16 Tauesch 1.02 0.37 2.77 ------------------------------------ Mantel-Haenszel OR =0.53 95% CI ( 0.39,0.73 ) Test for heterogeneity: X^2( 6 ) = 6.9 ( p-value 0.3303 )

The summary for the random effects model for this data is identical except, as one would expect, the overall confidence interval is somewhat wider: SummaryOR= 0.53 95% CI ( 0.37,0.78 ). A slight modification to enhanced the forest plot code provided by Chen and Peace (which works for both the fixed effects and random effects model objects) shows the typical way to present these results.

CPplot <- function(model){ c1 <- c("","Study",model$names,NA,"Summary") c2 <- c("Deaths","(Steroid)",cochrane$ev.trt,NA,NA) c3 <- c("Deaths","(Placebo)",cochrane$ev.ctrl,NA,NA) c4 <- c("","OR",format(exp(model[[1]]),digits=2),NA,format(exp(model[[3]]),digits=2)) tableText <-cbind(c1,c2,c3,c4) mean <- c(NA,NA,model[[1]],NA,model[[3]]) stderr <- c(NA,NA,model[[2]],NA,model[[4]]) low <- mean - 1.96*stderr up <- mean + 1.96*stderr forestplot(tableText,mean,low,up,zero=0, is.summary=c(TRUE,TRUE,rep(FALSE,8),TRUE),clip=c(log(0.1),log(2.5)),xlog=TRUE) }

CPplot(model.FE)

The whole idea of meta-analysis is intriguing. However, because of the challenges I mentioned above, I would be remiss not to point out that it elicits considerable criticism. The article Meta-analysis and its problems by H J Eysenck captures the issues and is well worth reading. Also, have a look at the review article by Walker, Hernandez and Kattan writing in the Cleveland Clinic Journal of Medicine.

by Joseph Rickert

John Chambers opened UseR! 2014 by describing how the R language grew out of early efforts to give statisticians easier access to high quality statistical software. In 1976 computational statistics was a very active field, but most algorithms were compiled as Fortran subroutines. Building models with this software was not a trivial process. First you had to write a main Fortran program to implement the model and call the right subroutines, and then you had to write the job control language code to submit your job and get it executed. When John and his Bell Labs colleagues sat down on that May afternoon to work on what would become the first implementation of the S language they were thinking about how they could make this process easier. The top half John’s famous diagram from that afternoon schematically indicates their intention to design a software interface so that one could call an arbitrary Fortran subroutine, ABC, by wrapping it in some simplified calling syntax: XABC( ).

The main idea was to bring the best computational facilities to the people doing the analysis. As John phrased it: “combine serious computational challenges with convenience”. In the end, the designers of both S, and its second incarnation, R, did much better than convenience. They built a tool to facilitate “flow”. When you are engaged in any mentally challenging work in (including statistical analysis) at a high level of play, you want to be able to stay in the zone and not get knocked out by peripheral tasks that interrupt your thought processes. As engaging and meaningful as it is in its own right, writing code is not doing statistics. One of the big advantages of working with R is that you can do quite a bit of statistics with just a handful of functions and the simplest syntax. R is a tool that helps you keep moving forward. If you want to see something then plot it. If the data in the wrong format, then mutate it.

A second idea that flows from the idea of S as an interface is that S was not intended to be self sufficient. John was explicit that S was designed as an interface to the “best algorithms”, not as a “from the ground up programming language”. The idea of being able to make use of external computational resources is still compelling. There will always be high-quality stuff that we will want to get at. Moreover, as John elaborated: “unlike 38 years ago there are many possible interfaces to languages, to other computing models and to (specialized) hardware”. The challenge is to interface to applications that are “too diverse for one solution to fit them all”, and to do this “without loosing the R that works in ‘ordinary’ circumstances.

John offered three examples of R projects that extend the reach of R to leverage other computing environments.

- Rcpp - turns C++ in to an R function by generating an interface to C++ with much less programming effort than .Call
- RLLVM - enables compiling R language code into specialized forms for efficiency and other purposes
- H2O - provides a compressed, efficient external version of a data frame for running statistical models on large data sets.

These examples, chosen to represent each of the three different kinds of interface targets that John called out, also represent projects of different scope and levels of integration. With a total of 226 reverse depends and reverse imports, Rcpp is already a great success. It is likely that ready access to C++ will form a permanent part of the R programmers mindset.

RLLVM is a much more radical and ambitious project that would allow R to be the window to entirely different computing models. As best I understand it, the central idea is to use the R environment as the system interface to “any number of new languages” perhaps languages that have not yet been invented. RLLVM would “Use R syntax for commands to be interpreted in a different interpreter”. RLLVM seems to be a powerful idea and a direct generalization of the original XABC() idea.

The RH2O package is an example of providing R users with transparent access to data sets that are too large to fit into memory. It is one of many efforts underway (including those from Revolution Analytics) to integrate Hadoop, Teradata, Spark and other specialized computing platforms within the R environment. Some of these specialized platforms may indeed be longed lived, but it is not likely that all of them will. From the point of view of doing statistics, it is the R interface that is likely to survive and persist, platforms will come and go.

An implication of the willingness of R developers to embrace diversity is that R is likely to always be a work in progress. There will be loose ends, annoying inconsistencies and unimplemented possibilities. I suppose that there are people who will never be comfortable with this state of affairs. It is not unreasonable to prefer a system where there is one best way to do something and where, within the bounds of some pre-established design, there is near perfect consistency. However, the pursuit of uniformity and consistency seems to me to doom designers to be at least one step behind, because it means continually starting over to get things right.

So what does this say about the future of R? John closed his talk by stating that “the best future would be one of variety, not uniformity”. I take this to mean that, for the near future anyway, whatever the next big thing is, it is likely that someone will write an R package to talk to it.

Some links regarding S and R History:

- John Chambers useR! 2006 slides
- Trevor Hastie's Interview with John Chambers
- Ross Ihaka: R: Past and Future History
- New York Times Article

by Joseph Rickert

UserR! 2014 got under way this past Monday with a very impressive array of tutorials delivered on the day that the conference organizers were struggling to cope with a record breaking crowd. My guess is that conference attendance is somewhere in the 700 range. Moreover, this the first year that I can remember that tutorials were free. The combination made for jam-packed tutorials.

The first thing that jumps out just by looking at the tutorial schedule is the effort that RStudio made to teach at the conference. Of the sixteen tutorials given, four were presented by the RStudio team. Winston Chang conducted an introduction to interactive graphics with ggvis; Yihui Xie presented Dynamic Documents with R and knitr; Garrett Grolemund Interactive data display with Shiny and R and Hadley Wickham taught data manipulation with dplyr. Shiny is still captivating R users. I was particularly struck by a conversation I had with a undergraduate Stats major who seemed to be genuinely pleased and excited about being able to build her own Shiny apps. Kudos to the RStudio team.

Bob Muenchen brought this same kind of energy to his introduction to managing data with R. Bob has extensive experience with SAS and Stata and he seems to have a gift anticipating areas where someone proficient in one of these packages, but new to R, might have difficulty.

Matt Dowle presented his tutorial on Data Table and, from the chatter I heard, increased his growing list of data table converts. Dirk Eddelbuettel presented An Example-Driven, Hands-on Introduction to Rcpp and Romain Francois taught C++ and Rcpp11 for beginners. Both Dirk and Romain work hard to make interfacing to C++ a real option for people who themselves are willing to make an effort.

I came late to Martin Morgan’s tutorial on Bioconductor but couldn’t get in the room. Fortunately, Martin prepared an extensive set of materials for his course which I hope to be able to work through.

Max Kuhn taught an introduction to applied predictive based on the recent book he wrote with Kjell Johnson. Both the slides and code are available. Virgilio Gomez Rubio presented a tutorial on applied spatial data analysis with R. His materials are available here. Ramnath Vaidyanathan Interactive Documents with R Have a look at both Ramnath’s slides, the map embedded in his abstract and his screencast.

Drew Schmidt taught a workshop on Programming with Big Data in R based on the pdbR package. The course page points to a rich set of introductory resources.

I very much wanted to attend Søren Højsgaard’s tutorial on Graphical Models and Bayesian Networks with R, but couldn’t make it. I did attend John Nash's tutorial on Nonlinear parameter optimization and modeling in R and I am glad that I did. This is a new field for me and I was fortunate to see something of John’s meticulous approach to the subject.

I was disappointed that I didn’t get to attend Thomas Petzoldt’s tutorial on Simulating differential equation models in R. This is an area that is not usually associated with R. The webpage for the tutorial is really worth a look.

I don't know if the conference organizers planned it that way, but as it turned out, the tutorial subjects chosen are an excellent showcase for the depth and diversity of applications that can be approached through R. Many thanks to the tutorial teachers and congratulations to the UseR! 2014 conference organizers for a great start to the week. I hope to have more to say about the conference in future posts.

by Joseph Rickert

Predictive Modeling or “Predictive Analytics”, the term that appears to be gaining traction in the business world, is driving the new “Big Data” information economy. Predictably, there is no shortage of material to be found on this subject. Some discussion of predictive modeling is sure to be found in any reasonably technical presentation of business decision making, forecasting, data mining, machine learning, data science, statistical inference or just plain science. There are hundreds of booksthat have something worthwhile to say about predictive modeling. However, in my judgment,* Applied Predictive Modeling* by Max Kuhn and Kjell Johnson (Springer 2013) ought to be at the very top of reading list of anyone who has some background in statistics, who is serious about building predictive models, and who appreciates rigorous analysis, careful thinking and good prose.

The authors begin their book by stating that “the practice of predictive modeling defines the process of developing a model in a way that we can understand and quantify the model’s prediction accuracy on future, yet-to-be-seen data”. They emphasize that predictive modeling is primarily concerned with making accurate predictions and not necessarily building models that are easily interpreted. Neverless, they are careful to point out that “the foundation of an effective predictive model is laid with intuition and deep knowledge of the problem context”. The book is a masterful exposition of the modeling process delivered at high level of play, with the authors gently pushing the reader to understand the data, to carefully select models, to question and evaluate results, to quantify the accuracy of predictions and to characterize their limitations.

Kuhn and Johnson are intense but not oppressive. They come across like coaches who really, really want you to be able to do this stuff. They write simply and with great clarity. However, the material is not easy. I frequently, found myself rereading a passage and almost always found it to be worth the effort. This mostly happened when reading a careful discussion of a familiar topic (i.e. something I thought I understood). For example, Chapter 14 on Classification Trees and Rule-Based models contains what I thought to be an illuminating discussion on the difference between building trees with grouped categories and taking the trouble to decompose a categorical predictor into binary dummy variables, in effect forcing binary splits for the categories.

*Applied Predictive Modeling* begins with chapter that introduces the case studies that referenced throughout the book. Thereafter, chapters are organized into four parts: General Strategies, Regression Models, Classification Models, Other Considerations and three appendices, including a brief introduction to R (too brief to teach someone R, but adequate to give a programmer new to R enough of an orientation to make sense of the R scripts included in the book). This organization has the virtue of allowing the authors to focus on the specifics of the various models while providing a natural way to repeat and reinforce fundamental principles. For example, Regression Trees and Classification Trees share a great deal in common and many authors treat them together. However, by splitting them into separate sections Kuhn and Johnson can focus on the performance measures that are peculiar to each kind of model while getting a second chance to explain fundamental principles and techniques such as bagging and boosting that are applicable to both kinds of models.

There are many ways to go about reading *Applied Predictive Modeling*. I can easily envision someone committed to mastering the material reading the text from cover to cover. However, the chapters are pretty much self contained, and the authors are very diligent about providing back references to topics they have covered previously. You can pretty much jump in anywhere and find your way around. Additionally, the authors take the trouble to include quite a bit of “forward referencing” which I found to be very helpful. As an example, In section 3.6, where the authors mention credit scoring with respect to a discussion on adding predictors to a model, they point ahead to section 4.5 which is short discussion of the credit scoring case study. This section, in turn, points ahead to section 11.2 and a discussion of evaluating predicted classes. These forward references encourage and facilitate latching on to a topic and then threading through the book to track it down.

Three major strengths of the book are its fundamental grounding in the principles of statistical inference, the thoroughness with which the case studies are presented, and its use of the R language. The statistical viewpoint is apparent both from the choice of topics presented and the authors’ overall approach to predictive modeling. Topics that are peculiar to a statistical approach include the presentation of stratified sampling and other sampling techniques in the discussion of data splitting, and the sections on partial least squares and linear discriminant analysis. The real statistical value of the text, however, is embedded in the Kuhn and Johnson’s methodology. They take great care to examine the consequences of modeling decisions and continually encourage the reader to challenge the results of particular models. The chapters on data preparation and model evaluation do an excellent job of informally presenting a formal methodlolgy for making inferences. Applied Predictive Modeling contains very few equations and very little statistical jargon but it is infused with statistical thinking. (A side effect of the text is to teach statistics without being too obvious about it. You will know you are catching on if you think the xkcd cartoon in chapter 19 is really funny.)

A nice feature about the case studies is that they are rich enough to illustrate several aspects of the model building process and are used effectively throughout the text. The discussion in Chapter 12 on preparing the Kaggle contest, University of Melbourne grant funding data set is particularly thorough. This kind of “blow by blow” discussion of why the authors make certain modeling decisions is invaluable.

The R language comes into play in several ways in the text. The most obvious is the section on computing that closes most chapter. These sections contain R code that illustrates the major themes presented in the chapter. To some extent, these brief R statements substitute for the equations that are missing from the text. They provide concrete visual representations of the key ideas accessible to anyone who makes the effort to learn very little R syntax. The chapter ending code is itself backed up with an R package available on CRAN, AppliedPredictiveModeling, that contains scripts to reproduce all of the analyses and plots in the text. (This feature makes the text especially well-suited for self study.)

*Applied Predictive Modeling* is resplendent with R graphs and plots, many of them in color that are integral to the presentation of ideas but which also serve to illustrate how easily presentation level graphs can be created in R. Form definitely follows function here, and it makes for a rather pretty book. One of my favorite plots is the first part of Figure 11.3 reproduced below which shows the test set probabilities for a logistic regression model of the German Credit data set.

The authors point out that the estimates of bad credit in the right panel are skewed showing that most estimates predict very low probabilities for bad credit when the credit is, in fact, good - just what you want to happen. In contrast, the estimates of bad credit are flat in the left panel, “reflecting the model’s inability to distinguish bad credit cases”.

Finally, *Applied Predictive Modeling* can be view as an introduction to the caret package. There is great depth here. This is not a book that comes with a little bit of illustrative code, icing on a cake so to speak, rather the included code is just the tip of the iceberg. It provides a gateway to the caret package and the full functionality of R’s machine learning capabilities.

*Applied Predictive Modeling* is a remarkable text. At 600 pages, it is the succinct distillation of years of experience of two expert modelers working in the pharmaceutical industry. I expect that beginners and experienced model builders alike will find something of value here. On my shelf, it sits up there right next to Hastie, Tibshirani and Friedman’s *The Elements of Statistical Learning*.

by Wayne Smith, Ph.D. California State University, Northridge

*Editor's note: This post was abstracted from the monthly newsletter of the Southern California Chapter of the ASA.*

On May 13th and 14^{th}, the Intel International Science and Engineering Fair (Intel ISEF) the world’s largest international pre-college competition, was held at the Los Angeles Convention Center.

I was blessed with the opportunity to represent the American Statistical Association (ASA). As one of approximately 30 statisticians, I helped assist in the judging of the statistics-related elements of numerous prescient and empirical projects presented by high school students from around the world. These students had already won other local and regional science and engineering competitions. We selected first, second, and third place winners, but 16 student teams in total received special recognition and goodie bags filled with software, books, and other items.

The photograph below shows the first place winner, Soham Daga, from New York who used Google Trends to develop a model ot prodict the likelihood of mortgage delinquency. An Interview with Soham can be found here.

I have no doubt that a lasting affinity with statistical professionals and supporting organizations will be a tangible outcome for these motivated, young researchers.

I was energized and transformed by the breadth and depth of the research methods and concomitant inferential analysis applied to address pressing issues in areas as diverse as health care, energy, sustainability, material science, pharmacology, biochemistry, financial economics, and many others. Along with my ASA colleagues, I discussed projects with students as young as 15. As one might expect, many of the High School seniors are attending top research universities in the Fall. I was especially impressed with the rich diversity of students, including groups of students from Qatar, Egypt, Tunisia, Brazil, Japan, Russia, and historically underrepresented areas in the U.S. such as Fresno, CA. Some of the students' work has been ongoing for more than a year, and the students offered background literature (with references!), purposeful hypotheses, detailed analysis and results (occasionally with tool manifests and explanatory code), and integrated conclusions.

Of the 80 or so projects I reviewed, I observed applications of the general linear model; repeated measures; logistic regression; non-parametric measures; classification, feature extraction, and dimensionality reduction; sundry machine learning approaches; and Monte Carlo simulations. I was equally impressed by these students' abilities in fundamental research tasks such as locating and using open source software (e.g., R), understanding and coherently explaining potential I/O- and computational-bounds, finding and interpreting peer-reviewed literature, and seeking out the assistance of relevant industry professionals. Additionally, the students' ebullient entrepreneurial spirit in the design and execution of physical proof-of-concept prototypes and related statistical experiments was especially noteworthy. I came away from each project and each student/team discussion with a new understanding of a thorny issue, a vision for what the solution space and product and process possibilities might be, and perhaps most germane for a College instructor, a renewed calibration for the knowledge, skills, and abilities of a tapestry of young people in the broad areas of mathematical, statistical, and computational sciences. I felt visceral pride in the statistical calling of many of these young finalists, and I know that they will craft much social, intellectual, and economic value for many decades to come.

A side benefit of service at this event was the opportunity to interact with academic and professional colleagues representing a variety of statistical-education interests. In particular, I'd like to thank Madeline Bauer (USC/Keck), Theresa Utlaut (Intel), Jo Hardin (Pomona College), and Olga Korosteleva (CSULB) for their guidance in the judging process. At this event one can interact with professionals from dozens of other professional societies and technology firms as well.

This Intel-sponsored event circulates annually among three U.S. cities. I strongly recommend that individuals with an general interest in statistics and data science volunteer at this event and at local SCASA and OCLBASA events in the future.

Many many thanks to all the statisticians who participated as judges and/or behind the scenes! Thanks to the ASA for the cash prizes and thanks to Chapman Hall/CRC, JMP, Minitab, O’Reilly Media, Revolution Analytics, Sage, Stata, and Taylor & Francis for the donated books, magazines, software and other items.

by Joseph Rickert

In last week’s post, I sketched out the history of Generalized Linear Models and their implementations. In this post I’ll attempt to outline how GLM functions evolved in R to handle large data sets.

The first function to make it possible to build GLM models with datasets that are too big to fit into memory was the bigglm()from Thomas Lumley’s biglm package which was released to CRAN in May 2006. bigglm()is an example of a external memory or “chunking” algorithm. This means that data is read from some source on disk and processed one chunk at a time. Conceptually, chunking algorithms work as follows: a program reads a chunk of data into memory, performs intermediate calculations to compute the required sufficient statistics, saves the results and reads the next chunk. The process continues until the entire dataset is processed. Then, if necessary, the intermediate results are assembled into a final result.

According to the documentation trail, bigglm()is based on Alan Miller’s 1991 refinement (algorithm AS 274 implemented in Fortran 77) to W. Morevin Genetlemen’s 1975 Algol algorithm ( AS 75). Both of these algorithms work by updating the Cholesky decomposition of the design matrix with new observations. For a model with p variables, only the p x p triangular Cholesky factor and a new row of data need to be in memory at any given time.

bigglm()does not do the chunking for you. Working with the algorithm requires figuring out how to feed it chunks of data from a file or a database that are small enough to fit into memory with enough room left for processing. ( Have a look at the make.data() function defined on page 4 of the biglm pdf for the prototype example of chunking by passing a function to bigglm()’s data argument.) bigglm() and the biglm package offer few features for working with data. For example, bigglm() can handle factors but it assumes that the factor levels are consistent across all chunks. This is very reasonable under the assumption that the appropriate place to clean and prepare the data for analysis is the underlying database.

The next steps in the evolution of building GLM models with R was the development of memory-mapped data structures along with the appropriate machinery to feed bigglm() data stored on disk. In late 2007, Dan Alder et al. released the ff package which provides data structures that, from R's point of view, make data residing on disk appear is if it were in RAM. The basic idea is that only a chunk (pagesize) of the underlying data file is mapped into memory and this data can be fed to bigglm(). This strategy really became useful in 2011 when Edwin de Jonge, Jan Wijffels and Jan van der Laan released ffbase, a package of statistical functions designed to exploit ff’s data structures. ffbase contains quite a few functions including some for basic data manipulation such as ffappend() and ffmatch(). For an excellent example of building a bigglm() model with a fairly large data set have a look at the post from the folks at BNOSAC. This is one of the most useful, hands-on posts with working code for building models with R and large data sets to be found. (It may be a testimony to the power of provocation.)

Not longer after ff debuted (June 2008), Michael Kane, John Emerson and Peter Haverty released bigmemory, a package for working with large matrices backed by memory-mapped files. Thereafter followed a sequence of packages in the Big Memory Project, including biganalytics, for exploiting the computational possibilities opened by by bigmemory. bigmemory packages are built on the Boost Interprocess C++ library and were designed to facilitate parallel programming with foreach, snow, Rmpi and multicore and enable distributed computing from within R. The biganalytics package contains a wrapper function for bigglm() that enables building GLM models from very large files mapped to big.matrix objects with just a few lines of code.

The initial release in early August 2010 of the RevoScaleR package for Revolution R Enterprise included rxLogit(), a function for building logistic regression models on very masive data sets. rxLogit() was one of the first of RevoScaleR’s Parallel External Memory Algorithms (PEMA). These algorithms are designed specifically for high performance computing with large data sets on a variety of distributed platforms. In June 2012, Revolution Analytics followed up with rxGlm(), a PEMA that implements all of the all of the standard GLM link/family pairs as well as Tweedie models and user-defined link functions. As with all of the PEMAS, scripts including rxGlm() may be run on different platforms just by changing a few lines of code that specifies the user’s compute context. For example, a statistician could test out a model on a local PC or cluster and then change the compute context to run it directly on a Hadoop cluster.

The only other Big Data GLM implementation accessible through an R package of which I am aware is h20.glm() function that is part of the 0xdata’s JVM implementation of machine learning algorithms which was announced in October 2013. As opposed the the external memory R implementations described above, H20 functions run in the distributed memory created by the H20 process. Look here for h20.glm() demo code.

And that's it, I think this brings us up to date with R based (or accessible) functions for running GLMs on large data sets.