by Joseph Rickert

I think we can be sure that when American botanist Edgar Anderson meticulously collected data on three species of iris in the early 1930s he had no idea that these data would produce a computational storm that would persist well into the 21st century. The calculations started, presumably by hand, when R. A. Fisher selected this data set to illustrate the techniques described in his 1936 paper on discriminant analysis. However, they really got going in the early 1970s when the pattern recognition and machine learn community began using it to test new algorithms or illustrate fundamental principles. (The earliest reference I could find was: Gates, G.W. "The Reduced Nearest Neighbor Rule". IEEE Transactions on Information Theory, May 1972, 431-433.) Since then, the data set (or one of its variations) has been used to test hundreds, if not thousands, of machine learning algorithms. The UCI Machine Learning Repository, which contains what is probably the “official” iris data set, lists over 200 papers referencing the iris data.

So why has the iris data set become so popular? Like most success stories, randomness undoubtedly plays a huge part. However, Fisher’s selecting it to illustrate a discrimination algorithm brought it to peoples attention, and the fact that the data set contains three classes, only one of which is linearly separable from the other two, makes it interesting.

For similar reasons, the airlines data set used in the 2009 ASA Sections on Statistical Computing and Statistical Graphics Data expo has gained a prominent place in the machine learning world and is well on its way to becoming the “iris data set for big data”. It shows up in all kinds of places. (In addition to this blog, it made its way into the RHIPE documentation and figures in several college course modeling efforts.)

Some key features of the airlines data set are:

- It is big enough to exceed the memory of most desktop machines. (The version of the airlines data set used for the competition contained just over 123 million records with twenty-nine variables.
- The data set contains several different types of variables. (Some of the categorical variables have hundreds of levels.)
- There are interesting things to learn from the data set. (This exercise from Kane and Emerson for example)
- The data set is
*tidy*, but not clean, making it an attractive tool to practice big data wrangling. (The AirTime variable ranges from -3,818 minutes to 3,508 minutes)

An additional, really nice feature of the airlines data set is that it keeps getting bigger! RITA, The Research and Innovative Technology Administration Bureau of Transportation Statistics continues to collect data which can be downloaded in .csv files. For your convenience, we have a 143M+ record version of the data set Revolution Analytics test data site which contains all of the RITA records from 1987 through the end of 2012 available for download.

The following analysis from Revolution Analytics’ Sue Ranney uses this large version of the airlines data set and illustrates how a good model, driven with enough data, can reveal surprising features of a data set.

# Fit a Tweedie GLM tm <- system.time( glmOut <- rxGlm(ArrDelayMinutes~Origin:Dest + UniqueCarrier + F(Year) + DayOfWeek:F(CRSDepTime) , data = airData, family = rxTweedie(var.power =1.15), cube = TRUE, blocksPerRead = 30) ) tm # Build a dataframe for three airlines: Delta (DL), Alaska (AK), HA airVarInfo <- rxGetVarInfo(airData) predData <- data.frame( UniqueCarrier = factor(rep(c( "DL", "AS", "HA"), times = 168), levels = airVarInfo$UniqueCarrier$levels), Year = as.integer(rep(2012, times = 504)), DayOfWeek = factor(rep(c("Mon", "Tues", "Wed", "Thur", "Fri", "Sat", "Sun"), times = 72), levels = airVarInfo$DayOfWeek$levels), CRSDepTime = rep(0:23, each = 21), Origin = factor(rep("SEA", times = 504), levels = airVarInfo$Origin$levels), Dest = factor(rep("HNL", times = 504), levels = airVarInfo$Dest$levels) ) # Use the model to predict the arrival day for the three airlines and plot predDataOut <- rxPredict(glmOut, data = predData, outData = predData, type = "response") rxLinePlot(ArrDelayMinutes_Pred~CRSDepTime|UniqueCarrier, groups = DayOfWeek, data = predDataOut, layout = c(3,1), title = "Expected Delay: Seattle to Honolulu by Departure Time, Day of Week, and Airline", xTitle = "Scheduled Departure Time", yTitle = "Expected Delay")

Here, rxGLM()fits a Tweedie Generalized Linear model that looks at arrival delay as a function of the interaction between origin and destination airports, carriers, year, and the interaction between days of the week and scheduled departure time. This function kicks off a considerable amount of number crunching as Origin is a factor variable with 373 levels and Dest, also a factor, has 377 levels. The F() function makes Year and CrsDepTIme factors "on the fly" as the model is being fit. The resulting model ends up with 140,852 coefficients 8,626 of which are not NA. The calculation takes 12.6 minutes to run on a 5 node (4 cores and 16GB of RAM per node) IBM Platform LFS cluster.

The rest of the code uses the model to predict arrival delay for three airlines and plots the fitted values by day of the week and departure time.

It looks like Saturday is the best time to fly for these airlines. Note that none of the sturcture revealed in these curves was put into the model in the sense that there are no polynomial terms in the model.

It will take a few minutes to download the zip file with the 143M airlines records, but please do, and let us know how your modeling efforts go.

*The latest in a series by Daniel Hanson*

**Introduction**

Correlations between holdings in a portfolio are of course a key component in financial risk management. Borrowing a tool common in fields such as bioinformatics and genetics, we will look at how to use heat maps in R for visualizing correlations among financial returns, and examine behavior in both a stable and down market.

While base R contains its own heatmap(.) function, the reader will likely find the heatmap.2(.) function in the R package gplots to be a bit more user friendly. A very nicely written companion article entitled A short tutorial for decent heat maps in R (Sebastian Raschka, 2013), which covers more details and features, is available on the web; we will also refer to it in the discussion below.

We will present the topic in the form of an example.

**Sample Data**

As in previous articles, we will make use of R packages Quandl and xts to acquire and manage our market data. Here, in a simple example, we will use returns from the following global equity indices over the period 1998-01-05 to the present, and then examine correlations between them:

S&P 500 (US)

RUSSELL 2000 (US Small Cap)

NIKKEI (Japan)

HANG SENG (Hong Kong)

DAX (Germany)

CAC (France)

KOSPI (Korea)

First, we gather the index values and convert to returns:

library(xts) library(Quandl) my_start_date <- "1998-01-05" SP500.Q <- Quandl("YAHOO/INDEX_GSPC", start_date = my_start_date, type = "xts") RUSS2000.Q <- Quandl("YAHOO/INDEX_RUT", start_date = my_start_date, type = "xts") NIKKEI.Q <- Quandl("NIKKEI/INDEX", start_date = my_start_date, type = "xts") HANG_SENG.Q <- Quandl("YAHOO/INDEX_HSI", start_date = my_start_date, type = "xts") DAX.Q <- Quandl("YAHOO/INDEX_GDAXI", start_date = my_start_date, type = "xts") CAC.Q <- Quandl("YAHOO/INDEX_FCHI", start_date = my_start_date, type = "xts") KOSPI.Q <- Quandl("YAHOO/INDEX_KS11", start_date = my_start_date, type = "xts") # Depending on the index, the final price for each day is either # "Adjusted Close" or "Close Price". Extract this single column for each: SP500 <- SP500.Q[,"Adjusted Close"] RUSS2000 <- RUSS2000.Q[,"Adjusted Close"] DAX <- DAX.Q[,"Adjusted Close"] CAC <- CAC.Q[,"Adjusted Close"] KOSPI <- KOSPI.Q[,"Adjusted Close"] NIKKEI <- NIKKEI.Q[,"Close Price"] HANG_SENG <- HANG_SENG.Q[,"Adjusted Close"] # The xts merge(.) function will only accept two series at a time. # We can, however, merge multiple columns by downcasting to *zoo* objects. # Remark: "all = FALSE" uses an inner join to merge the data. z <- merge(as.zoo(SP500), as.zoo(RUSS2000), as.zoo(DAX), as.zoo(CAC), as.zoo(KOSPI), as.zoo(NIKKEI), as.zoo(HANG_SENG), all = FALSE) # Set the column names; these will be used in the heat maps: myColnames <- c("SP500","RUSS2000","DAX","CAC","KOSPI","NIKKEI","HANG_SENG") colnames(z) <- myColnames # Cast back to an xts object: mktPrices <- as.xts(z) # Next, calculate log returns: mktRtns <- diff(log(mktPrices), lag = 1) head(mktRtns) mktRtns <- mktRtns[-1, ] # Remove resulting NA in the 1st row

**Generate Heat Maps**

As noted above, heatmap.2(.) is the function in the gplots package that we will use. For convenience, we’ll wrap this function inside our own generate_heat_map(.) function, as we will call this parameterization several times to compare market conditions.

As for the parameterization, the comments should be self-explanatory, but we’re keeping things simple by eliminating the dendogram, and leaving out the trace lines inside the heat map and density plot inside the color legend. Note also the setting Rowv = FALSE, this ensures the ordering of the rows and columns remains consistent from plot to plot. We’re also just using the default color settings; for customized colors, see the Raschka tutorial linked above.

require(gplots) generate_heat_map <- function(correlationMatrix, title) { heatmap.2(x = correlationMatrix, # the correlation matrix input cellnote = correlationMatrix # places correlation value in each cell main = title, # heat map title symm = TRUE, # configure diagram as standard correlation matrix dendrogram="none", # do not draw a row dendrogram Rowv = FALSE, # keep ordering consistent trace="none", # turns off trace lines inside the heat map density.info="none", # turns off density plot inside color legend notecol="black") # set font color of cell labels to black }

Next, let’s calculate three correlation matrices using the data we have obtained:

- Correlations based on the entire data set from 1998-01-05 to the present
- Correlations of market indices during a reasonably calm period -- January through December 2004
- Correlations of falling market indices in the midst of the financial crisis - October 2008 through May 2009

Now, let’s call our heat map function using the total market data set:

generate_heat_map(corr1, "Correlations of World Market Returns, Jan 1998 - Present")

And then, examine the result:

As expected, we trivially have correlations of 100% down the main diagonal. Note that, as shown in the color key, the darker the color, the lower the correlation. By design, using the parameters of the heatmap.2(.) function, we set the title with the main = title parameter setting, and the correlations shown in black by using the notecol="black" setting.

Next, let’s look at a period of relative calm in the markets, namely the year 2004:

generate_heat_map(corr2, "Correlations of World Market Returns, Jan - Dec 2004")

This gives us:

generate_heat_map(corr2, "Correlations of World Market Returns, Jan - Dec 2004")

Note that in this case, at a glance of the darker colors in each of the cells, we can see that we have even lower correlations than those from our entire data set. This may of course be verified by comparing the numerical values.

Finally, let’s look at the opposite extreme, during the upheaval of the financial crisis in 2008-2009:

generate_heat_map(corr3, "Correlations of World Market Returns, Oct 2008 - May 2009")

This yields the following heat map:

Note that in this case, again just at first glance, we can tell the correlations have increased compared to 2004, by the colors changing from dark to light nearly across the board. While there are some correlations that do not increase all that much, such as the SP500/Nikkei and the Russell 2000/Kospi values, there are others across international and capitalization categories that jump quite significantly, such as the SP500/Hang Seng correlation going from about 21% to 41%, and that of the Russell 2000/DAX moving from 43% to over 57%. So, in other words, portfolio diversification can take a hit in down markets.

**Conclusion**

In this example, we only looked at seven market indices, but for a closer look at how correlations were affected during 2008-09 -- and how heat maps among a greater number of market sectors compared -- this article, entitled* Diversification is Broken*, is a recommended and interesting read.

by Joseph Rickert

Last week, I posted a list of sessions at the Joint Statistical Meetings related to R. As it turned out, that list was only the tip of the iceberg. In some areas of statistics, such as graphics, simulation and computational statistics the use of R is so prevalent that people working in the field often don't think to mention it. For example, in the session New Approaches to Data Exploration and Discovery which included the presentation on the Glassbox package that figured in my original list, R was important to the analyses underlying nearly all of the talks in one way or another. The following are synopses of the talks in that session along with some pointers to relevant R resources.

Exploring Huge Collections of Scatterplots

Statistics and visualization legend Leland Wilkinson of Skytree showed off ScagExployer, a tool he built with Tuan Dang of the University of Illinois at Chicago to explore scagnostics (a contraction for “Scatter Plot Diagnostics” made up by John Hartigan and Paul Tukey in the 1980’s). ScagExployer makes it possible to look for anomalies and search for similar distributions in a huge collections of scatter plots. (The example Leland showed contained 124K plots).The ideas and many of the visuals for the talk can be found in the paper ScagExplorer: Exploring Scatterplots by Their Scagnostics. ScagExployer is Java based tool, but R users can work with the scagnostics package written by Lee Wilkinson and Anushka Anand in 2007.

Glassbox: An R Package for Visualizing Algorithmic Models:

Google’s Max Ghenis presented work he did with fellow Googlers Ben Ogorek; and Estevan Flores. Glassbox is an R application that attempts to provide transparency to “blackbox” algorithmic models such as Random Forests. Among other things, it calculates and plots the collective importance of groups of variables in such a model. The slides for the presentation are available, as is the package itself. Google is using predictive modeling and tools such as glassbox to better understand the characteristics of its workforce and to ask important, reflective questions such a “How can we better understand diversity?” The company also does HR modeling to see if what they know about people can give them a competitive edge in hiring. For example, Google uses data collected from people who have interviewed at the company in the past, but who have not received offers from Google, to try and understand Google's future hiring needs. The coolest thing about this presentation was that these guys work for the Human Resources Department! If you think that you work for a tech company go down to HR and see if you can get some help with Random Forests.

A Web Application for Efficient Analysis of Peptide Libraries

Eric Hare of Iowa State University introduced PeLica, work he did with colleagues Timo Sieber of University Medical Center Hamburg-Eppendorf and Heike Hofmann of Iowa State University. PeLica is an interactive, Shiny application to help assess the statistical properties of peptide libraries. PeLica’s creators refer to it as a Peptide Library Calculator that acts as a front end to the R package peptider which contains functions for evaluating the diversity of peptide libraries. The authors have done an exceptional job of using the documentation features available in Shiny to make their app a teaching tool.

To Merge or Not to Merge: An Interactive Visualization Tool for Local Merges of Mixture Model Components Elizabeth Lorenzi of Carnegie Mellon showed the prototype for an interactive visualization tool that she is working on with Rebecca Nugent of Carnegie Mellon and Nema Dean of the University of Glasgow. The software calculates inter-component similarities of mixture model component trees and displays them as hierarchical dendrograms. Elizabeth and her colleagues are implementing this tool as an R package.

An Interactive Visualization Platform for Interpreting Topic Models

Carson Sievert of Iowa State University presented LDAvis, a general framework for visualizing topic models that he is building with Kenny Shirley of AT&T Labs. LDAvis is interactive R software that enables users to interpret and compare topics by highlighting keywords. The theory is nicely described in a recent paper, and the examples on Carson’s Github page are instructive and fun to play with. In this plot below, circle 26 representing a topic has been selected. The bar chart on the right displays the 30 most relevant terms for this topic. The red bars represent the frequency of a term in a given topic, (proportional to p(term | topic)), and the gray bars represent a term's frequency across the entire corpus, (proportional to p(term)).

Gravicom: A Web-Based Tool for Community Detection in Networks

Andrea Kaplan showed off an interactive application that she and her Iowa State University team members, Heike Hofmann and Daniel Nordman are building. GRavicom is an interactive web application based on Shiny and the D3 JavaScript library that lets a user manually collect nodes into clusters in a social network graph and then save this grouping information for subsequent processing. The idea is that eyeballing a large social network and selecting “obvious” groups may be an efficient way to initialize a machine learning algorithm. Have a look at a Live demo.

Human Factors Influencing Visual Statistical Inference

Mahbubul Majumder of the University of Nebraska presented joint work done with Heike Hofmann and Dianne Cook, both of Iowa State University, on identifying key factors such as demographics, experience, training, of even the placement of figures in an array of plots, that may be important for the human analysis of visual data.

I'm here at the JSM conference in Boston, the latest annual gathering of 6000+ statisticians from North America and around the world. (Revolution Analytics is a proud sponsor of the conference.) One of the great things to see is that the American Statistical Association, the organizer of the conference and the professional body for statisticians, is putting some effort into promoting Statistics as a discipline with its new website This is Statistics.

Statistics as a discipline has been overshadowed by its close sibling, Data Science. (I'm as guilty of this as anyone.) As statistician Terry Speed points out (hat tip: Stephanie Hicks), Statisticians haven't been deeply involved in the Big Data revolution, despite having been trained in exactly the types of issues that are critical to extracting useful inferences from complex data sets. So it's great to see that the ASA is helping to promote the role of Statisticians in today's data centric workplace. My favorite video comes from Roger Peng, who has himself been instrumental promoting statistical analysis in R via Coursera:

Check out the complete This is Statistics site at the link below, and share it with your friends so they can learn what Statistics is really about.

American Statistical Association: This is Statistics

by Joseph Rickert

The Joint Statistical Meetings (JSM) get underway this weekend in Boston and Revolution Analytics is again proud to be a sponsor. More than 6,000 statisticians and data scientists from around the world are expected to attend and listen to thousands of presentations. It is true that many talks will be on specialized topics that only statisticians working in particular a field will have the interest and patience to sit through. However, there is evidence that the conference will have something exciting to offer data scientists and statisticians working in industry. Keyword searches yield 79 presentations for Big Data, 29 on Machine Learning, 17 on Data Science, 17 on Data Mining and 19 related to R. There is more than enough here to fill a data scientist’s dance card.

Three must-see presentations under the Big Data keyword are: Michael Franklin's presentation on Analyzing Data at Scale with the Berkeley Data Analytics Stack; Hui Jiang et al. on Implementation of Statistical Algorithms in Big Data Platforms and Tim Hesterberg's talk on Simulation-Based Methods in Statistics Education, and Google Tools. Under the Data Science label, Bill Ruh’s invited talk Industrial Internet, an Opportunity for Statisticians to Become Data Scientists looks most inviting. There are also quite a few Data Science talks that indicated some soul searching within the academic community as to how the statistics curriculum ought to be changed. See, for example, Michael Rappa’s talk on Data Scientists: How Do We Prepare for the Future? and Johanna Hardin’s talk: Data Science and Statistics: How Should They Fit into Our Curriculum?

Here is the list of R related presentations:

**Saturday, August 2**

- 8:00 AM - 12:00 PM: Adaptive Tests of Significance Using R and SAS — Professional Development Continuing Education Course ASA Instructor: Tom O'Gorman

**Sunday, August 3**

- 8:30 AM - 5:00 PM: Adaptive Methods in Modern Clinical Trials — Professional Development Continuing Education Course ASA , Biometrics Section Instructors: Frank Bretz, Byron Jones, and Guosheng Yin
- 4:20 PM: Glassbox: An R Package for Visualizing Algorithmic Models: Max Ghenis and Ben Ogorek and Estevan Flores
- 4:45 PM: Bayesian Enrollment and Event Predictions in Clinical Trials Leveraging Literature Data: Aijun Gao and Fanni Natanegara and Govinda Weerakkody

**Monday, August 4**

- 8:55 AM: Thinking with Data in the Second Course: Nicholas J. Horton and Ben S. Baumer and Hadley Wickham
- 8:30 AM to 10:20 AM: Do You See What I See? Formal Usability Testing and Statistical Graphics: Marie C. Vendettuoli and Matthew Williams and Susan Ruth VanderPlas
- 8:35 AM: Preparing Students for Big Data Using R and Rstudio: Randall Pruim
- 8:35 AM: Does R Provide What Customer Need?: Vipin Arora
- 8:55 AM: Doing Reporducible Research Unconscously: Higher Standard, but Less Work: Yihui Xie
- 12:30 PM: to 1:50 PM: Analyzing Umpire Performance Using PITCHf/x: Andrew Swift
- 3:30 PM: The Perfect Bracket: Machine Learning in NCAA Basketball: Sara Stoudt and Loren Santana and Ben S. Baumer

**Tuesday, August 5**

- 10:35 AM: Tools for Teaching R and Statistics Using Games Brad Luen and Michael Higgins
- 2:00 PM: Multiple Treatment Groups: A Case Study with Health Care Practice and Policy Implications Alexandra Hanlon and Karen Hirschman and Beth Ann Griffin and Mary Naylor
- 2:05 PM: glmmplus: An R Package for Messy Longitudinal Data Ben Ogorek and Caitlin Hogan
- 3:30 PM: Give Me an Old Computer, a Blank DVD, and an Internet Connection and I'll Give You World-Class Analytics Ty Henkaline

**Wednesday, August 6**

- 9:35 AM: Testing Packages for the R Language: Stephen Kaluzny and Lou Bajuk-Yorgan
- 9:50 AM: Using R Analytics on Streaming Data: Lou Bajuk-Yorgan and Stephen Kaluzny
- 10:35 Shiny: Easy Web Applications in R:Joseph Cheng
- 10:30 AM to 12:20 PM: Classroom Demonstrations of Big Data: Eric A. Suess
- 11:00 AM: ggvis: Moving Toward a Grammar of Interactive Graphics: Hadley Wickham
- 3:05 PM: Accessing Data from the Census Bureau API: Alex Shum and Heike Hofmann

**Thursday, August 7**

- 9:20 AM: Predicting Dangerous E. Coli Levels at Erie, Pennsylvania, Beaches with Random Forests in R: Michael Rutter
- 9:25 AM: Beyond the Black Box: Flexible Programming of Hierarchical Modeling Algorithms for BUGS-Compatible Models Using NIMBLE: Perry de de Valpine and Daniel Turek and Christopher J. Paciorek and Rastislav Bodik and Duncan Temple Lang

If you are going to JSM please come by booth #303 to say hello. You may also find the mobile apps (Apple or Android) that Revolution Analytics is sponsoring useful, and don't forget to fill out the survey for a chance to win an Apple TV.

Finally, I will be the program chair for Session 401, Monte Carlo Methods to be held Tuesday, 8/5/2014, from 2:00 PM to 3:50 PM in room CC-101. If you are interested in simulation be sure to drop in. I have seen the presentations and think they are well worth attending.

by Joseph Rickert

If I had to pick just one application to be the “killer app” for the digital computer I would probably choose Agent Based Modeling (ABM). Imagine creating a world populated with hundreds, or even thousands of agents, interacting with each other and with the environment according to their own simple rules. What kinds of patterns and behaviors would emerge if you just let the simulation run? Could you guess a set of rules that would mimic some part of the real world? This dream is probably much older than the digital computer, but according to Jan Thiele’s brief account of the history of ABMs that begins his recent paper, *R Marries NetLogo: Introduction to the RNetLogo Package* in the *Journal of Statistical Software,* academic work with ABMs didn’t really take off until the late 1990s.

Now, people are using ABMs for serious studies in economics, sociology, ecology, socio-psychology, anthropology, marketing and many other fields. No less of a complexity scientist than Doyne Farmer (of Dynamic Systems and Prediction Company fame) has argued in *Nature* for using ABMs to model the complexity of the US economy, and has published on using ABMs to drive investment models. in the following clip of a 2006 interview, Doyne talks about building ABMs to explain the role of subprime mortgages on the Housing Crisis. (Note that when asked about how one would calibrate such a model Doyne explains the need to collect massive amounts of data on individuals.)

Fortunately, the tools for building ABMs seem to be keeping pace with the ambition of the modelers. There are now dozens of platforms for building ABMs, and it is somewhat surprising that NetLogo, a tool with some whimsical terminology (e.g. agents are called turtles) that was designed for teaching children, has apparently become a defacto standard. NetLogo is Java based, has an intuitive GUI, ships with dozens of useful sample models, is easy to program, and is available under the GPL 2 license.

As you might expect, R is a perfect complement for NetLogo. Doing serious simulation work requires a considerable amount of statistics for calibrating models, designing experiments, performing sensitivity analyses, reducing data, exploring the results of simulation runs and much more. The recent *JASS* paper* Facilitating Parameter Estimation and Sensitivity Analysis of Agent-Based Models: a Cookbook Using NetLogo and R *by Thiele and his collaborators describe the R / NetLogo relationship in great detail and points to a decade’s worth of reading. But the real fun is that Thiele’s RNetLogo package lets you jump in and start analyzing NetLogo models in a matter of minutes.

Here is part of an extended example from Thiele's *JSS* paper that shows R interacting with the Fire model that ships with NetLogo. Using some very simple logic, Fire models the progress of a forest fire.

Snippet of NetLogo Code that drives the Fire model

to go if not any? turtles ;; either fires or embers [ stop ] ask fires [ ask neighbors4 with [pcolor = green] [ ignite ] set breed embers ] fade-embers tick end ;; creates the fire turtles to ignite ;; patch procedure sprout-fires 1 [ set color red ] set pcolor black set burned-trees burned-trees + 1 end

The general idea is that turtles represent the frontier of the fire run through a grid of randomly placed trees. Not shown in the above snippet is the logic that shows that the entire model is controlled by a single parameter representing the density of the trees.

This next bit of R code shows how to launch the Fire model from R, set the density parameter, and run the model.

# Launch RNetLogo and control an initial run of the # NetLogo Fire Model library(RNetLogo) nlDir <- "C:/Program Files (x86)/NetLogo 5.0.5" setwd(nlDir) nl.path <- getwd() NLStart(nl.path) model.path <- file.path("models", "Sample Models", "Earth Science","Fire.nlogo") NLLoadModel(file.path(nl.path, model.path)) NLCommand("set density 70") # set density value NLCommand("setup") # call the setup routine NLCommand("go") # launch the model from R

Here we see the Fire model running in the NetLogo GUI after it was launched from RStudio.

This next bit of code tracks the progression of the fire as a function of time (model "ticks"), returns results to R and plots them. The plot shows the non-linear behavior of the system.

# Investigate percentage of forest burned as simulation proceeds and plot library(ggplot2) NLCommand("set density 60") NLCommand("setup") burned <- NLDoReportWhile("any? turtles", "go", c("ticks", "(burned-trees / initial-trees) * 100"), as.data.frame = TRUE, df.col.names = c("tick", "percent.burned")) # Plot with ggplot2 p <- ggplot(burned,aes(x=tick,y=percent.burned)) p + geom_line() + ggtitle("Non-linear forest fire progression with density = 60")

As with many dynamical systems, the Fire model displays a phase transition. Setting the density lower than 55 will not result in the complete destruction of the forest, while setting density above 75 will very likely result in complete destruction. The following plot shows this behavior.

RNetLogo makes it very easy to programatically run multiple simulations and capture the results for analysis in R. The following two lines of code runs the Fire model twenty times for each value of density between 55 and 65, the region surrounding the pahse transition.

d <- seq(55, 65, 1) # vector of densities to examine res <- rep.sim(d, 20) # Run the simulation

The plot below shows the variability of the percent of trees burned as a function of density in the transition region.

My code to generate plots is available in the file: Download NelLogo_blog while all of the code from Thiele's JSS paper is available from the journal website.

Finally, here are a few more interesting links related to ABMs.

- On validating ABMs
- ABMs and

by James Peruvankal

There are plenty of options if you want to learn R and are looking for training: your college’s statistics department, massive open online courses like Coursera, Udacity, edX, Datacamp etc. SiliconANGLE recently published an article about top R-training companies.

Let’s talk about how to choose a good R-trainer.

- First and foremost is technical competency in R - In addition to having done a significant amount of R programming, the instructor should have an education in a quantitative field. The idea behind this is that the instructor will have had experience expressing non-trivial ideas in R. However, it is not necessarily the case that the most technically competent person is the best instructor.
- Experience in teaching statistics - Learning R invariably involves working with statistics. So knowing where students can go wrong in understanding statistical concepts is a skill that greatly increases the effectiveness of an R Instructor. This skill only comes with experience. Joan Garfield's 1995 article in the International Statistical Review: How students learn Statistics is an excellent reference on what could go wrong in learning statistics and how to correct them.
- Communication skills - The instructor should have the ability to clearly communicate complex topics in simple examples that students can relate to. We recommend Gelman and Nolan's book: Teaching Statistics: A Bag of Tricks which promotes an activity based approach to teaching.
- Evangelism - Passion generates passion. The enthusiasm of the instructor spreads to the students.
- Teaching style and philosophy - From our experience in teaching and based on decades of research on how people learn, we have come up with our teaching philosophy. The most important factor is that people ‘learn by doing’. Ensure that hands-on learning is where most of the time is spent on.

At Revolution Analytics we are guided by the teaching philosophy presented in the following chart:

So, if you are serious about learning R, brush up on your statistics, be prepared to jump right in and start doing things on your own, surround yourself with people who are passionate about statistics and R, and figure out how to make the whole process fun for you. If you are teaching R and want to join us in our mission to ‘take R to the Enterprise’, see if you can fit in with our team.

Statistics has many canonical data sets. For classification statistics, we have the Fisher's iris data. For Big Data statistics, the canonical data set used in many examples is the Airlines data. And for dotplots, we have the barley data, first popularized by Bill Cleveland in the landmark 1993 text *Visualizing Data*. Cleveland's innovations in data visualiation were hugely influential in the S language and (later) R's lattice and ggplot2 packages, and the panel chart of the barley data shown below is one of the best known.

The chart above shows the yields for several different varieties of barley (Trebi, Glabron and so on) planted at each of six different sites in Minnesota (Duluth, Grand Rapids, etc.) in the years 1931 (pink) and 1932 (blue). The reason this data set has become legendary appears in the "Morris" panel, where unlike all other sites the yields in 1931 exceeded those in 1932 for all barley varieties. This is a great demonstration of the power of dotplots and panel graphics. In his book, Cleveland said that "either an extraordinary natural event, such as disease or a local weather anomaly, produced a strange coincidence, or the years for Morris were inadvertently reversed", and "on the basis of the evidence, the mistake hypothesis would appear to be the more likely."

But it now looks that despite Cleveland's suggestion, the data are correct after all. In a paper in the American Statistician published last year, Kevin Wright notes that in that time period local effects of weather (especially drought), insects and disease had greater impact on barley yields than any overall year-to-year effects on yield, and that the results at Morris were not surprising. Kevin offers as evidence extended barley yield data (available in his R package agridat) covering 10 years and 18 varieties. As you can see in the chart below, there is significant variation across years and within sites. Take a look at 1934 for example: a bounty of barley in Duluth, but a meagre crop in St Paul:

So it goes to show that in Cleveland's original example, it wasn't a data error that led to the "unusual" results at the Morris site. Rather, it's an expected consequence of the year-to-year variation of yields in each of the growing sites. But it's no less of an interesting data set to show off the power of dot plots and panel charts — as you can see from several other examples included in Kevin Wright's paper linked below. (With thanks to Kevin for describing this example to me at the useR! 2014 poster session. You can see a version of his poster here.)

American Statistician: Revisiting Immer's Barley Data. The American Statistician, 67(3), 129–133.

by Joseph Rickert

Broadly speaking, a meta-analysis is any statistical analysis that attempts to combine the results of several individual studies. The term was apparently coined by statistician Gene V Glass in a 1976 speech he made to the American Education Research Association. Since that time, not only has meta-analysis become a fundamental tool in medicine, but it is also becoming popular in economics, finance, the social sciences and engineering. Organizations responsible for setting standards for evidence-based medicine such as the United Kingdom’s National Institute for Health and Care Excellence (NICE) make extensive use of meta-analysis.

The application of meta-analysis to medicine is intuitive and, on the surface, compelling. Clinical trials designed to test efficacy for some new treatment for a disease against the standard treatment tend to be based on relatively small samples. (For example, the largest four trials for Respiratory Tract Diseases currently listed on ClinicalTrials.gov has an estimated enrollment of 533 patients.) It would seem to be a “no brainer” to use “all of the information” to get more accurate results. However, as for so many things, the devil is in the details. The preliminary tasks of establishing a rigorous protocol for guiding the meta-analysis and the systematic review to search for relevant studies are themselves far from trivial. One has to work hard to avoid “selection bias”, “publication bias” and other even more subtle difficulties.

In my limited experience with meta-analysis, I found it extraordinariy difficult to determine whether patient populations from different clinical trials were sufficiently homogenous to be included in the same meta-analysis. Even when working with well-written papers, published in quality journals, a considerable amount of medical expertise was required to interpret the data. I came away with the strong impression that a good meta-analysis requires collaboration from a team of experts.

Historically, it has probably been the case that most meta-analyses were conducted either with general tools such as Excel or specialized software like RevMan from the Cochrane Collaboration. However, R is the natural platform for meta-analysis both because of the myriad possibilities for statistical analyses that are not generally available through the specialized software, and because of the many packages devoted to various aspects of meta-analysis. The CRAN Meta Analysis Task View is exceptionally well-organized listing R packages according to the different stages of conducting a meta-analysis and also calling out some specialized techniques such as meta-regression and network-meta analysis.

ln a future post, I hope to be able to explore some of these packages more closely. For now, let’s look at a very simple analysis based on Thomas Lumley’s rmeta package which has been a part of R since 1999. The following simple meta-analysis is written up very nicely in the book by Chen and Peace titled Applied Meta-Analysis with R.

The cochrane data set in the rmeta package contains the results from seven randomized clinical trials designed to test the effectiveness of corticosteriod therapy in preventing neonatal deaths in premature labor. The columns of the data set are: the name of the trial center, the number of deaths in the treatment group, the total number of patients in the treatment group, the number of deaths in the control group and the total number of patients in the control group.

The null hypothesis is that there is no difference between treatment and control. Following Chen and Peace, we fit both fixed effects and random effects models to look at the odds ratios.

The summary for the fixed effects models shows that while only two studies, Auckland and Doran, individually show a significant effect, the overall confidence interval from the Mantel Haenszel test does indicate a benefit from the treatment.

Fixed effects ( Mantel-Haenszel ) meta-analysis Call: meta.MH(ntrt = n.trt, nctrl = n.ctrl, ptrt = ev.trt, pctrl = ev.ctrl, names = name, data = cochrane) ------------------------------------ OR (lower 95% upper) Auckland 0.58 0.38 0.89 Block 0.16 0.02 1.45 Doran 0.25 0.07 0.81 Gamsu 0.70 0.34 1.45 Morrison 0.35 0.09 1.41 Papageorgiou 0.14 0.02 1.16 Tauesch 1.02 0.37 2.77 ------------------------------------ Mantel-Haenszel OR =0.53 95% CI ( 0.39,0.73 ) Test for heterogeneity: X^2( 6 ) = 6.9 ( p-value 0.3303 )

The summary for the random effects model for this data is identical except, as one would expect, the overall confidence interval is somewhat wider: SummaryOR= 0.53 95% CI ( 0.37,0.78 ). A slight modification to enhanced the forest plot code provided by Chen and Peace (which works for both the fixed effects and random effects model objects) shows the typical way to present these results.

CPplot <- function(model){ c1 <- c("","Study",model$names,NA,"Summary") c2 <- c("Deaths","(Steroid)",cochrane$ev.trt,NA,NA) c3 <- c("Deaths","(Placebo)",cochrane$ev.ctrl,NA,NA) c4 <- c("","OR",format(exp(model[[1]]),digits=2),NA,format(exp(model[[3]]),digits=2)) tableText <-cbind(c1,c2,c3,c4) mean <- c(NA,NA,model[[1]],NA,model[[3]]) stderr <- c(NA,NA,model[[2]],NA,model[[4]]) low <- mean - 1.96*stderr up <- mean + 1.96*stderr forestplot(tableText,mean,low,up,zero=0, is.summary=c(TRUE,TRUE,rep(FALSE,8),TRUE),clip=c(log(0.1),log(2.5)),xlog=TRUE) }

CPplot(model.FE)

The whole idea of meta-analysis is intriguing. However, because of the challenges I mentioned above, I would be remiss not to point out that it elicits considerable criticism. The article Meta-analysis and its problems by H J Eysenck captures the issues and is well worth reading. Also, have a look at the review article by Walker, Hernandez and Kattan writing in the Cleveland Clinic Journal of Medicine.

by Joseph Rickert

John Chambers opened UseR! 2014 by describing how the R language grew out of early efforts to give statisticians easier access to high quality statistical software. In 1976 computational statistics was a very active field, but most algorithms were compiled as Fortran subroutines. Building models with this software was not a trivial process. First you had to write a main Fortran program to implement the model and call the right subroutines, and then you had to write the job control language code to submit your job and get it executed. When John and his Bell Labs colleagues sat down on that May afternoon to work on what would become the first implementation of the S language they were thinking about how they could make this process easier. The top half John’s famous diagram from that afternoon schematically indicates their intention to design a software interface so that one could call an arbitrary Fortran subroutine, ABC, by wrapping it in some simplified calling syntax: XABC( ).

The main idea was to bring the best computational facilities to the people doing the analysis. As John phrased it: “combine serious computational challenges with convenience”. In the end, the designers of both S, and its second incarnation, R, did much better than convenience. They built a tool to facilitate “flow”. When you are engaged in any mentally challenging work in (including statistical analysis) at a high level of play, you want to be able to stay in the zone and not get knocked out by peripheral tasks that interrupt your thought processes. As engaging and meaningful as it is in its own right, writing code is not doing statistics. One of the big advantages of working with R is that you can do quite a bit of statistics with just a handful of functions and the simplest syntax. R is a tool that helps you keep moving forward. If you want to see something then plot it. If the data in the wrong format, then mutate it.

A second idea that flows from the idea of S as an interface is that S was not intended to be self sufficient. John was explicit that S was designed as an interface to the “best algorithms”, not as a “from the ground up programming language”. The idea of being able to make use of external computational resources is still compelling. There will always be high-quality stuff that we will want to get at. Moreover, as John elaborated: “unlike 38 years ago there are many possible interfaces to languages, to other computing models and to (specialized) hardware”. The challenge is to interface to applications that are “too diverse for one solution to fit them all”, and to do this “without loosing the R that works in ‘ordinary’ circumstances.

John offered three examples of R projects that extend the reach of R to leverage other computing environments.

- Rcpp - turns C++ in to an R function by generating an interface to C++ with much less programming effort than .Call
- RLLVM - enables compiling R language code into specialized forms for efficiency and other purposes
- H2O - provides a compressed, efficient external version of a data frame for running statistical models on large data sets.

These examples, chosen to represent each of the three different kinds of interface targets that John called out, also represent projects of different scope and levels of integration. With a total of 226 reverse depends and reverse imports, Rcpp is already a great success. It is likely that ready access to C++ will form a permanent part of the R programmers mindset.

RLLVM is a much more radical and ambitious project that would allow R to be the window to entirely different computing models. As best I understand it, the central idea is to use the R environment as the system interface to “any number of new languages” perhaps languages that have not yet been invented. RLLVM would “Use R syntax for commands to be interpreted in a different interpreter”. RLLVM seems to be a powerful idea and a direct generalization of the original XABC() idea.

The RH2O package is an example of providing R users with transparent access to data sets that are too large to fit into memory. It is one of many efforts underway (including those from Revolution Analytics) to integrate Hadoop, Teradata, Spark and other specialized computing platforms within the R environment. Some of these specialized platforms may indeed be longed lived, but it is not likely that all of them will. From the point of view of doing statistics, it is the R interface that is likely to survive and persist, platforms will come and go.

An implication of the willingness of R developers to embrace diversity is that R is likely to always be a work in progress. There will be loose ends, annoying inconsistencies and unimplemented possibilities. I suppose that there are people who will never be comfortable with this state of affairs. It is not unreasonable to prefer a system where there is one best way to do something and where, within the bounds of some pre-established design, there is near perfect consistency. However, the pursuit of uniformity and consistency seems to me to doom designers to be at least one step behind, because it means continually starting over to get things right.

So what does this say about the future of R? John closed his talk by stating that “the best future would be one of variety, not uniformity”. I take this to mean that, for the near future anyway, whatever the next big thing is, it is likely that someone will write an R package to talk to it.

Some links regarding S and R History:

- John Chambers useR! 2006 slides
- Trevor Hastie's Interview with John Chambers
- Ross Ihaka: R: Past and Future History
- New York Times Article