*by Bill Jacobs*

Revolution R Enterprise is the industry's first R-based analytics platform that supports a variety of parallel, grid and clustered systems such as Hadoop, Teradata database and Platform LSF Linux grids.

Last year, we enhanced Revolution R Enterprise (RRE) to support big data systems, with support for Hadoop. We continued expansion of RRE in 2014, adding support for Teradata EDWs in March, for Kerberos in May, and for MapR Hadoop in June.

We're continuing our commitment to big data analytics in R by releasing RRE Version 7.3. Available immediately, RRE V7.3 adds a number of new capabilities:

Algorithms:

- A new Stochastic Gradient Boosting algorithm called rxBTrees, provides a machine learning algorithm that creates boosted classification and regression trees. Like our Decision Forests algorithm (equivalent to Random Forest) trees are fitted to subsamples and boosted, but added sequentially. At each iteration, the new regression trees are fitted to the current pseudo-residuals, further improving prediction accuracy.
- PMML export for Decision Forests including our new Stochastic Gradient Boosting algorithm.
- A production-tested API for writing custom parallelized algorithms in R with expanded support for EDWs and clustered systems like Hadoop.
- Performance and memory utilization improvements for our Decision Forest algorithm.

Updated and Improved Platform Support

- Simplified and automated Hadoop installation processes including improved Cloudera Manager Parcels for CDH4 and CDH5.
- Improvements in performance and memory utilization for Teradata EDWs.
- Support for YARN on MapR Hadoop with support for MapR version 4.0.1.
- Certification of the new Teradata 15.0 and Cloudera Hadoop CDH 5.1 platforms.

Deployment and Integration

- New RBroker Framework added to DeployR speeds integration to provide on-demand R analytics to JavaScript, Java, or .NET applications.
- Enriched output from data scoring by appending additional variables that simplify use of scored data by other applications.

Open Source:

- RRE now includes R version 3.1.1
- Revolution Analytics has released a version of DeployR as open source.

RRE V7.3 is available now. Existing users have been notified of the availability of the download and it’s recommended for all RRE users. If you’ve not received your letter, contact support or your sales team to get the information.

While it’s a “minor” revision, it’s an important one, especially for our big data users. For more information on RRE version 7.3, browse:

http://packages.revolutionanalytics.com/doc/7.3.0/README_RevoEnt_Windows_7.3.0.pdf

For more information on Revolution’s DeployR integration and deployment platform browse:

http://deployr.revolutionanalytics.com/

Have a look and tell us what you think in the comments.

by Joseph Rickert

Recently, I had the opportunity to present a webinar on R and Data Science. The challenge with attempting this sort of thing is to say something interesting that does justice to the subject while being suitable for an audience that may include both experienced R users and curious beginners. The approach I settled on had three parts. I decided to:

- show a few slides that indicate the status of R among data scientists
- offer some thoughts as to why R is such a popular and effective tool
- work through some code.

The "why" slides attempt to convey the great number of machine learning and statistical algorithms available in R, the visualization capabilities, the richness of the R programming language and its many tools for data manipulation. I tried to emphasize the great amount of effort that the R community continues to make in order to integrate R with other languages and computing platforms, and to scale R to handle massive data sets on Hadoop and other big data platforms.

The code examples presented in the webinar emphasize the machine learning algorithms oganized in the caret package and the many tools available for working through the predictive modeling process such as functions for searching through the parameter space of a model, performing cross validation, comparing models etc. The code for the caret examples is available here.

Towards the end of the webinar I show the code for running a large Tweedie model with Revolution Analytics rxGlm() function and I also show what it looks like to run an rxLogit() model directly on Hadoop.

Click on video to view the webinar, or go to the Revolution Analytics' website to download the webinar and a pdf of the slides. All of the code is available on my GitHub repository.

by Joseph Rickert

While preparing for the DataWeek R Bootcamp that I conducted this week I came across the following gem. This code, based directly on a Max Kuhn presentation of a couple years back, compares the efficacy of two machine learning models on a training data set.

#----------------------------------------- # SET UP THE PARAMETER SPACE SEARCH GRID ctrl <- trainControl(method="repeatedcv", # use repeated 10fold cross validation repeats=5, # do 5 repititions of 10-fold cv summaryFunction=twoClassSummary, # Use AUC to pick the best model classProbs=TRUE) # Note that the default search grid selects 3 values of each tuning parameter # grid <- expand.grid(.interaction.depth = seq(1,7,by=2), # look at tree depths from 1 to 7 .n.trees=seq(10,100,by=5), # let iterations go from 10 to 100 .shrinkage=c(0.01,0.1)) # Try 2 values of the learning rate parameter # BOOSTED TREE MODEL set.seed(1) names(trainData) trainX <-trainData[,4:61] registerDoParallel(4) # Registrer a parallel backend for train getDoParWorkers() system.time(gbm.tune <- train(x=trainX,y=trainData$Class, method = "gbm", metric = "ROC", trControl = ctrl, tuneGrid=grid, verbose=FALSE)) #--------------------------------- # SUPPORT VECTOR MACHINE MODEL # set.seed(1) registerDoParallel(4,cores=4) getDoParWorkers() system.time( svm.tune <- train(x=trainX, y= trainData$Class, method = "svmRadial", tuneLength = 9, # 9 values of the cost function preProc = c("center","scale"), metric="ROC", trControl=ctrl) # same as for gbm above ) #----------------------------------- # COMPARE MODELS USING RESAPMLING # Having set the seed to 1 before running gbm.tune and svm.tune we have generated paired samplesfor comparing models using resampling. # # The resamples function in caret collates the resampling results from the two models rValues <- resamples(list(svm=svm.tune,gbm=gbm.tune)) rValues$values #--------------------------------------------- # BOXPLOTS COMPARING RESULTS bwplot(rValues,metric="ROC") # boxplot

After setting up a grid to search the parameter space of a model, the train() function from the caret package is used used to train a generalized boosted regression model (gbm) and a support vector machine (svm). Setting the seed produces paired samples and enables the two models to be compared using the resampling technique described in Hothorn at al, *"The design and analysis of benchmark experiments"*, Journal of Computational and Graphical Statistics (2005) vol 14 (3) pp 675-699

The performance metric for the comparison is the ROC curve. From examing the boxplots of the sampling distributions for the two models it is apparent that, in this case, the gbm has the advantage.

Also, notice that the call to registerDoParallel() permits parallel execution of the training algorithms. (The taksbar showed all foru cores of my laptop maxed out at 100% utilization.)

I chose this example because I wanted to show programmers coming to R for the first time that the power and productivity of R comes not only from the large number of machine learning models implemented, but also from the tremendous amount of infrastructure that package authors have built up, making it relatively easy to do fairly sophisticated tasks in just a few lines of code.

All of the code for this example along with the rest of the my code from the Datweek R Bootcamp is available on GitHub.

by Joseph Rickert

The days are getting shorter here in California and the summer R conferences UseR!2014 and JSM are behind us, but there are still some very fine conferences for R users to look forward to before the year ends.

DataWeek starts in San Francisco on September 15th. I will be conducting a bootcamp for new R users, and on Wedneday the 17th Skylar Lyon and David Smith will talk about R in production during an R Case Studies Session.

The same week, halfway around the world, the EARL (Effective Applications of the R Language) Conference starts in London. Ben Goldacre, author of two best sellers Bad Science and Bad Pharma will be the keynote speaker. The technical program will include sessions from such R luminaries as Hadley Wickham, Patrick Burns, Matt Dowle, Andrie de Vries, Romain Francois, Tal Galili and more.

On October 5th through 9th, Predictive Analytics World - Healthcare will be held in Boston. Max Kuhn will be conducting a hands-on workshop on R for predictive modeling in Business and Healthcare.

One week later, October 15th, Strata and Hadoop World will kick off in New York City. The R Studio team will be start the conference with an R Day. Hadley Wickham, WInston Chang, Garrett Grolemund, JJ Allaire and Yihui Xie will all be giving presentations. On the 16th, Amanda Cox, the force behind the Time's superb R based graphics, will be givining a keynote address. Other R releated sessions include a presentation by Sunil Venkayala and Indrajit Roy on HP's Distributed R platform. Revolution Analytics will, once again, be sponsoring this conference. If you go, please drop by the Revolution Analytics' booth.

That same week, PAZUR'14 the Polish Academic R User's Meeting will be taking place. Workshops will be held on data visualization, the analysis of surveys, the exploration of geospatial data and much more. Revolution Analytics is pleased to be a sponsor here too.

On October 25th, the ACM will once again hold its very popular Data Science Bootcamp at eBay in San Jose. And, once again Revolution Analytics is proud to be a sponsor. I will be attending this event, so please look me up if you want to chat about R.

There is sure to be some R content at the ICSM, International Conference on Statistics and Mathematics and the Workshop on Bayesian Modeling to be held on November 24th through 26th in Surabaya, Indonesia.

Also, sometime in November, rOpenSci will be "bringing together a mix of developers and academics to hack on/create tools in the open data space using and for R". A date hasen't been set yet but we will let you know when things are finalized. Here is the link to the Spring hackathon that took place in San Francsco which was pretty impressive.

If I have missed anything, please let us know!

by Don Boyd, Senior Fellow, Rockefeller Institute of Government

The Rockefeller Institute of Government is excited to be developing models to simulate the finances of public pension funds, using R.

Public pension funds invest contributions from governments and public sector workers in an effort to ensure that they can pay all promised benefits when due. State and local government pension funds in the United States currently have more than $3 trillion invested, more than $2 trillion of which is in equity-like investments. For example, NYC, has over $158 billion invested. Governments usually act as a backstop: if pension fund investment returns do better than expected, governments will be able to contribute less, but if investment returns fall short they will have to contribute more. When that happens, politicians must raise taxes or cut spending programs. These risks often are not well understood or widely discussed. (For a discussion of many of the most significant issues, see *Strengthening the Security of Public Sector Defined Benefit Plans**.)*

We are building stochastic simulation models in R to help quantify the investment risks and their potential consequences. We are modeling the finances of specific pension plans, taking into account all of the main flows such as current and expected benefit payouts to workers, contributions from governments and from workers, and investment returns, and how they affect liabilities and investible assets. The models will take into account the changing demographics of the workforce and retiree populations. We are modeling investment returns stochastically, examining different return scenarios and different economic environments, as well as different governmental contribution policies. We will use these models to evaluate the risks currently being taken and to help provide policy advice to governments, pension funds, and others. (For a full description of our approach, see *Modeling and Disclosing Public Pension Fund Risk, and Consequences for Pension Funding Security*)

We have chosen R because:

- It is extremely flexible, allowing us to do data collection, data management, exploratory data analysis, and other essential non-modeling tasks.
- Manipulating matrices is easy.
- It has sophisticated tools for modeling investment returns and for analyzing and presenting results of simulations. And it has great tools for visualizing results.
- The work can be completely open and reproducible, which is essential to the success of this project.

All programming languages have weaknesses. R’s great flexibility means that it is easy to write ill-organized programs that are hard to understand and debug. And poorly written programs that do not take advantage of R’s strengths can be extremely slow. We believe we can compensate for these weaknesses by making our programs modular, using a consistent programming style with appropriate documentation, and by using R features smartly and speed-testing where appropriate.

R analysts and programmers interested in learning about the opportunity to work on this project should examine the programmer/analyst position description and related materials at the Rockefeller Institute’s web site.

by Joseph Rickert

If I had to pick just one application to be the “killer app” for the digital computer I would probably choose Agent Based Modeling (ABM). Imagine creating a world populated with hundreds, or even thousands of agents, interacting with each other and with the environment according to their own simple rules. What kinds of patterns and behaviors would emerge if you just let the simulation run? Could you guess a set of rules that would mimic some part of the real world? This dream is probably much older than the digital computer, but according to Jan Thiele’s brief account of the history of ABMs that begins his recent paper, *R Marries NetLogo: Introduction to the RNetLogo Package* in the *Journal of Statistical Software,* academic work with ABMs didn’t really take off until the late 1990s.

Now, people are using ABMs for serious studies in economics, sociology, ecology, socio-psychology, anthropology, marketing and many other fields. No less of a complexity scientist than Doyne Farmer (of Dynamic Systems and Prediction Company fame) has argued in *Nature* for using ABMs to model the complexity of the US economy, and has published on using ABMs to drive investment models. in the following clip of a 2006 interview, Doyne talks about building ABMs to explain the role of subprime mortgages on the Housing Crisis. (Note that when asked about how one would calibrate such a model Doyne explains the need to collect massive amounts of data on individuals.)

Fortunately, the tools for building ABMs seem to be keeping pace with the ambition of the modelers. There are now dozens of platforms for building ABMs, and it is somewhat surprising that NetLogo, a tool with some whimsical terminology (e.g. agents are called turtles) that was designed for teaching children, has apparently become a defacto standard. NetLogo is Java based, has an intuitive GUI, ships with dozens of useful sample models, is easy to program, and is available under the GPL 2 license.

As you might expect, R is a perfect complement for NetLogo. Doing serious simulation work requires a considerable amount of statistics for calibrating models, designing experiments, performing sensitivity analyses, reducing data, exploring the results of simulation runs and much more. The recent *JASS* paper* Facilitating Parameter Estimation and Sensitivity Analysis of Agent-Based Models: a Cookbook Using NetLogo and R *by Thiele and his collaborators describe the R / NetLogo relationship in great detail and points to a decade’s worth of reading. But the real fun is that Thiele’s RNetLogo package lets you jump in and start analyzing NetLogo models in a matter of minutes.

Here is part of an extended example from Thiele's *JSS* paper that shows R interacting with the Fire model that ships with NetLogo. Using some very simple logic, Fire models the progress of a forest fire.

Snippet of NetLogo Code that drives the Fire model

to go if not any? turtles ;; either fires or embers [ stop ] ask fires [ ask neighbors4 with [pcolor = green] [ ignite ] set breed embers ] fade-embers tick end ;; creates the fire turtles to ignite ;; patch procedure sprout-fires 1 [ set color red ] set pcolor black set burned-trees burned-trees + 1 end

The general idea is that turtles represent the frontier of the fire run through a grid of randomly placed trees. Not shown in the above snippet is the logic that shows that the entire model is controlled by a single parameter representing the density of the trees.

This next bit of R code shows how to launch the Fire model from R, set the density parameter, and run the model.

# Launch RNetLogo and control an initial run of the # NetLogo Fire Model library(RNetLogo) nlDir <- "C:/Program Files (x86)/NetLogo 5.0.5" setwd(nlDir) nl.path <- getwd() NLStart(nl.path) model.path <- file.path("models", "Sample Models", "Earth Science","Fire.nlogo") NLLoadModel(file.path(nl.path, model.path)) NLCommand("set density 70") # set density value NLCommand("setup") # call the setup routine NLCommand("go") # launch the model from R

Here we see the Fire model running in the NetLogo GUI after it was launched from RStudio.

This next bit of code tracks the progression of the fire as a function of time (model "ticks"), returns results to R and plots them. The plot shows the non-linear behavior of the system.

# Investigate percentage of forest burned as simulation proceeds and plot library(ggplot2) NLCommand("set density 60") NLCommand("setup") burned <- NLDoReportWhile("any? turtles", "go", c("ticks", "(burned-trees / initial-trees) * 100"), as.data.frame = TRUE, df.col.names = c("tick", "percent.burned")) # Plot with ggplot2 p <- ggplot(burned,aes(x=tick,y=percent.burned)) p + geom_line() + ggtitle("Non-linear forest fire progression with density = 60")

As with many dynamical systems, the Fire model displays a phase transition. Setting the density lower than 55 will not result in the complete destruction of the forest, while setting density above 75 will very likely result in complete destruction. The following plot shows this behavior.

RNetLogo makes it very easy to programatically run multiple simulations and capture the results for analysis in R. The following two lines of code runs the Fire model twenty times for each value of density between 55 and 65, the region surrounding the pahse transition.

d <- seq(55, 65, 1) # vector of densities to examine res <- rep.sim(d, 20) # Run the simulation

The plot below shows the variability of the percent of trees burned as a function of density in the transition region.

My code to generate plots is available in the file: Download NelLogo_blog while all of the code from Thiele's JSS paper is available from the journal website.

Finally, here are a few more interesting links related to ABMs.

- On validating ABMs
- ABMs and

InsideBigData has published a new *Guide to Machine Learning*, in collaboration with Revolution Analytics. As the name suggests, the Guide provides an overview of machine learning techniques, with a focus on implementation with the R language and (for big-data applications) Revolution R Enterprise. You can download the Guide here (email registration required), or for a quick overview of the contents check out the series of posts by Daniel Gutierrez on the topics covered in the Guide:

- About the insideBigData Guide to Machine Learning
- Introduction to Machine Learning
- R – the Data Scientist’s Choice and Data Access
- Data Munging, Exploratory Data Analysis, and Feature Engineering
- Supervised Machine Learning
- Unsupervised Machine Learning
- Production Deployment with R
- Production Deployment Environments for R

insideBigData: Guide to Machine Learning

by Joseph Rickert

UserR! 2014 got under way this past Monday with a very impressive array of tutorials delivered on the day that the conference organizers were struggling to cope with a record breaking crowd. My guess is that conference attendance is somewhere in the 700 range. Moreover, this the first year that I can remember that tutorials were free. The combination made for jam-packed tutorials.

The first thing that jumps out just by looking at the tutorial schedule is the effort that RStudio made to teach at the conference. Of the sixteen tutorials given, four were presented by the RStudio team. Winston Chang conducted an introduction to interactive graphics with ggvis; Yihui Xie presented Dynamic Documents with R and knitr; Garrett Grolemund Interactive data display with Shiny and R and Hadley Wickham taught data manipulation with dplyr. Shiny is still captivating R users. I was particularly struck by a conversation I had with a undergraduate Stats major who seemed to be genuinely pleased and excited about being able to build her own Shiny apps. Kudos to the RStudio team.

Bob Muenchen brought this same kind of energy to his introduction to managing data with R. Bob has extensive experience with SAS and Stata and he seems to have a gift anticipating areas where someone proficient in one of these packages, but new to R, might have difficulty.

Matt Dowle presented his tutorial on Data Table and, from the chatter I heard, increased his growing list of data table converts. Dirk Eddelbuettel presented An Example-Driven, Hands-on Introduction to Rcpp and Romain Francois taught C++ and Rcpp11 for beginners. Both Dirk and Romain work hard to make interfacing to C++ a real option for people who themselves are willing to make an effort.

I came late to Martin Morgan’s tutorial on Bioconductor but couldn’t get in the room. Fortunately, Martin prepared an extensive set of materials for his course which I hope to be able to work through.

Max Kuhn taught an introduction to applied predictive based on the recent book he wrote with Kjell Johnson. Both the slides and code are available. Virgilio Gomez Rubio presented a tutorial on applied spatial data analysis with R. His materials are available here. Ramnath Vaidyanathan Interactive Documents with R Have a look at both Ramnath’s slides, the map embedded in his abstract and his screencast.

Drew Schmidt taught a workshop on Programming with Big Data in R based on the pdbR package. The course page points to a rich set of introductory resources.

I very much wanted to attend Søren Højsgaard’s tutorial on Graphical Models and Bayesian Networks with R, but couldn’t make it. I did attend John Nash's tutorial on Nonlinear parameter optimization and modeling in R and I am glad that I did. This is a new field for me and I was fortunate to see something of John’s meticulous approach to the subject.

I was disappointed that I didn’t get to attend Thomas Petzoldt’s tutorial on Simulating differential equation models in R. This is an area that is not usually associated with R. The webpage for the tutorial is really worth a look.

I don't know if the conference organizers planned it that way, but as it turned out, the tutorial subjects chosen are an excellent showcase for the depth and diversity of applications that can be approached through R. Many thanks to the tutorial teachers and congratulations to the UseR! 2014 conference organizers for a great start to the week. I hope to have more to say about the conference in future posts.

by Joseph Rickert

Predictive Modeling or “Predictive Analytics”, the term that appears to be gaining traction in the business world, is driving the new “Big Data” information economy. Predictably, there is no shortage of material to be found on this subject. Some discussion of predictive modeling is sure to be found in any reasonably technical presentation of business decision making, forecasting, data mining, machine learning, data science, statistical inference or just plain science. There are hundreds of booksthat have something worthwhile to say about predictive modeling. However, in my judgment,* Applied Predictive Modeling* by Max Kuhn and Kjell Johnson (Springer 2013) ought to be at the very top of reading list of anyone who has some background in statistics, who is serious about building predictive models, and who appreciates rigorous analysis, careful thinking and good prose.

The authors begin their book by stating that “the practice of predictive modeling defines the process of developing a model in a way that we can understand and quantify the model’s prediction accuracy on future, yet-to-be-seen data”. They emphasize that predictive modeling is primarily concerned with making accurate predictions and not necessarily building models that are easily interpreted. Neverless, they are careful to point out that “the foundation of an effective predictive model is laid with intuition and deep knowledge of the problem context”. The book is a masterful exposition of the modeling process delivered at high level of play, with the authors gently pushing the reader to understand the data, to carefully select models, to question and evaluate results, to quantify the accuracy of predictions and to characterize their limitations.

Kuhn and Johnson are intense but not oppressive. They come across like coaches who really, really want you to be able to do this stuff. They write simply and with great clarity. However, the material is not easy. I frequently, found myself rereading a passage and almost always found it to be worth the effort. This mostly happened when reading a careful discussion of a familiar topic (i.e. something I thought I understood). For example, Chapter 14 on Classification Trees and Rule-Based models contains what I thought to be an illuminating discussion on the difference between building trees with grouped categories and taking the trouble to decompose a categorical predictor into binary dummy variables, in effect forcing binary splits for the categories.

*Applied Predictive Modeling* begins with chapter that introduces the case studies that referenced throughout the book. Thereafter, chapters are organized into four parts: General Strategies, Regression Models, Classification Models, Other Considerations and three appendices, including a brief introduction to R (too brief to teach someone R, but adequate to give a programmer new to R enough of an orientation to make sense of the R scripts included in the book). This organization has the virtue of allowing the authors to focus on the specifics of the various models while providing a natural way to repeat and reinforce fundamental principles. For example, Regression Trees and Classification Trees share a great deal in common and many authors treat them together. However, by splitting them into separate sections Kuhn and Johnson can focus on the performance measures that are peculiar to each kind of model while getting a second chance to explain fundamental principles and techniques such as bagging and boosting that are applicable to both kinds of models.

There are many ways to go about reading *Applied Predictive Modeling*. I can easily envision someone committed to mastering the material reading the text from cover to cover. However, the chapters are pretty much self contained, and the authors are very diligent about providing back references to topics they have covered previously. You can pretty much jump in anywhere and find your way around. Additionally, the authors take the trouble to include quite a bit of “forward referencing” which I found to be very helpful. As an example, In section 3.6, where the authors mention credit scoring with respect to a discussion on adding predictors to a model, they point ahead to section 4.5 which is short discussion of the credit scoring case study. This section, in turn, points ahead to section 11.2 and a discussion of evaluating predicted classes. These forward references encourage and facilitate latching on to a topic and then threading through the book to track it down.

Three major strengths of the book are its fundamental grounding in the principles of statistical inference, the thoroughness with which the case studies are presented, and its use of the R language. The statistical viewpoint is apparent both from the choice of topics presented and the authors’ overall approach to predictive modeling. Topics that are peculiar to a statistical approach include the presentation of stratified sampling and other sampling techniques in the discussion of data splitting, and the sections on partial least squares and linear discriminant analysis. The real statistical value of the text, however, is embedded in the Kuhn and Johnson’s methodology. They take great care to examine the consequences of modeling decisions and continually encourage the reader to challenge the results of particular models. The chapters on data preparation and model evaluation do an excellent job of informally presenting a formal methodlolgy for making inferences. Applied Predictive Modeling contains very few equations and very little statistical jargon but it is infused with statistical thinking. (A side effect of the text is to teach statistics without being too obvious about it. You will know you are catching on if you think the xkcd cartoon in chapter 19 is really funny.)

A nice feature about the case studies is that they are rich enough to illustrate several aspects of the model building process and are used effectively throughout the text. The discussion in Chapter 12 on preparing the Kaggle contest, University of Melbourne grant funding data set is particularly thorough. This kind of “blow by blow” discussion of why the authors make certain modeling decisions is invaluable.

The R language comes into play in several ways in the text. The most obvious is the section on computing that closes most chapter. These sections contain R code that illustrates the major themes presented in the chapter. To some extent, these brief R statements substitute for the equations that are missing from the text. They provide concrete visual representations of the key ideas accessible to anyone who makes the effort to learn very little R syntax. The chapter ending code is itself backed up with an R package available on CRAN, AppliedPredictiveModeling, that contains scripts to reproduce all of the analyses and plots in the text. (This feature makes the text especially well-suited for self study.)

*Applied Predictive Modeling* is resplendent with R graphs and plots, many of them in color that are integral to the presentation of ideas but which also serve to illustrate how easily presentation level graphs can be created in R. Form definitely follows function here, and it makes for a rather pretty book. One of my favorite plots is the first part of Figure 11.3 reproduced below which shows the test set probabilities for a logistic regression model of the German Credit data set.

The authors point out that the estimates of bad credit in the right panel are skewed showing that most estimates predict very low probabilities for bad credit when the credit is, in fact, good - just what you want to happen. In contrast, the estimates of bad credit are flat in the left panel, “reflecting the model’s inability to distinguish bad credit cases”.

Finally, *Applied Predictive Modeling* can be view as an introduction to the caret package. There is great depth here. This is not a book that comes with a little bit of illustrative code, icing on a cake so to speak, rather the included code is just the tip of the iceberg. It provides a gateway to the caret package and the full functionality of R’s machine learning capabilities.

*Applied Predictive Modeling* is a remarkable text. At 600 pages, it is the succinct distillation of years of experience of two expert modelers working in the pharmaceutical industry. I expect that beginners and experienced model builders alike will find something of value here. On my shelf, it sits up there right next to Hastie, Tibshirani and Friedman’s *The Elements of Statistical Learning*.

by Wayne Smith, Ph.D. California State University, Northridge

*Editor's note: This post was abstracted from the monthly newsletter of the Southern California Chapter of the ASA.*

On May 13th and 14^{th}, the Intel International Science and Engineering Fair (Intel ISEF) the world’s largest international pre-college competition, was held at the Los Angeles Convention Center.

I was blessed with the opportunity to represent the American Statistical Association (ASA). As one of approximately 30 statisticians, I helped assist in the judging of the statistics-related elements of numerous prescient and empirical projects presented by high school students from around the world. These students had already won other local and regional science and engineering competitions. We selected first, second, and third place winners, but 16 student teams in total received special recognition and goodie bags filled with software, books, and other items.

The photograph below shows the first place winner, Soham Daga, from New York who used Google Trends to develop a model ot prodict the likelihood of mortgage delinquency. An Interview with Soham can be found here.

I have no doubt that a lasting affinity with statistical professionals and supporting organizations will be a tangible outcome for these motivated, young researchers.

I was energized and transformed by the breadth and depth of the research methods and concomitant inferential analysis applied to address pressing issues in areas as diverse as health care, energy, sustainability, material science, pharmacology, biochemistry, financial economics, and many others. Along with my ASA colleagues, I discussed projects with students as young as 15. As one might expect, many of the High School seniors are attending top research universities in the Fall. I was especially impressed with the rich diversity of students, including groups of students from Qatar, Egypt, Tunisia, Brazil, Japan, Russia, and historically underrepresented areas in the U.S. such as Fresno, CA. Some of the students' work has been ongoing for more than a year, and the students offered background literature (with references!), purposeful hypotheses, detailed analysis and results (occasionally with tool manifests and explanatory code), and integrated conclusions.

Of the 80 or so projects I reviewed, I observed applications of the general linear model; repeated measures; logistic regression; non-parametric measures; classification, feature extraction, and dimensionality reduction; sundry machine learning approaches; and Monte Carlo simulations. I was equally impressed by these students' abilities in fundamental research tasks such as locating and using open source software (e.g., R), understanding and coherently explaining potential I/O- and computational-bounds, finding and interpreting peer-reviewed literature, and seeking out the assistance of relevant industry professionals. Additionally, the students' ebullient entrepreneurial spirit in the design and execution of physical proof-of-concept prototypes and related statistical experiments was especially noteworthy. I came away from each project and each student/team discussion with a new understanding of a thorny issue, a vision for what the solution space and product and process possibilities might be, and perhaps most germane for a College instructor, a renewed calibration for the knowledge, skills, and abilities of a tapestry of young people in the broad areas of mathematical, statistical, and computational sciences. I felt visceral pride in the statistical calling of many of these young finalists, and I know that they will craft much social, intellectual, and economic value for many decades to come.

A side benefit of service at this event was the opportunity to interact with academic and professional colleagues representing a variety of statistical-education interests. In particular, I'd like to thank Madeline Bauer (USC/Keck), Theresa Utlaut (Intel), Jo Hardin (Pomona College), and Olga Korosteleva (CSULB) for their guidance in the judging process. At this event one can interact with professionals from dozens of other professional societies and technology firms as well.

This Intel-sponsored event circulates annually among three U.S. cities. I strongly recommend that individuals with an general interest in statistics and data science volunteer at this event and at local SCASA and OCLBASA events in the future.

Many many thanks to all the statisticians who participated as judges and/or behind the scenes! Thanks to the ASA for the cash prizes and thanks to Chapman Hall/CRC, JMP, Minitab, O’Reilly Media, Revolution Analytics, Sage, Stata, and Taylor & Francis for the donated books, magazines, software and other items.