The Mango Solutions team have done it again: another excellent Effective Applications of R (EARL) conference just wrapped up here in London. The conference was attended by almost 400 R users from companies all around the world, and was a really fun experience. I was honored to deliver a keynote presentation, alonside keynotes from Joe Cheng and Garrett Grolemund (RStudio), Lou Bajuk (Tibco), Gabor Csardi (R Consortium's R Hub project) and a thought-provoking presentation on machine learning and AI from the Economist's Kenneth Cukier. There was also a fantastic social program, including a dinner inside the 1000-year-old White Tower at the Tower of London. (Many thanks to Microsoft for sponsoring this memorable evening!)

As always though, the focus of the EARL conference is on applications of the R language, and this week's event showcased many real-world uses of R in major companies. Some application highlights include:

- The
**Financial Times**uses R for quantitative journalism - The
**British Museum**uses R to predict attendance at special exhibitions **Beazley**uses R with SQL Server to analyze insurance data**Investec**'s IT department supports R usage for asset management with a standardized suite including RStudio, Revolution R Open and DeployR**Syngenta**uses R to improve global food security**Maersk**uses R to optimize capacity for its shipping lines**ATASS Sports**uses R to predict the probabilty of wins, draws and losses in the UK Premier League**BCA**is using the Azure ML cloud to deploy R models to production**Microsoft**uses R to heat buildings on campus more efficiently**eBay**uses R to quantify the effect and efficiency of TV, radio and print marketing

... and that's just a selection from the talks that I got to see! With three parallel tracks, there were many great talks that I missed, but you can check out the abstracts (with links to slides to come) at the EARL website.

The next EARL Conference will be held in Boston, November 7-9. The call for presentations is still open, so if you have an interesting story to tell about how R is used at your organization, why not come along?

EARL London 2016: Speakers

I'm excited to share that one of my data science heroes will be a presenter at the Microsoft Data Science Summit in Atlanta, September 26-27. Edward Tufte, the data visualization pioneer, will deliver a keynote address on the future of data analysis and the how to make more credible conclusions based on data.

If you're not familiar with Tufte, a great place to start is to read his seminal book Visual Display of Quantitative Information. First published in 1983 — well before the advent of mainstream data visualization software — this is the book that introduced and/or popularized many familiar concepts in data visualization today, such a small multiples, sparklines, and the data-ink ratio. Check out this 2011 Washington Monthy profile for more background on Tufte's career and influence. Tufte's work also influenced R: you can easily recreate many of Tufte's graphics in the R graphics system, including this famous weather chart.

The program for the Data Science Summit looks fantastic, and will also include keynote presentations from Microsoft CEO Satya Nadella and Data Group CVP Joseph Sirosh. Also there's a fantastic crop of Microsoft data scientists (plus yours truly) giving a wealth of practical presentations on how to use Microsoft tools and open-source software for data science. Here's just a sample:

- Jennifer Marsman will speak about building intelligent applications with the Cognitive Services APIs
- Danielle Dean will describe deploying real-world predictive maintenance solutions based on sensor data
- Brandon Rohrer will give a live presentation of his Data Science for Absolutely Everybody series
- Frank Seide will introduce CNTK, Microsoft's open source deep learning toolkit
- Maxim Likuyanov will share some best practices for interactive data analysis and scalable machine learning with Apache Spark
- Rafal Lukawiecki will explain how to apply data science in a business context
- Debraj GuhaThakurta and Max Kaznady will demonstrate statistical modeling on huge data sets with Microsoft R Server and Spark
- David Smith (that's me!) will give some examples of how data science at Microsoft (and R!) is being used to improve the lives of disabled people
- ... and many many more!

Check out the agenda for the breakout sessions on the Data Science Summit page for more. I hope to see you there: it will be a great opportunity to meet with Microsoft's data science team and see some great talks as well. To register, follow the link below.

Microsoft Data Science Summit, September 26-27, Altanta GA: Register Now

Microsoft has a brand-new conference, exclusively for data scientists, big data engineers, and machine learning practitioners. The Microsoft Data Science Summit, to be held in Atlanta GA, September 26-27, will feature talks and lab sessions from Microsoft engineers and thought leaders on using data science techniques and Microsoft technology, applied to real-world problems.

Included in the agenda are several topics of direct interest to R users, including:

- The Data Science Virtual Machine, which includes R
- Deploying Predictive Maintenance solutions with Cortana Intelligence
- Using the Cognitive Services framework, which you can use to incorporate machine intelligence into R scripts

Other topics of interest include building with bot frameworks, deep learning, Internet of Things applications, and in-depth Data Science topics.

To register for the conference, follow the link below. Discounted day passes to Microsoft Ignite on September 28-29 are also available to Microsoft Data Summit registrants.

Microsoft Events: Microsoft Data Science Summit, September 26-27

by Joseph Rickert

Last week, I mentioned a few of the useR tutorials that I had the opportunity to attend. Here are the links to the slides and code for all but one of the tutorials:

Regression Modeling Strategies and the rms Package - Frank Harrell

Using Git and GitHub with R, RStudio, and R Markdown - Jennifer Bryan

Effective Shiny Programming - Joe Cheng

Missing Value Imputation with R -Julie Josse

Extracting data from the web APIs and beyond - Ram, Grolemund & Chamberlain

Ninja Moves with data.table - Learn by Doing in a Cookbook Style Workshop - Matt Dowle

Never Tell Me the Odds! Machine Learning with Class Imbalances - Max Kuhn

MoRe than woRds, Text and Context: Language Analytics in Finance with R - Das & Mokashi

Time-to-Event Modeling as the Foundation of Multi-Channel Revenue Attribution - Tess Calvez

Handling and Analyzing Spatial, Spatiotemporal and Movement Data - Edzer Pebesma

Machine Learning Algorithmic Deep Dive - Erin LeDell

Introduction to SparkR Part1, Part 2 - Venkataraman & Falaki

Using R with Jupyter Notebooks for Reproducible Research de Vries & Harris

Understanding and Creating Interactive Graphics Part 1, Part 2- Hocking & Ekstrom

Genome-Wide Association Analysis and Post-Analytic Interrogation Part 1, Part 2 - Foulkes

An Introduction to Bayesian Inference using R Interfaces to Stan - Ben Goodrich

Small Area Estimation with R - Virgilio Gómez Rubio

Dynamic Documents with R Markdown - Yihui Xie

Granted that since the tutorials were not videotaped they mostly fall into the category of a "you had to be there" experience. However, many of the presenters put a significant effort into preparing their talks and collectively they comprise a rich resource that is worth a good look. Here are just of couple examples of what is to be found.

The first comes from Julie Josse's Missing Data tutorial where a version of the ozone data set with missing values is used to illustrate a basic principle of exploratory data analysis: visualize your data and look for missing values. If there are missing values try to determine if there are any patterns in their location.

maxO3 T9 T12 T15 Ne9 Ne12 Ne15 Vx9 Vx12 Vx15 maxO3v 20010601 87 15.6 18.5 NA 4 4 8 0.6946 -1.7101 -0.6946 84 20010602 82 NA NA NA 5 5 7 -4.3301 -4.0000 -3.0000 87 20010603 92 15.3 17.6 19.5 2 NA NA 2.9544 NA 0.5209 82 20010604 114 16.2 19.7 NA 1 1 0 NA 0.3473 -0.1736 92 20010605 94 NA 20.5 20.4 NA NA NA -0.5000 -2.9544 -4.3301 114 20010606 80 17.7 19.8 18.3 6 NA 7 -5.6382 -5.0000 -6.0000 94

These first two plot from made with the aggr() function in the VIM package shows proportion of missing values for each variable and relationship of missing values among all of the variables.

The next plot shows a scatter plot of two variables along boxplots along the margins that show the distributions of missing values for each variable. (Here blue represents data that are present and red the missing values.) The code to do this and many more advanced analyses is included on the tutorial page.

It looks like missing values are pretty much spread among the data.

Frank Harrell's tutorial provides a modern look at regression analysis from a statisticians point of view. The following plot comes from the section of his tutorial on Modeling and Testing Complex Interactions. If you haven't paid much attention to the the theory behind interpreting linear models in a while you may find this interesting.

Finally, I had one of those "Aha" moments right at beginning of Ben Goodrich's presentation on Bayesian modeling. MCMC methods work by simulating draws from a Markov chain whose limiting distribution converges to the distribution of interest. This technique works best when the simulated draws are able to explore the entire space of the target distribution. In the following the figure, the target is the bivariate normal distribution on the far right. Neither the Metropolis nor Gibbs Sampling algorithms come close to sampling from the entire target distribution space, but the Hamiltonian Monte Carlo "NUTS" algorithm in the STAN package displays very good coverage.

For reasons I described last week I believe that this year's useR tutorial speakers have raised the bar on both content and presentation. I am going to do my best to work through these before attending next year's conference in Brussels.

Before there was R, there was S. R was modeled on a language developed at AT&T Bell Labs starting in 1976 by Rick Becker and John Chambers (and, later, Alan Wilks) along with Doug Dunn, Jean McRae, and Judy Schilling.

At last week's useR! conference, Rick Becker gave a fascinating keynote address, Forty Years of S. His talk recounts the history of S's genesis and development, using just 3MB of disk on an Interdata 8/32. Rick's talk includes numerous tidbits that explain many characteristics of R, including the philosophy behind the graphics system and the origin of the arrow <- assignment operator in R. The story is also coloured with anecdotes from various other luminaries at Bell Labs at the time, including John Tukey (the pioneer of exploratory data analysis and the inventor of the words "software" and "bit"), and Kernighan and Ritchie (who were upstairs designing Unix and the C language at the same time S was being developed).

Here's Rick's talk, with an introduction by Trevor Hastie. (Many thanks to Microsoft for recording and making the video available.)

For more on the history of S, see this interview with another of the creators of S, John Chambers.

by Joseph Rickert

Over the years I have seen several excellent tutorials at useR!conferences that were not only very satisfying "you had to be there" experiences but were also backed up with meticulously prepared materials of lasting value. This year, quite a few useR!20i6 tutorials measure up to this level of quality. My take on why things turned out this way is that GitHub, Markdown, and Jupyter notebooks have been universally adopted as workshop / tutorial creation tools, and that having the right tools encourages creativity and draws out one's best efforts.

Jenny Bryan's tutorial Using Git and GitHub with R, Rstudio, and R Markdown and the tutorial by Andrie de Vries and Micheleen Harris: Using R with Jupyter notebooks for reproducible research are two superb, Escheresque self-referencing examples of what I am talking about. Bryan's tutorial which uses GitHub and R Markdown to teach GitHub and R Markdown is an impressive introduction to these two essential resources. And, the tutorial by de Vries and Harris makes very effective use of GitHub and Jupyter Notebooks. Moreover, this tutorial sets the gold standard for how to set up a system for interactive user participation. Harris and de Vries staged their tutorial on Microsoft's Azure Data Science VM. The Linux version of this VM comes provisioned with JupyterHub, a set of processes that enables a multi-user Jupyter Notebook server. Once the VM is loaded with the training materials, its only a matter of giving students a username and password to grant them immediate access to the interactive workshop materials. Have a look at notebook 06 to see how to set all of this up.

After seeing this, and comparing it to other tutorials where instructors wasted the better part of an hour trying to get students up and running with local copies of their course materials I can't see why everyone wouldn't opt for a cloud solution to this problem. When word gets out, the Data Science VM is going to be the standard for delivering technical workshops.

Unfortunately, I couldn't get around to see all of the tutorials, but two more that I can heartily recommend are MoRe than woRds, Text and Context: Language Analytics in Finance with R, the introduction to text mining by Sanjiv Das and Karthik Mokashi and Machine Learning Algorithmic Deep Dive by Erin Ledell. Sanjiv Das is an inspired educator and I have never seen a presentation either at R/Finance, useR! or even BARUG, the Bay Area useR Group where he wasn't on his game and super prepared. The tutorial he and Karthik gave this year at useR!2016 is a self-contained course in text mining.

Erin Ledell also came prepared with more tutorial material then she could ever present in three hours. But because of her thoughtful use of GitHub, Markdown and notebooks we have a machine learning resource that is well worth studying. Just being introduced to this incredible visualization of decision trees by Tony Chu and Stephanie Yee made my day.

Ledell is also a gifted teacher who anticipates where her audience may have difficulties. Her historical approach to understanding gradient boosting machines provides an opportunity to clarify the differences between various versions of the boosting algorithms. Sometimes understanding how something came to be is halfway towards understanding how it works.

The bar for presenting lectures, tutorials and workshops has been set pretty high. Anyone who is serious about delivering a high quality education probably needs to develop some skills with GitHub, Markdown and Notebooks. Studying the tutorial materials from useR! 2016 is a good place to start.

The useR! 2016 conference, the annual gathering of R users from around the world, is already underway at Stanford University. Today is a day of interactive tutorials, and the presentation program begins tomorrow.

But don't worry if you weren't able to make it to California, or if you missed out on getting a ticket for the conference. (You're not alone: registrations sold out within 2 weeks.) As a service to R users everywhere, Microsoft has sent down a camera crew to the conference, and will be livestreaming many of the sessions on Channel 9, the Microsoft video portal. Visit the R User Conference 2016 streaming page to explore the 200+ sessions that will be recorded, beginning with Rick Becker's keynote Forty Years of S at 9AM Pacific time on June 28.

Even if you can't catch the sessions live, all of the presentations will be preserved on Channel 9 to watch at any time. You'll also be download and watch any of the presentations in any medium you choose, as all the presentations will be made freely available under the Creative Commons Attribution-NonCommercial-NoDerivs 3.0 Unported license.

Several of the data science team from Microsoft, myself included, are at useR! 2016 this week, so if you are at the conference please come and say hi, and visit the Microsoft booth to pick up some R swag. You might also be interested to check out the talks from the Microsoft team on the livestream. (Click on the dates to add a link to your calendar.)

- Changing lives with Data Science at Microsoft (David Smith) to
- R at Microsoft (David Smith) to
- Exploring the R / SQL boundary (Gopi Kumar) to
- Forecasting Revenue for S&P 500 Companies Using the baselineforecast Package (Konstantin Golyaev) to
- A Lap Around R Tools for Visual Studio (John Lam) to
- When will this machine fail? (Xinwei Xue) to

Whether you're attending the conference in person or watching the sessions via livestream, I hope you enjoy user! 2016 as much as I'm looking forward to it!

Channel 9: Livestreaming: https://aka.ms/user2016conference; Recordings: useR! 2016 international R User conference

by Joseph Rickert

Just about two and a half years ago I wrote about some resources for doing Bayesian statistics in R. Motivated by the tutorial Modern Bayesian Tools for Time Series Analysis by Harte and Weylandt that I attended at R/Finance last month, and the upcoming tutorial An Introduction to Bayesian Inference using R Interfaces to Stan that Ben Goodrich is going to give at useR! I thought I'd look into what's new. Well, Stan is what's new! Yes, Stan has been under development and available for some time. But somehow, while I wasn't paying close attention, two things happened: (1) the rstan package evolved to make the mechanics of doing Bayesian in R analysis really easy and (2) the Stan team produced and/or organized an amazing amount of documentation.

My impressions of doing Bayesian analysis in R were set in the WinBUGS era. The separate WinBUGs installation was always tricky, and then moving between the BRugs and R2WinBUGS packages presented some additional challenges. My recent Stan experience was nothing like this. I had everything up and running in just a few minutes. The directions for getting started with rstan are clear and explicit about making sure that you have the right tool chain in place for your platform. Since I am running R 3.3.0 on Windows 10 I installed Rtools34. This went quickly and as expected except that C:\Rtools\gcc-4.x-y\bin did not show up in my path variable. Not a big deal: I used the menus in the Windows System Properties box to edit the Path statement by hand. After this, rstan installed like any other R package and I was able to run the 8schools example from the package vignette. The following 10 minute video by Ehsan Karim takes you through the install process and the vignette example.

The Stan documentation includes four major components: (1) The Stan Language Manual, (2) Examples of fully worked out problems, (3) Contributed Case Studies and (4) both slides and video tutorials. This is an incredibly rich cache of resources that makes a very credible case for the ambitious project of teaching people with some R experience both Bayesian Statistics and Stan at the same time. The "trick" here is that the documentation operates at multiple levels of sophistication with entry points for students with different backgrounds. For example, a person with some R and the modest statistics background required for approaching Gelman and Hill's extraordinary text: Data Analysis Using Regression and Multilevel/Hierarchical Models can immediately beginning running rstan code for the book's examples. To run the rstan version of the example in section 5.1, Logistic Regression with One Predictor, with no changes a student only needs only to copy the R scripts and data into her local environment. In this case, she would need the R script: 5._LogisticRegressionWithOnePredictor. R, the data: nes1992_vote.data.R and the Stan code: nes_logit.stan**.** The Stan code for this simple model is about as straightforward as it gets: variable declarations, parameter identification and the model itself.

data { | |

int<lower=0> N; | |

vector[N] income; | |

int<lower=0,upper=1> vote[N]; | |

} | |

parameters { | |

vector[2] beta; | |

} | |

model { | |

vote ~ bernoulli_logit(beta[1] + beta[2] * income); | |

} |

Running the script will produce the iconic logistic regression plot:

I'll wind down by curbing my enthusiasm just a little by pointing out that Stan is not the only game in town. JAGS is a popular alternative, and there is plenty that can be done with unaugmented R code alone as the Bayesian Inference Task View makes abundantly clear.

If you are a book person and new to Bayesian statistics, I highly recommend Bayesian Essentials with R by Jean-Michel Marin and Christian Robert. The authors provide a compact introduction to Bayesian statistics that is backed up with numerous R examples. Also, the new book by Richard McElreath, Statistical Rethinking: A Bayesian Course with Examples in R and Stan looks like it is going to be an outstanding read. The online supplements to the book are certainly worth a look.

Finally, if you are a Bayesian or a thinking about becoming one and you are going to useR!, be sure to catch the following talks:

- Bayesian analysis of generalized linear mixed models with JAGS, by Martyn Plummer
- bamdit: An R Package for Bayesian meta-Analysis of diagnostic test data by Pablo Emilio Verde
- Fitting complex Bayesian models with R-INLA and MCMC by Virgilio Gómez-Rubio
- bayesboot: An R package for easy Bayesian bootstrapping by Rasmus Arnling Bååth
- An Introduction to Bayesian Inference using R Interfaces to Stan by Ben Goodrich
- DiLeMMa - Distributed Learning with Markov Chain Monte Carlo Algorithms Using the ROAR Package by Ali Zaidi

by Joseph Rickert

It is always a delight to discover a new and useful R package, and it is especially nice when the discovery comes with at context and testimonial to its effectiveness. It is also satisfying to be able to check in once in awhile and get an idea of what people think is hot, or current or trending in the R world. The schedule for the upcoming useR! conference at Stanford is a touchstone for both of these purposes. It lists 128 contributed talks over 3 days. Vetted for both content and quality by the program committee, these talks represent a snapshot of topics and research areas that are of current interest to R developers and researchers and also catalog the packages and tools that support the research. As best as I can determine, the abstracts for the talks explicitly call out 154 unique packages.

Some of the talks are all about introducing new R packages that have been very recently released or are still under development on Github. Others, mention the packages that represent the current suite of tools for a particular research area. Obviously, the abstracts do not list all of the packages employed by the researchers in their work, but presumably they do mention the packages that the speakers think are most important or in describing their work and attracting an audience.

The following table maps packages to talks. There are multiple lines for packages when they map to more than one talk and vice versa. For me, browsing a list like this is akin to perusing a colleague's book shelves; noting old favorites and discovering the occasional new gem.

I am reticent to try any conclusions about the importance of the packages to the talks from the abstracts alone before the talks are given. However, it is interesting to note that all of the packages mentioned more than twice are from the RStudio suite. Moreover, shiny and rmarkdown are apparently still new and shiny. Mentioning them in an abstract still conveys some cachet. I suspect that ggplot2 will figure in more analyses than both of these packages put together, but ggplot2 has become so much a part of that fabric of R that there is no premium in promoting it.

If you are looking forward to useR! 2016, whether it attend in person or to watch the videos afterwards, gaining some familiarity with the R packages highlighted in the contributed talks is sure to enhance your experience.

by Joseph Rickert

R / Finance 2016 lived up to expectations and provided the quality networking and learning experience that longtime participants have come to value. Eight years is a long time for a conference to keep its sparkle and pizzazz. But, the conference organizers and the UIC have managed to create a vibe that keeps people coming back. The fact that invited keynote speakers (e.g. Bernhard Pfaff 2012, Sanjiv Das 2013, and Robert McDonald 2014) regularly submit papers in subsequent years is a testimony to the quality and networking importance of the event. My guess is that the single track format, quality presentations, intense compact schedule and pleasant venues comprise a winning formula.

Since I have recently written about the content of this year's conference in preparation for the event, and since most of the presentations are already online for you to examine directly I'll just present a few personal highlights here.

My favorite single visual from the conference is Bryan Lewis' depiction of corporate "Big Data" architectures as a manifestation of the impulse for completeness, control and dominance that once drove Soviet style central planning. (If you don't read Russian, run google translate on the text in the first panel.)

In his presentation, R in practice, at scale, Bryan presents a lightweight, R-centric architecture built around Redis that is adequate fro many "big data" tasks.

Matt Dziubinski's talk on Getting the Most out of Rcpp, High-Performance C++ in Practice, is probably not a talk I would have elected to attend in a multi-track conference, and I would have missed seeing a virtuoso performance. Matt got through over 120 of his 153 prepared slides in a single, lucid stream of clear (but loud) explanations in only 20 minutes. Never stopping to pause, he gave a mini-course in computer science performance evaluation (both hardware and software aspects) that addressed the Why, What and How of it all.

Ryan Hafen's presentation, Interactively Exploring Financial Trades in R, showed how to use a tool chain built around Tessera and the NxCore R package to perform exploratory data analysis on a large NxCore data set containing approximately 1.25 billion records of 47 variables without leaving the R environment. The following slide provides an example of the kinds of insights that are possible.

In his presentation, Quantitative Analysis of Dual Moving Average Indicators in Automated Trading, Douglas Service showed how to use stochastic differential equations and the Itô calculus to derive a closed form solution for expected Log returns under the Luxor trading strategy and a baseline set of simplifying assumptions. If you like seeing the Math you will be pleased to see that Doug provides all of the details.

Michael Kane (glmnetlib: A Low-Level Library for Regularized Regression) discussed the motivations for continuing to improve linear models and showed the progress he is making on re-implementing glmnet which, although very efficient, does not support arbitrary link family combinations or out of memory calculations and is written in the obscure Mortran flavor of Fortran. Kane's goal with his new package (renamed pirls: Penalized, Iteratively Reweighted Least Squares Regression) is to rectify these deficiencies while producing something fast enough to use.

In his presentation Community Finance Teaching Resources with R/Shiny, Matt Brigida showed off some opensource resources for teaching quantitative finance that are based on the new paradigm of GitHub as the place for tech savvy people to hang out and Shiny as the teaching / presentation tool for making calculations come alive. Check out some of Matt's 5 minute mini lessons. Here is an example from his What is Risk module:

There is much more than I have presented here on the R / Finance conference site. If you are interested in deep Finance and not just the tools I have highlighted, be sure to check out the presentations by Sanjiv Das, Bernhard Pfaff, Matthew Dixon and others. There is plenty of useful R code to be mined in these presentations too.

I would be remiss without mentioning Patrick Burns' keynote presentation which was highly entertaining, novel and thought provoking on many levels: everything a keynote should be. Pat launched his talk by referring to the Sapir-Worf hypothesis which posits that language controls how we think and assigned a similar role to model building. He went on to describe his Agent inspired R simulation model and showed how he calibrated this model to provide a useful tool for investigating ideas such as risk parity, variance targeting and strategies for taxing market pollution. The code for Pat's model is available here, but since his slides are not up on the conference site, and I was apparently too mesmerized to take useful notes, we will have to wait for Pat to post more on his website. (Pat's slides should be available soon.)

Finally, I would like to note that Doug Service and Sanjiv Das won the best paper prizes. This is the second year in a row for Sanjiv to win an R / Finance award. Congratulations to both Doug and Sanjiv!