by Joseph Rickert

Last week, I mentioned a few of the useR tutorials that I had the opportunity to attend. Here are the links to the slides and code for all but one of the tutorials:

Regression Modeling Strategies and the rms Package - Frank Harrell

Using Git and GitHub with R, RStudio, and R Markdown - Jennifer Bryan

Effective Shiny Programming - Joe Cheng

Missing Value Imputation with R -Julie Josse

Extracting data from the web APIs and beyond - Ram, Grolemund & Chamberlain

Ninja Moves with data.table - Learn by Doing in a Cookbook Style Workshop - Matt Dowle

Never Tell Me the Odds! Machine Learning with Class Imbalances - Max Kuhn

MoRe than woRds, Text and Context: Language Analytics in Finance with R - Das & Mokashi

Time-to-Event Modeling as the Foundation of Multi-Channel Revenue Attribution - Tess Calvez

Handling and Analyzing Spatial, Spatiotemporal and Movement Data - Edzer Pebesma

Machine Learning Algorithmic Deep Dive - Erin LeDell

Introduction to SparkR Part1, Part 2 - Venkataraman & Falaki

Using R with Jupyter Notebooks for Reproducible Research de Vries & Harris

Understanding and Creating Interactive Graphics Part 1, Part 2- Hocking & Ekstrom

Genome-Wide Association Analysis and Post-Analytic Interrogation Part 1, Part 2 - Foulkes

An Introduction to Bayesian Inference using R Interfaces to Stan - Ben Goodrich

Small Area Estimation with R - Virgilio Gómez Rubio

Dynamic Documents with R Markdown - Yihui Xie

Granted that since the tutorials were not videotaped they mostly fall into the category of a "you had to be there" experience. However, many of the presenters put a significant effort into preparing their talks and collectively they comprise a rich resource that is worth a good look. Here are just of couple examples of what is to be found.

The first comes from Julie Josse's Missing Data tutorial where a version of the ozone data set with missing values is used to illustrate a basic principle of exploratory data analysis: visualize your data and look for missing values. If there are missing values try to determine if there are any patterns in their location.

maxO3 T9 T12 T15 Ne9 Ne12 Ne15 Vx9 Vx12 Vx15 maxO3v
20010601 87 15.6 18.5 NA 4 4 8 0.6946 -1.7101 -0.6946 84
20010602 82 NA NA NA 5 5 7 -4.3301 -4.0000 -3.0000 87
20010603 92 15.3 17.6 19.5 2 NA NA 2.9544 NA 0.5209 82
20010604 114 16.2 19.7 NA 1 1 0 NA 0.3473 -0.1736 92
20010605 94 NA 20.5 20.4 NA NA NA -0.5000 -2.9544 -4.3301 114
20010606 80 17.7 19.8 18.3 6 NA 7 -5.6382 -5.0000 -6.0000 94

These first two plot from made with the aggr() function in the VIM package shows proportion of missing values for each variable and relationship of missing values among all of the variables.

The next plot shows a scatter plot of two variables along boxplots along the margins that show the distributions of missing values for each variable. (Here blue represents data that are present and red the missing values.) The code to do this and many more advanced analyses is included on the tutorial page.

It looks like missing values are pretty much spread among the data.

Frank Harrell's tutorial provides a modern look at regression analysis from a statisticians point of view. The following plot comes from the section of his tutorial on Modeling and Testing Complex Interactions. If you haven't paid much attention to the the theory behind interpreting linear models in a while you may find this interesting.

Finally, I had one of those "Aha" moments right at beginning of Ben Goodrich's presentation on Bayesian modeling. MCMC methods work by simulating draws from a Markov chain whose limiting distribution converges to the distribution of interest. This technique works best when the simulated draws are able to explore the entire space of the target distribution. In the following the figure, the target is the bivariate normal distribution on the far right. Neither the Metropolis nor Gibbs Sampling algorithms come close to sampling from the entire target distribution space, but the Hamiltonian Monte Carlo "NUTS" algorithm in the STAN package displays very good coverage.

For reasons I described last week I believe that this year's useR tutorial speakers have raised the bar on both content and presentation. I am going to do my best to work through these before attending next year's conference in Brussels.