by Andrie de Vries

A few weeks ago I wrote about the growth of CRAN packages, where I demonstrated how to scrape CRAN archives to get an estimate of the number of packages over time. In this post I briefly mentioned that the Ecdat package contains a dataset, CRANpackages, with snapshots recorded by John Fox and Spencer Graves.

Here is a plot of the data they collected. The dataset contains data through 2014, so I manually added the package count as of today (8,329).

In my previous post, I asked the question: "are there indications that the contribution rate is steady, accelerating or decelerating?"

This hints to analysis by John Fox where he says "The number of packages on CRAN ... has grown roughly exponentially, with residuals from the exponential trend ... showing a recent decline in the rate of growth" (Fox, 2009).

In my previous post Using segmented regression to analyse world record running times I used segmented regression to estimate a model that is piece-wise linear.

I used the same process to fit a segmented regression line through the CRAN package data.

By default the segmented package fits a single break point through the data. The results of this analysis indicates a break point occurring some time during 2008. This is entirely consistent with the observation by John Fox that the rate of growth is slowing down.

However, note that the the segmented regression line doesn't fit the data very well during the period 2008 to 2012.

With a small amount of extra work you can fit segmented models with multiple break points. To do this, you simply have to specify initial values for the search. Here I show the results of a simple model with two break points. This model finds the first break point during 2007 and the second break point during 2011.

Natural systems can not maintain exponential growth forever. There are always some limits on the system that will ultimately inhibit any further growth. This is why many systems display some kind of sigmoid curve, or S curve.

Although the growth curve of CRAN packages shows signs of slowing down, it does not seem as if there is an inflexion point in the data. An inflexion point is where the curve transitions from being convex to being concave.

Thus it seems the grown of CRAN packages will appear to be exponential for quite some time in the future!

As usual, here is the R code I used.

You might think literary criticism is no place for statistical analysis, but given digital versions of the text you can, for example, use sentiment analysis to infer the dramatic arc of an Oscar Wilde novel. Now you can apply similar techniques to the works of Jane Austen thanks to Julia Silge's R package janeaustenr (available on CRAN). The package includes the full text the 6 Austen novels, including *Pride and Prejudice* and *Sense and Sensibility*.

With the novels' text in hand, Julia then applied Bing sentiment analysis (as implemented in R's syuzhet package), shown here with annotations marking the major dramatic turns in the book:

There's quite a lot of noise in that chart, so Julia took the elegant step of using a low-pass fourier transform to smooth the sentiment for all six novels, which allows for a comparison of the dramatic arcs:

An apparent Austen afficionada, Julia interprets the analysis:

This is super interesting to me.

EmmaandNorthanger Abbeyhave the most similar plot trajectories, with their tales of immature women who come to understand their own folly and grow up a bit.Mansfield ParkandPersuasionalso have quite similar shapes, which also is absolutely reasonable; both of these are more serious, darker stories with main characters who are a little melancholic.Persuasionalso appears unique in starting out with near-zero sentiment and then moving to more dramatic shifts in plot trajectory; it is a markedly different story from Austen’s other works.

For more on the techniques of the analysis, including all the R code (plus some clever Austen-based puns), check out Julia's complete post linked below.

data science ish: If I Loved Natural Language Processing Less, I Might Be Able to Talk About It More

by Andrie de Vries

Every once in a while somebody asks me how many packages are on CRAN. (More than 8,000 in April, 2016). A year ago, in April 2015, there were ~6,200 packages on CRAN.

This poses a second question: what is the historical growth of CRAN packages?

One source of information is Bob Muenchen's blog R Now Contains 150 Times as Many Commands as SAS, that contains this graphic showing packages from 2002 through 2014. (Bob fitted a quadratic curve through the data, that fits quite well, except that this model estimates too high in the very early years).

But where does this data come from? Bob's article references an earlier article by John Fox in the R Journal, Aspects of the Social Organization and Trajectory of the R Project. (This is a fascinating article, and I highly recommend you read it). The analysis by John Fox contains this graphic showing data from 2001 through 2009. John fits an exponential growth curve through the data, that again fits very well:

I was particularly interested in trying to see if I can find the original source of the data. The original graphic contains a caption with references to the R source code on SVN, but I could only find the release dates of historical R releases, not the package counts.

Next I put the search term "john fox 2009 cran package data" into my favourite search engine and came across the dataset CRANPackages in the package Ecdat. The Ecdat package contains data sets for econometrics, compiled by Spencer Graves.

I promptly installed the package and inspected the data:

> library(Ecdat) > head(CRANpackages) Version Date Packages Source 1 1.3 2001-06-21 110 John Fox 2 1.4 2001-12-17 129 John Fox 3 1.5 2002-05-29 162 John Fox 4 1.6 2002-10-01 163 John Fox, updated 5 1.7 2003-05-27 219 John Fox 6 1.8 2003-11-16 273 John Fox

> tail(CRANpackages) Version Date Packages Source 24 2.15 2012-07-07 4000 John Fox 25 2.15 2012-11-01 4082 Spencer Graves 26 2.15 2012-12-14 4210 Spencer Graves 27 2.15 2013-10-28 4960 Spencer Graves 28 2.15 2013-11-08 5000 Spencer Graves 29 3.1 2014-04-13 5428 Spencer Graves

This data is exactly what I was after, but what is the origin?

> ?CRANpackages

Data casually collected on the number of packages on the Comprehensive R Archive Network (CRAN) at different dates.

So it seems this gets compiled and updated by hand, orginally by John Fox, and more recently by Spencer Graves himself.

This set me thinking. Can we do better and automate this process by scraping CRAN?

This is in fact possible, and you can find the source data at CRAN for older, archived releases (R-1.7 in 2004 through R-2.10 in 2010) as well as more recent releases.

However, you will have to scrape the dates from a list of package release dates for each historic release (you can find my code at the bottom of this blog).

I get the following result. Note that the rug marks indicate the release date and number of packages for each release. The data is linear, not log, but the rug marks gives the illusion of a logarithmic scale.

I took a few shortcuts in the analysis:

- For each release, the actual data is a list of packages, as well as the publication date for each package. I took the date of the "release" as the very last package publication date. This means my estimate for the "release date" will be wrong. Specifically, in each case, the actual release would have occurred earlier.
- I made no attempt to find the data prior to 2004.

The analysis can really benefit from fitting some curves through the data. Specifically, I would like to fit an exponential growth curve to see. For example, are there indications that the contribution rate is steady, accelerating or decelerating. Might a S-curve fit the data better?

The plot itself needs additional labels for the dot releases.

I hope to address these in a follow-up post.

by Joseph Rickert

Packages continue to flood into CRAN at a rate the challenges the sanity of anyone trying to keep up with what's new. So far this month, more than 190 packages have been added. Here is a my view of what's interesting in this March madness.

The launch_tutorial() function from the RtutoR package by Anup Nair launches a Shiny-based interactive R tutorial that, so far, includes sections on basic operations on a data set, data manipulation, loops and functions, and basic model development. The following screen shows the page for selecting columns from a data set. Notice that the example code offers two different dplyr based alternatives. The interface is far from perfect, but it's quite workable. Interactive tutorials launched directly from the command line may very well be the next generation of R documentation.

It also looks like the idea of using an R package to launch a shiny application may indicate a trend. The lavaan.shiny package by William Kyle Hamilton, also new this month, contains a single function to launch an interactive tutorial on latent variable analysis based on the lavaan package.

Time Series aficionados will want to have a look at the dCovTS package from Pitsillou and Foxianos which implements the distance covariance and correlation metrics for univariate and multivariate time series. These are relatively new metrics published by Z. Zhou in a 2012 paper in which he adapted the distance correlation metric developed by Szekely et al to measure non-linear dependence in time series. The following plots shows the data, ACF, PACF and Auto-Distance Correlation Function (ADCF) for a time series of monthly deaths from bronchitis, emphysema and asthma for makes in the UK between 1974 and 1979.

The ADCF plot produced by the function ADCFplot(mdeaths,method="Wild",b=100) uses the "Wild Bootstrap", a relatively new re-sampling technique for stationary time series.

If you are working with generalized linear mixed models you may be interested in two new packages that provide a few enhancements for lme4. glmmsr by Helen Ogden provides some alternatives to the Laplace method for approximating likelihood functions (The vignette does a good job of explaining the new alternatives) and GLMMRR from Fox, Klotzke and Veen fits GLMM models to binary, randomized response data and provides Cauchit, Log-log, Logistic and Probit link functions.

Machine Learning enthusiasts may find a few new packages interesting. The MultivariateRandomForest package by Raziur Rahman contains functions to fit multivariate Random Forests models and make predictions. The hclust2() function in Gagolewski, Bartoszuk and Cena's genie package clusters data using the Gini index. hclust2() is a hierarchical clustering technique that is billed as being outlier resistant. The package kmlShape by Genolini and Guichard contains functions to do hierarchical clustering on longitudinal data using the Frechet's distance metric to group trajectories. The following plot shows clusters identified for the artificial data generated in example 2 for the kmlShape() function.

deepboost from Marcous and Sandbank provides and interface to google's Deep Boasting algorithm as described in this paper by Cortes et al. it provides functions for training, evaluation, predicting and hyper parameter optimizing using grid search and cross validation.

The last package I'll mention today is rEDM from Ye, Clark and Deyle that brings empirical dynamic modeling (EDM) to R. EDM uses time series data to reconstruct the state space of a dynamic system using the Takens’ Theorem (1981) which implies that the reconstruction can be accomplished using lags of a time series data for the unknown or unobserved variables. The vignette makes a nice case for why attractors and chaos belong in R.

by Joseph Rickert

In a post late last year, my colleague and fellow blogger, Andrie de Vries described enhancements to the AzureML R package that makes it easy to publish R functions that consume data frames as Azure Web Services. A very nice consequence is that it is now feasible to develop predictive models in R and enable the Excel powered business analysts in your organization to use your model to generate predictions with new data. This is made possible by an Azure feature that integrates a published web service into an Excel workbook. Once you publish your R model as a web service and set up the Excel workbook anybody you give the workbook to will be able to score new data copied into the workbook.

Now, I'll walk through the steps required assuming you have already set up an Azure ML account. The AzureML package vignette gives a detailed example of publishing a new model. For convenience, I reproduce the necessary code from the vignette here:

The first part of the code fits a generalized boosted regression model (gbm) to the Boston data set from the MASS package that contains features characterizing the housing values for suburban Boston. The prediction function, mypredict() is set up to take in a data frame containing new data and use the gbm model to predict the median value of owner-occupied homes. Notice that the function includes the statement require(gbm). This ensures that the Azure environment will have access to the gbm package when making predictions.

The rest of the code "tests" the prediction using a data frame containing the first five lines of the Boston data set and then publishes the prediction function as a web service using the function publishWebService().

Once you have gotten this far there are only a few more steps to set up the Excel workbook. Login to your Azure Machine Learning account and go to the web services page. You should see something like this:

Notice that the name used in the publisWebService function appears on the list of available services. Clicking on this will bring you to a page like this next one. Go ahead and select Excel 2013 or later workbook in the REQUEST/RESPONSE row.

This should bring you to the Sample Data screen below that pretty much explains what you are about to do.

Now we almost there. A couple more clicks will bring you to an empty workbook like the one below. To get this screen I manually pasted in test data from the Boston file.

Then I used the input boxes on the right to set the range for the input data and select range of cell to place the output.

Once the file is built you can distribute it to your colleagues to begin making predictions. You can download my Excel file here Download AzureML-vignette-gbm-3_15_2016 12_13_33 AM and begin making predictions.

by Hong Ooi, Sr. Data Scientist, Microsoft

I’m pleased to announce the release of version 0.62 of the dplyrXdf package, a backend to dplyr that allows the use of pipeline syntax with Microsoft R Server’s Xdf files. This update adds a new verb (`persist`

), fills some holes in support for dplyr verbs, and fixes various bugs.

`persist`

verbA side-effect of dplyrXdf handling file management is that passing the output from one pipeline into subsequent pipelines can have unexpected results. Consider the following example:

*# pipeline 1*
output1 <- flightsXdf %>%
**mutate**(delay=(arr_delay + dep_delay)/2)
*# use the output from pipeline 1*
output2 <- output1 %>%
**group_by**(carrier) %>%
**summarise**(delay=**mean**(delay))
*# reuse the output from pipeline 1 -- WRONG*
output3 <- output1 %>%
**group_by**(dest) %>%
**summarise**(delay=**mean**(delay))

The problem with this code is that the second pipeline will overwrite or delete its input, so the third pipeline will fail. This is consistent with dplyrXdf’s philosophy of only saving the most recent output of a pipeline, where a pipeline is defined as *all operations starting from a raw xdf file.* However, in this case it isn’t what’s desired.

Similarly, dplyrXdf stores its output files in R’s temporary directory, so when you close your R session, these files will be deleted. This saves you having to manually delete files that are no longer in use, but it means that you must copy the output of your pipeline to a permanent location if you want to keep it around.

The new `persist`

verb is meant to address these issues. It saves a pipeline’s output to a permanent location and also resets the status of the pipeline, so that subsequent operations will know not to overwrite the data.

*# pipeline 1 -- use persist to save the data to the working directory*
output1 <- flightsXdf %>%
**mutate**(delay=(arr_delay + dep_delay)/2) %>% **persist**("output1.xdf")
*# use the output from pipeline 1*
output2 <- output1 %>%
**group_by**(carrier) %>%
**summarise**(delay=**mean**(delay))
*# reuse the output from pipeline 1 -- this works as expected*
output3 <- output1 %>%
**group_by**(dest) %>%
**summarise**(delay=**mean**(delay))

`factorise`

callYou can now specify the levels for a factor created by `factorise`

, using the standard name=value syntax:

**factorise**(data, x=**c**("a", "b", "c"))

This will convert the variable `x`

into a factor with levels `a`

, `b`

and `c`

. Any values that don’t match the given levels will be turned into NAs. If `x`

is already a factor, its levels will be changed to match those specified.

`semi_join`

and `anti_join`

The `semi_join`

and `anti_join`

verbs have been implemented. As these types of joins aren’t internally supported by `rxMerge`

, they are done using a combination of other verbs:

*# same as semi_join(a, b, by="x")*
*# select everything in 'a' that matches a value of 'x' in 'b'*
semi <- **inner_join**(a,
**select**(b, x) %>% distinct,
by="x")
*# same as anti_join(a, b, by="x")*
*# select everything in 'a' that doesn't match a value of 'x' in 'b'*
anti <- **left_join**(a,
**transmute**(b, x, .ones=**rep**(1, .rxNumRows)) %>% distinct,
by="x") %>% **filter**(**is.na**(.ones))

`do`

and `doXdf`

You can now use unnamed arguments with `do`

and `doXdf`

, like the native `dplyr::do`

. In both cases, the output has to be coercible to a data frame (again, like `dplyr::do`

).

*# example of unnamed argument to do*
do_unnamed <- flightsXdf %>%
**group_by**(carrier) %>%
**do**(**data.frame**(quantile=**sprintf**("%d%%", **seq**(0, 100, by=25)),
quant_arr=**quantile**(.$arr_delay, na.rm=TRUE),
quant_dep=**quantile**(.$dep_delay, na.rm=TRUE)))
*# example of unnamed argument to doXdf*
do_unnamedXdf <- flightsXdf %>%
**group_by**(carrier) %>%
**doXdf**(**rxSummary**(~ arr_delay, .)$sDataFrame)

A number of bug fixes have been implemented. In particular, joining tables on factor variables should now work even when the factor levels in the two tables aren’t exactly the same. The `mutate_each`

, `summarise_each`

, `count`

and `tally`

verbs have also been verified to work correctly for Xdf files.

If you encounter any bugs or issues with dplyrXdf, please contact me at hongooi@microsoft.com.

With the US election season in full swing, you can hardly browse a newspaper website without seeing some kind of map showing election or polling results, like this one from the New York Times.

With election data (usually) accessible online, and a wealth of mapping tools available in the R language, you can fairly easily make similar maps yourself by following this 10-step guide from Computerworld's Sharon Machlis. The guide covers reading the raw data into R and defining a variable to map, combining the data with a shape file of geographic boundaries, creating a static map with color-coding, and making and publishing an interactive map with data pop-ups, layers, and navigation controls. The tutorial makes heavy use of RStudio's leaflet package, which makes it easy to access the capabilities of the leaflet Javascript library from R.

To learn how to create interactive versions of maps like the one above, check out the tutorial at the link below.

Computerworld: Create maps in R in 10 (fairly) easy steps

The first official release of R, R version 1.0.0 was released on February 29, 2000. The anniversary was marked on Twitter by Thomas Lumley, a member of R Core Group: 20 leading statisicians and computer scientists (and 4 alums) from around the world without whom the R Project would not exist. That makes it 16 years — sixteen! — that the R language has faithfully served statisticians, bioinformaticians, quantiative analysts, data scientists and others solving problems with data.

But in fact, the R project is even more venerable than that. The project itself began in 1993, 7 years before the first "official" R release was made available to the public, as research project initiated by Ross Ihaka and Robert Gentleman. Here's a brief timeline of the history of R:

**1993**: Research project in Auckland, NZ**1995**: R Released as open-source software**1997**: R core group formed**2000**: R 1.0.0 released (February 29)**2003**: R Foundation founded**2004**: First international user conference in Vienna**2015:**R Consortium founded

The project is still as strong and as active as ever. (A new update for R, version 3.2.4, is scheduled for March 10.) Likewise, the community around R continues to grow rapidly, as evidenced by user-created contributions to R hosted on the Comprehensive R Archive Network, CRAN. CRAN is a repository where anyone can contribute an extension to R (called a "package"), as long as it meets the quality and licensing requirements set by the CRAN maintainers. On February 29, R's official 16th anniversary, there were exactly 8,000 packages — *eight thousand* — hosted on CRAN. (You can explore those packages at MRAN, Microsoft's historical archive of CRAN.)

That number doesn't even count packages that have been accepted to CRAN but have since been retired (either through obsolescence or lack of updates by the author when they fail to pass checks with new versions of R). Gergely Daróczi analyzed the CRAN logs to show that submission have increased exponentially over time, and more than 9,000 distinct packages have been accepted to CRAN:

By the way, hosting and maintaining CRAN is a huge effort, run by volunteers from the R Core Group. Every package submitted to CRAN is automatically tested on a wide variety of platforms by the CRAN build system, and volunteers spend a significant amount of time personally interacting with package authors to resolve problems that arise. On top of that, the CRAN system runs checks of R packages with each nightly build of R, a process that takes 90 days of computing time (37 of which is on the Solaris system, which R still supports.) Again, problems that arise in this process often result in manual notifications to package authors by the CRAN volunteers. R packages are an incredibly important part of the value of R, and it's thanks to the CRAN system (and its volunteers) that all R users have access to such an amazingly rich source of capabilities for R.

So this week of R's 16th official anniversary marks a great time to thank the R Core Group and the CRAN volunteers for providing their time and expertise to create the most useful ecosystem for data science the world has ever known. Thank you all!

By Joseph Rickert

The ability to generate synthetic data with a specified correlation structure is essential to modeling work. As you might expect, R’s toolbox of packages and functions for generating and visualizing data from multivariate distributions is impressive. The basic function for generating multivariate normal data is mvrnorm() from the MASS package included in base R, although the mvtnorm package also provides functions for simulating both multivariate normal and t distributions. (For tutorial on how to use R to simulate from multivariate normal distributions from first principles using some linear algebra and the Cholesky decomposition see the astrostatistics tutorial on Multivariate Computations.)

The following block of code generates 5,000 draws from a bivariate normal distribution with mean (0,0) and covariance matrix Sigma printed in code. The function kde2d(), also from the Mass package generates a two-dimensional kernel density estimation of the distribution's probability density function.

# SIMULATING MULTIVARIATE DATA # https://stat.ethz.ch/pipermail/r-help/2003-September/038314.html # lets first simulate a bivariate normal sample library(MASS) # Simulate bivariate normal data mu <- c(0,0) # Mean Sigma <- matrix(c(1, .5, .5, 1), 2) # Covariance matrix # > Sigma # [,1] [,2] # [1,] 1.0 0.1 # [2,] 0.1 1.0 # Generate sample from N(mu, Sigma) bivn <- mvrnorm(5000, mu = mu, Sigma = Sigma ) # from Mass package head(bivn) # Calculate kernel density estimate bivn.kde <- kde2d(bivn[,1], bivn[,2], n = 50) # from MASS package

R offers several ways of visualizing the distribution. These next two lines of code overlay a contour plot on a "heat Map" that maps the density of points to a gradient of colors.

This plots the irregular contours of the simulated data. The code below which uses the ellipse() function from the ellipse package generates the classical bivariate normal distribution plot that graces many a textbook.

# Classic Bivariate Normal Diagram library(ellipse) rho <- cor(bivn) y_on_x <- lm(bivn[,2] ~ bivn[,1]) # Regressiion Y ~ X x_on_y <- lm(bivn[,1] ~ bivn[,2]) # Regression X ~ Y plot_legend <- c("99% CI green", "95% CI red","90% CI blue", "Y on X black", "X on Y brown") plot(bivn, xlab = "X", ylab = "Y", col = "dark blue", main = "Bivariate Normal with Confidence Intervals") lines(ellipse(rho), col="red") # ellipse() from ellipse package lines(ellipse(rho, level = .99), col="green") lines(ellipse(rho, level = .90), col="blue") abline(y_on_x) abline(x_on_y, col="brown") legend(3,1,legend=plot_legend,cex = .5, bty = "n")

The next bit of code generates a couple of three dimensional surface plots. The second of which is an rgl plot that you will be able to rotate and view from different perspectives on your screen.

Next, we have some code to unpack the grid coordinates produced by the kernel density estimator and get x y, and z values to plot the surface using the new scatterplot3js() function from the htmlwidgets, javascript threejs package. This visualization does not render the surface with the same level of detail as the rgl plot. Nevertheless, it does show some of the salient features of the pdf and has the distinct advantage of being easily embedded in web pages. I expect that html widget plots will keep getting better and easier to use.

# threejs Javascript plot library(threejs) # Unpack data from kde grid format x <- bivn.kde$x; y <- bivn.kde$y; z <- bivn.kde$z # Construct x,y,z coordinates xx <- rep(x,times=length(y)) yy <- rep(y,each=length(x)) zz <- z; dim(zz) <- NULL # Set up color range ra <- ceiling(16 * zz/max(zz)) col <- rainbow(16, 2/3) # 3D interactive scatter plot scatterplot3js(x=xx,y=yy,z=zz,size=0.4,color = col[ra],bg="black")<

The code that follows uses the rtmvt() function from the tmvtnorm package to generate bivariate t distribution. The rgl plot renders the surface kernel density estimate of the surface in impressive detail.

# Draw from multi-t distribution without truncation library (tmvtnorm) Sigma <- matrix(c(1, .1, .1, 1), 2) # Covariance matrix X1 <- rtmvt(n=1000, mean=rep(0, 2), sigma = Sigma, df=2) # from tmvtnorm package t.kde <- kde2d(X1[,1], X1[,2], n = 50) # from MASS package col2 <- heat.colors(length(bivn.kde$z))[rank(bivn.kde$z)] persp3d(x=t.kde, col = col2)

The real value of the multivariate distribution functions from the data science perspective is to simulate data sets with many more than two variables. The functions we have been considering are up to the task, but there are some technical considerations and, of course, we don't have the same options for visualization. The following code snippet generates 10 variables from a multivariate normal distribution with a specified covariance matrix. Note that I've used the genPositiveDefmat() function from the clusterGeneration package to generate the covariance matrix. This is because mvrnorm() will throw an error, as theory says it should, if the covariance matrix is not positive definite, and guessing a combination of matrix elements to make a high dimensional matrix positive definite would require quite a bit of luck along with some serious computation time.

After generating the matrix, I use the corrplot() function from the corrplot package to produce an attractive pairwise correlation plot that is coded both by shape and color. corrplot() scales pretty well with the number of variables and will give a decent chart with 40 to 50 variables. (Note that now ggcorrplot will do this for ggplot2 plots.) Other plotting options would be to generate pairwise scatter plots and R offers many alternatives for these.

Finally, what about going beyond the multivariate normal and t distributions? R does have a few functions like rlnorm() from the compositions package which generates random variates from the multivariate lognormal distribution that are as easy to use as mvrorm(), but you will have to hunt for them. I think a more fruitful approach if you are serious about probability distributions is to get familiar with the copula package.

by Joseph Rickert

Earlier this month the Bay Area useR Group (BARUG) held it annual lightning talk meeting. This is by far our most popular meeting format: eight, 15 minute talks (12 minutes speaking and 3 minutes Q & A while the next speaker is setting up) packed into a two hour time slot. The intensity seems to really energize the speakers and engaged the audience.

Bradley Shanrock-Solberg kicked off the event with delightful example of an R Monte Carlo simulation based on his wildpoker package that you can find on CRAN. I have never seen a more prepared lightning talk presenter: high energy, a royal flush presentation and a four color printed hand out just in case you have trouble keeping up with him for the 12 minutes. In a series of well conceived plots Bradley showed how, for a number of different poker variations, the best hand changes as the game progresses. The number of players who start the game, the number who stay until the showdown, wildcards and many more contingent events dynamically change the value of your hand. Bradley is definitely the guy for your next trip to Vegas.

If you are a poker player, you will definitely want to check out his package and supplementary material: his paper on Winning More at Dealers Choice Poker and his example of why rules of thumb fail.

William Sundstrom, professor of Economics at Santa Clara University, gave an entertaining and thought provoking presentation on teaching R to undergraduate Econometrics students. One interesting observation that generated some discussion was that even though today's students are "digital natives" having grown up using intelligent devices of all kinds, many of them are nevertheless "digital naïfs". The following slide, a reprint of an email from one of Professor Sundstrom's students, captures some of their frustration.

David Ouyang MD, a Stanford resident, a guy who sometime puts in 73+ hour work weeks, presented some explorations of the Epic electronic medical records data set he is analyzing in his spare time. The following plot shows the distribution of physician's interactions with the Epic system over the course of a day.

Dennis Noren, a long time BARUG member and contributor showed some results from a recommendation App he is building on top of The Movie database. The following slide from his presentation shows a Shiny dashboard he build to drive a parallel plot. The interactivity really makes the plot useful.

If you are thinking about betting on the Academy Awards you might want to consider Keith Everett's predictions based on his GLMNet Model.

Nelson Auner talked about Modern KPI Tracking in R. My favorite slide describes the behavior of Execs who don't work for companies that sell BI Software:

And that wasn't all of it! We also had a update on data.table from Matt Dowle himself, and introduction to exploring large data sets with Apache Spark from Hossein Falaki.

Many thanks to Earl Hubbell and the folks at Thermo Fisher Scientific who hosted the meeting.