by Mark Malter

After reading the book, Analyzing Baseball with R, by Max Marchi and Jim Albert, I decided to expand on some of their ideas relating to runs created and put them into an R shiny app .

The Server and UI code are linked at the bottom of the Introduction tab.

I downloaded the Retrosheet play-by-play data for every game played in the 2011-2014 seasons in every park and aggregated every plate appearance by one of the 24 bases/outs states (ranging from nobody on/nobody out to bases loaded/two outs). With Retrosheets data, I wrote code to track the batter, bases, outs, runs scored over remainder of inning, current game score, and inning. I also used the R Lahman package and databases for individual player information. Below is a brief explanation of the function of each tab on the app.

**Potential Runs by bases/outs state**: Matrix of all 24 possible bases/outs states, both with expected runs over the remainder of an inning, and the probability of scoring at least one run over the remainder of the inning (for late innings of close games). I used this table to analyze several types of plays, as shown below. Notice that, assuming average hitters, the analysis below shows why sacrifice bunts are always a bad idea. The Runs Created stat for a plate appearance is defined as:

end state – start state + runs created on play.

I first became serious about this after watching the last inning of the 2014 world series. Down 3-2 with two outs and nobody on base, Alex Gordon singled to center and advanced to third on a two base error. As Gordon was heading into third base, Giants shortstop Brandon Crawford was taking the relay throw in short left field. Had Gordon been sent home, Crawford would likely have thrown him out at the plate. However, the runs matrix shows only a 26% chance of scoring a run with a man on third and two outs, and with Madison Bumgarner on the mound, it was even less likely that on deck hitter Salvador Perez would be able to drive in Hosmer. So even though sending Gordon would likely have ended the game (and the series), it still may have been the optimal play. This would be similar to hitting 16 vs. a dealer’s ten in Blackjack- you’ll probably lose, but you’re making an optimal play. For equivalency, see the Tag from Third analysis below, as this play would have been equivalent to tagging from third after a catch for the second out.

**Runs Created All Regular MLB Players**: I filtered out all players with fewer than 400 plate appearances and created an interactive rchart showing each player’s runs potential by runs created. I placed the following filters in the UI: year, innings (1-3, 4-6, 7-extras), run differential at time of at bat (0-1, 2-3, 4+), position, team, bats, age range, and weight. Hovering over a point shows the player and his salary. For example, Mike Trout created 58 runs out of a potential of 332 in 2014. Filtering 2013 for second baseman under the age of 30 and weighing less than 200 pounds, we see Jason Kipnis created 27 runs out of a potential of 300.

**Player Runs Table**: Same as above, but this shows each player (> 400 plate appearances for the selected season), broken down by each of the eight bases states. For example, in 2014 Jose Abreu created 43.5 runs on a potential of 291, and was most efficient with a runner on second base, where he created 10.3 runs on a potential of only 36.

The following tabs show runs expectancies of various offensive plays from the start state the expected end state, based on the expected Baserunning Success rate in the UI. For each play, I created a graphical as well as a table tab. For the graphical tabs, there is a UI to switch between views of expected runs and scoring probability.

**Stolen bases Graphic/Table**: For each of fifteen different base stealing situations, I show the start state, end state (based on the UI selected success rate), and the breakeven success rate for the given situation. We see that rather than one generic rule of thumb for breaking even, the situational b/e’s vary widely, ranging from 91% with a runner on second and two outs, to 54% for a double steal with first and second and one out (I assume that any out is the lead runner). Notice though that if only the runner on second attempts to steal, the break even jumps from 54% to 72%.

**Tag from Third Graphic/Table**: I broke down every situation where a fly ball was caught with a runner on third, where the catch was either the first or second out. I tracked the attempt frequency and success rate for each situation, based on the outs and whether there were trailing runners. Surprisingly, I found that almost every success rate is well over 95%, meaning runners are only tagging when they’re almost certain to score. However, the break evens range from 40% with first and third with two outs (after the catch) to 77% with runners on second and third with one out. I believe this shows a gray area between the b/e and success rates where runners are being far too cautious.

The following tabs show whether a base runner should attempt to advance two bases on a single. Again, of course it depends on the situation.

**First to Third Graphic/Table**: Here we see that the attempted frequencies are very low, and as expected, lowest on balls hit to left field. However, as with the above tag plays, runners are almost always safe, showing another gray area between attempts and b/e’s. For example, on a single to right field with one out, runners only attempt to advance to third base 42.1% of the time, and are safe 97.3%. If we place the UI Success Rate slider on 0.85, we see that the attempt increases the runs expectancy from 0.87 to 0.99.

**Second to Home Graphic/Table**: Here we see the old adage, “don’t make the first or second out at the plate”, is not necessarily true. Attempting to score from second on a single depends not only on the outs, but also whether there is a trailing runner. The break evens range from 93% with no outs and no trailing runner on first, to 40% with two outs and no runner on first. Once again, the success rates are almost always higher than the break even rate, showing too much caution.

**Sacrifice Bunt Graphic/Table**: These tabs show that unless we have a hitter far below average, the sacrifice should never be attempted. For example, in going from a runner on first and no outs to a runner on second with one out, or going from a runner on second with no outs to a runner on third with one out, we drop from 0.85 runs to 0.66 runs and from 1.10 runs to 0.94 runs respectively. Worse, I’m assuming that the bunt is always successful with the lead runner never being thrown out. The only situation where the bunt might be wise is in a late inning and the team is playing for one run after a leadoff double. Getting the runner from second and no outs to third with one out increases the probability of scoring from 0.61 to 0.65, IF the bunt is successful. Even here, it is a poor play if the success rate is less than 90%. The graphic tab allows the user to see how the expected end state changes as the UI success rate slider is altered.

UI code: https://github.com/malter61/retrosheets/blob/master/ui.R

Server code: https://github.com/malter61/retrosheets/blob/master/server.R

*Mark Malter is a data scientist currently working for Houghton, Mifflin, Harcourt, as well as the consulting firm Channel Pricing, specializing in building predictive models, cluster analysis, and visualizing data. He is also a sixteen year veteran stock options market-maker at the Chicago Board Options Exchange. He has a BS degree in electrical engineering, an MBA, and is currently working on an MS degree in Predictive Analytics. Mark also spent 14 years as a director and coach of his local youth baseball league.*

by Joseph Rickert

Tracking R user group meetings is a good way to stay informed about what's happening in the R world. On Tuesday the Bay Area useR Group (BARUG) met at AdRoll in San Francisco. It was a mini-conference with 6 talks:

- Bryan Galvin our host at AdRoll (many thanks for the pizza and beer) kicked off the evening by showing how his company has created Sparkle, a big-league workflow for deploying Shiny Apps.
- Shivaram Venkataraman of Amplab gave an overview of SparkR, highlighting new features and giving a preview of what's coming.
- Perry de Valpine presented NIMBLE, a new C++ compiled language for building BUGS models with R that Bayesian R users should find of great interest.
- Hamed Dagour presented three scenarios's of how Personal Capital uses R for financial applications.
- Ramnath Vaidyanathan delivered a high energy, overview of htmlwidgets for R. Html widgets may very well be "the next big thing".
- Bill Grosso finished up the evening by describing how his team at Scientific Revenue use R to build dynamic pricing models for internet game companies.

SparkR has been generating quite a bit of interest in the R community at large. Shivaram revealed that SparkR is now going mainstream within the Spark project. SparkR will be merged into Spark in the upcoming 1.4 release. This means that there will be a Spark API for R.

The following slide illustrates the data frame methods that have been developed for SparkR:

Future SparkR work which is highlighted in Shivaram's presentation includes:

- High-level APIs for machine learning algorithms
- Pipelines with "featurizers"
- Extended models and summary methods
- APIs for streaming and time series analysis
- Distributed matrix operations

htmlwidgets is another project that is also generating some buzz in the R world. The idea here is to create R binding to JavaScript libraries that allow interactive JavaScript plots to be used in stand-alone applications or embedded in markdown files or Shiny plots.

Ramnath's presentation highlighted some brand new work in html widgets. Really notable was his live demonstration of cooperating html widgets. Using message passing to transmit state status, Ramnath showed a cluster of graphs changing on the fly to reflect new information from graphs that they were linked to. This gif link gives an idea of the kinds of cool things people are doing with htmlwidgets.

In other R user group meetings around the world this week: The New Delhi UseR Group gave R tutorials, the Taiwan R User Group explored the MLDM package and its applications. Statistical Programming DC looked into R and Shiny. MadR looked into making R code easier and more reliable.The Portland R User Group got a glimpse of new features coming to RStudio. The Barcelona R users group will look into dplyr. Dublin R will look at using R with medical statistics method comparison studies. DataPhilly will investigate data science in health care and look into using R in Plotly. The Denver R Users Group will discuss websites with GitHub and RStudio in the context of continuous delivery systems. The Atlanta R Users Group will look into building a netter data stack.

It is only a small sample of 15 talks, but they do emphasize that both Shiny and the use of R in production level work are both trending topics in the R Community. Bryan's talk illustrates the practical intersection of both trends.

In today's data-oriented world, just about every retailer has amassed a huge database of purchase transaction. Each transaction consists of a number of products that have been purchased together. A natural question that you could answer from this database is: What products are typically purchased together? This is called Market Basket Analysis (or Affinity Analysis). A closely related question is: Can we find relationships between certain products, which indicate the purchase of other products? For example, if someone purchases avocados and salsa, it's likely they'll purchase tortilla chips and limes as well. This is called association rule learning, a data mining technique used by retailers to improve product placement, marketing, and new product development.

R has an excellent suite of algorithms for market basket analysis in the arules package by Michael Hahsler and colleagues. It includes support for both the Apriori algorithm and the ECLAT (equivalence class transformation algorithm). You can find an in-depth description of both techniques (including several examples) in the Introduction to arules vignette. The slides below, by Yanchang Zhao provide a nice overview, and you can find further examples at RDataMining.com.

by Ari Lamstein

Today I will walk through an analysis of San Francisco Zip Code Demographics using my new R package choroplethrZip. This package creates choropleth maps of US Zip Codes and connects to the US Census Bureau. A choropleth is a map that shows boundaries of regions (such as zip codes) and colors those regions according to some metric (such as population).

Zip codes are a common geographic unit for businesses to work with, but rendering them is difficult. Official zip codes are maintained by the US Postal Service, but they exist solely to facilitate mail delivery. The USPS does not release a map of them, they change frequently and, in some cases, are not even polygons. The most authoritative map I could find of US Zip codes was the Census Bureau’s Map of Zip Code Tabulated Areas (ZCTAs). Despite shipping with only a simplified version of this map (60MB instead of 500MB), choroplethrZip is still too large for CRAN. It is instead hosted on github, and you can install it from an R console like this:

# install.github(“devtools”)

library(devtools)

install_github('arilamstein/choroplethrZip@v1.1.1')

The package vignettes (1, 2) explain basic usage. In this article I’d like to demonstrate a more in depth example: showing racial and financial characteristics of each zip code in San Francisco. The data I use comes from the 2013 American Community Survey (ACS) which is run by the US Census Bureau. If you are new to the ACS you might want to view my vignette on Mapping US Census Data.

One table that deals with race and ethnicity is B03002 - Hipanic or Latino Origin by Race. Many people will be surprised by the large number of categories. This is because the US Census Bureau has a complex framework for categorizing race and ethnicity. Since my purpose here is to demonstrate technology, I will simplify the data by only dealing with only a handful of the values: Total Hispanic or Latino, White (not Hispanic), Black (not Hispanic) and Asian (not Hispanic).

The R code for getting this data into a data.frame can be viewed here, and the code for generating the graphs in this post can be viewed here. Here is a boxplot of the ethnic breakdown of the 27 ZCTAs in San Francisco.

This boxplot shows that there is a wide variation in the racial and ethnic breakdown of San Francisco ZCTAs. For example, the percentage of White people in each ZCTA ranges from 7% to 80%. The percentage of black and hispanics seem to have a tighter range, but also contains outliers. Also, while Asian Americans only make up the 5% of the total US population, the median ZCTA in SF is 30% Asian.

Viewing this data with choropleth maps allows us to associate locations with these values.

When discussing demographics people often ask about Per Capita Income. Here is a boxplot of the Per capita Income in San Francisco ZCTAs.

The range of this dataset - from $15,960 to $144,400 - is striking. Equally striking is the outlier at the top. We can use a continuous scale choropleth to highlight the outlier. We can also use a four color choropleth to show the locations of the quartiles.

The outlier for income is zip 94105, which is where a large number of tech companies are located. The zips in the southern part of the city tend to have a low income.

After viewing this analysis readers might wish to do a similar analysis for the city where they live. To facilitate this I have created an interactive web application. The app begins by showing a choropleth map of a random statistic of the Zips in a random Metropolitan Statistical Area (MSA). You can choose another statistic, zoom in, or select another MSA.

In the event that the application does not load (for example, if I reach my monthly quota at my hosting company) then you can run the app from source, which is available here.

I hope that you have enjoyed this exploration of Zip code level demographics with choroplethrZip. I also hope that it encourages more people to use R for demographic statistics.

by Sherri Rose

Assistant Professor of Health Care Policy

Harvard Medical School

Targeted learning methods build machine-learning-based estimators of parameters defined as features of the probability distribution of the data, while also providing influence-curve or bootstrap-based confidence internals. The theory offers a general template for creating targeted maximum likelihood estimators for a data structure, nonparametric or semiparametric statistical model, and parameter mapping. These estimators of causal inference parameters are double robust and have a variety of other desirable statistical properties.

Targeted maximum likelihood estimation built on the loss-based “super learning” system such that lower-dimensional parameters could be targeted (e.g., a marginal causal effect); the remaining bias for the (low-dimensional) target feature of the probability distribution was removed. Targeted learning for effect estimation and causal inference allows for the complete integration of machine learning advances in prediction while providing statistical inference for the target parameter(s) of interest. Further details about these methods can be found in the many targeted learning papers as well as the 2011 targeted learning book.

Practical tools for the implementation of targeted learning methods for effect estimation and causal inference have developed alongside the theoretical and methodological advances. While some work has been done to develop computational tools for targeted learning in proprietary programming languages, such as SAS, the majority of the code has been built in R.

Of key importance are the two R packages SuperLearner and tmle. Ensembling with SuperLearner allows us to use many algorithms to generate an ideal prediction function that is a weighted average of all the algorithms considered. The SuperLearner package, authored by Eric Polley (NCI), is flexible, allowing for the integration of dozens of prespecified potential algorithms found in other packages as well as a system of wrappers that provide the user with the ability to design their own algorithms, or include newer algorithms not yet added to the package. The package returns multiple useful objects, including the cross-validated predicted values, final predicted values, vector of weights, and fitted objects for each of the included algorithms, among others.

Below is sample code with the ensembling prediction package SuperLearner using a small simulated data set.

library(SuperLearner) ##Generate simulated data## set.seed(27) n<-500 data <- data.frame(W1=runif(n, min = .5, max = 1), W2=runif(n, min = 0, max = 1), W3=runif(n, min = .25, max = .75), W4=runif(n, min = 0, max = 1)) data <- transform(data, #add W5 dependent on W2, W3 W5=rbinom(n, 1, 1/(1+exp(1.5*W2-W3)))) data <- transform(data, #add Y dependent on W1, W2, W4, W5 Y=rbinom(n, 1,1/(1+exp(-(-.2*W5-2*W1+4*W5*W1-1.5*W2+sin(W4)))))) summary(data) ##Specify a library of algorithms## SL.library <- c("SL.nnet", "SL.glm", "SL.randomForest") ##Run the super learner to obtain predicted values for the super learner as well as CV risk for algorithms in the library## fit.data.SL<-SuperLearner(Y=data[,6],X=data[,1:5],SL.library=SL.library, family=binomial(),method="method.NNLS", verbose=TRUE) ##Run the cross-validated super learner to obtain its CV risk## fitSL.data.CV <- CV.SuperLearner(Y=data[,6],X=data[,1:5], V=10, SL.library=SL.library,verbose = TRUE, method = "method.NNLS", family = binomial()) ##Cross validated risks## mean((data[,6]-fitSL.data.CV$SL.predict)^2) #CV risk for super learner fit.data.SL #CV risks for algorithms in the library

The final lines of code return the cross-validated risks for the super learner as well as each algorithm considered within the super learner. While a trivial example with a small data set and few covariates, these results demonstrate that the super learner, which takes a weighted average of the algorithms in the library, has the smallest cross-validated risk and outperforms each individual algorithm.

The tmle package, authored by Susan Gruber (Reagan-Udall Foundation), allows for the estimation of both average treatment effects and parameters defined by a marginal structural model in cross-sectional data with a binary intervention. This package also includes the ability to incorporate missingness in the outcome and the intervention, use SuperLearner to estimate the relevant components of the likelihood, and use data with a mediating variable. Additionally, TMLE and collaborative TMLE R code specifically tailored to answer quantitative trait loci mapping questions, such as those discussed in Wang et al 2011, is available in the supplementary material of that paper.

The multiPIM package, authored by Stephan Ritter (Omicia, Inc.), is designed specifically for variable importance analysis, and estimates an attributable-risk-type parameter using TMLE. This package also allows the use of SuperLearner to estimate nuisance parameters and produces additional estimates using estimating-equation-based estimators and g-computation. The package includes its own internal bootstrapping function to calculate standard errors if this is preferred over the use of influence curves, or influence curves are not valid for the chosen estimator.

Four additional prediction-focused packages are casecontrolSL, cvAUC, subsemble, and h2oEnsemble, all primarily authored by Erin LeDell (Berkeley). The casecontrolSL package relies on SuperLearner and performs subsampling in a case-control design with inverse-probability-of-censoring-weighting, which may be particularly useful in settings with rare outcomes. The cvAUC package is a tool kit to evaluate area under the ROC curve estimators when using cross-validation. The subsemble package was developed based on a new approach to ensembling that fits each algorithm on a subset of the data and combines these fits using cross-validation. This technique can be used in data sets of all size, but has been demonstrated to be particularly useful in smaller data sets. A new implementation of super learner can be found in the Java-based h2oEnsemble package, which was designed for big data. The package uses the H2O R interface to run super learning in R with a selection of prespecified algorithms.

Another TMLE package is ltmle, primarily authored by Joshua Schwab (Berkeley). This package mainly focuses on parameters in longitudinal data structures, including the treatment-specific mean outcome and parameters defined by a marginal structural model. The package returns estimates for TMLE, g-computation, and estimating-equation-based estimators.

*The text above is a modified excerpt from the chapter "Targeted Learning for Variable Importance" by Sherri Rose in the forthcoming Handbook of Big Data (2015) edited by Peter Buhlmann, Petros Drineas, Michael John Kane, and Mark Van Der Laan to be published by CRC Press.*

Thanks to all who attended my webinar earlier this week, Reproducibility with Revolution R Open and the Checkpoint Package. If you missed the live session, you can catch up with the slides and video replay which I've embedded below.

If you just want to check out the demo of the checkpoint package, it starts at 18:30 in the video below. If you want to follow along at home, you can download the demo script here.

Revolution Analytics webinars: Reproducibility with Revolution R Open and the Checkpoint Package

by Herman Jopia

**What is Binning?**

**Binning** is the term used in scoring modeling for what is also known in Machine Learning as **Discretization**, the process of transforming a continuous characteristic into a finite number of intervals (the bins), which allows for a better understanding of its distribution and its relationship with a binary variable. The bins generated by the this process will eventually become the attributes of a predictive characteristic, the key component of a Scorecard.

**Why Binning?**

Though there are some reticence to it [1], the benefits of binning are pretty straight forward:

- It allows missing data and other special calculations (e.g. divided by zero) to be included in the model.
- It controls or mitigates the impact of outliers over the model.
- It solves the issue of having different scales among the characteristics, making the weights of the coefficients in the final model comparable.

**Unsupervised Discretization**

Unsupervised Discretization divides a continuous feature into groups (bins) without taking into account any other information. It is basically a partiton with two options: equal length intervals and equal frequency intervals.

**Equal length intervals**

- Objective: Understand the distribution of a variable.
- Example: The classic histogram, whose bins have equal length that can be calculated using different rules (Sturges, Rice, and others).
- Disadvantage: The number of records in a bin may be too small to allow for a valid calculation, as shown in Table 1.

Table 1. Time on Books and Credit Performance. Bin 6 has no bads, producing indeterminate metrics.

**Equal frequency intervals**

- Objective: Analyze the relationship with a binary target variable through metrics like bad rate.
- Example: Quartlies or Percentiles.
- Disadvantage: The cutpoints selected may not maximize the difference between bins when mapped to a target variable, as shown in Table 2

Table 2. Time on Books and Credit Performance. Different cutpoints may improve the Information Value (0.4969).

**Supervised Discretization**

Supervised Discretization divides a continuous feature into groups (bins) mapped to a target variable. The central idea is to find those cutpoints that maximize the difference between the groups.

In the past, analysts used to iteratively move from Fine Binning to Coarse Binning, a very time consuming process of finding manually and visually the right cutpoints (if ever). Nowadays with algorithms like ChiMerge or Recursive Partitioning, two out of several techniques available [2], analysts can quickly find the optimal cutpoints in seconds and evaluate the relationship with the target variable using metrics such as Weight of Evidence and Information Value.

**An Example With 'smbinning'**

Using the 'smbinning' package and its data (chileancredit), whose documentation can be found on its supporting website, the characteristic Time on Books is grouped into bins taking into account the Credit Performance (Good/Bad) to establish the optimal cutpoints to get meaningful and statistically different groups. The R code below, Table 3, and Figure 1 show the result of this application, which clearly surpass the previous methods with the highest Information Value (0.5353).

# Load package and its data library(smbinning) data(chileancredit) # Training and testing samples chileancredit.train=subset(chileancredit,FlagSample==1) chileancredit.test=subset(chileancredit,FlagSample==0) # Run and save results result=smbinning(df=chileancredit.train,y="FlagGB",x="TOB",p=0.05) result$ivtable # Relevant plots (2x2 Page) par(mfrow=c(2,2)) boxplot(chileancredit.train$TOB~chileancredit.train$FlagGB, horizontal=T, frame=F, col="lightgray",main="Distribution") mtext("Time on Books (Months)",3) smbinning.plot(result,option="dist",sub="Time on Books (Months)") smbinning.plot(result,option="badrate",sub="Time on Books (Months)") smbinning.plot(result,option="WoE",sub="Time on Books (Months)")

Table 3. Time on Books cutpoints mapped to Credit Performance.

Figure 1. Plots generated by the package.

In the middle of the "data era", it is critical to speed up the development of scoring models. Binning, and more specifically, automated binning helps to reduce significantly the time consuming process of generating predictive characteristics, reason why companies like SAS and FICO have developed their own proprietary algorithms to implement this functionality on their respective software. For analysts who do not have these specific tools or modules, the R package 'smbinning' offers an statistically robust alternative to run their analysis faster.

For more information about binning, the package's documentation available on CRAN lists some references related to the algorithm behind it and its supporting website some references for scoring modeling development.

**References**

[1] Dinero, T. (1996) Seven Reasons Why You Should Not Categorize Continuous Data. Journal of health & social policy 8(1) 63-72 (1996).

[2] Garcia, S. et al (2013) A Survey of Discretization Techniques: Taxonomy and Empirical Analysis in Supervised Learning. IEEE Transactions on Knowledge and Data Engineering, Vol. 25, No. 4, April 2013.

A new version of the checkpoint package for R has just been released on CRAN. With the checkpoint package, you can easily:

- Write R scripts or projects using CRAN package versions from a specific point in time;
- Share R scripts with others that will automatically install the appropriate package versions (no need to manually install CRAN packages);
- Write R scripts that use older versions of packages, or packages that are no longer available on CRAN;
- Install packages (or package versions) visible only to a specific project, without affecting other R projects or R users on the same system;
- Manage multiple projects that use different package versions;
- Write and share code R whose results can be reproduced, even if new (and possibly incompatible) package versions are released later.

The biggest change with this new version of checkpoint is that it checks whether package versions have already been installed for your project: running a checkpoint'ed script for the second time has virtually no overhead now. There's also better feedback during the scanning phase while it determines which (if any) packages need to be installed. You can also now control where packages are installed on your system (the default is a .checkpoint folder in your home directory), and check the specific version of R in use when checkpoint is called.

If you're interested in learning more about how checkpoint works, you can read the new vignette Using checkpoint for reproducible research, or follow its development at the checkpoint Github page. You can install checkpoint from CRAN now by running install.packages("checkpoint") at the R command line.

MRAN: checkpoint: Install Packages from Snapshots on the Checkpoint Server for Reproducibility

by Joseph Rickert

What will you be doing at 26 minutes and 53 seconds past 9 this coming Saturday morning? I will probably be running simulations. I have become obsessed with an astounding result from number theory and have been trying to devise Monte Carlo simulations to get at it. The result, well known to number theorists says: choose two integers at random; the probability that they will be coprime is 6/*π*^{2} ! Here, *π* materializes out of thin air. Who could have possibly guessed this? - well, Leonhard Euler, apparently, and this sort of magic seems to be quite common in number theory.

More formally, the theorem Euler proved goes something like this: Let P_{N} be the probability that two randomly chosen integers in {1, 2, ... N} are coprime. Then, as N goes to infinity P_{N} goes to 6/π^{2 }. Well, this seems to be a little different. You don't actually have to sample from an infinite set. So, I asked myself, would a person who is not familiar with this result, but who is allowed to do some Monte Carlo simulations, have a reasonable chance of guessing the answer? I know that going to infinity would be quite a trip, but I imagined that I could take a few steps and see something interesting. How about this: choose some boundary, N, and draw lots of pairs of random numbers. Count the number of coprime pairs M. Then sqrt( 6 / (M/N)) will give an estimate of π. As N gets bigger and bigger you should see the digits of the average of the estimates marching closer and closer to *π* - 3.1, 3.14, 3.141 etc.

Before you go off and try this, let me warn you: there is no parade of digits. For a modest 100,000 draws with an N around 10,000,000 you should get an estimate of *π* close to 3.14. Then, even with letting N get up around 1e13 with a million draws, you won't do much better. The following code (compliments of a colleague who introduced me to mapply) performs 100 simulations, each with 100,000 draws as the range varies from +/- 1,000,000 to +/- 1e13. (The code runs pretty quickly on my laptop.)

# Monte Carlo estimate of pi library( numbers ) library(ggplot2) set.seed(123) bigRange <- seq(1e6,1e13,by=1e11) M <- length(bigRange) draws <- 1e5 for(i in 1:M){ maxRange <- bigRange[i] print(bigRange[i]) min <- -maxRange max <- maxRange r1 <- round( runif( n = draws, min = min, max = max ) ) r2 <- round( runif( n = draws, min = min, max = max ) ) system.time( coprimeTests <- mapply( coprime, r1, r2 ) ) prob[i] <- sum( coprimeTests ) / draws print(prob[i]) piEst[i] <- sqrt( 6 / prob[i] ) print(piEst[i]) } piRes2 <- data.frame(bigRange,prob,piEst) p2 <- ggplot(piRes2,aes(bigRange,piEst)) p2 + geom_line() + geom_point() + xlab("Half Range for Random Draws")+ ylab("Estimate of Pi") + ggtitle("Expanding Range Simulation") + geom_smooth(method = "lm", se=FALSE, color="black")

Here is the "time series" plot of the results.

The mean is 3.142959. The slight upward trend is most likely a random number induced illusion.

The problem as, I have framed it, is probably beyond the reach of a naive Monte Carlo approach. Nevertheless, on Saturday, when I have some simulation, time I will try running 100,000,000 draws. This should get me another digit of *π* since the accuracy of the mean increases as sqrt(N).

The situation, however, is not as dismal as I have been making it out. Don't imagine that just because the problem is opaque to a brute force Monte Carlo effort that it is without the possibility of computational illumination. Euler's proof, mentioned above, turns on recognizing that the expression for the probability of two randomly chosen integers being coprime may be expressed as the mathematical function z(2). The poof yields *π*^{2} = 6z(2), The wizards of R have reduced this calculation to the trivial. The Rmpfr package, which allows the use of arbitrarily precise numbers instead of R's double precision numbers, includes the function zeta(x)! So, here, splendidly arrayed, is *π*.

Happy Pi Day!

It's fair to say that Hadley Wickham, chief scientist at RStudio and a new member of the R Foundation, has made great contributions to the R community. Not only is he the author of several R-related books including Advanced R, Hadley is also the author of dozens of R packages which have transformed the way that data scientists work with R. His packages include an entirely new graphics system (actually, two), a powerful language for manipulating data, tools for importing data files, and developer tools for other R package authors that have influenced R packages way beyond his own.

If you use R it's extremely likely you've used at least some of Hadley's packages. For everything else, check out The Hitchhiker's Guide to the Hadleyverse, Adolfo Álvarez's comprehensive guide to all 55 packages with Hadley's contributions (to date). Or for an overview of the packages in slide form, take a look at Barry Rowlingson's presentation, "Packages of the Hadleyverse: Power for your R".

Thanks Hadley for all your contributions!