by Joseph Rickert

A recent post by David Smith included a map that shows the locations of R user groups around the world. While is exhilarating to see how R user groups span the globe, the map does not give any idea about the size of the community at each location. The following plot, constructed from information on the websites of the groups listed in Revolutions Analytics' Local R User Group Directory (the same source for the map) shows the membership size for the largest 25 groups.

There are 11 groups with over a thousand members and a couple more who are close to achieving that milestone. With the possible exception of Groupe des utilisateurs du logicel R, I believe all of the groups in the top 25 hold regular, face-to-face meetings: so the plot gives some idea of the size, location and density of the "social" R community.

There are, however, quite a few problems with the data that make it less than optimal for even the limited goal of characterizing the number of R users who regularly participate in face-to-face R events.

- Only 102 of the 159 groups make it easy to find membership information. (Most of these are groups that have webpages through meetup.com which prominently display membership information.)
- It is not clear that membership information is up to date. As an R user group organize, I know that it is difficult to keep track of active members.
- People join multiple groups. It is very likely, for example, there is a significant overlap between the members of the Berkeley R Language Beginner Study Group and the Bay Area useR Group.

Nevertheless, I think this plot along with the map mentioned above do give some idea of where the action is in the R world.

Note that the data used to build the plot may be obtained here: Download RUGS_WW_4_7_15 and the code is here: Download RUG_Bar_Chart

Also note that it is still a good time to start a new R user group. The deadline for funding for large user groups through Revolution Analytics 2015 R User Group Sponsorship Program has passed. However, we will be taking applications for new groups until the end of September. The $120 grant for Vector level groups should be enough to finance a site on meetup.com. If you are thinking of starting a new group have a look at the link above as well as our tips for getting started.

In today's data-oriented world, just about every retailer has amassed a huge database of purchase transaction. Each transaction consists of a number of products that have been purchased together. A natural question that you could answer from this database is: What products are typically purchased together? This is called Market Basket Analysis (or Affinity Analysis). A closely related question is: Can we find relationships between certain products, which indicate the purchase of other products? For example, if someone purchases avocados and salsa, it's likely they'll purchase tortilla chips and limes as well. This is called association rule learning, a data mining technique used by retailers to improve product placement, marketing, and new product development.

R has an excellent suite of algorithms for market basket analysis in the arules package by Michael Hahsler and colleagues. It includes support for both the Apriori algorithm and the ECLAT (equivalence class transformation algorithm). You can find an in-depth description of both techniques (including several examples) in the Introduction to arules vignette. The slides below, by Yanchang Zhao provide a nice overview, and you can find further examples at RDataMining.com.

by Ari Lamstein

Today I will walk through an analysis of San Francisco Zip Code Demographics using my new R package choroplethrZip. This package creates choropleth maps of US Zip Codes and connects to the US Census Bureau. A choropleth is a map that shows boundaries of regions (such as zip codes) and colors those regions according to some metric (such as population).

Zip codes are a common geographic unit for businesses to work with, but rendering them is difficult. Official zip codes are maintained by the US Postal Service, but they exist solely to facilitate mail delivery. The USPS does not release a map of them, they change frequently and, in some cases, are not even polygons. The most authoritative map I could find of US Zip codes was the Census Bureau’s Map of Zip Code Tabulated Areas (ZCTAs). Despite shipping with only a simplified version of this map (60MB instead of 500MB), choroplethrZip is still too large for CRAN. It is instead hosted on github, and you can install it from an R console like this:

# install.github(“devtools”)

library(devtools)

install_github('arilamstein/choroplethrZip@v1.1.1')

The package vignettes (1, 2) explain basic usage. In this article I’d like to demonstrate a more in depth example: showing racial and financial characteristics of each zip code in San Francisco. The data I use comes from the 2013 American Community Survey (ACS) which is run by the US Census Bureau. If you are new to the ACS you might want to view my vignette on Mapping US Census Data.

One table that deals with race and ethnicity is B03002 - Hipanic or Latino Origin by Race. Many people will be surprised by the large number of categories. This is because the US Census Bureau has a complex framework for categorizing race and ethnicity. Since my purpose here is to demonstrate technology, I will simplify the data by only dealing with only a handful of the values: Total Hispanic or Latino, White (not Hispanic), Black (not Hispanic) and Asian (not Hispanic).

The R code for getting this data into a data.frame can be viewed here, and the code for generating the graphs in this post can be viewed here. Here is a boxplot of the ethnic breakdown of the 27 ZCTAs in San Francisco.

This boxplot shows that there is a wide variation in the racial and ethnic breakdown of San Francisco ZCTAs. For example, the percentage of White people in each ZCTA ranges from 7% to 80%. The percentage of black and hispanics seem to have a tighter range, but also contains outliers. Also, while Asian Americans only make up the 5% of the total US population, the median ZCTA in SF is 30% Asian.

Viewing this data with choropleth maps allows us to associate locations with these values.

When discussing demographics people often ask about Per Capita Income. Here is a boxplot of the Per capita Income in San Francisco ZCTAs.

The range of this dataset - from $15,960 to $144,400 - is striking. Equally striking is the outlier at the top. We can use a continuous scale choropleth to highlight the outlier. We can also use a four color choropleth to show the locations of the quartiles.

The outlier for income is zip 94105, which is where a large number of tech companies are located. The zips in the southern part of the city tend to have a low income.

After viewing this analysis readers might wish to do a similar analysis for the city where they live. To facilitate this I have created an interactive web application. The app begins by showing a choropleth map of a random statistic of the Zips in a random Metropolitan Statistical Area (MSA). You can choose another statistic, zoom in, or select another MSA.

In the event that the application does not load (for example, if I reach my monthly quota at my hosting company) then you can run the app from source, which is available here.

I hope that you have enjoyed this exploration of Zip code level demographics with choroplethrZip. I also hope that it encourages more people to use R for demographic statistics.

by Sherri Rose

Assistant Professor of Health Care Policy

Harvard Medical School

Targeted learning methods build machine-learning-based estimators of parameters defined as features of the probability distribution of the data, while also providing influence-curve or bootstrap-based confidence internals. The theory offers a general template for creating targeted maximum likelihood estimators for a data structure, nonparametric or semiparametric statistical model, and parameter mapping. These estimators of causal inference parameters are double robust and have a variety of other desirable statistical properties.

Targeted maximum likelihood estimation built on the loss-based “super learning” system such that lower-dimensional parameters could be targeted (e.g., a marginal causal effect); the remaining bias for the (low-dimensional) target feature of the probability distribution was removed. Targeted learning for effect estimation and causal inference allows for the complete integration of machine learning advances in prediction while providing statistical inference for the target parameter(s) of interest. Further details about these methods can be found in the many targeted learning papers as well as the 2011 targeted learning book.

Practical tools for the implementation of targeted learning methods for effect estimation and causal inference have developed alongside the theoretical and methodological advances. While some work has been done to develop computational tools for targeted learning in proprietary programming languages, such as SAS, the majority of the code has been built in R.

Of key importance are the two R packages SuperLearner and tmle. Ensembling with SuperLearner allows us to use many algorithms to generate an ideal prediction function that is a weighted average of all the algorithms considered. The SuperLearner package, authored by Eric Polley (NCI), is flexible, allowing for the integration of dozens of prespecified potential algorithms found in other packages as well as a system of wrappers that provide the user with the ability to design their own algorithms, or include newer algorithms not yet added to the package. The package returns multiple useful objects, including the cross-validated predicted values, final predicted values, vector of weights, and fitted objects for each of the included algorithms, among others.

Below is sample code with the ensembling prediction package SuperLearner using a small simulated data set.

library(SuperLearner) ##Generate simulated data## set.seed(27) n<-500 data <- data.frame(W1=runif(n, min = .5, max = 1), W2=runif(n, min = 0, max = 1), W3=runif(n, min = .25, max = .75), W4=runif(n, min = 0, max = 1)) data <- transform(data, #add W5 dependent on W2, W3 W5=rbinom(n, 1, 1/(1+exp(1.5*W2-W3)))) data <- transform(data, #add Y dependent on W1, W2, W4, W5 Y=rbinom(n, 1,1/(1+exp(-(-.2*W5-2*W1+4*W5*W1-1.5*W2+sin(W4)))))) summary(data) ##Specify a library of algorithms## SL.library <- c("SL.nnet", "SL.glm", "SL.randomForest") ##Run the super learner to obtain predicted values for the super learner as well as CV risk for algorithms in the library## fit.data.SL<-SuperLearner(Y=data[,6],X=data[,1:5],SL.library=SL.library, family=binomial(),method="method.NNLS", verbose=TRUE) ##Run the cross-validated super learner to obtain its CV risk## fitSL.data.CV <- CV.SuperLearner(Y=data[,6],X=data[,1:5], V=10, SL.library=SL.library,verbose = TRUE, method = "method.NNLS", family = binomial()) ##Cross validated risks## mean((data[,6]-fitSL.data.CV$SL.predict)^2) #CV risk for super learner fit.data.SL #CV risks for algorithms in the library

The final lines of code return the cross-validated risks for the super learner as well as each algorithm considered within the super learner. While a trivial example with a small data set and few covariates, these results demonstrate that the super learner, which takes a weighted average of the algorithms in the library, has the smallest cross-validated risk and outperforms each individual algorithm.

The tmle package, authored by Susan Gruber (Reagan-Udall Foundation), allows for the estimation of both average treatment effects and parameters defined by a marginal structural model in cross-sectional data with a binary intervention. This package also includes the ability to incorporate missingness in the outcome and the intervention, use SuperLearner to estimate the relevant components of the likelihood, and use data with a mediating variable. Additionally, TMLE and collaborative TMLE R code specifically tailored to answer quantitative trait loci mapping questions, such as those discussed in Wang et al 2011, is available in the supplementary material of that paper.

The multiPIM package, authored by Stephan Ritter (Omicia, Inc.), is designed specifically for variable importance analysis, and estimates an attributable-risk-type parameter using TMLE. This package also allows the use of SuperLearner to estimate nuisance parameters and produces additional estimates using estimating-equation-based estimators and g-computation. The package includes its own internal bootstrapping function to calculate standard errors if this is preferred over the use of influence curves, or influence curves are not valid for the chosen estimator.

Four additional prediction-focused packages are casecontrolSL, cvAUC, subsemble, and h2oEnsemble, all primarily authored by Erin LeDell (Berkeley). The casecontrolSL package relies on SuperLearner and performs subsampling in a case-control design with inverse-probability-of-censoring-weighting, which may be particularly useful in settings with rare outcomes. The cvAUC package is a tool kit to evaluate area under the ROC curve estimators when using cross-validation. The subsemble package was developed based on a new approach to ensembling that fits each algorithm on a subset of the data and combines these fits using cross-validation. This technique can be used in data sets of all size, but has been demonstrated to be particularly useful in smaller data sets. A new implementation of super learner can be found in the Java-based h2oEnsemble package, which was designed for big data. The package uses the H2O R interface to run super learning in R with a selection of prespecified algorithms.

Another TMLE package is ltmle, primarily authored by Joshua Schwab (Berkeley). This package mainly focuses on parameters in longitudinal data structures, including the treatment-specific mean outcome and parameters defined by a marginal structural model. The package returns estimates for TMLE, g-computation, and estimating-equation-based estimators.

*The text above is a modified excerpt from the chapter "Targeted Learning for Variable Importance" by Sherri Rose in the forthcoming Handbook of Big Data (2015) edited by Peter Buhlmann, Petros Drineas, Michael John Kane, and Mark Van Der Laan to be published by CRC Press.*

by Herman Jopia

**What is Binning?**

**Binning** is the term used in scoring modeling for what is also known in Machine Learning as **Discretization**, the process of transforming a continuous characteristic into a finite number of intervals (the bins), which allows for a better understanding of its distribution and its relationship with a binary variable. The bins generated by the this process will eventually become the attributes of a predictive characteristic, the key component of a Scorecard.

**Why Binning?**

Though there are some reticence to it [1], the benefits of binning are pretty straight forward:

- It allows missing data and other special calculations (e.g. divided by zero) to be included in the model.
- It controls or mitigates the impact of outliers over the model.
- It solves the issue of having different scales among the characteristics, making the weights of the coefficients in the final model comparable.

**Unsupervised Discretization**

Unsupervised Discretization divides a continuous feature into groups (bins) without taking into account any other information. It is basically a partiton with two options: equal length intervals and equal frequency intervals.

**Equal length intervals**

- Objective: Understand the distribution of a variable.
- Example: The classic histogram, whose bins have equal length that can be calculated using different rules (Sturges, Rice, and others).
- Disadvantage: The number of records in a bin may be too small to allow for a valid calculation, as shown in Table 1.

Table 1. Time on Books and Credit Performance. Bin 6 has no bads, producing indeterminate metrics.

**Equal frequency intervals**

- Objective: Analyze the relationship with a binary target variable through metrics like bad rate.
- Example: Quartlies or Percentiles.
- Disadvantage: The cutpoints selected may not maximize the difference between bins when mapped to a target variable, as shown in Table 2

Table 2. Time on Books and Credit Performance. Different cutpoints may improve the Information Value (0.4969).

**Supervised Discretization**

Supervised Discretization divides a continuous feature into groups (bins) mapped to a target variable. The central idea is to find those cutpoints that maximize the difference between the groups.

In the past, analysts used to iteratively move from Fine Binning to Coarse Binning, a very time consuming process of finding manually and visually the right cutpoints (if ever). Nowadays with algorithms like ChiMerge or Recursive Partitioning, two out of several techniques available [2], analysts can quickly find the optimal cutpoints in seconds and evaluate the relationship with the target variable using metrics such as Weight of Evidence and Information Value.

**An Example With 'smbinning'**

Using the 'smbinning' package and its data (chileancredit), whose documentation can be found on its supporting website, the characteristic Time on Books is grouped into bins taking into account the Credit Performance (Good/Bad) to establish the optimal cutpoints to get meaningful and statistically different groups. The R code below, Table 3, and Figure 1 show the result of this application, which clearly surpass the previous methods with the highest Information Value (0.5353).

# Load package and its data library(smbinning) data(chileancredit) # Training and testing samples chileancredit.train=subset(chileancredit,FlagSample==1) chileancredit.test=subset(chileancredit,FlagSample==0) # Run and save results result=smbinning(df=chileancredit.train,y="FlagGB",x="TOB",p=0.05) result$ivtable # Relevant plots (2x2 Page) par(mfrow=c(2,2)) boxplot(chileancredit.train$TOB~chileancredit.train$FlagGB, horizontal=T, frame=F, col="lightgray",main="Distribution") mtext("Time on Books (Months)",3) smbinning.plot(result,option="dist",sub="Time on Books (Months)") smbinning.plot(result,option="badrate",sub="Time on Books (Months)") smbinning.plot(result,option="WoE",sub="Time on Books (Months)")

Table 3. Time on Books cutpoints mapped to Credit Performance.

Figure 1. Plots generated by the package.

In the middle of the "data era", it is critical to speed up the development of scoring models. Binning, and more specifically, automated binning helps to reduce significantly the time consuming process of generating predictive characteristics, reason why companies like SAS and FICO have developed their own proprietary algorithms to implement this functionality on their respective software. For analysts who do not have these specific tools or modules, the R package 'smbinning' offers an statistically robust alternative to run their analysis faster.

For more information about binning, the package's documentation available on CRAN lists some references related to the algorithm behind it and its supporting website some references for scoring modeling development.

**References**

[1] Dinero, T. (1996) Seven Reasons Why You Should Not Categorize Continuous Data. Journal of health & social policy 8(1) 63-72 (1996).

[2] Garcia, S. et al (2013) A Survey of Discretization Techniques: Taxonomy and Empirical Analysis in Supervised Learning. IEEE Transactions on Knowledge and Data Engineering, Vol. 25, No. 4, April 2013.

by Joseph Rickert

What will you be doing at 26 minutes and 53 seconds past 9 this coming Saturday morning? I will probably be running simulations. I have become obsessed with an astounding result from number theory and have been trying to devise Monte Carlo simulations to get at it. The result, well known to number theorists says: choose two integers at random; the probability that they will be coprime is 6/*π*^{2} ! Here, *π* materializes out of thin air. Who could have possibly guessed this? - well, Leonhard Euler, apparently, and this sort of magic seems to be quite common in number theory.

More formally, the theorem Euler proved goes something like this: Let P_{N} be the probability that two randomly chosen integers in {1, 2, ... N} are coprime. Then, as N goes to infinity P_{N} goes to 6/π^{2 }. Well, this seems to be a little different. You don't actually have to sample from an infinite set. So, I asked myself, would a person who is not familiar with this result, but who is allowed to do some Monte Carlo simulations, have a reasonable chance of guessing the answer? I know that going to infinity would be quite a trip, but I imagined that I could take a few steps and see something interesting. How about this: choose some boundary, N, and draw lots of pairs of random numbers. Count the number of coprime pairs M. Then sqrt( 6 / (M/N)) will give an estimate of π. As N gets bigger and bigger you should see the digits of the average of the estimates marching closer and closer to *π* - 3.1, 3.14, 3.141 etc.

Before you go off and try this, let me warn you: there is no parade of digits. For a modest 100,000 draws with an N around 10,000,000 you should get an estimate of *π* close to 3.14. Then, even with letting N get up around 1e13 with a million draws, you won't do much better. The following code (compliments of a colleague who introduced me to mapply) performs 100 simulations, each with 100,000 draws as the range varies from +/- 1,000,000 to +/- 1e13. (The code runs pretty quickly on my laptop.)

# Monte Carlo estimate of pi library( numbers ) library(ggplot2) set.seed(123) bigRange <- seq(1e6,1e13,by=1e11) M <- length(bigRange) draws <- 1e5 for(i in 1:M){ maxRange <- bigRange[i] print(bigRange[i]) min <- -maxRange max <- maxRange r1 <- round( runif( n = draws, min = min, max = max ) ) r2 <- round( runif( n = draws, min = min, max = max ) ) system.time( coprimeTests <- mapply( coprime, r1, r2 ) ) prob[i] <- sum( coprimeTests ) / draws print(prob[i]) piEst[i] <- sqrt( 6 / prob[i] ) print(piEst[i]) } piRes2 <- data.frame(bigRange,prob,piEst) p2 <- ggplot(piRes2,aes(bigRange,piEst)) p2 + geom_line() + geom_point() + xlab("Half Range for Random Draws")+ ylab("Estimate of Pi") + ggtitle("Expanding Range Simulation") + geom_smooth(method = "lm", se=FALSE, color="black")

Here is the "time series" plot of the results.

The mean is 3.142959. The slight upward trend is most likely a random number induced illusion.

The problem as, I have framed it, is probably beyond the reach of a naive Monte Carlo approach. Nevertheless, on Saturday, when I have some simulation, time I will try running 100,000,000 draws. This should get me another digit of *π* since the accuracy of the mean increases as sqrt(N).

The situation, however, is not as dismal as I have been making it out. Don't imagine that just because the problem is opaque to a brute force Monte Carlo effort that it is without the possibility of computational illumination. Euler's proof, mentioned above, turns on recognizing that the expression for the probability of two randomly chosen integers being coprime may be expressed as the mathematical function z(2). The poof yields *π*^{2} = 6z(2), The wizards of R have reduced this calculation to the trivial. The Rmpfr package, which allows the use of arbitrarily precise numbers instead of R's double precision numbers, includes the function zeta(x)! So, here, splendidly arrayed, is *π*.

Happy Pi Day!

by Joseph Rickert

Distcomp, a new R package available on GitHub from a group of Stanford researchers has the potential to significantly advance the practice of collaborative computing with large data sets distributed over separate sites that may be unwilling to explicitly share data. The fundamental idea is to be able to rapidly set up a web service based on Shiny and opencpu technology that manages and performs a series of master / slave computations which require sharing only intermediate results. The particular target application for distcomp is any group of medical researchers who would like to fit a statistical model using the data from several data sets, but face daunting difficulties with data aggregation or are constrained by privacy concerns. Distcomp and its methodology, however, ought to be of interest to any organization with data spread across multiple heterogeneous database environments.

Setting up the distcomp environment requires some preliminary work and out-of-band communication among the collaborators. In the first step, the lead investigator uses a distcomp function to invoke a browser-based Shiny application to describe the location of her data set, the variables to be used in the computation, the model formula and other metadata necessary to describe the computation.

Next, the investigator invokes another distcomp function to move the metadata and a copy of the local data set to computation server with a unique identifier. Once the master server is in place, collaborating investigators at remote locations perform a similar process to set up slave computation servers at their sites. When the lead investigator receives the URLs pointing to the slave servers she is ready to kick off the computation.

All of the details of this setup process are described in this paper by Narasimham et al. The paper also describes two non-trivial computations: a distributed rank-k singular value decomposition and distributed, stratified Cox model that are of interest in their own right. The algorithm and code for the stratified Cox model ought to be useful to data scientists in a number of fields working on time to event models. A really nice feature of the algorithm is that it only requires each site to independently optimize the partial likelihood function using its local data. The master process uses the partial likelihood information from all of the sites to compute a final estimate of the coefficients and their variances.

There are several nice aspects to this work:

- It builds on the cumulative work of the R community to provide a big league, big data application around open source R.
- It provides a flexible paradigm for implementing distributed / parallel applications that leverages existing R algorithms (e.g. the Cox model makes use of code in the survival package)
- It illustrates the ease with which R projects can be deployed in web services applications with Shiny and other R centric software such as DeployR
- It provides an alternative to building out infrastructure and aggregating data before realizing the benefits of a big data computation. (Prototyping calculations with distcomp might also serve to justify the expense and effort of developing centralized infrastructure.)
- It recognizes that privacy and other social concerns are important in big data applications and provides a model for respecting some of the social requirements for dealing with sensitive data.

Distcomp is new work and the developers acknowledge several limitations. (So far, they have only built out two algorithms and they don’t have a way to easily deal with factor data across the distributed data sets.) Nevertheless, the project appears to show great promise.

By David R. Morganstein, ASA President

Raise your hand if you recently read an article in a newspaper or online about a new scientific discovery and were surprised by how the journalist reported the data or a statistical concept.

You’re not alone! We may both want reporters to be more statistically literate.

More of them will be, thanks to a new American Statistical Association (ASA) and Sense About Science USA (SAS USA) initiative whose goal is to connect with journalists and their editors and help them become more statistically savvy.

Last year, the ASA began working with the newly formed SAS USA to re-launch STATS.org, a statistics informational and resource hub for journalists — and anyone interested in how numbers shape science and society. The project will help reporters gain access to the statistical, data-driven perspective of stories on which they are working.

Through STATS.org, journalists are connected with statisticians who are experts on specific topics to provide them understandable statistical advice and explanation. This connection is critical to helping raise the understanding of the media on the statistical issues in their stories. For the first time many reporters will have access to a statistical expert who can help them interpret a scientific study and convey that meaning to the public.

In addition, ASA member statisticians provide background information and the statistical science perspective on timely news stories and STATS.org writers, who write about quantitative concepts in a very readable, easily understandable style, produce articles. These articles are featured on the STATS.org website for reporters and others interested in statistical matters to read and learn about statistics.

The project, which publicly launched January 1, already is hard at work. The website is populated with articles covering medicines, nutrition science, the vaccination debate and climate change.

The latter is in response to a recent *New York Times* opinion article that advocated climate scientists use a less stringent standard to assess global warming. Michael Lavine, an ASA member and professor of mathematics and statistics at the University of Massachusetts Amherst, wrote an article clarifying the statistical components of the opinion article.

“Unfortunately, to make her argument, the author confuses several different aspects of confidence, evidence, belief, and decision-making. The purpose here is to point out the confusion and clarify the statistical issues. This article is not about climate change; it’s about statistics. Oreskes’ mistaken interpretation of these statistical ideas do not imply that climate change is under question; the evidence for climate change consists of mechanistic as well as statistical arguments, and has little to do with the topic under discussion here: a misinterpretation of what is called the p-value,” wrote Lavine.

You, too, can help ASA and STATS.org. The next time you read a news article that misstates or misinterprets statistical data or concepts, take a minute to forward a copy of it to the ASA or STATS.org. Also, if you see a key statistical concept that consistently is misreported in the media, let us know about it as well.

STATS.org: Because Numbers Count

The World Cup of Cricket starts this week. (C'mon Aussie!) Cricket isn't well-known amongst many of my American friends or colleagues, so when I'm asked about it I usually point them to this video, which gives a good sense of the game:

Actually, this Vox article and this ESPN video do a much better job of describing the game. One thing the ESPN video doesn't mention (besides not listing all 11 ways to be out) is the possibility of a draw. In test match cricket, it's entirely possible for a match lasting five days to end without a winner. The reason is that a test match lasts four innings (each team gets to bat twice), and is also limited to five days. If the time limit ends before both innings are complete, *and* the trailing team is still at bat, the game is declared a draw. (The idea is that the trailing team may have caught up if only the game could continue.) The strategy for the trailing team, if they don't think they can achieve outright victory, is to instead play for time and go for the draw. (The winner of a test series is the the team with the most wins over five five-day test matches.)

Playing for five days without a definitive outcome can try the patience of the modern sporting fan, so one-day cricket was born. Here, there are just two innings per game, each team is given a fixed number of overs (balls) with which to score, after which the innings is automatically over and the other team has an opportunity to bat. As the name suggests, the game is over in a day and one team or the other will be declared the winner (unless there is an exact tie in scores).

There's an interesting statistical angle here, which is related to interruptions in the game. Let's say we're halfway through the second innings and Australia is at bat with 142 runs to England's 204. Normally, Australia would need to score 63 runs (the "target") to win. Now suppose it starts to rain, and the game is suspended for an hour. To keep the game from running long, Australia will be given fewer overs to bat, and their target will be reduced as well. But the target isn't reduced in exact proportion to the overs removed, to reflect the fact that more runs are generally scored in the latter part of an innings. The exact calculation is based on statistical analysis of cricket games, and is a great example of censored data analysis. (The basic idea is to be able to forecast what the final score *would have been* in games that are interrupted.) The calculation is known as the Duckworth-Lewis method, named after the two British statisticians that devised it. People often talk about statistics and baseball in the same breath, but this is the only example I can think of where statistical *modeling* is such an important part of a sport. (If you can think of others, let me know in the comments!)

Well, that's all for this week — I'm off to watch the cricket! See you back here on Monday.

From the "statistician humour" department, today's xkcd cartoon will ring a bell for anyone who's ever published (or read!) a scientific article including a P-value for a statistical test:

If finding P-value excuses is a common activity for you (and let's hope not!) then R has you covered with the Significantly Improved Significance Test. This R code from Rasmus Bååth will automatically annotate your P-values between 0.05 and 0.12 with excuses like "suggestive of statistical significance", "weakly non-significant" or "quasi-significant". Bonus points for links in the comments to real journal articles that actually use these excuses!