by Joseph Rickert

Quite a few times over the past few years I have highlighted presentations posted by R user groups on their websites and recommended these sites as a source for interesting material, but I have never thought to see what the user groups were doing on GitHub. As you might expect, many people who make presentations at R user group meetings make their code available on GitHub. However as best as I can tell, only a few R user groups are maintaining GitHub sites under the user group name.

The Indy UseR Group is one that seems to be making very good use of their GitHub Site. Here is the link to a very nice tutorial from Shankar Vaidyaraman on using the rvest package to do some web scraping with R. The following code which scrapes the first page from Springer's Use R! series to produce a short list of books comes form Shankar's simple example.

# load libraries library(rvest) library(dplyr) library(stringr) # link to Use R! titles at Springer site useRlink = "http://www.springer.com/?SGWID=0-102-24-0-0&series=Use+R&sortOrder=relevance&searchType=ADVANCED_CDA&searchScope=editions&queryText=Use+R" # Read the page userPg = useRlink %>% read_html() ## Get info of books displayed on the page booktitles = userPg %>% html_nodes(".productGraphic img") %>% html_attr("alt") bookyr = userPg %>% html_nodes(xpath = "//span[contains(@class,'renditionDescription')]") %>% html_text() bookauth = userPg %>% html_nodes("span[class = 'displayBlock']") %>% html_text() bookprice = userPg %>% html_nodes(xpath = "//div[@class = 'bookListPriceContainer']//span[2]") %>% html_text() pgdf = data.frame(title = booktitles, pubyr = bookyr, auth = bookauth, price = bookprice) pgdf

This plot,which shows a list of books ranked by number of downloads, comes from Shankar's extended recommender example.

The Ann Arbor R User Group meetup site has done an exceptional job of creating an aesthetically pleasing and informative web property on their GitHub site.

I am particularly impressed with the way they have integrated news, content and commentary into their "News" section. Scroll down the page and have look at the care taken to describe and document the presentations made to the group. I found the introduction and slides for Bob Carpenter's RStan presentation very well done.

Other RUGs active on GitHub include:

- Cambridge R User Group: meetup site and GitHub site
- Inland Northwest R User Group: GitHub only
- Twin Cities R User Group: meetup site and GitHub site
- UVa R Users Group: meetup site and GitHub site

If your R user group is on GitHub and I have not included you in my short list please let me know about it. I think RUG GitHub sites have the potential for creating a rich code sharing experience among user groups. If you would like some help getting started with GItHub have a look at tutorials on the Murdoch University R User Group webpage.

by Joseph Rickert

In a previous post, I showed some elementary properties of discrete time Markov Chains could be calculated, mostly with functions from the markovchain package. In this post, I would like to show a little bit more of the functionality available in that package by fitting a Markov Chain to some data. In this first block of code, I load the gold data set from the forecast package which contains daily morning gold prices in US dollars from January 1, 1985 through March 31, 1989. Next, since there are few missing values in the sequence, I impute them with a simple "ad hoc" process by substituting the previous day's price for one that is missing. There are two statements in the loop because there are a number of instances where there are two missing values in a row. Note that some kind of imputation is necessary because I will want to compute the autocorrelation of the series, and like many R functions acf() does not like NAs. (it doesn't make sense to compute with NAs.)

library(forecast)

library(markovchain)

data(gold) # Load gold time series

# Impute missing values

gold1 <- gold

for(i in 1:length(gold)){

gold1[i] <- ifelse(is.na(gold[i]),gold[i-1],gold[i])

gold1[i] <- ifelse(is.na(gold1[i]),gold1[i-1],gold1[i])

}

plot(gold1, xlab = "Days", ylab = "US$", main = "Gold prices 1/1/85 - 1/31/89")

This is an interesting series with over 1,000 points, but definitely not stationary; so it is not a good candidate for trying to model as a Markov Chain. The series produced by taking first differences is more reasonable. The series flat, oscillating about a mean (0.07) slightly above zero and the autocorrelation trails off as one might expect for a stationary series.

# Take first differences to try and get stationary series

goldDiff <- diff(gold1)

par(mfrow = c(1,2))

plot(goldDiff,ylab="",main="1st differences of gold")

acf(goldDiff)

Next, we set up for modeling by constructing a series of labels. In this analysis, I settle for constructing a two state chain that reflects whether the series of differences assumes positive or non-positive values. Note that I have introduced a few zero values into the series because of my crude imputation process. Here, i just lump them in with the negatives.

# construct a series of labels

goldSign <- vector(length=length(goldDiff))

for (i in 1:length(goldDiff)){

goldSign[i] <- ifelse(goldDiff[i] > 0, "POS","NEG")

}

Next, we can use some of the statistical tests built into the markovchain package to assess our assumptions so far. The function verifyMarkovProperty() attempts to verify that a sequence satisfies the Markov property by performing Chi squared tests on a series of contingency tables where the columns are sequences of past to present to future transitions and the rows are sequences of state transitions. Large p values indicate that one should not reject the null hypothesis that the Markov property holds for a specific transition. The output of verifyMarkovProperty() is a list with an entry for each possible state transition. The package vignette An introduction to the markovchain package presents the details of how this works. The following shows the output for the NEG to POS transitions.

# Verify the Markov property

vmp <- verifyMarkovProperty(goldSign)

vmp[2]

# $NEGPOS

# $NEGPOS$statistic

# X-squared

# 0

#

# $NEGPOS$parameter

# df

# 1

#

# $NEGPOS$p.value

# [1] 1

#

# $NEGPOS$method

# [1] "Pearson's Chi-squared test with Yates' continuity correction"

#

# $NEGPOS$data.name

# [1] "table"

#

# $NEGPOS$observed

# SSO TSO-SSO

# NEG 138 116

# POS 164 139

#

# $NEGPOS$expected

# SSO TSO-SSO

# NEG 137.7163 116.2837

# POS 164.2837 138.7163

#

# $NEGPOS$residuals

# SSO TSO-SSO

# NEG 0.02417181 -0.02630526

# POS -0.02213119 0.02408452

#

# $NEGPOS$stdres

# SSO TSO-SSO

# NEG 0.04843652 -0.04843652

# POS -0.04843652 0.04843652

#

# $NEGPOS$table

# SSO TSO

# NEG 138 254

# POS 164 303

The assessOrder() function uses a Chi squared test to test that hypothesis that the sequence is consistent with a first order Markov Chain

assessOrder(goldSign)

# The assessOrder test statistic is: 0.4142521

# the Chi-Square d.f. are: 2 the p-value is: 0.8129172

# $statistic

# [1] 0.4142521

#

# $p.value

# [1] 0.8129172

There is an additional function assessStationarity() to test for the stationarity of the sequence. However in this case, the chisq.test() function at the heart of things reports that the p-values are unreliable.

Next, we use the markovchainFit() function to fit a Markov Chain to the data using the maximum likelihood estimator explained in the vignette mentioned above. The output from the function includes the estimated transition matrix, the estimated error and lower and upper endpoint transition matrices which provide a confidence interval for the transition matrix.

goldMC <- markovchainFit(data = goldSign, method="mle", name = "gold mle")

goldMC$estimate

# gold mle

# A 2 - dimensional discrete Markov Chain with following states

# NEG POS

# The transition matrix (by rows) is defined as follows

# NEG POS

# NEG 0.4560144 0.5439856

# POS 0.5519126 0.4480874

goldMC$standardError

# NEG POS

# NEG 0.02861289 0.03125116

# POS 0.03170655 0.02856901

goldMC$confidenceInterval

# [1] 0.95

#

# $lowerEndpointMatrix

# NEG POS

# NEG 0.4089504 0.4925821

# POS 0.4997599 0.4010956

#

# $upperEndpointMatrix

# NEG POS

# NEG 0.5030784 0.5953892

# POS 0.6040652 0.4950793

The transition matrix does show some interesting structure with it being more likely for the chain to go from a negative to a positive value than to stay negative. And, once it is positive, the chain is more likely to stay positive than go negative.

Finally, we use the predict() function to produce a three day, look ahead forecast for the situation where the series has been negative for the last two days.

predict(object = goldMC$estimate, newdata = c("POS","POS"),n.ahead=3)

#"NEG" "POS" "NEG"

I still have not exhausted what is in the markovchain package. Perhaps, in an other post I will look at the functions for continuous time chains. What is presented here though should be enough to have some fun hunting for Markov Chains in all kinds of data.

The scientific process has been going through a welcome period of introspection recently, with a focus on understanding just how reliable the results of scientific studies are. We're not talking here about scientific fraud, but how the scientific process itself and the focus on p-values (which not even statisticians can easily explain) as the criterion for a positive result leads to a surprisingly large number of false positives to be published. On top of that, there's the issue of publication bias (especially in the pharmaceutical industry), an area where Ben Goldacre has taken a lead. The whole issue is wrapped in the concept of reproducibility — the idea that independent researchers should be able to replicate the results of published studies — for which David Spiegelhalter gives a great primer in the video below.

So it's welcome news that one of the top science breakthroughs of 2015 according to *Science* and *Nature* is Brian Nosek's project to reproduce the results of 100 scientific studies published in psychology journals. The detailed methodology is described in this paper, but in short Nosek recruited replication teams to recreate the studies as described in the carefully-selected papers, and analyze the data they collect:

Moreover,to maximize reproducibility and accuracy, the analyses for every replication study were reproduced by another analyst independent of the replication team using the R statistical programming language and a standardized analytic format. A controller R script was created to regenerate the entire analysis of every study and recreate the master data file.

R is a natural fit for a reproducibility project like this: as a scripting language, the R script itself provides a reproducible documentation of every step of the process. (Revolution R Open, Microsoft's enhanced R distribution, additionally includes features to facilitate reproducibility when using R packages.) The R script used for the psychology replication project describes and executes the process for checking the results of the papers.

Of the 100 papers studies, 97 of them reported statistically significant effects. (This is itself a reflection of publication bias; studies where there is no effect rarely get published.) Yet of those 97 papers, in 61 cases the reported significant results could not be replicated when the study was repeated. Their conclusion:

A large portion of replications produced weaker evidence for the original findings despite using materials provided by the original authors, review in advance for methodological fidelity, and high statistical power to detect the original effect sizes.

Study like this of the scientific method itself can only improve the scientific process, and is deserving of its accolade as a breakthrough. Read more about the project and the replicated studies at the link below.

Open Science Framework: Estimating the Reproducibility of Psychological Science (via Solomon Messing)

by Joseph Rickert

There are number of R packages devoted to sophisticated applications of Markov chains. These include msm and SemiMarkov for fitting multistate models to panel data, mstate for survival analysis applications, TPmsm for estimating transition probabilities for 3-state progressive disease models, heemod for applying Markov models to health care economic applications, HMM and depmixS4 for fitting Hidden Markov Models and mcmc for working with Monte Carlo Markov Chains. All of these assume some considerable knowledge of the underlying theory. To my knowledge only DTMCPack and the relatively recent package, markovchain, were written to facilitate basic computations with Markov chains.

In this post, we’ll explore some basic properties of discrete time Markov chains using the functions provided by the markovchain package supplemented with standard R functions and a few functions from other contributed packages. “Chapter 11”, of Snell’s online probability book will be our guide. The calculations displayed here illustrate some of the theory developed in this document. In the text below, section numbers refer to this document.

A large part of working with discrete time Markov chains involves manipulating the matrix of transition probabilities associated with the chain. This first section of code replicates the Oz transition probability matrix from section 11.1 and uses the plotmat() function from the diagram package to illustrate it. Then, the efficient operator %^% from the expm package is used to raise the Oz matrix to the third power. Finally, left matrix multiplication of OZ^3 by the distribution vector u = (1/3, 1/3, 1/3) gives the weather forecast three days ahead.

library(expm)

library(markovchain)

library(diagram)

library(pracma)

stateNames <- c("Rain","Nice","Snow")

Oz <- matrix(c(.5,.25,.25,.5,0,.5,.25,.25,.5),

nrow=3, byrow=TRUE)

row.names(Oz) <- stateNames; colnames(Oz) <- stateNames

Oz

# Rain Nice Snow

# Rain 0.50 0.25 0.25

# Nice 0.50 0.00 0.50

# Snow 0.25 0.25 0.50

plotmat(Oz,pos = c(1,2),

lwd = 1, box.lwd = 2,

cex.txt = 0.8,

box.size = 0.1,

box.type = "circle",

box.prop = 0.5,

box.col = "light yellow",

arr.length=.1,

arr.width=.1,

self.cex = .4,

self.shifty = -.01,

self.shiftx = .13,

main = "")

Oz3 <- Oz %^% 3

round(Oz3,3)

# Rain Nice Snow

# Rain 0.406 0.203 0.391

# Nice 0.406 0.188 0.406

# Snow 0.391 0.203 0.406

u <- c(1/3, 1/3, 1/3)

round(u %*% Oz3,3)

#0.401 0.198 0.401

The igraph package can also be used to Markov chain diagrams, but I prefer the “drawn on a chalkboard” look of plotmat.

This next block of code reproduces the 5-state Drunkward’s walk example from section 11.2 which presents the fundamentals of absorbing Markov chains. First, the transition matrix describing the chain is instantiated as an object of the S4 class makrovchain. Then, functions from the markovchain package are used to identify the absorbing and transient states of the chain and place the transition matrix, P, into canonical form.

p <- c(.5,0,.5)

dw <- c(1,rep(0,4),p,0,0,0,p,0,0,0,p,rep(0,4),1)

DW <- matrix(dw,5,5,byrow=TRUE)

DWmc <-new("markovchain",

transitionMatrix = DW,

states = c("0","1","2","3","4"),

name = "Drunkard's Walk")

DWmc

# Drunkard's Walk

# A 5 - dimensional discrete Markov Chain with following states

# 0 1 2 3 4

# The transition matrix (by rows) is defined as follows

# 0 1 2 3 4

# 0 1.0 0.0 0.0 0.0 0.0

# 1 0.5 0.0 0.5 0.0 0.0

# 2 0.0 0.5 0.0 0.5 0.0

# 3 0.0 0.0 0.5 0.0 0.5

# 4 0.0 0.0 0.0 0.0 1.0

# Determine transient states

transientStates(DWmc)

#[1] "1" "2" "3"

# determine absorbing states

absorbingStates(DWmc)

#[1] "0" "4"

In canonical form, the transition matrix, P, is partitioned into the Identity matrix, I, a matrix of 0’s, the matrix, Q, containing the transition probabilities for the transient states and a matrix, R, containing the transition probabilities for the absorbing states.

Next, we find the Fundamental Matrix, N, by inverting (I – Q). For each transient state, j, n_{ij} gives the expected number of times the process is in state j given that it started in transient state i. u_{i} is the expected time until absorption given that the process starts in state i. Finally, we compute the matrix B, where b_{ij} is the probability that the process will be absorbed in state J given that it starts in state i.

# Find Matrix Q

getRQ <- function(M,type="Q"){

if(length(absorbingStates(M)) == 0) stop("Not Absorbing Matrix")

tm <- M@transitionMatrix

d <- diag(tm)

m <- max(which(d == 1))

n <- length(d)

ifelse(type=="Q",

A <- tm[(m+1):n,(m+1):n],

A <- tm[(m+1):n,1:m])

return(A)

}

# Put DWmc into Canonical Form

P <- canonicForm(DWmc)

P

Q <- getRQ(P)

# Find Fundamental Matrix

I <- diag(dim(Q)[2])

N <- solve(I - Q)

N

# 1 2 3

# 1 1.5 1 0.5

# 2 1.0 2 1.0

# 3 0.5 1 1.5

# Calculate time to absorption

c <- rep(1,dim(N)[2])

u <- N %*% c

u

# 1 3

# 2 4

# 3 3

R <- getRQ(P,”R”)

B <- N %*% R

B

# 0 4

# 1 0.75 0.25

# 2 0.50 0.50

# 3 0.25 0.75

For section 11. 3, which deals with regular and ergodic Markov chains we return to Oz, and provide four options for calculating the steady state, or limiting probability distribution for this regular transition matrix. The first three options involve standard methods which are readily available in R. Method 1 uses %^% to raise the matrix Oz to a sufficiently high value. Method 2 calculates the eigenvalue for the eigenvector 1, and method 3 uses the nullspace() function form the pracma package to compute the null space, or kernel of the linear transformation associated with the matrix. The fourth method uses the steadyStates() function from the markovchain package. To use this function, we first convert Oz into a markovchain object.

# 11.3 Ergodic Markov Chains

# Four methods to get steady states

# Method 1: compute powers on Matrix

round(Oz %^% 6,2)

# Rain Nice Snow

# Rain 0.4 0.2 0.4

# Nice 0.4 0.2 0.4

# Snow 0.4 0.2 0.4

# Method 2: Compute eigenvector of eigenvalue 1

eigenOz <- eigen(t(Oz))

ev <- eigenOz$vectors[,1] / sum(eigenOz$vectors[,1])

ev

# Method 3: compute null space of (P - I)

I <- diag(3)

ns <- nullspace(t(Oz - I))

ns <- round(ns / sum(ns),2)

ns

# Method 4: use function in markovchain package

OzMC<-new("markovchain",

states=stateNames,

transitionMatrix=

matrix(c(.5,.25,.25,.5,0,.5,.25,.25,.5),

nrow=3,

byrow=TRUE,

dimnames=list(stateNames,stateNames)))

steadyStates(OzMC)

The steadyState() function seems to be reasonably efficient for fairly large Markov Chains. The following code creates a 5,000 row by 5,000 column regular Markov matrix. On my modest, Lenovo ThinkPad ultrabook it took a little less than 2 minutes to create the markovchain object and about 11 minutes to compute the steady state distribution.

# Create a large random regular matrix

randReg <- function(N){

M <- matrix(runif(N^2,min=1,max=N),nrow=N,ncol=N)

rowS <- rowSums(M)

regM <- M/rowS

return(regM)

}

N <- 5000

M <-randReg(N)

#rowSums(M)

system.time(regMC <- new("markovchain", states = as.character(1:N),

transitionMatrix = M,

name = "M"))

# user system elapsed

# 98.33 0.82 99.46

system.time(ss <- steadyStates(regMC))

# user system elapsed

# 618.47 0.61 640.05

We conclude this little Markov Chain excursion by using the rmarkovchain() function to simulate a trajectory from the process represented by this large random matrix and plot the results. It seems that this is a reasonable method for simulating a stationary time series in a way that makes it easy to control the limits of its variability.

#sample from regMC

regMCts <- rmarkovchain(n=1000,object=regMC)

regMCtsDf <- as.data.frame(regMCts,stringsAsFactors = FALSE)

regMCtsDf$index <- 1:1000

regMCtsDf$regMCts <- as.numeric(regMCtsDf$regMCts)

library(ggplot2)

p <- ggplot(regMCtsDf,aes(index,regMCts))

p + geom_line(colour="dark red") +

xlab("time") +

ylab("state") +

ggtitle("Random Markov Chain")

For more on the capabilities of the markovchain package do have a look at the package vignette. For more theory on both discrete and continuous time Markov processes illustrated with R see Norm Matloff's book: From Algorithms to Z-Scores: Probabilistic and Statistical Modeling in Computer Science.

by Andrie de Vries

A week ago my high school friend, @XLRunner, sent me a link to the article "How Zach Bitter Ran 100 Miles in Less Than 12 Hours". Zach's effort was rewarded with the American record for the 100 mile event.

This reminded me of some analysis I did, many years ago, of the world record speeds for various running distances. The International Amateur Athletics Federation (IAAF) keeps track of world records for distances from 100m up to the marathon (42km). The distances longer than 42km do not fall in the IAAF event list, but these are also tracked by various other organisations.

You can find a list of IAAF world records at Wikipedia, and a list of ultramarathon world best times at Wikepedia.

I extracted only the mens running events from these lists, and used R to plot the average running speeds for these records:

You can immediately see that the speed declines very rapidly from the sprint events. Perhaps it would be better to plot this using a logarithmic x-scale, adding some labels at the same time. I also added some colour for what I call standard events - where "standard" is the type of distance you would see regularly at a world championships or olympic games. Thus the mile is "standard", but the 2,000m race is not.

Now our data points are in somewhat more of a straight line, meaning we could consider fitting a linear regression.

However, it seems that there might be two kinks in the line:

- The first kink occurs somewhere between the 800m distance and the mile. It seems that the sprinting distances (and the 800m is sometimes called a long sprint) has different dynamics from the events up to the marathon.
- And then there is another kink for the ultra-marathon distances. The standard marathon is 42.2km, and distances longer than this are called ultramarathons.

Also, note that the speed for the 100m is actually slower than for the 200m. This indicates the transition effect of getting started from a standing start - clearly this plays a large role in the very short sprint distance.

For the analysis below, I exlcuded the data for:

- The 100m sprint (transition effects play too large a role)
- The ultramarahon distances (they get raced less frequently, thus something strange seems to be happening in the data for the 50km race in particular).

To fit a regression line with kinks, more properly known as a segmented regression (or sometimes called piecewise regression), you can use the segmented package, available on CRAN.

The `segmented()`

function allows you to modify a fitted object of class `lm`

or `glm`

, specifying which of the independent variables should have segments (kinks). In my case, I fitted a linear model with a single variable (log of distance), and allowed `segmented()`

to find a single kink point.

My analysis indicates that there is a kink point at 1.13km (10^0.055 = 1.13), i.e. between the 800m event and the 1,000m event.

`> summary(sfit)`

`***Regression Model with Segmented Relationship(s)***`

`Call: `

`segmented.lm(obj = lfit, seg.Z = ~logDistance)`

`Estimated Break-Point(s):`

` Est. St.Err `

` 0.055 0.021`

`Meaningful coefficients of the linear terms:`

` Estimate Std. Error t value Pr(>|t|) `

`(Intercept) 27.2064 0.1755 155.04 < 2e-16 ***`

`logDistance - 15.1305 0.4332 -34.93 1.94e-13 ***`

`U1.logDistance 11.2046 0.4536 24.70 NA `

`---`

`Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1`

`Residual standard error: 0.2373 on 12 degrees of freedom`

`Multiple R-Squared: 0.9981, Adjusted R-squared: 0.9976`

`Convergence attained in 4 iterations with relative change -4.922372e-16 `

The final plot shows the same data, but this time with the segmented regression line also displayed.

I conlude:

- It is really easy to fit a segmented linear regression model using the segmented package
- There seems to be a different physiological process for the sprint events and the middle distance events. The segmented regression finds this kink point between the 800m event and the 1,000m event
- The ultramarathon distances have a completely different dynamic. However, it's not clear to me whether this is due to inherent physiological constraints, or vastly reduced competition in these "non-standard" events.
- The 50km world record seems too "slow". Perhaps the competition for this event is less intense than for the marathon?

Here is my code for the analysis:

by John Mount Ph.D.

Data Scientist at Win-Vector LLC

Our most recent article was a dynamic programming solution to the A/B test problem. Explicitly solving such dynamic programs is a long and tedious process, so you are well served by finding and introducing clever invariants to track (something better than just raw win-rates). This clever idea, called "sequential analysis", was introduced by Abraham Wald (whom we have written about before). If you have ever heard of a test plan such as "first process to get more than 30 wins ahead of the other is the one we choose" you have seen methods derived from Wald's sequential analysis technique.

Wald's famous airplane armor problem

In this "statistics as it should be" article we will discuss Wald's sequential analysis.

A particularly compelling (and unusual) type of test plan is the graphical sequential inspection procedure designed by Abraham Wald. Wald treats A/B testing as a "stopping problem." Wald asks the business partner for four parameters: u0,u1 (the desired bounds on relative error allowed in the estimate); and alpha,beta (the power and significance goals). With these parameters Wald designs an inspection plan that is a single chart (shown below).

The chart is used as follows: we send traffic to processes 1 and 2 in matched pairs. We get back pair measurements (c1,c2) where c1 = 1 if process 1 paid off (0 otherwise) and c2 = 1 if process 2 paid off (zero otherwise). Only look at pairs (0,1) or (1,0) (we discard all (0,0) and (1,1) pairs, they are not allowed to update the chart). Start at the origin of the inspection chart. When you see a (0,1) or (1,0) pair move on the inspection chart one unit to the right; if the observation was a (0,1) also move one unit up (a process 1 "win"). When you cross one of the decision lines you are done ("Reject process 1" means go with 2, and "Accept process 1" means go with 1).

This is completely procedural, but brilliant. It is a chart you can actually print, laminate, and post on the machine shop floor. The idea is so brilliant, it is easier to demonstrate than to describe (so we suggest skipping ahead and checking out our worked example here).

Wald's example was to run two manufacturing processes and then shut down the one that seems to have a lower success rate. Our internet age example would be running two advertisement campaigns and shutting down the one with lower conversion rate. In all cases the procedure is designed in terms of continuously looking at results (something people are likely to do), instead of prescribing a batch interval and hoping nobody spoils the statistical significance by peaking early (the classic approach).

Wald's sequential inspection procedure is itself a purely prescriptive procedure: move along the chart as he prescribes and you have fairly good decision procedure. So any fretting or worrying about business trade-offs need to have been worked through when specifying the u0, u1, alpha,beta needed to design the chart. This means these values need to be tried and the business person needs to see if the expected length of the inspection plans is acceptable (and revise their selection of u0, u1, alpha, beta if they are not). However, due to the extremely clever frame work, the chart is simple (two parallel lines) and all calculation is done when producing the chart (no calculation remains while making decisions).

The decision boundaries are two parallel lines with common slope determined by four user supplied parameters. Notice that there is no "experiment length" in Wald's sequential analysis procedure. How long the experiment runs is a random variable determined by the user parameters and the results seen. You don't spoil a sequential analysis by "peeking" or "stopping early" (which would spoil a standard A/B test design).

One fascinating property of Wald's procedure is it isn't running off a traditional sufficient statistic. If the data were all exchangeable then the sufficient statistic would be how many times process 1 was tried, how many process 1 successes, how many times process 2 was tried, and how many process 2 successes. Wald's position chart isn't a function of these sufficient statistics; it also depends on the exact pairing of process 1 and process 2 trials. Wald derives his decision surface from how many times process 1 performed different from process 2 for a single given pairing of the process 1 an process 2 trials. Oddly enough for measuring different rates of variations rare events (such as advertisement conversion) this "loss of statistical efficiency" is tiny.

Part of the brilliance of the plan is by throwing out (0,0) and (1,1) measurements (not allowing them to move the point on the sequential analysis chart) makes the chart independent of the rate we see (1,0) and (0,1) events, or largely independent of the magnitude of the unknown rates we are trying to compare. That is Wald's chart is roughly a pivotal summary of the experiment to date as the estimate is roughly independent of the unknown rates (to be estimated) once conditioned on the observations (a property shared by sampling distributions such as Student's t-distribution). By retaining only the (0,1) and (1,0) measurements Wald converted an arbitrary comparison of rates experiment to a specific "is the rate 50/50 or not" experiment. These advantages are likely why Optimizely switched to sequential hypothesis testing (from standard power/significance planners).

We are sharing an R implementation of Wald's sequential analysis test and also plot the simulated application of the test over many repetitions. This yields the following figure showing a simulation process 2 being reliably selected (when it is in fact the higher return process).

Wald's sequential test is easy to implement and we demonstrate it as an R knitr worksheet here. The technique is best seen in action, so we really recommend checking it out.

Here is the list of previous Win-Vector LLC posts in the series about A/B testing that combine elements from both operations research and statistical points of view.

A dynamic programming solution to A/B test design

Why does designing a simple A/B test seem so complicated?

A clear picture of power and significance in A/B tests

Bandit Formulations for A/B Tests: Some Intuition

Bayesian/loss-oriented: New video course: Campaign Response Testing

Bob Horton

Sr Data Scientist, Microsoft

Wikipedia describes Simpson’s paradox as “a trend that appears in different groups of data but disappears or reverses when these groups are combined.” Here is the figure from the top of that article (you can click on the image in Wikipedia then follow the “more details” link to find the R code used to generate it. There is a lot of R in Wikipedia).

I rearranged it a bit to put the values in a dataframe, to make it a bit easier to think of the “color” column as a confounding variable:

x | y | color |
---|---|---|

1 | 6 | 1 |

2 | 7 | 1 |

3 | 8 | 1 |

4 | 9 | 1 |

8 | 1 | 2 |

9 | 2 | 2 |

10 | 3 | 2 |

11 | 4 | 2 |

If we do not consider this confounder, we find that the coefficient of x is negative (the dashed line in the figure above):

`coefficients(lm(y ~ x, data=simpson_data))`

```
## (Intercept) x
## 8.3333333 -0.5555556
```

If we do take the confouder into account, we see the coefficient of x is positive:

`coefficients(lm(y ~ x + color, data=simpson_data))`

```
## (Intercept) x color
## 17 1 -12
```

In his book *Causality*, Judea Pearl makes a more sweeping statement regarding Simpson’s paradox: “Any statistical relationship between two variables may be reversed by including additional factors in the analysis.” [Pearl2009]

That sounds fun; let’s try it.

First we’ll make variables `x`

and `y`

with a simple linear relationship. I’ll use the same slopes and intercepts as in the Wikipedia figure, both to show the parallel and to demonstrate the incredible cosmic power I have to bend coefficients to my will.

```
set.seed(1)
N <- 3000
x <- rnorm(N)
m <- -0.5555556
b <- 8.3333333
y <- m * x + b + rnorm(length(x))
plot(x, y, col="gray", pch=20, asp=1)
fit <- lm(y ~ x)
abline(fit, lty=2, lwd=2)
```

When we look at the slope of the regression line determined by fitting the model, it is almost exactly equal to the constant `m`

that we used to determine `y`

.

`coefficients(fit)`

```
## (Intercept) x
## 8.3284021 -0.5358175
```

We get out what we put in; the coefficient of x is essentially the slope we originally gave `y`

when we generated it (-0.5555556). This is the ‘effect’ of `x`

, in that a one unit increase in `x`

apparently increases `y`

by this amount.

Now think about how to concoct a confounding variable to reverse the coefficient of `x`

. This figure shows one way to approach the problem – group the points into a set of parallel stripes, with the stripes sloping in a different direction from the overall dataset:

```
m_new <- 1 # the new coefficient we want x to have
cdf <- confounded_data_frame(x, y, m_new, num_grp=10) # see function below
striped_scatterplot(y ~ x, cdf) # also see below
```

The stripes were made by specifying a reference line with a slope equal to the x-coefficient we want to achieve, and calculating the distance to that line for each point. Putting these distances into categories (by rounding off some multiple of the distance) then groups the points into stripes (shown as colors in the figure). A regression line was then fitted separately to the set of points within each stripe. The regression lines for the stripes on the very ends can be a bit wild, since these groups are very small and scattered, but the ones near the center, representing the majority of the data points, have a quite consistent slope.

The equation for determining the distance from a point to a line is (of course) right there in Wikipedia.

With a little rearranging to express the line in terms of y-intercept (`b`

) and slope (`m`

), and leaving off the absolute value so that points below the line have negative distances (and thus end up in a different group from the stripe with a positive distance of the same magnitude), we get this function:

```
point_line_distance <- function(b, m, x, y)
(y - (m*x + b))/sqrt(m^2 + 1)
```

Here are functions for putting the points into stripewise groups, determining the regression coefficients for each group, and putting it all together into a figure:

```
confounded_data_frame <- function(x, y, m, num_grp){
b <- 0 # intercept doesn't matter
d <- point_line_distance(b, m, x, y)
d_scaled <- 0.0005 + 0.999 * (d - min(d))/(max(d) - min(d)) # avoid 0 and 1
data.frame(x=x, y=y,
group=as.factor(sprintf("grp%02d", ceiling(num_grp*(d_scaled)))))
}
find_group_coefficients <- function(data){
coef <- t(sapply(levels(data$group),
function(grp) coefficients(lm(y ~ x, data=data[data$group==grp,]))))
coef[!is.na(coef[,1]) & ! is.na(coef[,2]),]
}
striped_scatterplot <- function(formula, grouped_data){
# blue on top and red on bottom, to match the Wikipedia figure
colors <- rev(rainbow(length(levels(grouped_data$group)), end=2/3))
plot(formula, grouped_data, bg=colors[grouped_data$group], pch=21, asp=1)
grp_coef <- find_group_coefficients(grouped_data)
# if some coefficents get dropped, colors won't match exactly
for (r in 1:nrow(grp_coef))
abline(grp_coef[r,1], grp_coef[r,2], col=colors[r], lwd=2)
}
```

Note that the regression lines for each group are not exactly parallel to the stripes. This is because linear regression is about minimizing the squared error on the y-axis, not the distance of points from the line. However, the thinner the stripes are, the closer the group regression lines are to our target slope. If we make a large number of thin stripes, the coefficient of `x`

when the groups are taken into account is essentially the same as the slope of the reference line we used to orient the stripes:

```
cdf100 <- confounded_data_frame(x, y, m_new, num_grp=100)
# without confounder
coefficients(lm(y ~ x, cdf100))['x']
```

```
## x
## -0.5358175
```

```
# with confounder
coefficients(lm(y ~ x + group, cdf100))['x']
```

```
## x
## 0.9961566
```

This approach gives us the power to synthesize simulated confounders that can change the coefficient of `x`

to pretty much any value we choose when a model is fitted with the confounder taken into account. Plus, it makes pretty rainbows.

While Simpson’s Paradox is typically described in terms of categorical confounders, the same reversal principle applies to continuous confounders. But that’s a topic for another post.

[Pearl2009]: Pearl, J. Causality: Models, Reasoning and Inference (2ed). Cambridge University Press, New York 2009.

by Andrie de Vries

Back in 2011, I asked a question on StackOverflow: "How to make a great R reproducible example?".

This question attracted some great answers, including answers by Hadley Wickham and Joris Meys (co-author of R for Dummies).

In June of this year Tyler Rinker added a new answer. Tyler published the wakefield package. In his own words:

I am developing the wakefield package to address this need to quickly share reproducible data, sometimes dput() works fine for smaller data sets but many of the problems we deal with are much larger, sharing such a large data set via dput() is impractical.

I think it is a brilliant idea to create a package that allows you to easily create data with a specified structure.

The package has some very clever ideas. It contains functions that "knows" about certain data types, e.g. age() generates age ranges and coin() generates a bernoulli sample, to name just a few. You can also specify correlation between variables - a helpful feature if you want to demonstrate a specific statistical model.

The package is not yet on CRAN, but is extensively documented at github.

wakefieldis designed to quickly generate random data sets. The user passes`n`

(number of rows) and predefined vectors to the`r_data_frame`

function to produce a`dplyr::tbl_df`

object.

Here is an example from the documentation (modified only very slightly):

This produces the following plot. Notice the correlation in the data - people with high initial grades tend to maintain high grades over time, and vice versa.

To install the package, uncomment the first two lines of code and try the examples:

by Nina Zumel

Principal Consultant Win-Vector LLC

We've just finished off a series of articles on some recent research results applying differential privacy to improve machine learning. Some of these results are pretty technical, so we thought it was worth working through concrete examples. And some of the original results are locked behind academic journal paywalls, so we've tried to touch on the highlights of the papers, and to play around with variations of our own.

**A Simpler Explanation of Differential Privacy**: Quick explanation of epsilon-differential privacy, and an introduction to an algorithm for safely reusing holdout data, recently published in*Science*(Cynthia Dwork, Vitaly Feldman, Moritz Hardt, Toniann Pitassi, Omer Reingold, Aaron Roth, “The reusable holdout: Preserving validity in adaptive data analysis”,*Science*, vol 349, no. 6248, pp. 636-638, August 2015). Note that Cynthia Dwork, one of the inventors of differential privacy, originally used it in the analysis of sensitive information.**Using differential privacy to reuse training data**: Specifically, how differential privacy helps you build efficient encodings of categorical variables with many levels from your training data without introducing undue bias into downstream modeling.**A simple differentially private procedure**: The bootstrap as an alternative to Laplace noise to introduce differential privacy.

Our R code and experiments are available on Github here, so you can try some experiments and variations yourself. Image Credit

*Editor's Note:**The R code includes an example of using vtreat, a package for preparing and cleaning data frames based on level-based feature pruning.*

by Joseph Rickert

We all "know" that correlation does not imply causation, that unmeasured and unknown factors can confound a seemingly obvious inference. But, who has not been tempted by the seductive quality of strong correlations?

Fortunately, it is also well known that a well done randomized experiment can account for the unknown confounders and permit valid causal inferences. But what can you do when it is impractical, impossible or unethical to conduct a randomized experiment? (For example, we wouldn't want to ask a randomly assigned cohort of people to go through life with less education to prove that education matters.) One way of coping with confounders when randomization is infeasible is to introduce what Economists call instrumental variables. This is a devilishly clever and apparently fragile notion that takes some effort to wrap one's head around.

On Tuesday October 20th, we at the Bay Area useR Group (BARUG) had the good fortune to have Hyunseung Kang describe the work that he and his colleagues at the Wharton School have been doing to extend the usefulness of instrumental variables. Hyunseung's talk started with elementary notions: like explaining the effectiveness of randomized experiments, described the essential notion of instrumental variables and developed the background necessary for understanding the new results in this area. The slides from Hyunseung's talk available for download in two parts from the BARUG website. As with most presentations, these slides are little more than the mute residue of talk itself. Nevertheless, Hyunseung makes such imaginative used of animation and build slides that the deck is worth working through.

The following slide from Hyunseung's presentation captures the essence of the instrumental approach.

The general idea is that one or more variables, the instruments, are added to the model for the purpose of inducing randomness into the outcome. This has to be done in a way that conforms with the three assumptions mentioned in the figure. The first assumption, A1, is that the instrument variables are relevant to the process. The second assumption, A2, states that randomness is only induced into the exposure variables and not also into the outcome. The third assumption, A3, is a strong one: there are no unmeasured confounders. The claim is that if these three assumptions are met then causal effects can be estimated with coefficients for the exposure variables that are consistent and asymptotically unbiased.

In the education example developed by Hyunseung, the instrumental variables are the subject's proximity to 2 year and 4 year colleges. Here is where the "rubber meets the road" so to speak. Assessing the relevancy of the instrumental variables and interpreting their effects are subject to the kinds of difficulties described by Andrew Gelman in his post of a few years back.

In the second part of his presentation Hyunseung presents new work: (1) two methods that provide robust confidence intervals when assumption A1 is violated, (2) a method for implementing a sensitivity analysis to assess the sensitivity of an instrumental variable model to violations of assumptions A2 and A3, and (3) the R package ivmodel that ties it all together.

To delve even deeper into this topic have a look at the paper: Instrumental Variables Estimation With Some Invalid Instruments and its Application to Mendelian Randomization.