*by Shaheen Gauher, PhD, Data Scientist at Microsoft*

At the heart of a classification model is the ability to assign a class to an object based on its description or features. When we build a classification model, often we have to prove that the model we built is significantly better than random guessing. How do we know if our machine learning model performs better than a classifier built by assigning labels or classes arbitrarily (through random guess, weighted guess etc.)? I will call the latter *non-machine learning classifiers* as these do not learn from the data. A machine learning classifier should be smarter and should not be making just lucky guesses! At the least, it should do a better job at detecting the difference between different classes and should have a better accuracy than the latter. In the sections below, I will show three different ways to build a non-machine learning classifier and compute their accuracy. The purpose is to establish some baseline metrics against which we can evaluate our classification model.

In the examples below, I will assume we are working with data with population size \(n\). The data is divided into two groups, with \(x\%\) of the rows or instances belonging to one class (labeled positive or \(P\)) and \((1-x)\%\) belonging to another class (labeled negative or \(N\)). We will also assume that the majority of the data is labeled \(N\). (This is easily extended to data with more than two classes, as I show in the paper here). This is the ground truth.

Population Size \(=n\)

Fraction of instances labelled positive \(=x\)

Fraction of instances labelled negative \(=(1-x)\)

Number of instances labelled positive \((P)\) \(=xn\)

Number of instances labelled negative \((N)\) \(=(1-x)n\)

The confusion matrix with rows reflecting the ground truth and columns reflecting the machine learning classifier classifications looks like:

We can define some simple non-machine learning classifiers that assign labels based simply on the proportions found in the training data:

**Random Guess Classifier**: randomly assign half of the labels to \(P\) and the other half as \(N\).**Weighted Guess Classifier**: randomly assign \(x\%\) of the labels to \(P\), and the remaining \((1-x)\%\) to \(N\)**Majority Class Classifier**: assign all of the labels to \(N\) (the majority class in the data)

The confusion matrices for these trivial classifiers would look like:

The standard performance metrics for evaluating classifiers are **accuracy**, **recall** and **precision**. (In a previous post, we included definitions of these metrics and how to compute them in R.) In this paper I algebraically derive the performance metrics for these non-machine classifier. They are shown in the table below, and provide baseline metrics for comparing the performance of machine learning classifiers:

```
| Classifier | Accuracy | Recall | Precision |
| -------------- | ---------- | ------ | --------- |
| Random Guess | 0.5 | 0.5 | x |
| Weighted Guess | x
```^{2 }+ (1-x)^{2 }| x | x |
| Majority Class | (1-x) | 0 | 0 |

In this experiment at the Cortana Analytics Gallery you can follow a binary classification model using Census Income data set to see how the confusion matrices for the models compare with each other. An accuracy of 76% can be achieved simply by assigning all instances to majority class.

Fig. Showing confusion matrices for a binary classification model using Boosted Decision Tree for Census Income data and models based on random guess, weighted guess and all instances assigned to majority class. The experiment can be found here.

For a **multiclass classification** with \(k\) classes with \(x_i\) being the fraction of instances belonging to class \(i\) ( \(i\) = 1 to \(k\)), it can similarly be shown that for

**Random Guess**: Accuracy would by \(\frac{1}{k}\). The precision for a class \(i\) would be \(x_i\), the fraction of instances in the data with class \(i\). Recall for a class will be equal to \(\frac{1}{k}\). In the language of probability, the accuracy is simply the probability of selecting a class which for two classes (binary classification) is 0.5 and for \(k\) classes will be \(\frac{1}{k}\).

**Weighted Guess**: Accuracy is equal to \(\sum_{i=1}^k x_{i}^2 \). Precision and Recall for a class is equal to \(x_i\), the fraction of instances in the data with the class \(i\). In the language of probability, \(\frac{xn}{n}\) or \(x\) is the probability of a label being positive in the data (for negative label, the probability is \(\frac{(1-x)n}{n}\)or \((1-x)\) ). If there are more negative instances in the data, the model has a higher probability of assigning an instance as negative. The probability of assigning an instance as true positive by the model will be \(x*x\) (for true negative it is \((1-x)^2\). The accuracy is simply the sum of these two probabilities.

**Majority Class**: Accuracy will be equal to \((1-x_i)\), the fraction of instances belonging to the majority class (assumed negative label is majority here). Recall will be 1 for the majority class and 0 for all other classes. Precision will be equal to the fraction of instances belonging to the majority class for the majority class and 0 for all other classes. In the language of probability, the probability of assigning a positive label to an instance by the model will be zero and the probability of assigning a negative label to an instance will be 1.

*We can use these simple, non-machine learning classifiers as benchmarks with which to compare the performance of machine learning models.*

**MRS code** to compute the baseline metrics for a classification model (Decision Forest) using Census Income data set can be found below. The corresponding **R cod**e can be found here.

Acknowledgement: Special thanks to Danielle Dean, Said Bleik, Dmitry Pechyoni and George Iordanescu for their input.

by Joseph Rickert

Random Forests, the "go to" classifier for many data scientists, is a fairly complex algorithm with many moving parts that introduces randomness at different levels. Understanding exactly how the algorithm operates requires some work, and assessing how good a Random Forests model fits the data is a serious challenge. In the pragmatic world of machine learning and data science, assessing model performance often comes down to calculating the area under the ROC curve (or some other convenient measure) on a hold out set of test data. If the ROC looks good then the model is good to go.

Fortunately, however, goodness of fit issues have a kind of nagging persistence that just won't leave statisticians alone. In a gem of a paper (and here) that sparkles with insight, the authors (Wagner, Hastie and Efron) take considerable care to make things clear to the reader while showing how to calculate confidence intervals for Random Forests models.

Using the high ground approach favored by theorists, Wagner et al. achieve the result about Random Forests by solving a more general problem first: they derive estimates of the variance of bagged predictors that can be computed from the same bootstrap replicates that give the predictors. After pointing out that these estimators suffer from two distinct sources of noise:

- Sampling noise - noise resulting from the randomness of data collection
- Monte Carlo noise - noise that results from using a finite number of bootstrap replicates

they produce bias corrected versions of jackknife and infinitesimal jackknife estimators. A very nice feature of the paper is the way the authors' illustrate the theory with simulation experiments and then describe the simulations in enough detail in an appendix for readers to replicate the results. I generated the following code and figure to replicate Figure 1 of their the first experiment using the authors' GitHub based package, randomForestCI.

Here, I fit a randomForest model to eight features from the UCI MPG data set and use the randomForestInfJack() function to calculate the infinitesimal Jackknife estimator. (The authors use seven features, but the overall shape of the result is the same.)

# Random Forest Confidence Intervals

install.packages("devtools") library(devtools) install_github("swager/randomForestCI")

library(randomForestCI)

library(dplyr) # For data manipulation

library(randomForest) # For random forest ensemble models

library(ggplot2)

# Fetch data from the UCI MAchine Learning Repository

url <-"https://archive.ics.uci.edu/ml/machine-learning-databases/

auto-mpg/auto-mpg.data"

mpg <- read.table(url,stringsAsFactors = FALSE,na.strings="?")

# https://archive.ics.uci.edu/ml/machine-learning-databases/

auto-mpg/auto-mpg.names

names(mpg) <- c("mpg","cyl","disp","hp","weight","accel","year","origin","name")

head(mpg)

# Look at the data and reset some of the data types

dim(mpg); Summary(mpg)

sapply(mpg,class)

mpg <- mutate(mpg, hp = as.numeric(hp),

year = as.factor(year),

origin = as.factor(origin))

head(mpg,2)

#

# Function to divide data into training, and test sets

index <- function(data=data,pctTrain=0.7)

{

# fcn to create indices to divide data into random

# training, validation and testing data sets

N <- nrow(data)

train <- sample(N, pctTrain*N)

test <- setdiff(seq_len(N),train)

Ind <- list(train=train,test=test)

return(Ind)

}

#

set.seed(123)

ind <- index(mpg,0.8)

length(ind$train); length(ind$test)

form <- formula("mpg ~ cyl + disp + hp + weight +

accel + year + origin")

rf_fit <- randomForest(formula=form,data=na.omit(mpg[ind$train,]),

keep.inbag=TRUE) # Build the model

# Plot the error as the number of trees increases

plot(rf_fit)

# Plot the important variables

varImpPlot(rf_fit,col="blue",pch= 2)

# Calculate the Variance

X <- na.omit(mpg[ind$test,-1])

var_hat <- randomForestInfJack(rf_fit, X, calibrate = TRUE)

#Have a look at the variance

head(var_hat); dim(var_hat); plot(var_hat)

# Plot the fit

df <- data.frame(y = mpg[ind$test,]$mpg, var_hat)

df <- mutate(df, se = sqrt(var.hat))

head(df)

p1 <- ggplot(df, aes(x = y, y = y.hat))

p1 + geom_errorbar(aes(ymin=y.hat-se, ymax=y.hat+se), width=.1) +

geom_point() +

geom_abline(intercept=0, slope=1, linetype=2) +

xlab("Reported MPG") +

ylab("Predicted MPG") +

ggtitle("Error Bars for Random Forests")

An interesting feature of the plot is that Random Forests doesn't appear to have the same confidence in all of its estimates, sometimes being less confident about estimates closer to the diagonal than those further away.

Don't forget to include confidence intervals with your next Random Forests model.

By Joseph Rickert

The ability to generate synthetic data with a specified correlation structure is essential to modeling work. As you might expect, R’s toolbox of packages and functions for generating and visualizing data from multivariate distributions is impressive. The basic function for generating multivariate normal data is mvrnorm() from the MASS package included in base R, although the mvtnorm package also provides functions for simulating both multivariate normal and t distributions. (For tutorial on how to use R to simulate from multivariate normal distributions from first principles using some linear algebra and the Cholesky decomposition see the astrostatistics tutorial on Multivariate Computations.)

The following block of code generates 5,000 draws from a bivariate normal distribution with mean (0,0) and covariance matrix Sigma printed in code. The function kde2d(), also from the Mass package generates a two-dimensional kernel density estimation of the distribution's probability density function.

# SIMULATING MULTIVARIATE DATA # https://stat.ethz.ch/pipermail/r-help/2003-September/038314.html # lets first simulate a bivariate normal sample library(MASS) # Simulate bivariate normal data mu <- c(0,0) # Mean Sigma <- matrix(c(1, .5, .5, 1), 2) # Covariance matrix # > Sigma # [,1] [,2] # [1,] 1.0 0.1 # [2,] 0.1 1.0 # Generate sample from N(mu, Sigma) bivn <- mvrnorm(5000, mu = mu, Sigma = Sigma ) # from Mass package head(bivn) # Calculate kernel density estimate bivn.kde <- kde2d(bivn[,1], bivn[,2], n = 50) # from MASS package

R offers several ways of visualizing the distribution. These next two lines of code overlay a contour plot on a "heat Map" that maps the density of points to a gradient of colors.

This plots the irregular contours of the simulated data. The code below which uses the ellipse() function from the ellipse package generates the classical bivariate normal distribution plot that graces many a textbook.

# Classic Bivariate Normal Diagram library(ellipse) rho <- cor(bivn) y_on_x <- lm(bivn[,2] ~ bivn[,1]) # Regressiion Y ~ X x_on_y <- lm(bivn[,1] ~ bivn[,2]) # Regression X ~ Y plot_legend <- c("99% CI green", "95% CI red","90% CI blue", "Y on X black", "X on Y brown") plot(bivn, xlab = "X", ylab = "Y", col = "dark blue", main = "Bivariate Normal with Confidence Intervals") lines(ellipse(rho), col="red") # ellipse() from ellipse package lines(ellipse(rho, level = .99), col="green") lines(ellipse(rho, level = .90), col="blue") abline(y_on_x) abline(x_on_y, col="brown") legend(3,1,legend=plot_legend,cex = .5, bty = "n")

The next bit of code generates a couple of three dimensional surface plots. The second of which is an rgl plot that you will be able to rotate and view from different perspectives on your screen.

Next, we have some code to unpack the grid coordinates produced by the kernel density estimator and get x y, and z values to plot the surface using the new scatterplot3js() function from the htmlwidgets, javascript threejs package. This visualization does not render the surface with the same level of detail as the rgl plot. Nevertheless, it does show some of the salient features of the pdf and has the distinct advantage of being easily embedded in web pages. I expect that html widget plots will keep getting better and easier to use.

# threejs Javascript plot library(threejs) # Unpack data from kde grid format x <- bivn.kde$x; y <- bivn.kde$y; z <- bivn.kde$z # Construct x,y,z coordinates xx <- rep(x,times=length(y)) yy <- rep(y,each=length(x)) zz <- z; dim(zz) <- NULL # Set up color range ra <- ceiling(16 * zz/max(zz)) col <- rainbow(16, 2/3) # 3D interactive scatter plot scatterplot3js(x=xx,y=yy,z=zz,size=0.4,color = col[ra],bg="black")<

The code that follows uses the rtmvt() function from the tmvtnorm package to generate bivariate t distribution. The rgl plot renders the surface kernel density estimate of the surface in impressive detail.

# Draw from multi-t distribution without truncation library (tmvtnorm) Sigma <- matrix(c(1, .1, .1, 1), 2) # Covariance matrix X1 <- rtmvt(n=1000, mean=rep(0, 2), sigma = Sigma, df=2) # from tmvtnorm package t.kde <- kde2d(X1[,1], X1[,2], n = 50) # from MASS package col2 <- heat.colors(length(bivn.kde$z))[rank(bivn.kde$z)] persp3d(x=t.kde, col = col2)

The real value of the multivariate distribution functions from the data science perspective is to simulate data sets with many more than two variables. The functions we have been considering are up to the task, but there are some technical considerations and, of course, we don't have the same options for visualization. The following code snippet generates 10 variables from a multivariate normal distribution with a specified covariance matrix. Note that I've used the genPositiveDefmat() function from the clusterGeneration package to generate the covariance matrix. This is because mvrnorm() will throw an error, as theory says it should, if the covariance matrix is not positive definite, and guessing a combination of matrix elements to make a high dimensional matrix positive definite would require quite a bit of luck along with some serious computation time.

After generating the matrix, I use the corrplot() function from the corrplot package to produce an attractive pairwise correlation plot that is coded both by shape and color. corrplot() scales pretty well with the number of variables and will give a decent chart with 40 to 50 variables. (Note that now ggcorrplot will do this for ggplot2 plots.) Other plotting options would be to generate pairwise scatter plots and R offers many alternatives for these.

Finally, what about going beyond the multivariate normal and t distributions? R does have a few functions like rlnorm() from the compositions package which generates random variates from the multivariate lognormal distribution that are as easy to use as mvrorm(), but you will have to hunt for them. I think a more fruitful approach if you are serious about probability distributions is to get familiar with the copula package.

*by Daniel Moore**Director of Applied Statistics Engineering, Console Development**Microsoft*

In Xbox Hardware, we are interested in the various ways that our hardware is used, and we are especially interested in how that usage changes over time. We employ several several time series analysis techniques that are helpful in getting a holistic view of usage of the Xbox console. When it comes to the actual data that we look at – it could be an individual game or app usage or it could be specific features of the console. For all of these, there are a few parts of the time series that we are interested in. The first is the trend of usage over time. Many games are very popular when they are first released and then lose popularity as they age. Some games remain steady in their popularity. The consoles themselves may show increased usage on holiday periods. All of these would be reflected in the overall trend of the data. We are also interested in the weekly cycle of usage. As you can imagine, use of a gaming console goes up on the weekend and down during the week. This is probably even more so among children than adults – as they have more time (and permission!) to play on non-school days than they do on school days.

We use R extensively to perform a time series analysis. In this post we’ll explore the initial analysis and decomposition of the time series into its component parts. Though some packages offer more complete time series analysis options, the base version of R has some good built in features for this initial analysis. The data for this example is the usage of a single game over more than a year of usage on the Xbox One. This is done with the following code below:

data<-read.csv("dataset.csv") # loads the data set in plot(data, type="l") # plot the data first to get a look at it. Here, just a line plot showing weekly periodicity over a trend tsdata<-ts(data, start=1, freq=7) # here we define a time series out of the data, with the first observation set at 1 and a weekly frequency.. decomposeddata<-stl(tsdata, s.window=7) plot(decomposeddata) # this shows four panes of line charts – the data, the sinusoidal seasonal fluctuations, the trend, and the remainder

The object “decomposeddata” is of class “stl” with several components useful for time series analysis. This uses loess to decompose the time series and many smoothing and other settings are available, depending on specific needs and the analysis being performed.

To highlight something that can be illuminated with this, look at the seasonal component of the chart below.

You’ll notice a period (highlighted) where the seasonal fluctuations dropped significantly. This is the summer time when school is out and weekdays/weekends blend together for kids. It’s always fun to find those artifacts in the data!

by Joseph Rickert

Quite a few times over the past few years I have highlighted presentations posted by R user groups on their websites and recommended these sites as a source for interesting material, but I have never thought to see what the user groups were doing on GitHub. As you might expect, many people who make presentations at R user group meetings make their code available on GitHub. However as best as I can tell, only a few R user groups are maintaining GitHub sites under the user group name.

The Indy UseR Group is one that seems to be making very good use of their GitHub Site. Here is the link to a very nice tutorial from Shankar Vaidyaraman on using the rvest package to do some web scraping with R. The following code which scrapes the first page from Springer's Use R! series to produce a short list of books comes form Shankar's simple example.

# load libraries library(rvest) library(dplyr) library(stringr) # link to Use R! titles at Springer site useRlink = "http://www.springer.com/?SGWID=0-102-24-0-0&series=Use+R&sortOrder=relevance&searchType=ADVANCED_CDA&searchScope=editions&queryText=Use+R" # Read the page userPg = useRlink %>% read_html() ## Get info of books displayed on the page booktitles = userPg %>% html_nodes(".productGraphic img") %>% html_attr("alt") bookyr = userPg %>% html_nodes(xpath = "//span[contains(@class,'renditionDescription')]") %>% html_text() bookauth = userPg %>% html_nodes("span[class = 'displayBlock']") %>% html_text() bookprice = userPg %>% html_nodes(xpath = "//div[@class = 'bookListPriceContainer']//span[2]") %>% html_text() pgdf = data.frame(title = booktitles, pubyr = bookyr, auth = bookauth, price = bookprice) pgdf

This plot,which shows a list of books ranked by number of downloads, comes from Shankar's extended recommender example.

The Ann Arbor R User Group meetup site has done an exceptional job of creating an aesthetically pleasing and informative web property on their GitHub site.

I am particularly impressed with the way they have integrated news, content and commentary into their "News" section. Scroll down the page and have look at the care taken to describe and document the presentations made to the group. I found the introduction and slides for Bob Carpenter's RStan presentation very well done.

Other RUGs active on GitHub include:

- Cambridge R User Group: meetup site and GitHub site
- Inland Northwest R User Group: GitHub only
- Twin Cities R User Group: meetup site and GitHub site
- UVa R Users Group: meetup site and GitHub site

If your R user group is on GitHub and I have not included you in my short list please let me know about it. I think RUG GitHub sites have the potential for creating a rich code sharing experience among user groups. If you would like some help getting started with GItHub have a look at tutorials on the Murdoch University R User Group webpage.

by Joseph Rickert

In a previous post, I showed some elementary properties of discrete time Markov Chains could be calculated, mostly with functions from the markovchain package. In this post, I would like to show a little bit more of the functionality available in that package by fitting a Markov Chain to some data. In this first block of code, I load the gold data set from the forecast package which contains daily morning gold prices in US dollars from January 1, 1985 through March 31, 1989. Next, since there are few missing values in the sequence, I impute them with a simple "ad hoc" process by substituting the previous day's price for one that is missing. There are two statements in the loop because there are a number of instances where there are two missing values in a row. Note that some kind of imputation is necessary because I will want to compute the autocorrelation of the series, and like many R functions acf() does not like NAs. (it doesn't make sense to compute with NAs.)

library(forecast)

library(markovchain)

data(gold) # Load gold time series

# Impute missing values

gold1 <- gold

for(i in 1:length(gold)){

gold1[i] <- ifelse(is.na(gold[i]),gold[i-1],gold[i])

gold1[i] <- ifelse(is.na(gold1[i]),gold1[i-1],gold1[i])

}

plot(gold1, xlab = "Days", ylab = "US$", main = "Gold prices 1/1/85 - 1/31/89")

This is an interesting series with over 1,000 points, but definitely not stationary; so it is not a good candidate for trying to model as a Markov Chain. The series produced by taking first differences is more reasonable. The series flat, oscillating about a mean (0.07) slightly above zero and the autocorrelation trails off as one might expect for a stationary series.

# Take first differences to try and get stationary series

goldDiff <- diff(gold1)

par(mfrow = c(1,2))

plot(goldDiff,ylab="",main="1st differences of gold")

acf(goldDiff)

Next, we set up for modeling by constructing a series of labels. In this analysis, I settle for constructing a two state chain that reflects whether the series of differences assumes positive or non-positive values. Note that I have introduced a few zero values into the series because of my crude imputation process. Here, i just lump them in with the negatives.

# construct a series of labels

goldSign <- vector(length=length(goldDiff))

for (i in 1:length(goldDiff)){

goldSign[i] <- ifelse(goldDiff[i] > 0, "POS","NEG")

}

Next, we can use some of the statistical tests built into the markovchain package to assess our assumptions so far. The function verifyMarkovProperty() attempts to verify that a sequence satisfies the Markov property by performing Chi squared tests on a series of contingency tables where the columns are sequences of past to present to future transitions and the rows are sequences of state transitions. Large p values indicate that one should not reject the null hypothesis that the Markov property holds for a specific transition. The output of verifyMarkovProperty() is a list with an entry for each possible state transition. The package vignette An introduction to the markovchain package presents the details of how this works. The following shows the output for the NEG to POS transitions.

# Verify the Markov property

vmp <- verifyMarkovProperty(goldSign)

vmp[2]

# $NEGPOS

# $NEGPOS$statistic

# X-squared

# 0

#

# $NEGPOS$parameter

# df

# 1

#

# $NEGPOS$p.value

# [1] 1

#

# $NEGPOS$method

# [1] "Pearson's Chi-squared test with Yates' continuity correction"

#

# $NEGPOS$data.name

# [1] "table"

#

# $NEGPOS$observed

# SSO TSO-SSO

# NEG 138 116

# POS 164 139

#

# $NEGPOS$expected

# SSO TSO-SSO

# NEG 137.7163 116.2837

# POS 164.2837 138.7163

#

# $NEGPOS$residuals

# SSO TSO-SSO

# NEG 0.02417181 -0.02630526

# POS -0.02213119 0.02408452

#

# $NEGPOS$stdres

# SSO TSO-SSO

# NEG 0.04843652 -0.04843652

# POS -0.04843652 0.04843652

#

# $NEGPOS$table

# SSO TSO

# NEG 138 254

# POS 164 303

The assessOrder() function uses a Chi squared test to test that hypothesis that the sequence is consistent with a first order Markov Chain

assessOrder(goldSign)

# The assessOrder test statistic is: 0.4142521

# the Chi-Square d.f. are: 2 the p-value is: 0.8129172

# $statistic

# [1] 0.4142521

#

# $p.value

# [1] 0.8129172

There is an additional function assessStationarity() to test for the stationarity of the sequence. However in this case, the chisq.test() function at the heart of things reports that the p-values are unreliable.

Next, we use the markovchainFit() function to fit a Markov Chain to the data using the maximum likelihood estimator explained in the vignette mentioned above. The output from the function includes the estimated transition matrix, the estimated error and lower and upper endpoint transition matrices which provide a confidence interval for the transition matrix.

goldMC <- markovchainFit(data = goldSign, method="mle", name = "gold mle")

goldMC$estimate

# gold mle

# A 2 - dimensional discrete Markov Chain with following states

# NEG POS

# The transition matrix (by rows) is defined as follows

# NEG POS

# NEG 0.4560144 0.5439856

# POS 0.5519126 0.4480874

goldMC$standardError

# NEG POS

# NEG 0.02861289 0.03125116

# POS 0.03170655 0.02856901

goldMC$confidenceInterval

# [1] 0.95

#

# $lowerEndpointMatrix

# NEG POS

# NEG 0.4089504 0.4925821

# POS 0.4997599 0.4010956

#

# $upperEndpointMatrix

# NEG POS

# NEG 0.5030784 0.5953892

# POS 0.6040652 0.4950793

The transition matrix does show some interesting structure with it being more likely for the chain to go from a negative to a positive value than to stay negative. And, once it is positive, the chain is more likely to stay positive than go negative.

Finally, we use the predict() function to produce a three day, look ahead forecast for the situation where the series has been negative for the last two days.

predict(object = goldMC$estimate, newdata = c("POS","POS"),n.ahead=3)

#"NEG" "POS" "NEG"

I still have not exhausted what is in the markovchain package. Perhaps, in an other post I will look at the functions for continuous time chains. What is presented here though should be enough to have some fun hunting for Markov Chains in all kinds of data.

The scientific process has been going through a welcome period of introspection recently, with a focus on understanding just how reliable the results of scientific studies are. We're not talking here about scientific fraud, but how the scientific process itself and the focus on p-values (which not even statisticians can easily explain) as the criterion for a positive result leads to a surprisingly large number of false positives to be published. On top of that, there's the issue of publication bias (especially in the pharmaceutical industry), an area where Ben Goldacre has taken a lead. The whole issue is wrapped in the concept of reproducibility — the idea that independent researchers should be able to replicate the results of published studies — for which David Spiegelhalter gives a great primer in the video below.

So it's welcome news that one of the top science breakthroughs of 2015 according to *Science* and *Nature* is Brian Nosek's project to reproduce the results of 100 scientific studies published in psychology journals. The detailed methodology is described in this paper, but in short Nosek recruited replication teams to recreate the studies as described in the carefully-selected papers, and analyze the data they collect:

Moreover,to maximize reproducibility and accuracy, the analyses for every replication study were reproduced by another analyst independent of the replication team using the R statistical programming language and a standardized analytic format. A controller R script was created to regenerate the entire analysis of every study and recreate the master data file.

R is a natural fit for a reproducibility project like this: as a scripting language, the R script itself provides a reproducible documentation of every step of the process. (Revolution R Open, Microsoft's enhanced R distribution, additionally includes features to facilitate reproducibility when using R packages.) The R script used for the psychology replication project describes and executes the process for checking the results of the papers.

Of the 100 papers studies, 97 of them reported statistically significant effects. (This is itself a reflection of publication bias; studies where there is no effect rarely get published.) Yet of those 97 papers, in 61 cases the reported significant results could not be replicated when the study was repeated. Their conclusion:

A large portion of replications produced weaker evidence for the original findings despite using materials provided by the original authors, review in advance for methodological fidelity, and high statistical power to detect the original effect sizes.

Study like this of the scientific method itself can only improve the scientific process, and is deserving of its accolade as a breakthrough. Read more about the project and the replicated studies at the link below.

Open Science Framework: Estimating the Reproducibility of Psychological Science (via Solomon Messing)

by Joseph Rickert

There are number of R packages devoted to sophisticated applications of Markov chains. These include msm and SemiMarkov for fitting multistate models to panel data, mstate for survival analysis applications, TPmsm for estimating transition probabilities for 3-state progressive disease models, heemod for applying Markov models to health care economic applications, HMM and depmixS4 for fitting Hidden Markov Models and mcmc for working with Monte Carlo Markov Chains. All of these assume some considerable knowledge of the underlying theory. To my knowledge only DTMCPack and the relatively recent package, markovchain, were written to facilitate basic computations with Markov chains.

In this post, we’ll explore some basic properties of discrete time Markov chains using the functions provided by the markovchain package supplemented with standard R functions and a few functions from other contributed packages. “Chapter 11”, of Snell’s online probability book will be our guide. The calculations displayed here illustrate some of the theory developed in this document. In the text below, section numbers refer to this document.

A large part of working with discrete time Markov chains involves manipulating the matrix of transition probabilities associated with the chain. This first section of code replicates the Oz transition probability matrix from section 11.1 and uses the plotmat() function from the diagram package to illustrate it. Then, the efficient operator %^% from the expm package is used to raise the Oz matrix to the third power. Finally, left matrix multiplication of OZ^3 by the distribution vector u = (1/3, 1/3, 1/3) gives the weather forecast three days ahead.

library(expm)

library(markovchain)

library(diagram)

library(pracma)

stateNames <- c("Rain","Nice","Snow")

Oz <- matrix(c(.5,.25,.25,.5,0,.5,.25,.25,.5),

nrow=3, byrow=TRUE)

row.names(Oz) <- stateNames; colnames(Oz) <- stateNames

Oz

# Rain Nice Snow

# Rain 0.50 0.25 0.25

# Nice 0.50 0.00 0.50

# Snow 0.25 0.25 0.50

plotmat(Oz,pos = c(1,2),

lwd = 1, box.lwd = 2,

cex.txt = 0.8,

box.size = 0.1,

box.type = "circle",

box.prop = 0.5,

box.col = "light yellow",

arr.length=.1,

arr.width=.1,

self.cex = .4,

self.shifty = -.01,

self.shiftx = .13,

main = "")

Oz3 <- Oz %^% 3

round(Oz3,3)

# Rain Nice Snow

# Rain 0.406 0.203 0.391

# Nice 0.406 0.188 0.406

# Snow 0.391 0.203 0.406

u <- c(1/3, 1/3, 1/3)

round(u %*% Oz3,3)

#0.401 0.198 0.401

The igraph package can also be used to Markov chain diagrams, but I prefer the “drawn on a chalkboard” look of plotmat.

This next block of code reproduces the 5-state Drunkward’s walk example from section 11.2 which presents the fundamentals of absorbing Markov chains. First, the transition matrix describing the chain is instantiated as an object of the S4 class makrovchain. Then, functions from the markovchain package are used to identify the absorbing and transient states of the chain and place the transition matrix, P, into canonical form.

p <- c(.5,0,.5)

dw <- c(1,rep(0,4),p,0,0,0,p,0,0,0,p,rep(0,4),1)

DW <- matrix(dw,5,5,byrow=TRUE)

DWmc <-new("markovchain",

transitionMatrix = DW,

states = c("0","1","2","3","4"),

name = "Drunkard's Walk")

DWmc

# Drunkard's Walk

# A 5 - dimensional discrete Markov Chain with following states

# 0 1 2 3 4

# The transition matrix (by rows) is defined as follows

# 0 1 2 3 4

# 0 1.0 0.0 0.0 0.0 0.0

# 1 0.5 0.0 0.5 0.0 0.0

# 2 0.0 0.5 0.0 0.5 0.0

# 3 0.0 0.0 0.5 0.0 0.5

# 4 0.0 0.0 0.0 0.0 1.0

# Determine transient states

transientStates(DWmc)

#[1] "1" "2" "3"

# determine absorbing states

absorbingStates(DWmc)

#[1] "0" "4"

In canonical form, the transition matrix, P, is partitioned into the Identity matrix, I, a matrix of 0’s, the matrix, Q, containing the transition probabilities for the transient states and a matrix, R, containing the transition probabilities for the absorbing states.

Next, we find the Fundamental Matrix, N, by inverting (I – Q). For each transient state, j, n_{ij} gives the expected number of times the process is in state j given that it started in transient state i. u_{i} is the expected time until absorption given that the process starts in state i. Finally, we compute the matrix B, where b_{ij} is the probability that the process will be absorbed in state J given that it starts in state i.

# Find Matrix Q

getRQ <- function(M,type="Q"){

if(length(absorbingStates(M)) == 0) stop("Not Absorbing Matrix")

tm <- M@transitionMatrix

d <- diag(tm)

m <- max(which(d == 1))

n <- length(d)

ifelse(type=="Q",

A <- tm[(m+1):n,(m+1):n],

A <- tm[(m+1):n,1:m])

return(A)

}

# Put DWmc into Canonical Form

P <- canonicForm(DWmc)

P

Q <- getRQ(P)

# Find Fundamental Matrix

I <- diag(dim(Q)[2])

N <- solve(I - Q)

N

# 1 2 3

# 1 1.5 1 0.5

# 2 1.0 2 1.0

# 3 0.5 1 1.5

# Calculate time to absorption

c <- rep(1,dim(N)[2])

u <- N %*% c

u

# 1 3

# 2 4

# 3 3

R <- getRQ(P,”R”)

B <- N %*% R

B

# 0 4

# 1 0.75 0.25

# 2 0.50 0.50

# 3 0.25 0.75

For section 11. 3, which deals with regular and ergodic Markov chains we return to Oz, and provide four options for calculating the steady state, or limiting probability distribution for this regular transition matrix. The first three options involve standard methods which are readily available in R. Method 1 uses %^% to raise the matrix Oz to a sufficiently high value. Method 2 calculates the eigenvalue for the eigenvector 1, and method 3 uses the nullspace() function form the pracma package to compute the null space, or kernel of the linear transformation associated with the matrix. The fourth method uses the steadyStates() function from the markovchain package. To use this function, we first convert Oz into a markovchain object.

# 11.3 Ergodic Markov Chains

# Four methods to get steady states

# Method 1: compute powers on Matrix

round(Oz %^% 6,2)

# Rain Nice Snow

# Rain 0.4 0.2 0.4

# Nice 0.4 0.2 0.4

# Snow 0.4 0.2 0.4

# Method 2: Compute eigenvector of eigenvalue 1

eigenOz <- eigen(t(Oz))

ev <- eigenOz$vectors[,1] / sum(eigenOz$vectors[,1])

ev

# Method 3: compute null space of (P - I)

I <- diag(3)

ns <- nullspace(t(Oz - I))

ns <- round(ns / sum(ns),2)

ns

# Method 4: use function in markovchain package

OzMC<-new("markovchain",

states=stateNames,

transitionMatrix=

matrix(c(.5,.25,.25,.5,0,.5,.25,.25,.5),

nrow=3,

byrow=TRUE,

dimnames=list(stateNames,stateNames)))

steadyStates(OzMC)

The steadyState() function seems to be reasonably efficient for fairly large Markov Chains. The following code creates a 5,000 row by 5,000 column regular Markov matrix. On my modest, Lenovo ThinkPad ultrabook it took a little less than 2 minutes to create the markovchain object and about 11 minutes to compute the steady state distribution.

# Create a large random regular matrix

randReg <- function(N){

M <- matrix(runif(N^2,min=1,max=N),nrow=N,ncol=N)

rowS <- rowSums(M)

regM <- M/rowS

return(regM)

}

N <- 5000

M <-randReg(N)

#rowSums(M)

system.time(regMC <- new("markovchain", states = as.character(1:N),

transitionMatrix = M,

name = "M"))

# user system elapsed

# 98.33 0.82 99.46

system.time(ss <- steadyStates(regMC))

# user system elapsed

# 618.47 0.61 640.05

We conclude this little Markov Chain excursion by using the rmarkovchain() function to simulate a trajectory from the process represented by this large random matrix and plot the results. It seems that this is a reasonable method for simulating a stationary time series in a way that makes it easy to control the limits of its variability.

#sample from regMC

regMCts <- rmarkovchain(n=1000,object=regMC)

regMCtsDf <- as.data.frame(regMCts,stringsAsFactors = FALSE)

regMCtsDf$index <- 1:1000

regMCtsDf$regMCts <- as.numeric(regMCtsDf$regMCts)

library(ggplot2)

p <- ggplot(regMCtsDf,aes(index,regMCts))

p + geom_line(colour="dark red") +

xlab("time") +

ylab("state") +

ggtitle("Random Markov Chain")

For more on the capabilities of the markovchain package do have a look at the package vignette. For more theory on both discrete and continuous time Markov processes illustrated with R see Norm Matloff's book: From Algorithms to Z-Scores: Probabilistic and Statistical Modeling in Computer Science.

by Andrie de Vries

A week ago my high school friend, @XLRunner, sent me a link to the article "How Zach Bitter Ran 100 Miles in Less Than 12 Hours". Zach's effort was rewarded with the American record for the 100 mile event.

This reminded me of some analysis I did, many years ago, of the world record speeds for various running distances. The International Amateur Athletics Federation (IAAF) keeps track of world records for distances from 100m up to the marathon (42km). The distances longer than 42km do not fall in the IAAF event list, but these are also tracked by various other organisations.

You can find a list of IAAF world records at Wikipedia, and a list of ultramarathon world best times at Wikepedia.

I extracted only the mens running events from these lists, and used R to plot the average running speeds for these records:

You can immediately see that the speed declines very rapidly from the sprint events. Perhaps it would be better to plot this using a logarithmic x-scale, adding some labels at the same time. I also added some colour for what I call standard events - where "standard" is the type of distance you would see regularly at a world championships or olympic games. Thus the mile is "standard", but the 2,000m race is not.

Now our data points are in somewhat more of a straight line, meaning we could consider fitting a linear regression.

However, it seems that there might be two kinks in the line:

- The first kink occurs somewhere between the 800m distance and the mile. It seems that the sprinting distances (and the 800m is sometimes called a long sprint) has different dynamics from the events up to the marathon.
- And then there is another kink for the ultra-marathon distances. The standard marathon is 42.2km, and distances longer than this are called ultramarathons.

Also, note that the speed for the 100m is actually slower than for the 200m. This indicates the transition effect of getting started from a standing start - clearly this plays a large role in the very short sprint distance.

For the analysis below, I exlcuded the data for:

- The 100m sprint (transition effects play too large a role)
- The ultramarahon distances (they get raced less frequently, thus something strange seems to be happening in the data for the 50km race in particular).

To fit a regression line with kinks, more properly known as a segmented regression (or sometimes called piecewise regression), you can use the segmented package, available on CRAN.

The `segmented()`

function allows you to modify a fitted object of class `lm`

or `glm`

, specifying which of the independent variables should have segments (kinks). In my case, I fitted a linear model with a single variable (log of distance), and allowed `segmented()`

to find a single kink point.

My analysis indicates that there is a kink point at 1.13km (10^0.055 = 1.13), i.e. between the 800m event and the 1,000m event.

`> summary(sfit)`

`***Regression Model with Segmented Relationship(s)***`

`Call: `

`segmented.lm(obj = lfit, seg.Z = ~logDistance)`

`Estimated Break-Point(s):`

` Est. St.Err `

` 0.055 0.021`

`Meaningful coefficients of the linear terms:`

` Estimate Std. Error t value Pr(>|t|) `

`(Intercept) 27.2064 0.1755 155.04 < 2e-16 ***`

`logDistance - 15.1305 0.4332 -34.93 1.94e-13 ***`

`U1.logDistance 11.2046 0.4536 24.70 NA `

`---`

`Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1`

`Residual standard error: 0.2373 on 12 degrees of freedom`

`Multiple R-Squared: 0.9981, Adjusted R-squared: 0.9976`

`Convergence attained in 4 iterations with relative change -4.922372e-16 `

The final plot shows the same data, but this time with the segmented regression line also displayed.

I conlude:

- It is really easy to fit a segmented linear regression model using the segmented package
- There seems to be a different physiological process for the sprint events and the middle distance events. The segmented regression finds this kink point between the 800m event and the 1,000m event
- The ultramarathon distances have a completely different dynamic. However, it's not clear to me whether this is due to inherent physiological constraints, or vastly reduced competition in these "non-standard" events.
- The 50km world record seems too "slow". Perhaps the competition for this event is less intense than for the marathon?

Here is my code for the analysis: