If you're looking to get started with data science in R, a great place to start is OnePageR by Graham Williams. (Graham is the creator of Rattle, author of Data Mining with Rattle and R, and Director of Data Science at Microsoft.) This free (CC-licensed) resource is a series of hands-on mini-chapters and associated R code, organized into four main topic areas:

**Data Science**: introductions to data science, data mining, literate programming, and the R language**Dealing with Data**: Reading data files and open access data, basic explorations and visualizations, and two case studies**Building Models**: with tutorials for many kinds of models, including association analysis, ensemble models, and multivariate adaptive regression splines**Advanced R and Analytics**: with topics including writing functions, parallel processing, and text mining.

Example data files are provided for use in all chapters. There are also two very useful appendix chapters; an R Style Guide, and a guide to creating an R package.

OnePageR is a continual work in progress, and is regularly updated to incorporate advances in R and the R package ecosystem. To download the materials, follow the link below.

Togaware: OnePageR

A great way to learn is by doing, so if you've been thinking about how to enable R-based computations within SQL Server, a new tutorial will take you through all the steps of building an intelligent application. In a few simple steps, you'll set up all the necessary software and code to build a live service that predicts demand for a ski rental shop.

The tutorial "Create a predictive model in SQL Server R Services" will step you through:

- Installing SQL Server, Microsoft R Client, and an IDE for R (RTVS or RStudio) on your Windows machine. All of the components are available for free from Visual Studio Dev Essentials.
- Loading the provided ski-rental data, exploring the data, and building a model to predict number of rentals in R.
- Creating a stored procedure to predict rentals from the date and the amount of snowfall.

The tutorial also links to a number of SQL Server Management Studio custom reports which will be helpful to R developers to manage R packages on the server, monitor R execution statistics, and more.

To get started with the tutorial, simply follow the link below.

SQL Server: Build an intelligent app with SQL Server and R

Today is Talk Like A Pirate Day, the perfect day to learn R, the programming language of pirates (arrr, matey!). If you have two-and-a-bit hours to spare, Nathaniel Phillips has created a video tutorial YaRrr! The Pirate's Guide to R which will take you through the basics: installation, basic R operations, and the matrix and data frame obects.

For a more in-depth study of R, there's also a 250-page e-book YaRrr! The Pirate’s Guide to R which goes into the basics in more depth, and covers more advanced topics including data visualization, statistical analysis, and writing your own functions.

There's also an accompanying package to the video and book called (appropriately) yarr that includes datasets from the course and also an interesting "Pirate Plot" data visualization that combines raw data, summary statistics, a "bean plot" distribution, and a confidence interval.

For more on The Pirate's Guide to R (and to tip him a beer), follow the link to Nathaniel's blog below.

Nathaniel Phillips: YaRrr! The Pirate’s Guide to R

by Joseph Rickert

My guess is that a good many statistics students first encounter the bivariate Normal distribution as one or two hastily covered pages in an introductory text book, and then don't think much about it again until someone asks them to generate two random variables with a given correlation structure. Fortunately for R users, a little searching on the internet will turn up several nice tutorials with R code explaining various aspects of the bivariate Normal. For this post, I have gathered together a few examples and tweaked the code a little to make comparisons easier.

Here are five different ways to simulate random samples bivariate Normal distribution with a given mean and covariance matrix.

To set up for the simulations this first block of code defines N, the number of random samples to simulate, the means of the random variables, and and the covariance matrix. It also provides a small function for drawing confidence ellipses on the simulated data.

library(mixtools) #for ellipse

N <- 200 # Number of random samples

set.seed(123)

# Target parameters for univariate normal distributions

rho <- -0.6

mu1 <- 1; s1 <- 2

mu2 <- 1; s2 <- 8

# Parameters for bivariate normal distribution

mu <- c(mu1,mu2) # Mean

sigma <- matrix(c(s1^2, s1*s2*rho, s1*s2*rho, s2^2),

2) # Covariance matrix

# Function to draw ellipse for bivariate normal data

ellipse_bvn <- function(bvn, alpha){

Xbar <- apply(bvn,2,mean)

S <- cov(bvn)

ellipse(Xbar, S, alpha = alpha, col="red")

}

The first method, the way to go if you just want to get on with it, is to use the mvrnorm() function from the MASS package.

library(MASS)

bvn1 <- mvrnorm(N, mu = mu, Sigma = sigma ) # from MASS package

colnames(bvn1) <- c("bvn1_X1","bvn1_X2")

It takes so little code to do the simulation it might be possible to tweet in a homework assignment.

A look at the source code for mvrnorm() shows that it uses eignevectors to generate the random samples. The documentation for the function states that this method was selected because it is stabler than the alternative of using a Cholesky decomposition which might be faster.

For the second method, let's go ahead and directly generate generate bivariate Normal random variates with the Cholesky decomposition. Remember that the Cholesky decomposition of sigma (a positive definite matrix) yields a matrix M such that M times its transpose gives sigma back again. Multiplying M by a matrix of standard random Normal variates and adding the desired mean gives a matrix of the desired random samples. A lecture from Colin Rundel covers some of the theory.

M <- t(chol(sigma))

# M %*% t(M)

Z <- matrix(rnorm(2*N),2,N) # 2 rows, N/2 columns

bvn2 <- t(M %*% Z) + matrix(rep(mu,N), byrow=TRUE,ncol=2)

colnames(bvn2) <- c("bvn2_X1","bvn2_X2")

For the third method we make use of a special property of the bivariate normal that is discussed in almost all of those elementary textbooks. If X_{1} and X_{2} are two jointly distributed random variables, then the conditional distribution of X_{2} given X_{1} is itself normal with: mean = m_{2} + r(s_{2}/s_{1})(X_{1} - m_{1}) and variance = (1 - r^{2})s^{2}X_{2}.

Hence, a sample from a bivariate Normal distribution can be simulated by first simulating a point from the marginal distribution of one of the random variables and then simulating from the second random variable conditioned on the first. A brief proof of the underlying theorem is available here.

rbvn<-function (n, m1, s1, m2, s2, rho)

{

X1 <- rnorm(n, mu1, s1)

X2 <- rnorm(n, mu2 + (s2/s1) * rho *

(X1 - mu1), sqrt((1 - rho^2)*s2^2))

cbind(X1, X2)

}

bvn3 <- rbvn(N,mu1,s1,mu2,s2,rho)

colnames(bvn3) <- c("bvn3_X1","bvn3_X2")

The fourth method, my favorite, comes from Professor Darren Wiliinson's Gibbs Sampler tutorial. This is a very nice idea; using the familiar bivariate Normal distribution to illustrate the basics of the Gibbs Sampling Algorithm. Note that this looks very much like the previous method, except that now we are alternately sampling from the full conditional distributions.

gibbs<-function (n, mu1, s1, mu2, s2, rho)

{

mat <- matrix(ncol = 2, nrow = n)

x <- 0

y <- 0

mat[1, ] <- c(x, y)

for (i in 2:n) {

x <- rnorm(1, mu1 +

(s1/s2) * rho * (y - mu2), sqrt((1 - rho^2)*s1^2))

y <- rnorm(1, mu2 +

(s2/s1) * rho * (x - mu1), sqrt((1 - rho^2)*s2^2))

mat[i, ] <- c(x, y)

}

mat

}

bvn4 <- gibbs(N,mu1,s1,mu2,s2,rho)

colnames(bvn4) <- c("bvn4_X1","bvn4_X2")

The fifth and final way uses the rmvnorm() function form the mvtnorm package with the singular value decomposition method selected. The functions in this package are overkill for what we are doing here, but mvtnorm is probably the package you would want to use if you are calculating probabilities from high dimensional multivariate distributions. It implements numerical methods for carefully calculating the high dimensional integrals involved that are based on some papers by Professor Alan Genz dating from the early '90s. These methods are briefly explained in the package vignette.

library (mvtnorm)

bvn5 <- mvtnorm::rmvnorm(N,mu,sigma, method="svd")

colnames(bvn5) <- c("bvn5_X1","bvn5_X2")

Note that I have used the :: operator here to make sure that R uses the rmvnorm() function from the mvtnorm package. There is also a rmvnorm() function in the mixtools package that I used to get the ellipse function. Loading the packages in the wrong order could lead to the rookie mistake of having the function you want inadvertently overwritten.

Next, we plot the results of drawing just 100 random samples for each method. This allows us to see how the algorithms spread data over the sample space as they are just getting started.

bvn <- list(bvn1,bvn2,bvn3,bvn4,bvn5)

par(mfrow=c(3,2))

plot(bvn1, xlab="X1",ylab="X2",main= "All Samples")

for(i in 2:5){

points(bvn[[i]],col=i)

}

for(i in 1:5){

item <- paste("bvn",i,sep="")

plot(bvn[[i]],xlab="X1",ylab="X2",main=item, col=i)

ellipse_bvn(bvn[[i]],.5)

ellipse_bvn(bvn[[i]],.05)

}

par(mfrow=c(1,1))

The first plot shows all 500 random samples color coded by the method with which they were generated. The remaining plots show the samples generated by each method. In each of these plots the ellipses mark the 0.5 and 0.95 probability regions, i.e. the area within the ellipses should contain 50% and 95% of the points respectively. Note that bvn4 which uses the Gibbs sampling algorithm looks like all of the rest. In most use cases for the Gibbs it takes the algorithm some time to converge to the target distribution. In our case, we start out with a pretty good guess.

Finally, a word about accuracy: nice coverage of the sample space is not sufficient to produce accurate results. A little experimentation will show that, for all of the methods outlined above, regularly achieving a sample covariance matrix that is close to the target, sigma, requires something on the order of 10,000 samples as is Illustrated below.

> sigma

[,1] [,2]

[1,] 4.0 -9.6

[2,] -9.6 64.0

for(i in 1:5){

print(round(cov(bvn[[i]]),1))

}

bvn1_X1 bvn1_X2

bvn1_X1 4.0 -9.5

bvn1_X2 -9.5 63.8

bvn2_X1 bvn2_X2

bvn2_X1 3.9 -9.5

bvn2_X2 -9.5 64.5

bvn3_X1 bvn3_X2

bvn3_X1 4.1 -9.8

bvn3_X2 -9.8 63.7

bvn4_X1 bvn4_X2

bvn4_X1 4.0 -9.7

bvn4_X2 -9.7 64.6

bvn5_X1 bvn5_X2

bvn5_X1 4.0 -9.6

bvn5_X2 -9.6 65.3

Many people coming to R for the first time find it disconcerting to realize that there are several ways to do some fundamental calculation in R. My take is that rather than being a point of frustration, having multiple options indicates that richness of the R language. A close look at the package documentation will often show that yet another method to do something is a response to some subtle need that was not previously addressed. Enjoy the diversity!

If you've heard about Data Science but don't really understand what it's all about, you might want to check out the 5-part video series Data Science for Beginners presented by my colleague Brandon Rohrer, senior data scientist at Microsoft. Each video is short (5-10 minutes) and explains an aspect of data science without any assumed knowledge or technical jargon. The videos are embedded below (follow the links for a transcript and additional details). If you're puzzled by predictive models, check out Video 4 in particular, where Brandon explains the linear regression model (and confidence intervals!) without using a single equation. To try some of the models out yourself, Video 5 explains how to search the Cortana Intelligence Gallery to find pre-built experiments implementing techniques you may be interested in.

Video 1: The 5 questions data science answers

Video 2: Is your data ready for data science?

Video 3: Ask a question you can answer with data

Video 4: Predict an answer with a simple model

Video 5: Copy other people's work to do data science

There's no denying that for a language as popular as R, it has more than its fair share of quirks. If you've ever wondered *why*, for example, R has a non-standard assignment operator, or that periods are allowed in symbols (and don't signify method calls), or that character data imports as factors (not strings) by default, then this blog post by Oliver Keyes is for you. If you're new to R, it's worth checking out for some common traps in R's syntax. And if you're a longtime R user, it's an interesting (and entertaining) look into R's history, and how various influences (including ancient keyboards) have influenced the design of the world's most popular language for data science.

Oliver Keyes: Rbirtary Standards

If you're new to the concept of predictive models, or just want to review the background on how data scientists learn from past data to predict the future, you may be interested in my talk from the Data Insights Summit, Introduction to Real-Time Predictive Modeling.

In the talk above I gave a brief introduction to the R language and mentioned several applications using R. If you'd like to get started with R, you might like to follow along with my co-blogger Joseph Rickert's beginners workshop, Supercharge Your Data Analysis With R:

You can follow along with Joe's session by downloading Microsoft R Open and using the scripts in this GitHub repository.

by Joseph Rickert

If you are an R user and work for an enterprise where Microsoft SQL server is important, it is extremely helpful to have easy access to SQL Server databases. Over the past year, we have run several posts on this topic including a comprehensive four part series from Microsoft's Gregory Vandenbrouck on using various flavors of SQL with Azure as a data source (Part1, Part2, Part3 and Part4) as well as several posts on using the advanced features of Microsoft R Server (formerly Revolution R Enterprise) with SQL Server 2016. (See for example this recent post Credit Card Fraud Detection with SQL Server 2016 R Services).

In this post, I would just like to describe how to connect to an Azure SQL database from your local R session. Setting up an Azure hosted database is the easiest way I know for an R user to get started with Microsoft SQL server. You don't have to install the database, and very little SQL knowledge is necessary to begin working with a database. All of the heavy lifting is done by the Azure platform.

The only prerequisite is to get an Azure account. If you don't already have an account, signing up for a free trial account which you can do here will give you enough credits to experiment with working with SQL Server from R.

The first step after getting an account is to login to the Azure portal. You should see a screen that looks like the figure at right.

From here, clicking on the SQL database icon and selecting "New" should bring you to a screen that looks similar to the one below. The first time you create a database you will be asked to provide a server admin logon name and a password. Remember these because they will be necessary to form your complete connection string.

Next, select a name for your database, and click on the tab "blank database". This will bring you to a screen that looks like the figure below. Copy the text in the the "ODBC" box. This is everything that you need to form a complete connection string except for the password you created above. Note that text in the "ODBC" box will show your logon ID but not your password.

Now that the preliminaries are out of the way, you should be able to populate your database with the following code. The RODBC package is used to communicate with SQL Server. The nycflights13 package provides a convenient test data set with over 300,000 rows and 16 columns. The connection string is what I copied from the Azure ODBC text box described above, except that I have replaced my login ID with the text "my_ID". (Let this serve as a gentle reminder not to store your credentials in the source code. ) The command odbcDriverConnect() opens the ODBC connection to the database, but the real work is done sqlSave() that populates the database with the flights data. This took about 40 minutes on running from my RStudio R script executing on my lenovo Thinkpad.

# CONNECT TO AN AZURE SQL DATABASE library(RODBC) # Provides database connectivity library(nycflights13) # SOme sample data library(dplyr) # only used for nice format of Head() function here # The Connection string comes from the Azure ODBC text box connectionString <- "Driver={SQL Server Native Client 11.0};Server=tcp:hzgi1l8nwn.database.windows.net,1433;Database=Test_R2;Uid=your_logon_ID@hzgi1l8nwn;Pwd={your_password_here};Encrypt=yes;TrustServerCertificate=no;Connection Timeout=30;" #Open your RODBC connection myconn <- odbcDriverConnect(connectionString) # Get some data # We use the New York City 2013 Flight Data from thepackage nycflights12 dim(flights) #[1] 336776 16 head(flights) # Source: local data frame [6 x 16] # # year month day dep_time dep_delay arr_time arr_delay carrier tailnum # (int) (int) (int) (int) (dbl) (int) (dbl) (chr) (chr) # 1 2013 1 1 517 2 830 11 UA N14228 # 2 2013 1 1 533 4 850 20 UA N24211 # 3 2013 1 1 542 2 923 33 AA N619AA # 4 2013 1 1 544 -1 1004 -18 B6 N804JB # 5 2013 1 1 554 -6 812 -25 DL N668DN # 6 2013 1 1 554 -4 740 12 UA N39463 # Variables not shown: flight (int), origin (chr), dest (chr), air_time (dbl), # distance (dbl), hour (dbl), minute (dbl) # Save the table to the database sqlSave(channel=myconn, dat=flights, tablename = "flightsTbl")

The following code executes a SQL query and displays some of the data returned. odbcCloseAll() closes the ODBC connection.

# Fetch movies with ratings GT 3 from Azure SQL DB sqlQuery_m1 <- "SELECT * FROM flightsTbl WHERE month < 3" m1 <- sqlQuery(myconn, sqlQuery_m1) head(m1) # rownames year month day dep_time dep_delay arr_time arr_delay carrier # 1 1 2013 1 1 517 2 830 11 UA # 2 2 2013 1 1 533 4 850 20 UA # 3 3 2013 1 1 542 2 923 33 AA # 4 4 2013 1 1 544 -1 1004 -18 B6 # 5 5 2013 1 1 554 -6 812 -25 DL # 6 6 2013 1 1 554 -4 740 12 UA # tailnum flight origin dest air_time distance hour minute # 1 N14228 1545 EWR IAH 227 1400 5 17 # 2 N24211 1714 LGA IAH 227 1416 5 33 # 3 N619AA 1141 JFK MIA 160 1089 5 42 # 4 N804JB 725 JFK BQN 183 1576 5 44 # 5 N668DN 461 LGA ATL 116 762 5 54 # 6 N39463 1696 EWR ORD 150 719 5 54 dim(m1) #51955 17 odbcCloseAll()

To go further, have a look at this tutorial on writing Microsoft SQL queries, or get a more in depth introduction to Microsoft SQL Server here.

We had a fantastic turnout to last week's webinar, Introduction to Microsoft R Open. If you missed it, you can watch the replay below. In the talk, I gives some background on the R language and its applications, describe the performance and reproducibility benefits of Microsoft R Open, and give a demonstration of the basics of the R language along with a more in-depth demo of producing a beautiful weather data chart with R.

You can also check out the slides (and follow the embedded links) below:

Check out the other webinars in the Microsoft R Series as well.

Microsoft Webinars: Introduction to Microsoft R Open

by Joseph Rickert

There are number of R packages devoted to sophisticated applications of Markov chains. These include msm and SemiMarkov for fitting multistate models to panel data, mstate for survival analysis applications, TPmsm for estimating transition probabilities for 3-state progressive disease models, heemod for applying Markov models to health care economic applications, HMM and depmixS4 for fitting Hidden Markov Models and mcmc for working with Monte Carlo Markov Chains. All of these assume some considerable knowledge of the underlying theory. To my knowledge only DTMCPack and the relatively recent package, markovchain, were written to facilitate basic computations with Markov chains.

In this post, we’ll explore some basic properties of discrete time Markov chains using the functions provided by the markovchain package supplemented with standard R functions and a few functions from other contributed packages. “Chapter 11”, of Snell’s online probability book will be our guide. The calculations displayed here illustrate some of the theory developed in this document. In the text below, section numbers refer to this document.

A large part of working with discrete time Markov chains involves manipulating the matrix of transition probabilities associated with the chain. This first section of code replicates the Oz transition probability matrix from section 11.1 and uses the plotmat() function from the diagram package to illustrate it. Then, the efficient operator %^% from the expm package is used to raise the Oz matrix to the third power. Finally, left matrix multiplication of OZ^3 by the distribution vector u = (1/3, 1/3, 1/3) gives the weather forecast three days ahead.

library(expm)

library(markovchain)

library(diagram)

library(pracma)

stateNames <- c("Rain","Nice","Snow")

Oz <- matrix(c(.5,.25,.25,.5,0,.5,.25,.25,.5),

nrow=3, byrow=TRUE)

row.names(Oz) <- stateNames; colnames(Oz) <- stateNames

Oz

# Rain Nice Snow

# Rain 0.50 0.25 0.25

# Nice 0.50 0.00 0.50

# Snow 0.25 0.25 0.50

plotmat(Oz,pos = c(1,2),

lwd = 1, box.lwd = 2,

cex.txt = 0.8,

box.size = 0.1,

box.type = "circle",

box.prop = 0.5,

box.col = "light yellow",

arr.length=.1,

arr.width=.1,

self.cex = .4,

self.shifty = -.01,

self.shiftx = .13,

main = "")

Oz3 <- Oz %^% 3

round(Oz3,3)

# Rain Nice Snow

# Rain 0.406 0.203 0.391

# Nice 0.406 0.188 0.406

# Snow 0.391 0.203 0.406

u <- c(1/3, 1/3, 1/3)

round(u %*% Oz3,3)

#0.401 0.198 0.401

The igraph package can also be used to Markov chain diagrams, but I prefer the “drawn on a chalkboard” look of plotmat.

This next block of code reproduces the 5-state Drunkward’s walk example from section 11.2 which presents the fundamentals of absorbing Markov chains. First, the transition matrix describing the chain is instantiated as an object of the S4 class makrovchain. Then, functions from the markovchain package are used to identify the absorbing and transient states of the chain and place the transition matrix, P, into canonical form.

p <- c(.5,0,.5)

dw <- c(1,rep(0,4),p,0,0,0,p,0,0,0,p,rep(0,4),1)

DW <- matrix(dw,5,5,byrow=TRUE)

DWmc <-new("markovchain",

transitionMatrix = DW,

states = c("0","1","2","3","4"),

name = "Drunkard's Walk")

DWmc

# Drunkard's Walk

# A 5 - dimensional discrete Markov Chain with following states

# 0 1 2 3 4

# The transition matrix (by rows) is defined as follows

# 0 1 2 3 4

# 0 1.0 0.0 0.0 0.0 0.0

# 1 0.5 0.0 0.5 0.0 0.0

# 2 0.0 0.5 0.0 0.5 0.0

# 3 0.0 0.0 0.5 0.0 0.5

# 4 0.0 0.0 0.0 0.0 1.0

# Determine transient states

transientStates(DWmc)

#[1] "1" "2" "3"

# determine absorbing states

absorbingStates(DWmc)

#[1] "0" "4"

In canonical form, the transition matrix, P, is partitioned into the Identity matrix, I, a matrix of 0’s, the matrix, Q, containing the transition probabilities for the transient states and a matrix, R, containing the transition probabilities for the absorbing states.

Next, we find the Fundamental Matrix, N, by inverting (I – Q). For each transient state, j, n_{ij} gives the expected number of times the process is in state j given that it started in transient state i. u_{i} is the expected time until absorption given that the process starts in state i. Finally, we compute the matrix B, where b_{ij} is the probability that the process will be absorbed in state J given that it starts in state i.

# Find Matrix Q

getRQ <- function(M,type="Q"){

if(length(absorbingStates(M)) == 0) stop("Not Absorbing Matrix")

tm <- M@transitionMatrix

d <- diag(tm)

m <- max(which(d == 1))

n <- length(d)

ifelse(type=="Q",

A <- tm[(m+1):n,(m+1):n],

A <- tm[(m+1):n,1:m])

return(A)

}

# Put DWmc into Canonical Form

P <- canonicForm(DWmc)

P

Q <- getRQ(P)

# Find Fundamental Matrix

I <- diag(dim(Q)[2])

N <- solve(I - Q)

N

# 1 2 3

# 1 1.5 1 0.5

# 2 1.0 2 1.0

# 3 0.5 1 1.5

# Calculate time to absorption

c <- rep(1,dim(N)[2])

u <- N %*% c

u

# 1 3

# 2 4

# 3 3

R <- getRQ(P,”R”)

B <- N %*% R

B

# 0 4

# 1 0.75 0.25

# 2 0.50 0.50

# 3 0.25 0.75

For section 11. 3, which deals with regular and ergodic Markov chains we return to Oz, and provide four options for calculating the steady state, or limiting probability distribution for this regular transition matrix. The first three options involve standard methods which are readily available in R. Method 1 uses %^% to raise the matrix Oz to a sufficiently high value. Method 2 calculates the eigenvalue for the eigenvector 1, and method 3 uses the nullspace() function form the pracma package to compute the null space, or kernel of the linear transformation associated with the matrix. The fourth method uses the steadyStates() function from the markovchain package. To use this function, we first convert Oz into a markovchain object.

# 11.3 Ergodic Markov Chains

# Four methods to get steady states

# Method 1: compute powers on Matrix

round(Oz %^% 6,2)

# Rain Nice Snow

# Rain 0.4 0.2 0.4

# Nice 0.4 0.2 0.4

# Snow 0.4 0.2 0.4

# Method 2: Compute eigenvector of eigenvalue 1

eigenOz <- eigen(t(Oz))

ev <- eigenOz$vectors[,1] / sum(eigenOz$vectors[,1])

ev

# Method 3: compute null space of (P - I)

I <- diag(3)

ns <- nullspace(t(Oz - I))

ns <- round(ns / sum(ns),2)

ns

# Method 4: use function in markovchain package

OzMC<-new("markovchain",

states=stateNames,

transitionMatrix=

matrix(c(.5,.25,.25,.5,0,.5,.25,.25,.5),

nrow=3,

byrow=TRUE,

dimnames=list(stateNames,stateNames)))

steadyStates(OzMC)

The steadyState() function seems to be reasonably efficient for fairly large Markov Chains. The following code creates a 5,000 row by 5,000 column regular Markov matrix. On my modest, Lenovo ThinkPad ultrabook it took a little less than 2 minutes to create the markovchain object and about 11 minutes to compute the steady state distribution.

# Create a large random regular matrix

randReg <- function(N){

M <- matrix(runif(N^2,min=1,max=N),nrow=N,ncol=N)

rowS <- rowSums(M)

regM <- M/rowS

return(regM)

}

N <- 5000

M <-randReg(N)

#rowSums(M)

system.time(regMC <- new("markovchain", states = as.character(1:N),

transitionMatrix = M,

name = "M"))

# user system elapsed

# 98.33 0.82 99.46

system.time(ss <- steadyStates(regMC))

# user system elapsed

# 618.47 0.61 640.05

We conclude this little Markov Chain excursion by using the rmarkovchain() function to simulate a trajectory from the process represented by this large random matrix and plot the results. It seems that this is a reasonable method for simulating a stationary time series in a way that makes it easy to control the limits of its variability.

#sample from regMC

regMCts <- rmarkovchain(n=1000,object=regMC)

regMCtsDf <- as.data.frame(regMCts,stringsAsFactors = FALSE)

regMCtsDf$index <- 1:1000

regMCtsDf$regMCts <- as.numeric(regMCtsDf$regMCts)

library(ggplot2)

p <- ggplot(regMCtsDf,aes(index,regMCts))

p + geom_line(colour="dark red") +

xlab("time") +

ylab("state") +

ggtitle("Random Markov Chain")

For more on the capabilities of the markovchain package do have a look at the package vignette. For more theory on both discrete and continuous time Markov processes illustrated with R see Norm Matloff's book: From Algorithms to Z-Scores: Probabilistic and Statistical Modeling in Computer Science.