by Joseph Rickert

A frequent question that we get here at Microsoft about MRO (Microsoft R Open) is: can be used with RStudio? The short answer is absolutely yes! In fact, more than just being compatible, MRO is the perfect complement for the RStudio environment. MRO is a downstream distribution of open source R that supports multiple operating systems and provides features that enhance the performance and reproducible use of the R language. RStudio, being much more than a simple IDE, provides several features such as the tight integration knitr, RMarkdown and Shiny that promote literate programming, the creation of reproducible code as well as sharing and collaboration. Together, MRO and RStudio they make a powerful combination. Before elaborating on this theme, I should just make it clear how to select MRO from the RStudio IDE. After you have installed MRO on your system, open RStudio, go to the "Tools" tab at the top, and select "Global Options". You should see a couple of pop-up windows like the screen capture below. If RStudio is not already pointing to MRO (like it is in the screen capture) browse to it, and click "OK".

One feature of MRO that dovetails nicely with RStudio is that way that MRO is tied to a fixed repository. Every day, at precisely midnight UTC, the infrastructure that supports the MRO distribution takes a snapshot of CRAN and stores it on Microsoft’s MRAN site. (You can browse through the snapshots back to September 17, 2014 from the CRAN Time Machine.) Each MRO release is pre-configured to point to a particular CRAN snapshot. MRO 3.2.3, for example, points to CRAN as it was on January 1, 2016. Everyone who downloads MRO is guaranteed to start from a common baseline that reflects CRAN and all of its packages as they existed at a particular point in time. This provides an enormous advantage for corporations and collaborating teams of R programmers who can be sure that they are at least starting off on the same page, all working with the same CRAN release and a consistent view of the universe of R packages.

However, introducing the discipline of a fixed repository into the RStudio workflow is not completely frictionless. Occasionally, the stars don’t line up perfectly and an RStudio user, or any other user that needs a particular version of a CRAN package for some reason, may have to take some action. For example, I recently downloaded MRO 3.2.3, fired up RStudio and thought “sure why not” when reminded that a newer version of RStudio was available. Then, I clicked to create a new rmarkdown file and was immediately startled by an error message that said that the available rmarkdown package was not the version required by RStudio. The easy fix, of course, was to point to a repository containing a more recent version of rmarkdown than the one associated with the default snapshot date. If this happens to you, either of the following will take care of things:

To get the latest version of the markdown package, use:

install.packages("rmarkdown", repos = "https://cran.revolutionanalytics.com")

To get the 0.9.2 version of the markdown package, use:

install.packages("rmarkdown", repos = "https://mran.revolutionanalytics.com/snapshot/2016-01-02")

Apparently, by chance, we missed setting a snapshot date for MRO that would be convenient for RStudio users by one day,

A second way that MRO fits into RStudio is the way that the checkpoint package, which installs with MRO, can enhance the reproducibility power of RStudio’s project directory structure. If you choose a new directory when set up a new Rstudio project, and then run the checkpoint() function from that project, checkpoint will set up a local repository in a subdirectory of the project directory. For example, executing the following two lines of code from a script in the MyProject directory will install all packages required by your project as they were at midnight UTC on the specified date.

library(checkpoint)

checkpoint("2016-01-29")

Versions of all of the packages that are called out by scripts in your MyProject directory that existed on CRAN on January 29, 2016 will be installed in a subfolder of MyProject underneath ~/.checkpoint. Unless you use the same checkpoint date for other projects, the packages for MyProject will be independent of packages installed for those other projects. This kind of project specific structure is very helpful for keeping things straight. It provides a reproducibility code sharing layer on top of (or maybe underneath) RStudio's GitHub integration and other reproducibility features. When you want to share code with a colleague they don't need to manually install all of the packages ahead of time. Just have them clone your GitHub repository or put your code into their own RStudio project in some other way and then have them run checkpoint() from there. Checkpoint will search through the scripts in their project and install the versions of the packages they need.

Finally, I should mention that MRO can enhance any project by providing multi-threaded processing to the code underlying many of the R functions you will be using. R functions that make use of linear algebra operations under the hood such as matrix multiplication of Choleesky decompositions etc. will get a considerable performance boost. (Look here for some benchmarks.) For Linux and Windows platforms users can enable multi-threaded processing by downloading and Installing the Intel Math Kernel Libraries (MKL) when they install MRO from the MRAN site. Mac OS X users automatically get multithreading because MRO comes pre-configured to use the Mac Accelerate Framework.

Let us know if you use RStudio with MRO.

Like many modern cities, New York offers a public pick-up/drop-off bicycle service (called Citi Bikes). Subscribing City Bike members can grab a bike from almost 500 stations scattered around the city, hop on and ride to their destination, and drop the bike at a nearby station. (Visitors to the city can also purchase day passes.) The City Bike program shares data to the public about the operation of the service: time and location of pick-ups and drop-offs, and basic demographic data (age and gender) of subscriber riders.

Data Scientist Todd Schneider has followed-up on his tour-de-force analysis of Taxi Rides in NYC with a similar analysis of the Citi Bike data. Check out the wonderful animation of bike rides on September 16 below. While the Citi Bike data doesn't include actual trajectories (just the pick-up and drop-off locations), Todd has "interpolated" these points using Google Maps biking directions. Though these may not match actual routes (and gives extra weight to roads with bike lanes), it's nonetheless an elegant visualization of bike commuter patterns in the city.

Check out in particular the rush hours of 7-9AM and 4-6PM. September 16 was a Wednesday, but as Todd shows in the chart below, biking patterns are very different on the weekends as the focus switches from commuting to pleasure rides.

Todd also matched the biking data with NYC weather data to take a look at its effect on biking patterns. Unsurprisingly, low temperatures and rain both have a dampening effect (pun intended!) on ridership: one inch of rain deters as many riders as a 24-degree (F) drop in temperature. Surprisingly, snow doesn't have such a dramatic effect: an inch of snow depresses ridership like a 1.4 degree drop in temperature. (However, Todd's data doesn't include the recent blizzard in New York, from which many City Bike stations are still waiting to be dug out.)

Todd conducted all of the analysis and data visualization with the R language (he shares the R code on Github). He mainly used the the RPostgreSQL package for data extraction, the dplyr package for the data manipulation, the ggplot2 package for the graphics, and the minpack.lm package for the nonlinear least squares analysis of the weather impact.

There's plenty more detail to the analysis, including the effects of age and gender on cycling speed. For the complete analysis and lots more interesting charts, follow the link to the blog post below.

Todd W. Schneider: A Tale of Twenty-Two Million Citi Bikes: Analyzing the NYC Bike Share System

Astronomer and budding data scientist Julia Silge has been using R for less than a year, but based on the posts using R on her blog has already become very proficient at using R to analyze some interesting data sets. She has posted detailed analyses of water consumption data and health care indicators from the Utah Open Data Catalog, religious affiliation data from the Association of Statisticians of American Religious Bodies, and demographic data from the American Community Survey (that's the same dataset we mentioned on Monday).

In a two-part series, Julia analyzed another interesting dataset: her own archive of 10,000 tweets. (Julia provides all the R code for her analyses, so you can download your own Twitter archive and follow along.) In part one, Julia uses just a few lines of R to import her Twitter archive into R — in fact, that takes just one line of R code:

`tweets <- read.csv("./tweets.csv", stringsAsFactors = FALSE)`

She then uses the lubridate package to clean up the timestamps, and the ggplot2 package to create some simple charts of her Twitter activity. This chart takes just a few lines of R code and shows her Twitter activity over time categorized by type of tweet (direct tweets, replies, and retweets).

The really interesting part of the analysis comes in part two, where Julia uses the tm package (which provides a number of text mining functions to R) and syuzhet package (which includes the NRC Word-Emotion Association Lexicon algorithm) to analyze the sentiment of her tweets. Categorizing all 10,000 tweets as representing "anger", "fear", "surprise" and other sentiments, and generating a positive and negative sentiment score for each, is as simple as this one line of R code:

`mySentiment <- get_nrc_sentiment(tweets$text)`

Using those sentiment scores, Julia was easily able to summarize the sentiments expressed in her tweet history:

and create this time series chart showing her negative and positive sentiment scores over time:

If you've been thinking about applying sentiment analysis to some text data, you might find that with R it's easier than you think! Try it using your own Twitter archive by following along with Julia's posts linked below.

data science ish: Ten Thousand Tweets ; Joy to the World, and also Anticipation, Disgust, Surprise...

by Joseph Rickert

In a previous post, I showed some elementary properties of discrete time Markov Chains could be calculated, mostly with functions from the markovchain package. In this post, I would like to show a little bit more of the functionality available in that package by fitting a Markov Chain to some data. In this first block of code, I load the gold data set from the forecast package which contains daily morning gold prices in US dollars from January 1, 1985 through March 31, 1989. Next, since there are few missing values in the sequence, I impute them with a simple "ad hoc" process by substituting the previous day's price for one that is missing. There are two statements in the loop because there are a number of instances where there are two missing values in a row. Note that some kind of imputation is necessary because I will want to compute the autocorrelation of the series, and like many R functions acf() does not like NAs. (it doesn't make sense to compute with NAs.)

library(forecast)

library(markovchain)

data(gold) # Load gold time series

# Impute missing values

gold1 <- gold

for(i in 1:length(gold)){

gold1[i] <- ifelse(is.na(gold[i]),gold[i-1],gold[i])

gold1[i] <- ifelse(is.na(gold1[i]),gold1[i-1],gold1[i])

}

plot(gold1, xlab = "Days", ylab = "US$", main = "Gold prices 1/1/85 - 1/31/89")

This is an interesting series with over 1,000 points, but definitely not stationary; so it is not a good candidate for trying to model as a Markov Chain. The series produced by taking first differences is more reasonable. The series flat, oscillating about a mean (0.07) slightly above zero and the autocorrelation trails off as one might expect for a stationary series.

# Take first differences to try and get stationary series

goldDiff <- diff(gold1)

par(mfrow = c(1,2))

plot(goldDiff,ylab="",main="1st differences of gold")

acf(goldDiff)

Next, we set up for modeling by constructing a series of labels. In this analysis, I settle for constructing a two state chain that reflects whether the series of differences assumes positive or non-positive values. Note that I have introduced a few zero values into the series because of my crude imputation process. Here, i just lump them in with the negatives.

# construct a series of labels

goldSign <- vector(length=length(goldDiff))

for (i in 1:length(goldDiff)){

goldSign[i] <- ifelse(goldDiff[i] > 0, "POS","NEG")

}

Next, we can use some of the statistical tests built into the markovchain package to assess our assumptions so far. The function verifyMarkovProperty() attempts to verify that a sequence satisfies the Markov property by performing Chi squared tests on a series of contingency tables where the columns are sequences of past to present to future transitions and the rows are sequences of state transitions. Large p values indicate that one should not reject the null hypothesis that the Markov property holds for a specific transition. The output of verifyMarkovProperty() is a list with an entry for each possible state transition. The package vignette An introduction to the markovchain package presents the details of how this works. The following shows the output for the NEG to POS transitions.

# Verify the Markov property

vmp <- verifyMarkovProperty(goldSign)

vmp[2]

# $NEGPOS

# $NEGPOS$statistic

# X-squared

# 0

#

# $NEGPOS$parameter

# df

# 1

#

# $NEGPOS$p.value

# [1] 1

#

# $NEGPOS$method

# [1] "Pearson's Chi-squared test with Yates' continuity correction"

#

# $NEGPOS$data.name

# [1] "table"

#

# $NEGPOS$observed

# SSO TSO-SSO

# NEG 138 116

# POS 164 139

#

# $NEGPOS$expected

# SSO TSO-SSO

# NEG 137.7163 116.2837

# POS 164.2837 138.7163

#

# $NEGPOS$residuals

# SSO TSO-SSO

# NEG 0.02417181 -0.02630526

# POS -0.02213119 0.02408452

#

# $NEGPOS$stdres

# SSO TSO-SSO

# NEG 0.04843652 -0.04843652

# POS -0.04843652 0.04843652

#

# $NEGPOS$table

# SSO TSO

# NEG 138 254

# POS 164 303

The assessOrder() function uses a Chi squared test to test that hypothesis that the sequence is consistent with a first order Markov Chain

assessOrder(goldSign)

# The assessOrder test statistic is: 0.4142521

# the Chi-Square d.f. are: 2 the p-value is: 0.8129172

# $statistic

# [1] 0.4142521

#

# $p.value

# [1] 0.8129172

There is an additional function assessStationarity() to test for the stationarity of the sequence. However in this case, the chisq.test() function at the heart of things reports that the p-values are unreliable.

Next, we use the markovchainFit() function to fit a Markov Chain to the data using the maximum likelihood estimator explained in the vignette mentioned above. The output from the function includes the estimated transition matrix, the estimated error and lower and upper endpoint transition matrices which provide a confidence interval for the transition matrix.

goldMC <- markovchainFit(data = goldSign, method="mle", name = "gold mle")

goldMC$estimate

# gold mle

# A 2 - dimensional discrete Markov Chain with following states

# NEG POS

# The transition matrix (by rows) is defined as follows

# NEG POS

# NEG 0.4560144 0.5439856

# POS 0.5519126 0.4480874

goldMC$standardError

# NEG POS

# NEG 0.02861289 0.03125116

# POS 0.03170655 0.02856901

goldMC$confidenceInterval

# [1] 0.95

#

# $lowerEndpointMatrix

# NEG POS

# NEG 0.4089504 0.4925821

# POS 0.4997599 0.4010956

#

# $upperEndpointMatrix

# NEG POS

# NEG 0.5030784 0.5953892

# POS 0.6040652 0.4950793

The transition matrix does show some interesting structure with it being more likely for the chain to go from a negative to a positive value than to stay negative. And, once it is positive, the chain is more likely to stay positive than go negative.

Finally, we use the predict() function to produce a three day, look ahead forecast for the situation where the series has been negative for the last two days.

predict(object = goldMC$estimate, newdata = c("POS","POS"),n.ahead=3)

#"NEG" "POS" "NEG"

I still have not exhausted what is in the markovchain package. Perhaps, in an other post I will look at the functions for continuous time chains. What is presented here though should be enough to have some fun hunting for Markov Chains in all kinds of data.

by John Mount Ph.D.

Data Scientist at Win-Vector LLC

Let's talk about the use and benefits of parallel computation in R.

IBM's Blue Gene/P massively parallel supercomputer (Wikipedia).

Parallel computing is a type of computation in which many calculations are carried out simultaneously."

Wikipedia quoting: Gottlieb, Allan; Almasi, George S. (1989). Highly parallel computing

The reason we care is: by making the computer work harder (perform many calculations simultaneously) we wait less time for our experiments and can run more experiments. This is especially important when doing data science (as we often do using the R analysis platform) as we often need to repeat variations of large analyses to learn things, infer parameters, and estimate model stability. Typically to get the computer to work a harder the analyst, programmer, or library designer must themselves work a bit hard to arrange calculations in a parallel friendly manner. In the best circumstances somebody has already done this for you:

- Good parallel libraries, such as the multi-threaded BLAS/LAPACK libraries included in Revolution R Open (RRO, now Microsoft R Open) (see here).
- Specialized parallel extensions that supply their own high performance implementations of important procedures such as rx methods from RevoScaleR or h2o methods from h2o.ai.
- Parallelization abstraction frameworks such as Thrust/Rth (see here).
- Using R application libraries that dealt with parallelism on their own (examples include gbm, boot and our own vtreat). (Some of these libraries do not attempt parallel operation until you specify a parallel execution environment.)

In addition to having a task ready to "parallelize" you need a facility willing to work on it in a parallel manner. Examples include:

- Your own machine. Even a laptop computer usually now has four our more cores. Potentially running four times faster, or equivalently waiting only one fourth the time, is big.
- Graphics processing units (GPUs). Many machines have a one or more powerful graphics cards already installed. For some numerical task these cards are 10 to 100 times faster than the basic Central Processing Unit (CPU) you normally use for computation (see here).
- Clusters of computers (such as Amazon ec2, Hadoop backends and more).

Obviously parallel computation with R is a vast and specialized topic. It can seem impossible to quickly learn how to use all this magic to run your own calculation more quickly. In this tutorial we will demonstrate how to speed up a calculation of your own choosing using basic R. To read on please click here.

by Joseph Rickert

Over the past few months, a number of new CRAN packages have appeared that make it easier for R users to gain access to curated data. Most of these provide interfaces to a RESTful API written by the data publishers while a few just wrap the data set inside the package. Some of the new packages are only very simple, one function wrappers to the API. Others offer more features providing functions to control searches and the format of the returned data. Some of the packages require a user to first obtain login credentials while others don’t.

Here are 17 packages that connect to data sources of all sorts. It is by no means complete. New packages in this class seem to be arriving daily at CRAN.

ameco V 0.1: Contains the entire European Commission Annual macro-economic (AMECO) database. The vignette shows a nice example of data munging to get a plot of population data.

censusr V0.0.2: Provides an interface to the US Census Data API. The vignette shows exactly how to go about getting the key to use the API.

ckanr V 0.1.0: Provides and interface to the Comprehensive Knowledge Archive Network CKAN which bills itself as the “world’s leading open-source data portal platform”. The vignette walks you through installation and provides some examples.

dieZeit V 0.1.0 Provides access to Die Zeit's online content. This includes archives going back to 1946! The vignette shows how to get API access with a limit of 10,000 accesses per day.

ecb V0.1: Provides an interface to the European Central Bank's Statistical Warehouse API. The following plot from the vignette shows “headline” and core Harmonized Index of Consumer Prices (HICP) inflation numbers.

gesis V0.1: Provides an interface to the GESIS Catalogue of more than 5,000 data sets maintained by the Leibniz-Institute of the Social Sciences. The vignette shows you how to get started.

gtrendsR V1.3.0: Provides functions to perform and display Google Trend queries.

hdr V0.1: Provides an interface to the United Nations Development Program Human Development Report API. The vignette provides an example of accessing and plotting some data.

inegiR V1.0.2: Provides functions to download and parse information form the official Mexican statistics agency: INEGI.

maddison V0.1: Contains the Maddison Project database which provides estimates of GDP per capita for all countries between AD 1 and 2010. The following plot from the vignette shows estimated GDP per capita going back to 1800. Look at the WW II years.

mldr.datasets V0.3.1 Provides tools for the manipulation and exploration of multi-label data sets. Contains a large collection of multi-label data sets. The vignette which is very well done contains some theory, illustrative R code ans some spectacular visualizations. Here is a pretty cool plot in one line of code:

plot(genbase, labelIndices = genbase$labels$index[1:11])

pageviews V0.1.1: Provides an API client for Wikimedia traffic data. The following code, adapted from the vignette, plots views of the article R (programming language) for 2015.

library(pageviews)

library(ggplot2)

res <- article_pageviews(project = "en.wikipedia",

article = "R_(programming_language)",

start = "2015010100", end = "2015123124")

# Fiddle with the string to get it into proper format for forming a date.

# The regular expression comes from those kind folks at stackoverflow: http://bit.ly/1RKC9iQ

date <- gsub('^(.{7})(.*)$', '\\1-\\2', gsub('^(.{4})(.*)$', '\\1-\\2', res$timestamp))

res$date <- as.Date(substr(date,1,10))

p <- ggplot(res, aes(date, views)) + geom_line() +

geom_point(colour="red",size = 0.5) +

xlab("2015") + ylab("Daily Views") +

ggtitle("Wikipedia article: R (programming language)")

p

pangaear V0.1.0: Provides tools for interacting with the PANGAEA database. This should be of interest to environmental scientists.

prism V0.0.7; Allows you to download and visualize climate data from Oregon State's PRISM Project. The vignette is maintained on GitHub.

rstatscn V1.0: Provides functions to query Chinese National Data.

SocialMediaLab V0.19.0: Provides tools to collect data from Instagram, Facebook, Twitter and YouTube, construct networks and plot them.

wordbankr V0.1: Contains functions for connecting to Wordbank, Stanford's database of children's development vocabulary that spans 14 languages. There is a vignette.

Of course, there are countless sources for data that are easily accessible from R. Our data page on MRAN lists quite a few and there are even more on Mango Solution's data set page. Please let us know about any others that you think we should track.

If you've ever created a scatterplot with text labels using the text function in R, or the geom_text function in the ggplot2 package, you've probably found that the text labels can easily overlap, rendering some of them unreadable. Now, thanks to the new extensibility capabilities of the ggplot2 package, R user Kamil Slowikowski has created an R package ggrepel that adds alternative text labeling functions to ggplot2 that "repels" labels from data points and other labels to avoid overlapping. The new geom_text_repel replaces the standard geom_text for plain text lablels, and you can also use geom_label_repel instead of geom_label for these rounded and color-coded labels:

The resulting plot is definitely more attractive, and with more readable lables, than the standard version using geom_text:

You can see more examples of ggrepel in action here. A word of caution, though: if you're relying on the text labels as the fundamental element of your visualization, this does have the effect of moving your *data* around, and that could change your interpretation of theplot. (Case in point: the spread of data appears greater in the first plot than the "messy" one just above, even though it's the exact same data being presented both times.) But if your main goal is not interpretation, or if you just want to label a few points in particular (and ensure the labels are readable), this new ggrepel package is well worth a look. The ggrepel package is available on CRAN now, and you can follow its development on Github at the link below.

github (slowkow): ggrepel

by Joseph Rickert

There are number of R packages devoted to sophisticated applications of Markov chains. These include msm and SemiMarkov for fitting multistate models to panel data, mstate for survival analysis applications, TPmsm for estimating transition probabilities for 3-state progressive disease models, heemod for applying Markov models to health care economic applications, HMM and depmixS4 for fitting Hidden Markov Models and mcmc for working with Monte Carlo Markov Chains. All of these assume some considerable knowledge of the underlying theory. To my knowledge only DTMCPack and the relatively recent package, markovchain, were written to facilitate basic computations with Markov chains.

In this post, we’ll explore some basic properties of discrete time Markov chains using the functions provided by the markovchain package supplemented with standard R functions and a few functions from other contributed packages. “Chapter 11”, of Snell’s online probability book will be our guide. The calculations displayed here illustrate some of the theory developed in this document. In the text below, section numbers refer to this document.

A large part of working with discrete time Markov chains involves manipulating the matrix of transition probabilities associated with the chain. This first section of code replicates the Oz transition probability matrix from section 11.1 and uses the plotmat() function from the diagram package to illustrate it. Then, the efficient operator %^% from the expm package is used to raise the Oz matrix to the third power. Finally, left matrix multiplication of OZ^3 by the distribution vector u = (1/3, 1/3, 1/3) gives the weather forecast three days ahead.

library(expm)

library(markovchain)

library(diagram)

library(pracma)

stateNames <- c("Rain","Nice","Snow")

Oz <- matrix(c(.5,.25,.25,.5,0,.5,.25,.25,.5),

nrow=3, byrow=TRUE)

row.names(Oz) <- stateNames; colnames(Oz) <- stateNames

Oz

# Rain Nice Snow

# Rain 0.50 0.25 0.25

# Nice 0.50 0.00 0.50

# Snow 0.25 0.25 0.50

plotmat(Oz,pos = c(1,2),

lwd = 1, box.lwd = 2,

cex.txt = 0.8,

box.size = 0.1,

box.type = "circle",

box.prop = 0.5,

box.col = "light yellow",

arr.length=.1,

arr.width=.1,

self.cex = .4,

self.shifty = -.01,

self.shiftx = .13,

main = "")

Oz3 <- Oz %^% 3

round(Oz3,3)

# Rain Nice Snow

# Rain 0.406 0.203 0.391

# Nice 0.406 0.188 0.406

# Snow 0.391 0.203 0.406

u <- c(1/3, 1/3, 1/3)

round(u %*% Oz3,3)

#0.401 0.198 0.401

The igraph package can also be used to Markov chain diagrams, but I prefer the “drawn on a chalkboard” look of plotmat.

This next block of code reproduces the 5-state Drunkward’s walk example from section 11.2 which presents the fundamentals of absorbing Markov chains. First, the transition matrix describing the chain is instantiated as an object of the S4 class makrovchain. Then, functions from the markovchain package are used to identify the absorbing and transient states of the chain and place the transition matrix, P, into canonical form.

p <- c(.5,0,.5)

dw <- c(1,rep(0,4),p,0,0,0,p,0,0,0,p,rep(0,4),1)

DW <- matrix(dw,5,5,byrow=TRUE)

DWmc <-new("markovchain",

transitionMatrix = DW,

states = c("0","1","2","3","4"),

name = "Drunkard's Walk")

DWmc

# Drunkard's Walk

# A 5 - dimensional discrete Markov Chain with following states

# 0 1 2 3 4

# The transition matrix (by rows) is defined as follows

# 0 1 2 3 4

# 0 1.0 0.0 0.0 0.0 0.0

# 1 0.5 0.0 0.5 0.0 0.0

# 2 0.0 0.5 0.0 0.5 0.0

# 3 0.0 0.0 0.5 0.0 0.5

# 4 0.0 0.0 0.0 0.0 1.0

# Determine transient states

transientStates(DWmc)

#[1] "1" "2" "3"

# determine absorbing states

absorbingStates(DWmc)

#[1] "0" "4"

In canonical form, the transition matrix, P, is partitioned into the Identity matrix, I, a matrix of 0’s, the matrix, Q, containing the transition probabilities for the transient states and a matrix, R, containing the transition probabilities for the absorbing states.

Next, we find the Fundamental Matrix, N, by inverting (I – Q). For each transient state, j, n_{ij} gives the expected number of times the process is in state j given that it started in transient state i. u_{i} is the expected time until absorption given that the process starts in state i. Finally, we compute the matrix B, where b_{ij} is the probability that the process will be absorbed in state J given that it starts in state i.

# Find Matrix Q

getRQ <- function(M,type="Q"){

if(length(absorbingStates(M)) == 0) stop("Not Absorbing Matrix")

tm <- M@transitionMatrix

d <- diag(tm)

m <- max(which(d == 1))

n <- length(d)

ifelse(type=="Q",

A <- tm[(m+1):n,(m+1):n],

A <- tm[(m+1):n,1:m])

return(A)

}

# Put DWmc into Canonical Form

P <- canonicForm(DWmc)

P

Q <- getRQ(P)

# Find Fundamental Matrix

I <- diag(dim(Q)[2])

N <- solve(I - Q)

N

# 1 2 3

# 1 1.5 1 0.5

# 2 1.0 2 1.0

# 3 0.5 1 1.5

# Calculate time to absorption

c <- rep(1,dim(N)[2])

u <- N %*% c

u

# 1 3

# 2 4

# 3 3

R <- getRQ(P,”R”)

B <- N %*% R

B

# 0 4

# 1 0.75 0.25

# 2 0.50 0.50

# 3 0.25 0.75

For section 11. 3, which deals with regular and ergodic Markov chains we return to Oz, and provide four options for calculating the steady state, or limiting probability distribution for this regular transition matrix. The first three options involve standard methods which are readily available in R. Method 1 uses %^% to raise the matrix Oz to a sufficiently high value. Method 2 calculates the eigenvalue for the eigenvector 1, and method 3 uses the nullspace() function form the pracma package to compute the null space, or kernel of the linear transformation associated with the matrix. The fourth method uses the steadyStates() function from the markovchain package. To use this function, we first convert Oz into a markovchain object.

# 11.3 Ergodic Markov Chains

# Four methods to get steady states

# Method 1: compute powers on Matrix

round(Oz %^% 6,2)

# Rain Nice Snow

# Rain 0.4 0.2 0.4

# Nice 0.4 0.2 0.4

# Snow 0.4 0.2 0.4

# Method 2: Compute eigenvector of eigenvalue 1

eigenOz <- eigen(t(Oz))

ev <- eigenOz$vectors[,1] / sum(eigenOz$vectors[,1])

ev

# Method 3: compute null space of (P - I)

I <- diag(3)

ns <- nullspace(t(Oz - I))

ns <- round(ns / sum(ns),2)

ns

# Method 4: use function in markovchain package

OzMC<-new("markovchain",

states=stateNames,

transitionMatrix=

matrix(c(.5,.25,.25,.5,0,.5,.25,.25,.5),

nrow=3,

byrow=TRUE,

dimnames=list(stateNames,stateNames)))

steadyStates(OzMC)

The steadyState() function seems to be reasonably efficient for fairly large Markov Chains. The following code creates a 5,000 row by 5,000 column regular Markov matrix. On my modest, Lenovo ThinkPad ultrabook it took a little less than 2 minutes to create the markovchain object and about 11 minutes to compute the steady state distribution.

# Create a large random regular matrix

randReg <- function(N){

M <- matrix(runif(N^2,min=1,max=N),nrow=N,ncol=N)

rowS <- rowSums(M)

regM <- M/rowS

return(regM)

}

N <- 5000

M <-randReg(N)

#rowSums(M)

system.time(regMC <- new("markovchain", states = as.character(1:N),

transitionMatrix = M,

name = "M"))

# user system elapsed

# 98.33 0.82 99.46

system.time(ss <- steadyStates(regMC))

# user system elapsed

# 618.47 0.61 640.05

We conclude this little Markov Chain excursion by using the rmarkovchain() function to simulate a trajectory from the process represented by this large random matrix and plot the results. It seems that this is a reasonable method for simulating a stationary time series in a way that makes it easy to control the limits of its variability.

#sample from regMC

regMCts <- rmarkovchain(n=1000,object=regMC)

regMCtsDf <- as.data.frame(regMCts,stringsAsFactors = FALSE)

regMCtsDf$index <- 1:1000

regMCtsDf$regMCts <- as.numeric(regMCtsDf$regMCts)

library(ggplot2)

p <- ggplot(regMCtsDf,aes(index,regMCts))

p + geom_line(colour="dark red") +

xlab("time") +

ylab("state") +

ggtitle("Random Markov Chain")

For more on the capabilities of the markovchain package do have a look at the package vignette. For more theory on both discrete and continuous time Markov processes illustrated with R see Norm Matloff's book: From Algorithms to Z-Scores: Probabilistic and Statistical Modeling in Computer Science.

Despite the ggplot2 project — the most popular data visualization package for R — being in maintenance mode, RStudio's Hadley Wickham has given the R community a surprise gift with a version 2.0.0 update for ggplot2. According to Hadley this is a "huge" update with more than 100 fixes and improvements.

The most significant addition is that ggplot2 now has a formal extension mechanism, which means that package authors can now create their own geoms (plot types), statistics (data aggregation/transformation methods) and themes. There are also a number of smaller improvements, including making it easy to draw curved lines between points with geom_curve, a way to suppress overlapping text labels, and a way to add labels with rounded enclosing boxes to plots. There are also a few minor changes to the appearance of plots, with some changes to colors and text sizes to improve readability and (my personal favorite) the elimination of the diagonal line that used to appear in the color boxes in legends.

If you're new to ggplot2, getting to grips with the Grammar of Graphics system can be a steep learning curve, but it's well worth it in terms of long-term productivity in creating beautiful graphics with R. Hadley Wickham's book, Elegant Graphics for Data Analysis, is a great place to get started. Experienced ggplot2 users will also appreciate the freshly-updated ggplot2 cheatsheet from the RStudio team.

This new update is mostly compatible with older versions of ggplot2, but it may require updates to existing code in some cases. (If you want to try the new ggplot2 but still have access to old versions for compatibility, take a look at the checkpoint package.) The updated ggplot2 package is now available for download via CRAN with install.packages("ggplot2"), or via the ggplot2 Github repository.

RStudio Blog: ggplot2 2.0.0

Google Trends is a useful way to compare changes in popularity of certain search terms over time, and Google Trends data can be used as a proxy for all sorts of difficult-to-measure quantities like economic activity and disease propagation. If you'd like to use Google Trends data in your own analyses, the gtrendsR package for R is now available on CRAN. This package by Philippe Massicotte and Dirk Eddelbuettel adds functions to connect with your Google account, and download Trends data for one or more search terms at daily or weekly resolution over a specified period of time.

For example, this code shows the relative prevalence of searches including the terms "data is" and "data are" over the past 10 years:

library(gtrendsR)

usr <- "<Google account email>"

psw <- "<Google account password>"

gconnect(usr, psw)

lang_trend <- gtrends(c("data is", "data are"), res="week")

plot(lang_trend)

And here's the resulting plot:

In addition to the trends data (which is only useful to compare with the other terms in your query, and not for absolute popularity), the result object also includes data on the top geographic regions that requested the search terms, and the top complete queries that contained them.

You can install the gtrendsR package from CRAN, or find the latest version on GitHub.

Thinking Inside the Box: gtrends 1.3.0 now on CRAN: Google Trends in R