by Joseph Rickert

Pasha Roberts, Chief Scientist at Talent Analytics, is writing a series of articles on Employee Churn for the Predictive Analytics Times that comprise a really instructive and valuable example of using R to do some basic predictive modeling. So far, Pasha has published Employee Churn 201 in which he makes a case for the importance of modeling employee churn, and Employee Churn 202 where he builds a fairly sophisticated interactive model from first principles using only RStudio and basic R functions. And, while the series is not even complete, I think that is is going to be unique because it is working well on multiple levels.

In Churn 201, Pasha uses R almost incidentally, to produce the following plot that illustrates the concepts involved in understanding the costs and benefits contributed by a single employee.

At the lowest level, this is a nice example of what might be called a “programming literate essay”. R clearly isn’t necessary just to build create a graphic. (Note the use of ggplot's annotate() capability.) But, if you look at the R code behind the scenes, you will see that Pasha has gone a bit further. In a few lines of annotated code he has sketched out a self-documenting model that someone else could use to get “back of the envelope” results for their business. The exercise is roughly at the level of what a business analyst might attempt in an Excel spreadsheet.

In Employee Churn 202, Pasha goes still further moving the series from essays alone to a modeling effort. He uses basic survival analysis ideas and simple R functions to create a sophisticated decision model that computes several performance measures including something he calls Expected Cumulative Net Benefit. This measures the net benefit to the corporation employees who leave for both “good” and “bad” reasons.

The following figure shows the simulation running in RStudio complete with interactive tools built with the manipulate() function to perform "what if" analyses and display the results.

Running the simulation is easy. All of the code is available on Github where the file, churn202.md, provides details on how things work. Once you have run the code in the churn202.R or issued the command source("churn202.R") from the console, running the function manipSim202() will produce the simulation. (Note that might be necessary to click on “gear” icon in the upper left hand corner of the plots panel to have the slide bar controls appear.) The function runSensitivityTests()varies each of the parameters in the simulation through a reasonable range of values, while holding the other parameters fixed, to show the sensitivity of Expected Net Chumulative Benefit to each parameter. The function runHistograms() produces histograms of the synthetic data that drive the simulation and hints at the data collection effort that would be required to run the simulation for real.

By placing the code on GitHub and inviting feedback, comments and pull requests Pasha has raised his literary efforts to the status of an open source, employee churn project without comprimising the clarity of his exposition. I, for one, am looking forward to the rest of thes series.

Here's something fun you can do with R and its interface to Twitter, the TwitteR package. An R script by CMU student Mark Patterson downloads your Twitter profile picture, counts the number of Twitter followers you have, and then creates a pointillist version of your profile picture with as many dots as you have followers. Here's mine:

Note that to use Mark's script you will need to install the BioConductor package EBImage (follow the instructions in yellow on that page) and create a Twitter app and authenticate it. Once you've got those set up, call general.func("yourTwitterHandle") to create your own!

Decisions and R: Visualizing Twitter Followers Using Pointillism

Ari Lamstein has updated his choroplethr package with a new capability for creating animated data maps. I can't embed the animated version here, but click the image below to see an animation of US counties by average household income, from the richest to the poorest by percentile. (The code behind the animation is available on github.)

The chloroplethr package is also now available on CRAN, so you can install the latest version (including the new choroplethr_animate function) with the command install.packages(choroplethr) . In addition to playing animations from start to finish, you can also step though each frame using the + and - buttons.

This version of choroplethr was created during Trulia’s latest innovation week. Ari Lamstein wrote most of the R code, and the animation code was written by Brian P Johnson.

Google Groups choroplethr: choroplethr v1.4.0 is now available

by Joseph Rickert

Norman Matloff professor of computer science at UC Davis, and founding member of the UCD Dept. of Statistics has begun posting as Mad(Data)Scientist. (You may know Norm from his book, *The Art of R Programming:* NSP, 2011.) In his second post (out today) on the new R package, freqparcoord, that he wrote with Yinkang Xie, Norm looks into outliers in baseball data.

> library(freqparcoord) > data(mlb) > freqparcoord(mlb,-3,4:6,7)

We would like to welcome Norm as a new R blogger and we are looking forward to future posts!

Mad(Data)Scientist: More on freqparcoord

I'm a big fan of Wes Anderson's movies. I love the quirky characters and stories, the distinctive cinematography, and the unique visual style. Now you can bring some of that style to your own R charts, by making use of these Wes Anderson inspired palettes. Just choose your favourite Wes Anderson film or short:

Install the wesanderson pallettes package, created by Karthik Ram:

And convert your plain old ggplot2 charts like this:

Into this:

Here's the command for you cut-and-pasters:

ggplot(diamonds, aes(clarity, fill=cut)) + geom_bar() + scale_fill_manual(values = wes.palette(5, "Cavalcanti"))

The colors don't quite match up to the originals, but you can find other Wes Anderson-inspired palettes and adjust to your heart's content using the source code available from Karthik Ram's github page linked below.

Karthik Ram: Wes Anderson Palettes

*by Joseph Rickert*

Recently, I was trying to remember how to make a 3D scatter plot in R when it occurred to me that the documentation on how to do this is scattered all over the place. Hence, this short organizational note that you may find useful.

First of all, for the benefit of newcomers, I should mention that R has three distinct graphics systems: (1) the “traditional” graphics system, (2) the grid graphics system and (3) ggplot2. The traditional graphic system refers to the graphics and plotting functions in base R. According to Paul Murrell’s authoritative R Graphics, these functions implement the graphics facilities of the S language, and up to about 2005 they comprised most of the graphics functionality of R. The grid graphics system is based on Deepayan Sarkar’s lattice package which implements the functionality of Bill Cleveland’s Trellis graphics. Finally, ggplot2 Hadley Wickham’s package based on Wilkinson's Grammar of Graphics, took shape between 2007 and 2009 when ggplot2 Elegant Graphics for Data Analysis appeared.

There is considerable overlap of the functionality of R’s three graphics systems, but each has its own strengths and weaknesses. For example, although ggplot2 is currently probably the most popular R package for doing presentation quality plots it does not offer 3D plots. To work effectively in R I think it is necessary to know your way around at least two of the graphics systems. To really gain a command of the visualizations that can be done in R, a person would have to be familiar with all three systems as well as the many packages for specialized visualizations: maps, social networks, arc diagrams, animations, time series etc.

But back to the relatively tame task of 3D plots: the generic function persp() in the base graphics package draws perspective plots of a surface over the x–y plane. Typing demo(persp) at the console will give you an idea of what this function can do.

The plot3D package from Karline Soetaert builds on on persp()to provide functions for both 2D and 3D plotting. The vignette for plot3D shows some very impressive plots. Load the package and type the following commands at the console: example(persp3D), example(surf3D) and example(scatter3D) to see examples of 3D surface and scatter plots. Also, try this code to see a cut-away view of a Torus.

# 3D Plot of Half of a Torus par(mar = c(2, 2, 2, 2)) par(mfrow = c(1, 1)) R <- 3 r <- 2 x <- seq(0, 2*pi,length.out=50) y <- seq(0, pi,length.out=50) M <- mesh(x, y) alpha <- M$x beta <- M$y surf3D(x = (R + r*cos(alpha)) * cos(beta), y = (R + r*cos(alpha)) * sin(beta), z = r * sin(alpha), colkey=FALSE, bty="b2", main="Half of a Torus")

Created by Pretty R at inside-R.org

The scatterplot3d package from R core members Uwe Ligges and Martin Machler is the "go-to" package for 3D scatter plots. The vignette for this package is shows a rich array of plots. Load this package and type example(scatterplot3d) at the console to see examples of spirals, surfaces and 3D scatterplots.

The lattice package has its own distinctive look. Once you see one lattice plot it should be pretty easy to distinguish plots made with this package from base graphics plots. Load the packate and type example(cloud) in the console to see a 3D graph of a volcano, and 3D surface and scatter plots.

rgl from Daniel Adler and Duncan Murdoch and Rcmdr from John Fox et al. both allow interactive 3D vizualizations. Load the rgl package and type example(plot3d) to see a very cool, OpenGL, 3D scatter plot that you can grab with your mouse and rotate.

For additional references, see the scatterplot page of Robert Kabacoff's always helpful Quick-R site, and Paul E. Johnson's 3D Plotting presentation.

There's no shortage of web sites listing the current medal standings at Sochi, not least the official Winter Olympics Medal Tally. And here's the same tally, rendered with R:

Click through to see a real-time version of the chart, created with RStudio's Shiny by Ty Henkaline. (By the way, does anyone know if it's possible to embed a live version of the chart in a blog post like this?) If you're looking to create similar real-time charts of Web-based tables, be sure to check out the underlying code by Tyler Rinker that grabs the medal table from the Sochi website, cleans up the data, and plots the medal tally as a chart.

[**Updated**: The interactive chart by Ty Henkaline was mistakenly attributed to Ramnath Vaidyanathan. Apologies for the error.]

TRinker's R Blog: Sochi Olympic Medals

*by Daniel Hanson, QA Data Scientist, Revolution Analytics*

Last time, we included a couple of examples of plotting a single xts time series using the plot(.) function (ie, said function included in the xts package). Today, we’ll look at some quick and easy methods for plotting overlays of multiple xts time series in a single graph. As this information is not explicitly covered in the examples provided with xts and base R, this discussion may save you a bit of time.

To start, let’s look at five sets of cumulative returns for the following ETF’s:

SPY SPDR S&P 500 ETF Trust

QQQ PowerShares NASDAQ QQQ Trust

GDX Market Vectors Gold Miners ETF

DBO PowerShares DB Oil Fund (ETF)

VWO Vanguard FTSE Emerging Markets ETF

We first obtain the data using quantmod, going back to January 2007:

library(quantmod)

tckrs <- c("SPY", "QQQ", "GDX", "DBO", "VWO")

getSymbols(tckrs, from = "2007-01-01")

Then, extract just the closing prices from each set:

SPY.Close <- SPY[,4]

QQQ.Close <- QQQ[,4]

GDX.Close <- GDX[,4]

DBO.Close <- DBO[,4]

VWO.Close <- VWO[,4]

What we want is the set of cumulative returns for each, in the sense of the cumulative value of $1 over time. To do this, it is simply a case of dividing each daily price in the series by the price on the first day of the series. As SPY.Close[1], for example, is itself an xts object, we need to coerce it to numeric in order to carry out the division:

SPY1 <- as.numeric(SPY.Close[1])

QQQ1 <- as.numeric(QQQ.Close[1])

GDX1 <- as.numeric(GDX.Close[1])

DBO1 <- as.numeric(DBO.Close[1])

VWO1 <- as.numeric(VWO.Close[1])

Then, it’s a case of dividing each series by the price on the first day, just as one would divide an R vector by a scalar. For convenience of notation, we’ll just save these results back into the original ETF ticker names and overwrite the original objects:

SPY <- SPY.Close/SPY1

QQQ <- QQQ.Close/QQQ1

GDX <- GDX.Close/GDX1

DBO <- DBO.Close/DBO1

VWO <- VWO.Close/VWO1

We then merge all of these xts time series into a single xts object (*à la* a matrix):

basket <- cbind(SPY, QQQ, GDX, DBO, VWO)

Note that is.xts(basket)returns TRUE. We can also have a look at the data and its structure:

> head(basket)

SPY.Close QQQ.Close GDX.Close DBO.Close VWO.Close

2007-01-03 1.0000000 1.000000 1.0000000 NA 1.0000000

2007-01-04 1.0021221 1.018964 0.9815249 NA 0.9890886

2007-01-05 0.9941289 1.014107 0.9682540 1.0000000 0.9614891

2007-01-08 0.9987267 1.014801 0.9705959 1.0024722 0.9720154

2007-01-09 0.9978779 1.019889 0.9640906 0.9929955 0.9487805

2007-01-10 1.0012025 1.031915 0.9526412 0.9517923 0.9460847

> tail(basket)

SPY.Close QQQ.Close GDX.Close DBO.Close VWO.Close

2014-01-10 1.302539 NA 0.5727296 1.082406 0.5118100

2014-01-13 1.285209 1.989130 0.5893833 1.068809 0.5053915

2014-01-14 1.299215 2.027058 0.5750716 1.074166 0.5110398

2014-01-15 1.306218 2.043710 0.5826177 1.092707 0.5109114

2014-01-16 1.304520 2.043941 0.5886027 1.089411 0.5080873

2014-01-17 1.299003 2.032377 0.6070778 1.090647 0.5062901

Note that we have a few NA values here. This will not be of any significant consequence for demonstrating plotting functions, however.

We will now look how we can plot all five series, overlayed on a single graph. In particular, we will look at the plot(.) functions in both the zoo and xts packages.

The xts package is an extension of the zoo package, so coercing our xts object basket to a zoo object is a simple task:

zoo.basket <- as.zoo(basket)

Looking at head(zoo.basket) and tail(zoo.basket), we will get output that looks the same as what we got for the original xts basket object, as shown above; the date to data mapping is preserved. The plot(.) function provided in zoo is very simple to use, as we can use the whole zoo.basket object as input, and the plot(.) function will overlay the time series and scale the vertical axis for us with the help of a single parameter setting, namely the screens parameter.

Let’s now look at the code and the resulting plot in the following example, and then explain what’s going on:

# Set a color scheme:

tsRainbow <- rainbow(ncol(zoo.basket))

# Plot the overlayed series

plot(x = zoo.basket, ylab = "Cumulative Return", main = "Cumulative Returns",

col = tsRainbow, screens = 1)

# Set a legend in the upper left hand corner to match color to return series

legend(x = "topleft", legend = c("SPY", "QQQ", "GDX", "DBO", "VWO"),

lty = 1,col = tsRainbow)

We started by setting a color scheme, using the rainbow(.) command that is included in the base R installation. It is convenient as R will take in an arbitrary positive integer value and select a sequence of distinct colors up to the number specified. This is a nice feature for the impatient or lazy among us (yes, guilty as charged) who don’t want to be bothered with picking out colors and just want to see the result right away.

Next, in the plot(.) command, we assign to x our “matrix” of time series in the zoo.basket object, labels for the horizontal and vertical axes (xlab, ylab), a title for the graph (main), the the colors (col). Last, but crucial, is the parameter setting screens = 1, which tells the plot command to overlay each series in a single graph.

Finally, we include the legend(.) command to place a color legend at the upper left hand corner of the graph. The position (x) may be chosen from the list of keywords "bottomright", "bottom", "bottomleft", "left", "topleft", "top", "topright", "right" and "center"; in our case, we chose "topleft". The legend parameter is simply the list of ticker names. The lty parameter refers to “line type”, and by setting it to 1, the lines in the legend are shown as solid lines, and as in the plot(.) function, the same color scheme is assigned to the parameter col.

Back to the color scheme, we may at some point need to show our results to a manager or a client, so in that case, we probably will want to choose colors that are easier on the eye. In this case, one can just store the colors into a vector, and then use it as an input parameter. For example, set

myColors <- c("red", "darkgreen", "goldenrod", "darkblue", "darkviolet")

Then, just replace col = tsRainbow with col = myColors in the plot and legend commands:

plot(x = zoo.basket, xlab = "Time", ylab = "Cumulative Return",

main = "Cumulative Returns", col = myColors, screens = 1)

legend(x = "topleft", legend = c("SPY", "QQQ", "GDX", "DBO", "VWO"),

lty = 1, col = myColors)

We then get a plot that looks like this:

While the plot(.) function in zoo gave us a quick and convenient way of plotting multiple time series, it didn’t give us much control over the scale used along the horizontal axis. Using plot(.) in xts remedies this; however, it involves doing more work. In particular, we can no longer input the entire “matrix” object; we must add each series separately in order to layer the plots. We also need to specify the scale along the vertical axis, as in the xts case, the function will not do this on the fly as it did for us in the zoo case.

We will use individual columns from our original xts object, basket. By using basket rather than basket.zoo, this tells R to use the xts version of the function rather than the zoo version (à la an overloaded function in traditional object oriented programming). Let’s again look at an example and the resulting plot, and then discuss how it works:

plot(x = basket[,"SPY.Close"], xlab = "Time", ylab = "Cumulative Return",

main = "Cumulative Returns", ylim = c(0.0, 2.5), major.ticks= "years",

minor.ticks = FALSE, col = "red")

lines(x = basket[,"QQQ.Close"], col = "darkgreen")

lines(x = basket[,"GDX.Close"], col = "goldenrod")

lines(x = basket[,"DBO.Close"], col = "darkblue")

lines(x = basket[,"VWO.Close"], col = "darkviolet")

legend(x = 'topleft', legend = c("SPY", "QQQ", "GDX", "DBO", "VWO"),

lty = 1, col = myColors)

As mentioned, we need to add each time series separately in this case in order to get the desired overlays. If one were to try x = basket in the plot function, the graph would only display the first series (SPY), and a warning message would be returned to the R session. So, we first use the SPY series as input to the plot(.) function, and then add the remaining series with the lines(.) command. The color for each series is also included at each step (the same colors in our myColors vector).

As for the remaining arguments in the plot command, we use the same axis and title settings in xlab, ylab, and main. We set the scale of the vertical axis with the ylim parameter; noting from our previous example that VWO hovered near zero at the low end, and that DBO reached almost as high as 2.5, we set this range from 0.0 to 2.5. Two new arguments here are the major.ticks and minor.ticks settings. The major.ticks argument represents the periods in which we wish to chop up the horizontal axis; it is chosen from the set

{"years", "months", "weeks", "days", "hours", "minutes", "seconds"}

In the example above, we chose "years". The minor.ticks parameter can take values of TRUE/FALSE, and as we don’t need this for the graph, we choose FALSE. The same legend command that we used in the zoo case can be used here as well (using myColors to indicate the color of each time series plot). Just to compare, let’s change the major.ticks parameter to "months" in the previous example. The result is as follows:

A new package, called xtsExtra, includes a new plot(.) function that provides added functionality, including a legend generator. However, while it is available on R-Forge, it has not yet made it into the official CRAN repository. More sophisticated time series plotting capability can also be found in the quantmod and ggplot2 packages, and we will look at the ggplot2 case in an upcoming post. However, for plotting xts objects quickly and with minimal fuss, the plot(.) function in the zoo package fills the bill, and with a little more effort, we can refine the scale along the horizontal axis using the xts version of plot(.). R help files for each of these can be found by selecting plot.zoo and plot.xts respectively in help searches.

Choropleth maps are a popular way of representing spatial or geographic data, where a statistic of interest (say, income, voting results or crime rate) are color-coded by region. R includes all of the necessary tools for creating choropleth maps, but Trulia's Ari Lamstein has made the process even easier with the new choroplethr package now available on github. With couple of lines of code, you can easily convert a data frame of values coded by country, state, county or zip code into a choropleth like this:

The chart above shows the US zip codes with the highest per-capita incomes, based on data from the US Census Bureau's American Community Survey. The choroplethr package also includes an interface to ACS data, so if you know the data code you're looking for, you can create a choropleth of your favourite US demographic statistic with just a single line of code, like this:

choroplethr_acs(tableId="B19301", lod="zip")

You can read more information about the capabilities of the choroplethr package at the Trulia Tech+Design Blog. Many thanks to Trulia and Ari Lamstein for supporting the development of this useful package!

Trulia Tech+Design Blog: The choroplethr package for R

by Joseph Rickert

When I was in graduate school in the mid '70s Mathematics departments were still under the spell of abstraction for its own sake. At that time, Algebraic Topology which uses concepts from Abstract Algebra to study topological spaces was a major gateway to the realm of abstraction. On my first visit, it was not at all clear that any of the exotic creatures to be found there: simplices, homotopy groups, homology groups etc. would have anything to say about the Real line let alone the real world. So it is astounding to me that over the last several years the masters of mathematical abstraction have made Topological Data Analysis (TDA), a subfield of Computational Topology, an exciting technique for dealing with high dimensional data sets that shows great promise.

One fundamental idea of TDA is to consider a data set to be a sample or “point cloud” taken from a manifold in some high-dimensional topological space. The sample data are used to construct simplices, generalizations of intervals, which are, in turn, are “glued” together to form a kind of wireframe approximation of the manifold. This manifold represents the “shape” of the data, and once you have it, the tools of Algebraic Topology can be used to construct a class of groups, homology groups, that are algebraic analogues of certain properties of the manifold. For example, knowing the number of holes in a manifold at various dimensions characterizes the topology of the manifold. (It is a single hole that makes a torus different from a sphere but makes a coffee cup topologically equivalent to a donut.) Homology groups describe the algebraic analogues of the holes. So, investigating the properties of the homology groups can tell you something about the topology of the manifold and, hopefully, something useful about your data.

All of this so far is just basic Algebraic Topology. A recent breakthrough idea, the notion of "persistent homology", has helped to make TDA a practical tool for data analysis. Persistent homology algorithms look for topological invariants across various scales of a topological manifold. All of the several methods for constructing the simplicial complex that constitutes the wire frame described above involve a scale parameter e, but knowing the number of holes and other structural features that appear at any particular scale is not enough to characterize the manifold. What you really want to know are which features persist over the full range of the parameter e. Efficient algorithms have been developed for computing both the homology groups themselves and a way to visualize them. A “barcode” plot is a qualitative visualization of a homology group that looks like a collection of horizontal line segments. The x axis represents the parameter e and the y axis is an arbitrary ordering of homology generators. (See Ghrist for the details.) When examining a barcode plot you are looking for lines that span a good portion of the x axis. Short lines are most likely just topological noise, long lines that persist across a good bit of the e scale may tell you something about your data. The following barcode plot for the iris data set was drawn with functions from R package phom.

library(phom) data <- as.matrix(iris[,-5]) head(data) max_dim <- 0 max_f <- 1 irisInt0 <- pHom(data, dimension=max_dim, # maximum dimension of persistent homology computed max_filtration_value=max_f, # maximum dimension of filtration complex mode="vr", # type of filtration complex metric="euclidean") plotBarcodeDiagram(irisInt0, max_dim, max_f, title="H0 Barcode plot of Iris Data")

Created by Pretty R at inside-R.org

I interpret the plot to tell me that there are probably 3 or 4 clusters in the data set. This is a lot of heavy duty mathematical machinery to say something a bit vague about the iris data set, however, while this small example may not be particularly exciting, I hope it serves to whet your appetite. To investigate further have a look at CompTop, Stanford's group for Applied and Computational Algebraic Topology and the resources page on the Ayasdi website. This Palo Alto start up, founded by Stanford professor Gunnar Carlsson, a pioneer in Computational Topology, along with two other Stanford trained mathematicians is on a mission to make TDA mainstream. Be sure to take a look at Professor Carlsson's video which is entertaining, informative and well worth watching.