Thanks to everyone at the Chicago R User Group for giving me such a warm welcome for my presentation last night. In my talk, I gave an introduction to Revolution R Open, with a focus on how the checkpoint package makes sharing R code in a reproducible way easy:

If you'd like to try out the checkpoint package, it's available on CRAN now. You can also install a newer, more efficient version from GitHub as follows:

library(devtools)

install_github("RevolutionAnalytics/checkpoint")

You can't see my demo in the slides, but I ran this script in RStudio and demonstrated what happens when you adjust the checkpoint date. You can learn more about checkpoint and the Reproducible R Toolkit here.

The team at RStudio have just released an update to the immensely useful dplyr package, making it even more powerful for manipulating data frame data in R. The new 0.4.0 version adds new "verbs" to the syntax for mutating joins (left join, right join, etc.), filtering joins, and set operations (intersection and union). There's also some new documentation to help you get started with dplyr, including a vignette on using data frames with dplyr and a printable cheatsheet on data wrangling with dplyr and tidyr. Check out all the updates at the RStudio blog post linked below.

RStudio blog: dplyr 0.4.0

Bioconductor is a project to develop and curate a collection of R packages used for analysis of genetic data (specifically, analysis and comprehension of high-throughput genomic data). With the wealth of genetic data on humans and animals now available, Bioconductor is widely used in medical research to understand how genes influence our health, and to develop new therapies and drugs. (It was used in this recent Nature Molecular Psychiatry article, for example.)

The project currently includes 935 R packages — a tally that's not normally included in the count of available R packages (the count of package in CRAN currently stands at 6179). There are also 895 packages of genetic data (includeing many annotated genomes), and 223 packages of experimental data.

The latest issue of the Bioconductor newsletter shared some impressive statistics on the growth of the project:

- A search for BioConductor citations on Google Scholar yields more than 27,000 hits. (You can see a list of recent articles citing BioConductor here, including articles in
*Nature*,*Genome Biology*, and*Statistical Science*.) - Visits to the BioConductor website increased by 23% in 2014.
- 63 packages were added in the last 3 months of 2014.
- Package downloads have increased by 9%.

You can find much more news about the Bioconductor project in the current and previous Bioconductor newsletters. Start with the most revent issue at the link below.

Bioconductor Newsletter: January 2015

For Twitter, finding anomalies — sudden spikes or dips — in a time series is important to keep the microblogging service running smoothly. A sudden spike in shared photos may signify an "trending" event, whereas a sudden dip in posts might represent a failure in one of the back-end services that needs to be addressed. To detect such anomalies, the engineering team at Twitter created the AnomalyDetection R package, which they recently released as open source. (Late last year Twitter released a separate but related R package to detect "breakouts" in time series.)

Finding spikes and dips is relatively easy when they are extreme enough to extend beyond the natural seasonal variation in the time series. (Twitter calls these "global anomalies".) The real trick is in identifying "local anomalies": small variations on the seasonal trend, but which don't extend beyond the usual range of values.

The AnomalyDetection package uses the Seasonal Hybrid ESD (S-H-ESD) algorithm, which combines seasonal decomposition with robust statistical methods to identify local and global anomalies. The package can also be used to detect anomalies in non-time-series (unordered) data, though in this case the concept of "local" anomalies doesn't apply. You can find out more information about the package and how it's used at Twitter at the link below, or install it from Github for use with R.

Twitter Engineering Blog: Introducing practical and robust anomaly detection in a time series

The choice of colors you use in a statistical graphic isn't just about making your chart look good: the colors you choose are often critical to interpretation. For example, you wouldn't want to use a scale like this to represent, say, average income on a map:

That palette would be suitable for qualitative data without implicit ordering (say, political parties), but for a continuous variable the viewer has no reason to assume that "pink" should be more than "orange". Even when choosing a continuous scale, the colors you use may have cultural meanings (imagine a map of temperatures with "hot" represented by blue shades and "cold" represented by red). Still, bad color choices happen all the time.

One person who has done more than most to promote good practices in color use is the cartographer Cindy Brewer, who was recently the subject of a feature article in Wired. Her ColorBrewer2 website helps you choose the appropriate color scale for you map depending on your data type: qualitative, sequential or diverging (with a neutral color between two extremes). It can even give you colorblind-safe and print-friendly options.

For R users, RColorBrewer package make it easy to use such palettes in R charts. After loading the package, you can see the palettes you have to choose from with the display.brewer.all() function:

(**Update** Dec 8: display.brewer.all now has an option to display colorblindness-friendly palettes.) You can use the function brewer.pal to select one of these palettes with any number of colors with the function brewer.pal. And if you're a ggplot2 user, you can use the scale_brewer option to use a ColorBrewer palette.

On the other hand, if your chart is more whimsical than scientific, there's always the Wes Anderson palettes.

Wired: The Cartographer Who’s Transforming Map Design (via Trevor A. Branch)

With so many more devices and instruments connected to the "Internet of Things" these days, there's a whole lot more time series data available to analyze. But time series are typically quite noisy: how do you distinguish a short-term tick up or down from a true change in the underlying signal? To solve this problem, Twitter created the BreakoutDetection package for R, which decomposes a time series into a series of segments of one of three types:

**Steady state**: The time series follows a fixed mean (with random noise around the mean);**Mean shift**: The time series jumps directly from one steady state to another;**Ramp up / down**: The time series transitions linearly from one steady state to another, over a fixed period of time.

Given a univariate time series (and a few tuning parameters), the breakout function will return a list of **breakout points**: times when these state transitions are detected. It uses a non-parametric algorithm (E-Divisive with Medians) to detect the breakout points, so no assumptions are made about the underlying distribution of the time series.

Twitter uses this R package to monitor the user experience on the Twitter network and detect when things are "Breaking Bad". Data scientist Randy Zwitch used the package to identify the dates of blog posts or references on Hacker News from his blog traffic data. (He also compared the algorithm to anomaly detection with the Adobe Analytics API.) And the University of Louisville School of Medicine has also looked at using the package to identify past influenza outbreaks from CDC data:

For more information about the BreakoutDetection package, check out Twitter's blog post linked below. You can download the BreakoutDetection R package itself from GitHub.

Twitter Engineering blog: Breakout detection in the wild (via FlowingData)

by Joseph Rickert

The igraph package has become a fundamental tool for the study of graphs and their properties, the manipulation and visualization of graphs and the statistical analysis of networks. To get an idea of just how firmly igraph has become embedded into the R package ecosystem consider that currently igraph lists 72 reverse depends, 59 reverse imports and 24 reverse suggests. The following plot, which was made with functions from the igraph and miniCRAN packages, indicates the complexity of this network.

library(miniCRAN) library(igraph) pk <- c("igraph","agop","bc3net","BDgraph","c3net","camel", "cccd", "CDVine", "CePa", "CINOEDV", "cooptrees","corclass", "cvxclustr", "dcGOR", "ddepn","dils", "dnet", "dpa", "ebdbNet", "editrules", "fanovaGraph", "fastclime", "FisHiCal", "flare", "G1DBN", "gdistance", "GeneNet", "GeneReg", "genlasso", "ggm", "gRapfa", "hglasso", "huge", "igraphtosonia", "InteractiveIGraph", "iRefR", "JGL", "lcd", "linkcomm", "locits", "loe", "micropan", "mlDNA", "mRMRe", "nets", "netweavers", "optrees", "packdep", "PAGI", "pathClass", "PBC", "phyloTop", "picasso", "PoMoS", "popgraph", "PROFANCY", "qtlnet", "RCA", "ReliabilityTheory", "rEMM", "restlos", "rgexf", "RNetLogo", "ror", "RWBP", "sand", "SEMID", "shp2graph", "SINGLE", "spacejam", "TDA", "timeordered", "tnet") dg <- makeDepGraph(pk) plot(dg,main="Network of reverse depends for igraph",cex=.4,vertex.size=8)

The igraph package itself is a tour de force with over 200 functions. Learning the package can be a formidable task especially if you are trying to learn graph theory and network analysis at the same time. To help myself through this process, I sorted the functions in the igraph package into seven rough categories:

- Create Graph
- Describe Graph
- Environment
- Find Communities
- Operate on a Graph
- Plot
- Statistics

The following shows a portion of the table for the first 10 functions related to creating graphs.

Function | Description | Category of Function | |

1 | aging.prefatt.game | Generate an evolving random graph with preferential attachment and aging | Create Graph |

2 | barabasi.game | generate scale-free graphs according to the Barabasi-Albert model | Create Graph |

3 | bipartite.random.game | Generate bipartite graphs using the Erdos-Renyi model | Create Graph |

4 | degree.sequence.game | Generate a random graph with a given degree degree sequence | Create Graph |

5 | erdos.renyi.game | Generate random graphs according to the Erdos-Renyi model | Create Graph |

6 | forest.fire.game | Grow a network that simulates how a fire spreads by igniting trees | Create Graph |

7 | graph.adjacency | Create an igraph from an adjacency matrix | Create Graph |

8 | graph.bipartite | Create a bipartite graph | Create Graph |

9 | graph.complementer | Create the complementary graph for a given graph | Create Graph |

10 | graph.empty | Create an empty graph | Create Graph |

The entire table may be downloaded here: Download Igraph_functions.

infomap.community() is an intriguing function listed under the Finding Communities category that looks for structure in a network by minimizing the expected description length of a random walker trajectory. The abstract to the paper by Rosvall and Bergstrom that introduced this method states:

To comprehend the multipartite organization of large-scale biological and social systems, we introduce an information theoretic approach that reveals community structure in weighted and directed networks. We use the probability flow of random walks on a network as a proxy for information flows in the real system and decompose the network into modules by compressing a description of the probability flow. The result is a map that both simplifies and highlights the regularities in the structure and their relationships.

I am not willing to claim that the 37 communities found by the algorithm represent meaningful structure, however, the idea of partitioning the network based on information flow does seem relevant to the package building process. Anybody looking for a research project?

imc <- infomap.community(dg) imc # Graph community structure calculated with the infomap algorithm # Number of communities: 37 # Modularity: 0.5139813 # Membership vector:

Some additional resources for working with igraph are:

- The igraph home page
- A presentation by Gábor Csrádi, igraph's principle author
- An old post with some pointers to additional resources
- A nice couple of nice tutorials: here and here.
- The new book: Statistical Analysis of Network Data with R by Eric D. Kolaczyk and Gábor Csrádi is a very good read.

by Joseph Rickert

I recently had the opportunity to look at the data used for the 2009 KDD Cup competition. There are actually two sets of files that are still available from this competition. The "large" file is a series of five .csv files that when concatenated form a data set with 50,000 rows and 15,000 columns. The "small" file also contains 50,000 rows but only 230 columns. "Target" files are also provided for both the large and small data sets. The target files contain three sets of labels for "appentency", "churn" and "upselling" so that the data can be used to train models for three different classification problems.

The really nice feature of both the large and small data sets is that they are extravagantly ugly, containing large numbers of missing variables, factor variables with thousands of levels, factor variables with only one level, numeric variables with constant values, and correlated independent variables. To top it off, the targets are severely unbalanced containing very low proportions of positive examples for all three of the classification problems. These are perfect files for practice.

Often times, the most difficult part about working with data like this is just knowing where to begin. Since getting a good look is usually a good place to start, let's look at a couple of R tools that I found to be helpful for taking that first dive into messy data.

The mi package takes a sophisticated approach to multiple imputation and provides some very advance capabilities. However, it also contains simple and powerful tools for looking at data. The function missing.pattern.plot() lets you see the pattern of missing values. The following line of code provides a gestalt for the small data set.

missing.pattern.plot(DF,clustered=FALSE,

xlab="observations",

main = "KDD 2009 Small Data Set")

Observations (rows) go from left to right and variables from bottom to top. Red indicates missing values. Looking at just the first 25 variables makes it easier to see what the plot is showing.

The function mi.info(), also in the mi package, provides a tremendous amount of information about a data set. Here is the output for the first 10 variables. The first thing the function does is list the variables with no data and the variables that are highly correlated with each other. Thereafter, the function lists a row for each variable that includes the number of missing values and the variable type. This is remarkably useful information that would otherwise take a little bit of work to discover.

> mi.info(DF) variable(s) Var8, Var15, Var20, Var31, Var32, Var39, Var42, Var48, Var52, Var55, Var79, Var141, Var167, Var169, Var175, Var185 has(have) no observed value, and will be omitted. following variables are collinear [[1]] [1] "Var156" "Var66" "Var9" [[2]] [1] "Var104" "Var105" [[3]] [1] "Var111" "Var157" "Var202" "Var33" "Var61" "Var71" "Var91" names include order number.mis all.mis type collinear 1 Var1 Yes 1 49298 No nonnegative No 2 Var2 Yes 2 48759 No binary No 3 Var3 Yes 3 48760 No nonnegative No 4 Var4 Yes 4 48421 No ordered-categorical No 5 Var5 Yes 5 48513 No nonnegative No 6 Var6 Yes 6 5529 No nonnegative No 7 Var7 Yes 7 5539 No nonnegative No 8 Var8 No NA 50000 Yes proportion No 9 Var9 No NA 49298 No nonnegative Var156, Var66 10 Var10 Yes 8 48513 No nonnegative No

For Revolution R Enterprise users the function rxGetINfo() is a real workhorse. It applies to data frames as well as data stored in .xdf files. For data in these files there is essentially no limit to how many observations can be analysed. rxGetInfo() is an example of an external memory algorithm that only reads a chunk of data at a time from the file. Hence, there is no need to try and stuff all of the data into memory.

The following is a portion of the output from running the function with the getVarinfo flag set to TRUE.

rxGetInfo(DF, getVarInfo=TRUE)

Data frame: DF Number of observations: 50000 Number of variables: 230 Variable information: Var 1: Var1, Type: numeric, Low/High: (0.0000, 680.0000) Var 2: Var2, Type: numeric, Low/High: (0.0000, 5.0000) Var 3: Var3, Type: numeric, Low/High: (0.0000, 130668.0000)

.

.

. Var 187: Var187, Type: numeric, Low/High: (0.0000, 910.0000) Var 188: Var188, Type: numeric, Low/High: (-6.4200, 628.6200) Var 189: Var189, Type: numeric, Low/High: (6.0000, 642.0000) Var 190: Var190, Type: numeric, Low/High: (0.0000, 230427.0000) Var 191: Var191 2 factor levels: r__I Var 192: Var192 362 factor levels: _hrvyxM6OP _v2gUHXZeb _v2rjIKQ76 _v2TmBftjz ... zKnrjIPxRp ZlOBLJED1x ZSNq9atbb6 ZSNq9aX0Db ZSNrjIX0Db Var 193: Var193 51 factor levels: _7J0OGNN8s6gFzbM 2Knk1KF 2wnefc9ISdLjfQoAYBI 5QKIjwyXr4MCZTEp7uAkS8PtBLcn 8kO9LslBGNXoLvWEuN6tPuN59TdYxfL9Sm6oU ... X1rJx42ksaRn3qcM X2uI6IsGev yaM_UXtlxCFW5NHTcftwou7BmXcP9VITdHAto z3s4Ji522ZB1FauqOOqbkl zPhCMhkz9XiOF7LgT9VfJZ3yI Var 194: Var194 4 factor levels: CTUH lvza SEuy Var 195: Var195 23 factor levels: ArtjQZ8ftr3NB ArtjQZmIvr94p ArtjQZQO1r9fC b_3Q BNjsq81k1tWAYigY ... taul TnJpfvsJgF V10_0kx3ZF2we XMIgoIlPqx ZZBPiZh Var 196: Var196 4 factor levels: 1K8T JA1C mKeq z3mO Var 197: Var197 226 factor levels: _8YK _Clr _vzJ 0aHy ... ZEGa ZF5Q ZHNR ZNsX ZSv9 Var 198: Var198 4291 factor levels: _0Ong1z _0OwruN _0OX0q9 _3J0EW7 _3J6Cnn ... ZY74iqB ZY7dCxx ZY7YHP2 ZyTABeL zZbYk2K Var 199: Var199 5074 factor levels: _03fc1AIgInD8 _03fc1AIgL6pC _03jtWMIkkSXy _03wXMo6nInD8 ... zyR5BuUrkb8I9Lth ZZ5

.

.

.

rxGetInfo() doesn't provide all of the information that mi.info() does, but is does do a particularly nice job on factor data, giving the number of levels and showing the first few. The two functions are complementary.

For a full listing of the output shown above down load the file: Download Mi_info_output.

The latest update to the world's most popular statistical data analysis software is now available. R 3.1.2 (codename: "Pumpkin Helmet") makes a number of minor improvements and bug fixes to the R language engine. You can see the complete list of changes here, which include improvements for the log-Normal distribution function, improved axis controls for histograms, a fix to the nlminb optimizer which was causing rare crashes on Windows (and traced to a bug in the gcc compiler), and some compatibility updates for the Yosemite release of OS X on Macs.

This latest update comes on the heels of another major R milestone: the CRAN package repository now features more than 6,000 user-contributed packages! (CRAN actually hit that milestone two days ago; as of this writing there are actually 6,004.) These packages are all ready to use with R 3.1.2 — the CRAN system automatically checks to make sure the packages pass all of their test with the latest R version. You can search and explore R packages on MRAN, or simply browse R packages by topic area.

Revolution R Open will be updated to include R 3.1.2 very soon: the next update is in testing now and should be ready in a couple of weeks.

r-devel mailing list: R 3.1.2 is released