Google has just released a new package for R: CausalImpact. Amongst many other things, this package allows Google to resolve the classical conundrum: how can we asses the impact of an intervention (for example, the effect of an advertising campaign on website clicks) when we can't know what would have happened if we *hadn't* run the campaign? For a marketer, the worry is that the spike in clicks was partially or wholly the result of something unrelated (say, a general increase in web traffic) rather than your campaign.

The CausalImpact package uses Bayesian structural time-series models to resolve this question. All you need is a *second* time series to act a a "virtual" control, which is unaffected by your actions but which is still subject to the extraneous effects you're worried about. (For the marketing example, you might choose web clicks from a region where the campaign didn't run.) Then, you can model the extraneous effects and subtract them from your actual results, so see how your things would have played out had the intervention *not* occurred.

In the chart below (from the Google Open Source blog post) you can see the results of the campaign in black, with the campaign launch at the dotted line. The blue line shows the estimated results had the campaign *not* run, clearly showing that it was effective.

Google uses R and the CausalImpact package to measure the return-on-investment on advertising campaigns its customers run:

We've been testing and applying structural time-series models for some time at Google. For example, we've used them to better understand the effectiveness of advertising campaigns and work out their return on investment. We've also applied the models to settings where a randomised experiment was available, to check how similar our effect estimates would have been without an experimental control.

Now that Google has made the CausalImpact package available, you can do the same for any kind of intervention, as long as you have a time series of results before and after the intervation, and a "control" time series where the intervention had no effect. Read more about the package and the methodology behind it at the link below.

Google Open Source Blog: CausalImpact: A new open-source package for estimating causal effects in time series

If you're looking for just the right package to solve your R problem, you could always browse through the list of available packages on CRAN. But with almost 6000 entries, that's not going to be the most efficient process. And even then, many very useful packages aren't found on CRAN: there are more than 800 packages hosted on BioConductor and more than 200 commonly-used packages on GitHub (with more than 3 GitHub stars).

Rdocumentation.org, a handy website from the team at DataCamp, makes it much easier to fund just the function you're looking for, whether it's found in a package in CRAN, BioConductor or GitHub. Just search for a keyword, and it returns all of the matching packages, plus the full help packages for each of the matching functions. Or, you can just browse the most popular packages, based on their download statistics from the RStudio CRAN mirror.

(Also check out this poster for an analysis of the top-ranked packages from July 2014.)

GitHub and bioconductor: New! Search GitHub and BioConductor packages on RDocumentation

by Joseph Rickert

I think we can be sure that when American botanist Edgar Anderson meticulously collected data on three species of iris in the early 1930s he had no idea that these data would produce a computational storm that would persist well into the 21st century. The calculations started, presumably by hand, when R. A. Fisher selected this data set to illustrate the techniques described in his 1936 paper on discriminant analysis. However, they really got going in the early 1970s when the pattern recognition and machine learn community began using it to test new algorithms or illustrate fundamental principles. (The earliest reference I could find was: Gates, G.W. "The Reduced Nearest Neighbor Rule". IEEE Transactions on Information Theory, May 1972, 431-433.) Since then, the data set (or one of its variations) has been used to test hundreds, if not thousands, of machine learning algorithms. The UCI Machine Learning Repository, which contains what is probably the “official” iris data set, lists over 200 papers referencing the iris data.

So why has the iris data set become so popular? Like most success stories, randomness undoubtedly plays a huge part. However, Fisher’s selecting it to illustrate a discrimination algorithm brought it to peoples attention, and the fact that the data set contains three classes, only one of which is linearly separable from the other two, makes it interesting.

For similar reasons, the airlines data set used in the 2009 ASA Sections on Statistical Computing and Statistical Graphics Data expo has gained a prominent place in the machine learning world and is well on its way to becoming the “iris data set for big data”. It shows up in all kinds of places. (In addition to this blog, it made its way into the RHIPE documentation and figures in several college course modeling efforts.)

Some key features of the airlines data set are:

- It is big enough to exceed the memory of most desktop machines. (The version of the airlines data set used for the competition contained just over 123 million records with twenty-nine variables.
- The data set contains several different types of variables. (Some of the categorical variables have hundreds of levels.)
- There are interesting things to learn from the data set. (This exercise from Kane and Emerson for example)
- The data set is
*tidy*, but not clean, making it an attractive tool to practice big data wrangling. (The AirTime variable ranges from -3,818 minutes to 3,508 minutes)

An additional, really nice feature of the airlines data set is that it keeps getting bigger! RITA, The Research and Innovative Technology Administration Bureau of Transportation Statistics continues to collect data which can be downloaded in .csv files. For your convenience, we have a 143M+ record version of the data set Revolution Analytics test data site which contains all of the RITA records from 1987 through the end of 2012 available for download.

The following analysis from Revolution Analytics’ Sue Ranney uses this large version of the airlines data set and illustrates how a good model, driven with enough data, can reveal surprising features of a data set.

# Fit a Tweedie GLM tm <- system.time( glmOut <- rxGlm(ArrDelayMinutes~Origin:Dest + UniqueCarrier + F(Year) + DayOfWeek:F(CRSDepTime) , data = airData, family = rxTweedie(var.power =1.15), cube = TRUE, blocksPerRead = 30) ) tm # Build a dataframe for three airlines: Delta (DL), Alaska (AK), HA airVarInfo <- rxGetVarInfo(airData) predData <- data.frame( UniqueCarrier = factor(rep(c( "DL", "AS", "HA"), times = 168), levels = airVarInfo$UniqueCarrier$levels), Year = as.integer(rep(2012, times = 504)), DayOfWeek = factor(rep(c("Mon", "Tues", "Wed", "Thur", "Fri", "Sat", "Sun"), times = 72), levels = airVarInfo$DayOfWeek$levels), CRSDepTime = rep(0:23, each = 21), Origin = factor(rep("SEA", times = 504), levels = airVarInfo$Origin$levels), Dest = factor(rep("HNL", times = 504), levels = airVarInfo$Dest$levels) ) # Use the model to predict the arrival day for the three airlines and plot predDataOut <- rxPredict(glmOut, data = predData, outData = predData, type = "response") rxLinePlot(ArrDelayMinutes_Pred~CRSDepTime|UniqueCarrier, groups = DayOfWeek, data = predDataOut, layout = c(3,1), title = "Expected Delay: Seattle to Honolulu by Departure Time, Day of Week, and Airline", xTitle = "Scheduled Departure Time", yTitle = "Expected Delay")

Here, rxGLM()fits a Tweedie Generalized Linear model that looks at arrival delay as a function of the interaction between origin and destination airports, carriers, year, and the interaction between days of the week and scheduled departure time. This function kicks off a considerable amount of number crunching as Origin is a factor variable with 373 levels and Dest, also a factor, has 377 levels. The F() function makes Year and CrsDepTIme factors "on the fly" as the model is being fit. The resulting model ends up with 140,852 coefficients 8,626 of which are not NA. The calculation takes 12.6 minutes to run on a 5 node (4 cores and 16GB of RAM per node) IBM Platform LFS cluster.

The rest of the code uses the model to predict arrival delay for three airlines and plots the fitted values by day of the week and departure time.

It looks like Saturday is the best time to fly for these airlines. Note that none of the sturcture revealed in these curves was put into the model in the sense that there are no polynomial terms in the model.

It will take a few minutes to download the zip file with the 143M airlines records, but please do, and let us know how your modeling efforts go.

by Andrie de Vries

In my previous post I wrote about how to identify and visualize package dependencies. Within hours, Duncan Murdoch (member of R-core) identified some discrepancies between my list of dependencies and the visualisation. Since then, I fixed the dispecrancies. In this blog post I attempt to clarify the issues involved in listing package dependencies.

In miniCRAN I expose two functions that provides information about dependencies:

- The function
**pkgDep()**returns a character vector with the names of dependencies. Internally, pkgDep() is a wrapper around**tools::package_dependencies()**, a base R function that, well, tells you about package dependencies. My new function is in one way a convenience, but more importantly it sets different defaults (more about this later). - The function
**makeDepGraph()**creates an**igraph**representation of the dependencies.

Take a look at some examples. I illustrate with the the package **chron**, because chron neatly illustrates the different roles of Imports, Suggests and Enhances:

- chron
**Imports**the base packages**graphics**and**stats**. This means that chron internally makes use of graphics and stats and will always load these packages. - chron
**Suggests**the packages**scales**and**ggplot2**. This means that chron uses some functions from these packages in examples or in its vignettes. However, these functions are not necessary to use chron - chron
**Enhances**the package**zoo**, meaning that it adds something to zoo packages. These enhancements are made available to you if you have zoo installed.

The function **pkgDep()** exposes not only these dependencies, but also also all recursive dependencies. In other words, it answers the question which packages need to be installed to satsify all dependencies of dependencies.

This means that the algorithm is as follows:

- First retrieve a list of Suggests and Enhances, using a non-recursive dependency search
- Next, perform a recursive search for all Imports, Depends and LinkingTo

The resulting list of packages should then contain the complete list necessary to satisfy all dependencies. In code:

> library(miniCRAN)

> tags <- "chron" > pkgDep(tags, suggests=FALSE, enhances=FALSE, includeBasePkgs = TRUE) [1] "chron" "graphics" "stats"

> pkgDep(tags, suggests = TRUE, enhances=FALSE) [1] "chron" "RColorBrewer" "dichromat" "munsell" "plyr" "labeling" [7] "colorspace" "Rcpp" "digest" "gtable" "reshape2" "scales" [13] "proto" "MASS" "stringr" "ggplot2"

> pkgDep(tags, suggests = TRUE, enhances=TRUE) [1] "chron" "RColorBrewer" "dichromat" "munsell" "plyr" "labeling" [7] "colorspace" "Rcpp" "digest" "gtable" "reshape2" "scales" [13] "proto" "MASS" "stringr" "lattice" "ggplot2" "zoo"

Created by Pretty R at inside-R.org

To create an igraph plot of the dependencies, you can use the function **makeDepGraph()** and plot the results:

Created by Pretty R at inside-R.org

Note how the dependencies expand to zoo (enhanced), scales and ggplot (suggested) and then recursively from there to get all the Imports and LinkingTo dependencies.

In my previous post I tried to plot the most popular package tags on StackOverflow. Using the updated functionality in the miniCRAN functions, it is easier to understand the structure of the dependencies:

> tags <- c("ggplot2", "data.table", "plyr", "knitr", + "shiny", "xts", "lattice") > pkgDep(tags, suggests = TRUE, enhances=FALSE) [1] "ggplot2" "data.table" "plyr" "knitr" "shiny" "xts" [7] "lattice" "digest" "gtable" "reshape2" "scales" "proto" [13] "MASS" "Rcpp" "stringr" "RColorBrewer" "dichromat" "munsell" [19] "labeling" "colorspace" "evaluate" "formatR" "highr" "markdown" [25] "mime" "httpuv" "caTools" "RJSONIO" "xtable" "htmltools" [31] "bitops" "zoo" "SparseM" "survival" "Formula" "latticeExtra" [37] "cluster" "maps" "sp" "foreign" "mvtnorm" "TH.data" [43] "sandwich" "nlme" "Matrix" "bit" "codetools" "iterators" [49] "timeDate" "quadprog" "Hmisc" "BH" "quantreg" "mapproj" [55] "hexbin" "maptools" "multcomp" "testthat" "mgcv" "chron" [61] "reshape" "fastmatch" "bit64" "abind" "foreach" "doMC" [67] "itertools" "testit" "rgl" "XML" "RCurl" "Cairo" [73] "timeSeries" "tseries" "its" "fts" "tis" "KernSmooth" > set.seed(1) > plot(makeDepGraph(tags, includeBasePkgs=FALSE, suggests=TRUE, enhances=TRUE), + legendPosEdge = c(-1, -1), legendPosVertex = c(1, -1), vertex.size=10, cex=0.5)

Created by Pretty R at inside-R.org

After my previous post, Duncan Murdoch pointed out that the package **rgl**, suggested by **knitr**, appeared in the list, but not in the plot. This new version of the function fixes this bug, which was introduced because I retrieved the suggests dependencies incorrectly:

EDIT:

A few hours ago the miniCRAN went live on CRAN. Find miniCRAN at http://cran.r-project.org/web/packages/miniCRAN/index.html

by Joseph Rickert

If I had to pick just one application to be the “killer app” for the digital computer I would probably choose Agent Based Modeling (ABM). Imagine creating a world populated with hundreds, or even thousands of agents, interacting with each other and with the environment according to their own simple rules. What kinds of patterns and behaviors would emerge if you just let the simulation run? Could you guess a set of rules that would mimic some part of the real world? This dream is probably much older than the digital computer, but according to Jan Thiele’s brief account of the history of ABMs that begins his recent paper, *R Marries NetLogo: Introduction to the RNetLogo Package* in the *Journal of Statistical Software,* academic work with ABMs didn’t really take off until the late 1990s.

Now, people are using ABMs for serious studies in economics, sociology, ecology, socio-psychology, anthropology, marketing and many other fields. No less of a complexity scientist than Doyne Farmer (of Dynamic Systems and Prediction Company fame) has argued in *Nature* for using ABMs to model the complexity of the US economy, and has published on using ABMs to drive investment models. in the following clip of a 2006 interview, Doyne talks about building ABMs to explain the role of subprime mortgages on the Housing Crisis. (Note that when asked about how one would calibrate such a model Doyne explains the need to collect massive amounts of data on individuals.)

Fortunately, the tools for building ABMs seem to be keeping pace with the ambition of the modelers. There are now dozens of platforms for building ABMs, and it is somewhat surprising that NetLogo, a tool with some whimsical terminology (e.g. agents are called turtles) that was designed for teaching children, has apparently become a defacto standard. NetLogo is Java based, has an intuitive GUI, ships with dozens of useful sample models, is easy to program, and is available under the GPL 2 license.

As you might expect, R is a perfect complement for NetLogo. Doing serious simulation work requires a considerable amount of statistics for calibrating models, designing experiments, performing sensitivity analyses, reducing data, exploring the results of simulation runs and much more. The recent *JASS* paper* Facilitating Parameter Estimation and Sensitivity Analysis of Agent-Based Models: a Cookbook Using NetLogo and R *by Thiele and his collaborators describe the R / NetLogo relationship in great detail and points to a decade’s worth of reading. But the real fun is that Thiele’s RNetLogo package lets you jump in and start analyzing NetLogo models in a matter of minutes.

Here is part of an extended example from Thiele's *JSS* paper that shows R interacting with the Fire model that ships with NetLogo. Using some very simple logic, Fire models the progress of a forest fire.

Snippet of NetLogo Code that drives the Fire model

to go if not any? turtles ;; either fires or embers [ stop ] ask fires [ ask neighbors4 with [pcolor = green] [ ignite ] set breed embers ] fade-embers tick end ;; creates the fire turtles to ignite ;; patch procedure sprout-fires 1 [ set color red ] set pcolor black set burned-trees burned-trees + 1 end

The general idea is that turtles represent the frontier of the fire run through a grid of randomly placed trees. Not shown in the above snippet is the logic that shows that the entire model is controlled by a single parameter representing the density of the trees.

This next bit of R code shows how to launch the Fire model from R, set the density parameter, and run the model.

# Launch RNetLogo and control an initial run of the # NetLogo Fire Model library(RNetLogo) nlDir <- "C:/Program Files (x86)/NetLogo 5.0.5" setwd(nlDir) nl.path <- getwd() NLStart(nl.path) model.path <- file.path("models", "Sample Models", "Earth Science","Fire.nlogo") NLLoadModel(file.path(nl.path, model.path)) NLCommand("set density 70") # set density value NLCommand("setup") # call the setup routine NLCommand("go") # launch the model from R

Here we see the Fire model running in the NetLogo GUI after it was launched from RStudio.

This next bit of code tracks the progression of the fire as a function of time (model "ticks"), returns results to R and plots them. The plot shows the non-linear behavior of the system.

# Investigate percentage of forest burned as simulation proceeds and plot library(ggplot2) NLCommand("set density 60") NLCommand("setup") burned <- NLDoReportWhile("any? turtles", "go", c("ticks", "(burned-trees / initial-trees) * 100"), as.data.frame = TRUE, df.col.names = c("tick", "percent.burned")) # Plot with ggplot2 p <- ggplot(burned,aes(x=tick,y=percent.burned)) p + geom_line() + ggtitle("Non-linear forest fire progression with density = 60")

As with many dynamical systems, the Fire model displays a phase transition. Setting the density lower than 55 will not result in the complete destruction of the forest, while setting density above 75 will very likely result in complete destruction. The following plot shows this behavior.

RNetLogo makes it very easy to programatically run multiple simulations and capture the results for analysis in R. The following two lines of code runs the Fire model twenty times for each value of density between 55 and 65, the region surrounding the pahse transition.

d <- seq(55, 65, 1) # vector of densities to examine res <- rep.sim(d, 20) # Run the simulation

The plot below shows the variability of the percent of trees burned as a function of density in the transition region.

My code to generate plots is available in the file: Download NelLogo_blog while all of the code from Thiele's JSS paper is available from the journal website.

Finally, here are a few more interesting links related to ABMs.

- On validating ABMs
- ABMs and

R is a functional language, which means that your code often contains a lot of ( parentheses ). And complex code often means nesting those parentheses together, which make code hard to read and understand. But there's a very handy R package — magrittr, by Stefan Milton Bache — which lets you transform nested function calls into a simple pipeline of operations that's easier to write and understand.

Hadley Wickham's dplyr package benefits from the %>% pipeline operator provided by magrittr. Hadley showed at useR! 2014 an example of a data transformation operation using traditional R function calls:

Here's the same code, but rather than nesting one function call inside the next, data is passed from one function to the next using the %>% operator:

You can read this version aloud to easily get a sense of what it does: the flights data frame is filtered (to remove missing values of the dep_delay variable), grouped by hours within days, the mean delay is calculated withn groups, and returns the mean delay for those hours with more than 10 flights.

You can use the %>% operator with standard R functions — and even your own functions — too. The rules are simple: the object on the left hand side is passed as the first argument to the function on the right hand side. So:

- my.data %>% my.function is the same as my.function(my.data)
- my.data %>% my.function(arg=value) is the same as my.function(my.data, arg=value)

It's even possible to pass in data to something other than the first argument of the function using a . (dot) operator to mark the place where the object goes — see the magrittr vignette for details.

This new "pipelining" operation is a really useful addition to the R language, and R developers are starting to use it to make their code simpler to write and maintain. Hadley Wickham's newest R package, tidyr, makes it easy to clean up data sets for analysis by stringing together operations like "gather" and "spread" using the %>% operator.

And speaking of pipelining, you may have been wondering where the name "magrittr" comes from. Here's the answer:

The only other question is: will Stefan be making this coffee mug available?

magrittr vignette: Ceci n'est pas un pipe

by Joseph Rickert

Broadly speaking, a meta-analysis is any statistical analysis that attempts to combine the results of several individual studies. The term was apparently coined by statistician Gene V Glass in a 1976 speech he made to the American Education Research Association. Since that time, not only has meta-analysis become a fundamental tool in medicine, but it is also becoming popular in economics, finance, the social sciences and engineering. Organizations responsible for setting standards for evidence-based medicine such as the United Kingdom’s National Institute for Health and Care Excellence (NICE) make extensive use of meta-analysis.

The application of meta-analysis to medicine is intuitive and, on the surface, compelling. Clinical trials designed to test efficacy for some new treatment for a disease against the standard treatment tend to be based on relatively small samples. (For example, the largest four trials for Respiratory Tract Diseases currently listed on ClinicalTrials.gov has an estimated enrollment of 533 patients.) It would seem to be a “no brainer” to use “all of the information” to get more accurate results. However, as for so many things, the devil is in the details. The preliminary tasks of establishing a rigorous protocol for guiding the meta-analysis and the systematic review to search for relevant studies are themselves far from trivial. One has to work hard to avoid “selection bias”, “publication bias” and other even more subtle difficulties.

In my limited experience with meta-analysis, I found it extraordinariy difficult to determine whether patient populations from different clinical trials were sufficiently homogenous to be included in the same meta-analysis. Even when working with well-written papers, published in quality journals, a considerable amount of medical expertise was required to interpret the data. I came away with the strong impression that a good meta-analysis requires collaboration from a team of experts.

Historically, it has probably been the case that most meta-analyses were conducted either with general tools such as Excel or specialized software like RevMan from the Cochrane Collaboration. However, R is the natural platform for meta-analysis both because of the myriad possibilities for statistical analyses that are not generally available through the specialized software, and because of the many packages devoted to various aspects of meta-analysis. The CRAN Meta Analysis Task View is exceptionally well-organized listing R packages according to the different stages of conducting a meta-analysis and also calling out some specialized techniques such as meta-regression and network-meta analysis.

ln a future post, I hope to be able to explore some of these packages more closely. For now, let’s look at a very simple analysis based on Thomas Lumley’s rmeta package which has been a part of R since 1999. The following simple meta-analysis is written up very nicely in the book by Chen and Peace titled Applied Meta-Analysis with R.

The cochrane data set in the rmeta package contains the results from seven randomized clinical trials designed to test the effectiveness of corticosteriod therapy in preventing neonatal deaths in premature labor. The columns of the data set are: the name of the trial center, the number of deaths in the treatment group, the total number of patients in the treatment group, the number of deaths in the control group and the total number of patients in the control group.

The null hypothesis is that there is no difference between treatment and control. Following Chen and Peace, we fit both fixed effects and random effects models to look at the odds ratios.

The summary for the fixed effects models shows that while only two studies, Auckland and Doran, individually show a significant effect, the overall confidence interval from the Mantel Haenszel test does indicate a benefit from the treatment.

Fixed effects ( Mantel-Haenszel ) meta-analysis Call: meta.MH(ntrt = n.trt, nctrl = n.ctrl, ptrt = ev.trt, pctrl = ev.ctrl, names = name, data = cochrane) ------------------------------------ OR (lower 95% upper) Auckland 0.58 0.38 0.89 Block 0.16 0.02 1.45 Doran 0.25 0.07 0.81 Gamsu 0.70 0.34 1.45 Morrison 0.35 0.09 1.41 Papageorgiou 0.14 0.02 1.16 Tauesch 1.02 0.37 2.77 ------------------------------------ Mantel-Haenszel OR =0.53 95% CI ( 0.39,0.73 ) Test for heterogeneity: X^2( 6 ) = 6.9 ( p-value 0.3303 )

The summary for the random effects model for this data is identical except, as one would expect, the overall confidence interval is somewhat wider: SummaryOR= 0.53 95% CI ( 0.37,0.78 ). A slight modification to enhanced the forest plot code provided by Chen and Peace (which works for both the fixed effects and random effects model objects) shows the typical way to present these results.

CPplot <- function(model){ c1 <- c("","Study",model$names,NA,"Summary") c2 <- c("Deaths","(Steroid)",cochrane$ev.trt,NA,NA) c3 <- c("Deaths","(Placebo)",cochrane$ev.ctrl,NA,NA) c4 <- c("","OR",format(exp(model[[1]]),digits=2),NA,format(exp(model[[3]]),digits=2)) tableText <-cbind(c1,c2,c3,c4) mean <- c(NA,NA,model[[1]],NA,model[[3]]) stderr <- c(NA,NA,model[[2]],NA,model[[4]]) low <- mean - 1.96*stderr up <- mean + 1.96*stderr forestplot(tableText,mean,low,up,zero=0, is.summary=c(TRUE,TRUE,rep(FALSE,8),TRUE),clip=c(log(0.1),log(2.5)),xlog=TRUE) }

CPplot(model.FE)

The whole idea of meta-analysis is intriguing. However, because of the challenges I mentioned above, I would be remiss not to point out that it elicits considerable criticism. The article Meta-analysis and its problems by H J Eysenck captures the issues and is well worth reading. Also, have a look at the review article by Walker, Hernandez and Kattan writing in the Cleveland Clinic Journal of Medicine.

With the growing popularity of R, there is an associated increase in the popularity of online forums to ask questions. One of the most popular sites is StackOverflow, where more than 60 thousand questions have been asked and tagged to be related to R.

On the same page, you can also find related tags. Among the top 15 tags associated with R, several are also packages you can find on CRAN:

- ggplot2
- data.table
- plyr
- knitr
- shiny
- xts
- lattice

It very easy to install these packages directly from CRAN using the R function install.packages(), but this will also install all these package dependencies.

This leads to the question: How can one determine all these dependencies?

It is possible to do this using the function available.packages() and then query the resulting object.

But it is easier to answer this question using the functions in a new package, called miniCRAN, that I am working on. I have designed miniCRAN to allow you to create a mini version of CRAN behind a corporate firewall. You can use some of the function in miniCRAN to list packages and their dependencies, in particular:

- pkgAvail()
- pkgDep()
- makeDepGraph()

I illustrate these functions in the following scripts.

Start by loading miniCRAN and retrieving the available packages on CRAN. Use the function pkgAvail() to do this:

library(miniCRAN) pkgdata <- pkgAvail(repos = c(CRAN="http://cran.revolutionanalytics.com"), type="source") head(pkgdata[, c("Depends", "Suggests")]) ## Depends Suggests ## A3 "R (>= 2.15.0), xtable, pbapply" "randomForest, e1071" ## abc "R (>= 2.10), nnet, quantreg, MASS" NA ## abcdeFBA "Rglpk,rgl,corrplot,lattice,R (>= 2.10)" "LIM,sybil" ## ABCExtremes "SpatialExtremes, combinat" NA ## ABCoptim NA NA ## ABCp2 "MASS" NA

Next, use the function pkgDep() to get dependencies of the 7 popular tags on StackOverflow:

tags <- c("ggplot2", "data.table", "plyr", "knitr", "shiny", "xts", "lattice") pkgList <- pkgDep(tags, availPkgs=pkgdata, suggests=TRUE) pkgList ## [1] "abind" "bit64" "bitops" "Cairo" ## [5] "caTools" "chron" "codetools" "colorspace" ## [9] "data.table" "dichromat" "digest" "evaluate" ## [13] "fastmatch" "foreach" "formatR" "fts" ## [17] "ggplot2" "gtable" "hexbin" "highr" ## [21] "Hmisc" "htmltools" "httpuv" "iterators" ## [25] "itertools" "its" "KernSmooth" "knitr" ## [29] "labeling" "lattice" "mapproj" "maps" ## [33] "maptools" "markdown" "MASS" "mgcv" ## [37] "mime" "multcomp" "munsell" "nlme" ## [41] "plyr" "proto" "quantreg" "RColorBrewer" ## [45] "Rcpp" "RCurl" "reshape" "reshape2" ## [49] "rgl" "RJSONIO" "scales" "shiny" ## [53] "stringr" "testit" "testthat" "timeDate" ## [57] "timeSeries" "tis" "tseries" "XML" ## [61] "xtable" "xts" "zoo"

Wow, look how these 7 packages have dependencies on 63 other packages!

You can graphically visualise these dependencies in a graph, by using the function makeDepGraph():

p <- makeDepGraph(pkgList, availPkgs=pkgdata) library(igraph) plotColours <- c("grey80", "orange") topLevel <- as.numeric(V(p)$name %in% tags) par(mai=rep(0.25, 4)) set.seed(50) vColor <- plotColours[1 + topLevel] plot(p, vertex.size=8, edge.arrow.size=0.5, vertex.label.cex=0.7, vertex.label.color="black", vertex.color=vColor) legend(x=0.9, y=-0.9, legend=c("Dependencies", "Initial list"), col=c(plotColours, NA), pch=19, cex=0.9) text(0.9, -0.75, expression(xts %->% zoo), adj=0, cex=0.9) text(0.9, -0.8, "xts depends on zoo", adj=0, cex=0.9) title("Package dependency graph")

So, if you wanted to install the 7 most popular packages R packages (according to StackOverflow), R will in fact download and install up to 63 different packages!

The annual worldwide user conference useR! 2014 is underway at UCLA, beginning with a full day of tutorials. This year's useR! conference is a record-breaker with more than 700 attendees, so most of the tutorial sessions have been jam-packed. The tutorials cover a diverse array of R applications: data management, visualization, statistics and biostatistics, programming, and interactive applications. Follow the links below for more details about the packages and methods covered — some authors have already provided slides for their tutorials (and those that haven't probably will soon).

- Applied Predictive Modeling in R, Max Kuhn
- Interactive graphics with ggvis, Winston Chang
- Dynamic Documents with R and knitr, Yihui Xie
- C++ and Rcpp11 for beginners, Romain Francois
- Managing Data with R, Bob Muenchen
- Introduction to data.table, Matt Dowle
- Applied Spatial Data Analysis with R, Virgilio Gomez Rubio
- Bioconductor, Martin Morgan
- Data manipulation with dplyr, Hadley Wickham
- Interactive data display with Shiny and R, Garrett Grolemund
- Programming with Big Data in R, Drew Schmidt
- Graphical Models and Bayesian Networks with R, Søren Højsgaard
- Nonlinear parameter optimization and modeling in R [slides], John Nash
- An Example-Driven Hands-on Introduction to Rcpp, Dirk Eddelbuettel
- Interactive Documents with R, Ramnath Vaidyanathan
- Simulating differential equation models in R, Thomas Petzoldt

useR! 2014: Tutorials

Hadley Wickham's been working on the next-generation update to ggplot2 for a while, and now it's available on CRAN. The ggvis package is completely new, and combines a chaining syntax reminiscent of dplyr with the grammar of graphics concepts of ggplot2. The resulting charts are web-ready in scalable SVG format, and can easily be made interactive thanks to RStudio's shiny package.

For example, here's the code to create a scatterplot with a smoothing line from the mtcars data set:

```
mtcars %>%
ggvis(~wt, ~mpg) %>%
layer_points() %>%
layer_smooths()
```

And here's the corresponding SVG image:

SVG graphics are great online, because they're compact (this one's just 25Kb) and look great whatever size they're displayed at (it's a vector format, so you never get pixellation). The only think SVGs don't work well for is charts with millions of elements (points, lines, etc.) because then they can be large and slow to render. (The only other downside is that our blogging platform, TypePad, doesn't support SVG with its image tools, so I had to insert an <image> element into the HTML directly.)

You can easily add interactivity to a chart, by specifying parameters as input controls rather than numbers. Here's the code for the same chart, with a slider to specify the smoothing parameter and point size:

```
mtcars %>%
ggvis(~wt, ~mpg) %>%
layer_smooths(span = input_slider(0.5, 1, value = 1)) %>%
layer_points(size := input_slider(100, 1000, value = 100))
```

If you run that code in RStudio you'll get an interactive chart, or go here to see the same interactivity on a web page, rendered with RStudio's Shiny. For more details, check out the ggvis website linked below.

RStudio: ggvis 0.3 overview