by Joseph Rickert

I recently rediscovered the Timely Portfolio post on R Financial Time Series Plotting. If you are not familiar with this gem, it is well-worth the time to stop and have a look at it now. Not only does it contain some useful examples of time series plots mixing different combinations of time series packages (ts, zoo, xts) with multiple plotting systems (base R, lattice, etc.) but it provides an instructive, historical perspective that illustrates the non linear nature of progress in software development: new code is written to solve certain technical problems with the current software. Progress is made, and the new code makes it possible to do some things that couldn't be done before, but there were tradeoffs. Design choices for the new system make it a little more difficult to do something that was easy before. The net result: all of the software continues to advance in a messy mix, confusing the newcomer and providing critics with the opportunity to complain that there is not just one way to solve a problem.

Because this is turning out to be a week when more than a few people are likely lo be plotting financial time series, I thought I would be helpful to call attention to this time series resource and also take a look at the current state of the R art for performing a relatively simple task: plotting closing prices for two stocks on the same chart.

The following code just reads stock price data from Yahoo Finance for both IBM and LinkedIn from 8/24/2010 through 8/24/2015 and picks out the closing prices. I cheated a little here because I already knew the urls for the two series. I picked these two stocks because they both traded at about the same range for the period in question and because I wanted to see if the fact that one stock, LinkedIn, wasn't trading when at the beginning of the selected period caused any problems.

# Time Series Plotting library(ggplot2) library(xts) library(dygraphs) # Get IBM and Linkedin stock data from Yahoo Finance ibm_url <- "http://real-chart.finance.yahoo.com/table.csv?s=IBM&a=07&b=24&c=2010&d=07&e=24&f=2015&g=d&ignore=.csv" lnkd_url <- "http://real-chart.finance.yahoo.com/table.csv?s=LNKD&a=07&b=24&c=2010&d=07&e=24&f=2015&g=d&ignore=.csv" yahoo.read <- function(url){ dat <- read.table(url,header=TRUE,sep=",") df <- dat[,c(1,5)] df$Date <- as.Date(as.character(df$Date)) return(df)} ibm <- yahoo.read(ibm_url) lnkd <- yahoo.read(lnkd_url)

To my mind, the "go to" method for simple plotting that you will show to someone else is ggplot(). The following code suggested by Didzis Elferts, in answer to a StackOverflow question, accomplishes the task with great economy, using just a few more features than what the defaults would give you.

ggplot(ibm,aes(Date,Close)) + geom_line(aes(color="ibm")) + geom_line(data=lnkd,aes(color="lnkd")) + labs(color="Legend") + scale_colour_manual("", breaks = c("ibm", "lnkd"), values = c("blue", "brown")) + ggtitle("Closing Stock Prices: IBM & Linkedin") + theme(plot.title = element_text(lineheight=.7, face="bold"))

This next plot, which uses the dygraphs package, represents the new frontier for creating interactive time series plots in R.

# Plot with the htmlwidget dygraphs # dygraph() needs xts time series objects ibm_xts <- xts(ibm$Close,order.by=ibm$Date,frequency=365) lnkd_xts <- xts(lnkd$Close,order.by=lnkd$Date,frequency=365) stocks <- cbind(ibm_xts,lnkd_xts) dygraph(stocks,ylab="Close", main="IBM and Linkedin Closing Stock Prices") %>% dySeries("..1",label="IBM") %>% dySeries("..2",label="LNKD") %>% dyOptions(colors = c("blue","brown")) %>% dyRangeSelector()

Building on the work done for rCharts profiled in the Timely Portfolio piece, the dygraphs R package provides an interface to the dygraphs javascript library. With just a few lines of R code, it is now possible to produce charts that approach the polished look of the professional stock charting services - and no knowledge of JavaScript.

*by* *Ari Lamstein**, consultant specializing in software engineering and data analysis and author of the free email course Learn to Map Census Data in R.*

One of my favorite things about R is that it allows me to follow up on interesting news stories. Consider this interview on EconTalk about the history of fracking in America. Russ Roberts interviewed Gregory Zuckerman about his book The Frackers. One thing that struck me were the stories of how North Dakota is being transformed by the fracking boom. North Dakota sits on the Bakken formation which, due to fracking, is now able to be monetized.

Here are two maps I made which demonstrate North Dakota’s recent demographic changes. The first shows that between 2010 and 2013 North Dakota’s Per Capita Income grew at a rate of 15%, significantly above any other US state. The second one shows that North Dakota’s Median Age decreased by 2%, significantly below any other US state. Today I will demonstrate how to create these maps in R.

We’ll use the choroplethr and choroplethrMaps packages to create these maps. To install and load them, type the following from an R console:

install.packages(c(“choroplethr”, “choroplethrMaps”))

library(choroplethr)

library(choroplethrMaps)

The data we’ll be mapping comes from the US Census Bureau’s American Community Survey (ACS). To access their API you need an API key, which you can get for free here. Once you have a key, type the following from an R console:

library(acs)

api.key.install(“<your key>”)

The ACS began in 2005, but only data from 2010 thru 2013 is available via the API. We can get 2010 and 2013 data by typing the following:

demo_2010 = get_state_demographics(2010)

demo_2013 = get_state_demographics(2013)

To see the values that are available, look at the column names:

colnames(demo_2013)

[1] "region" "total_population" "percent_white" "percent_black"

[5] "percent_asian" "percent_hispanic" "per_capita_income"

[8] "median_rent" "median_age"

Before calculating the percent change, merge the two data frames together by region. Note that “region” here is just a state name:

demo_all = merge(demo_2010, demo_2013, by="region")

colnames(demo_all)

[1] "region" "total_population.x" "percent_white.x"

[4] "percent_black.x" "percent_asian.x" "percent_hispanic.x"

[7] "per_capita_income.x" "median_rent.x" "median_age.x"

[10] "total_population.y" "percent_white.y" "percent_black.y"

[13] "percent_asian.y" "percent_hispanic.y" "per_capita_income.y"

[16] "median_rent.y" "median_age.y"

The 2010 values now have .x appended to them, and the 2013 values have .y appended to them.

We’ll use the function **state_choropleth** to create choropleth maps of the demographic changes. First we need to create a column called **value** which represents the percent change in Per Capita Income between the two years:

demo_all$value = (demo_all$per_capita_income.y - demo_all$per_capita_income.x) / demo_all$per_capita_income.x * 100

Now we can create the map:

state_choropleth(demo_all)

The main problem with this map is the scale: it does not allow us to distinguish between negative and positive values. A more appropriate scale is scale_fill_gradient2: it makes 0 white, negative values red and positive values blue. To change the scale we need to use the object oriented features of choroplethr. Here is code to change the scale as well as do other things, such as remove the state labels.

choro = StateChoropleth$new(demo_all)

choro$title = "State Per Capita Income\n2010-2013 Percent Change"

min = min(demo_all$value)

max = max(demo_all$value)

choro$show_labels = FALSE

choro$set_num_colors(1)

choro$ggplot_scale = scale_fill_gradient2(name="Percent Change", limits = c(min, max))

income_map = choro$render()

income_map

We can create a similar map for changes in the median age using the same pattern:

demo_all$value = (demo_all$median_age.y - demo_all$median_age.x) / demo_all$median_age.x * 100

choro = StateChoropleth$new(demo_all)

choro$title = "State Median Age\n2010-2013 Percent Change"

choro$set_num_colors(1)

min = min(demo_all$value)

max = max(demo_all$value)

choro$show_labels = FALSE

choro$ggplot_scale = scale_fill_gradient2(name="Percent Change", limits=c(min, max))

age_map = choro$render()

age_map

Finally, we can merge the two maps using the **gridExtra** package:

library(gridExtra)

grid.arrange(income_map, age_map)

In closing, I want to point out that the ACS provides estimates only. For more information on the reliability of the estimates, please see the acs package created by Ezra Haber Glenn.

To experiment with ideas in this post have a look at the Shiny version of the plots.

*Ari Lamstein** is a consultant who specializes in software engineering and data analysis. He is the author of the free email course **Learn to Map Census Data in R**.*

by Joseph Rickert

One great beauty of the R ecosystem, and perhaps the primary reason for R’s phenomenal growth, is the system for contributing new packages. This, coupled to the rock solid stability of CRAN, R’s primary package repository, gives R a great advantage. However, anyone with enough technical knowhow to formulate a proper submission can contribute a package to CRAN. Just being on CRAN is no great indicator of merit: a fact that newcomers to R, and open source, often find troubling. It takes some time and effort working with R in a disciplined way to appreciate how the organic metracracy of the package system leads to high quality, integrated software. Nevertheless, even for relative newcomers it is not difficult to discover the bedrock packages that support the growth of the R language. Those packages which reliably add value to the R language are readily apparent in plots of CRAN’s package dependency network.

Finding new packages that may ultimately prove to be useful is another matter. In the spirit of discovery; here are 5, relatively new packages that I think may ultimately prove to be interesting to data scientists. None of these have been on CRAN long enough to be battle tested. So please, explore them with cooperation in mind.

AzureML V0.1.1

Cloud computing is, or will be, important to every practicing data scientist. Microsoft’s Azure ML is a particularly rich machine learning environment for R (and Python) programmers. If your are not yet an Azure user this new package goes a long way to overcoming the inertia involved in getting started. It provides functions to push R code from your local environment up to the Azure cloud and publish functions and models as web services. The vignette walks you step by step from getting a trial account and the necessary credentials to publishing your first simple examples.

distcomp V0.25.1

Distributed computing with large data sets is always tricky, especially in environments where it is difficult or impossible to share data among collaborators. A clever partial likelihood algorithm implemented in the distcomp package (See the paper by Narasimham et al.) makes it possible to build sophisticated statistical models on unaggregated data sets. Have a look at this previous blog post for more detail.

rotationForest V0.1

The forests algorithm is the “go to” ensemble method for many data scientists as it consistently performs well on diverse data sets. This new variation based on performing Principal Component Analysis on random subsets of the feature space shows great promise. See the paper by Rodriguez et. al. for an explanation of how the PCA amounts to rotating the feature space and a comparison of the rotation forest algorithm with standard random forests and the Adaboost algorithm.

rpca V0.2.3

Given a matrix that is a superposition of a low rank component and a sparse component, rcpa uses a robust PCA method to recover these components. Netflix data scientists publicized this algorithm, which is based on a paper by Candes et al, Robust Principal Component Analysis, earlier this year when they reported spectacular success using robust PCA in an outlier detection problem.

SwarmSVM V0.1

The support vector machine is also a mainstay machine learning algorithm. SwarmSVM, which is based on a clustering approach as described in a paper by Gu and Han provides three ensemble methods for training support vector machines. The vignette that accompanies the package provides a practical introduction to the method.

by Joseph Rickert

We can declare 2015 the year that R went mainstream at the JSM. There is no doubt about it, the calculations, visualizations and deep thinking of a great many of the world's statisticians are rendered or expressed in R and the JSM is with the program. In 2013 I was happy to have stumbled into a talk where an FDA statistician confirmed that R was indeed a much used and trusted tool. Last year, while preparing to attend the conference, I was delighted to find a substantial list of R and data science related talks. This year, talks not only mentioned R: they were *about* R.

The conference began with several R focused pre-conference tutorials including Statistical Analysis of Financial Data Using R, The Art and Science of Data Visualization Using R, and Hadley Wickham’s sold out Advanced R. The Sunday afternoon session on Advances in R Software played to a full room. Highlights of that session included Gabe Becker’s presentation on the switchr package for reproducible research, Mark Seligman’s update on the new work being done on the Arborist implementation of the random forest algorithm and my colleague’s Andrie de Vries presentation of some work we did on the network structure of R packages. (See yesterday’s post.)

The enthusiasm expressed by the overflowing crowd for Monday’s invited session on Recent Advances in Interactive Graphics for Data Analysis was contagious. Talks revolved around several packages linking R graphics to d3 and JavaScript in order to provide interactive graphics which are not only visually stunning but also open up new possibilities for exploratory data analysis. Hadley Wickham, the substitute chair for the session, characterized the various approaches to achieving interactive graphics in R with a bit of humor and much insight that I think brings some clarity to this chaotic whorl of development. Hadley places current efforts to provide interactive R graphics in one of three categories:

- Speaking in tongues: interfacing to low level specialized languages (examples: iplots and rggobi)
- Hacking existing graphics (examples: Animint and using ggplot2 with Shiny)
- Abusing the browser (examples: R/qtlcharts, leaflet and htmlwidgets)

Other highlights of the session included Kenney Shirley’s presentation on interactively visualizing trees with his summarytrees package that interfaces R to D3, Susan VanderPlas’ presentation of Animint (This package adds interactive aesthetics to ggplot2. Here is a nice tutorial.), and Karl Bowman’s discussion of visualizing high-dimensional genomic data (See qtlcharts and d3examples.)

In addition to visualization, education was another thread that stitched together various R related topics. Waller's talk, Evaluating Data Science Contributions in Teaching and Research, in the section of invited papers: The Statistics Identity Crisis: Are We Really Data Scientists provided some advice on how software developed by academics could be “packaged” to look like the more traditional work product traditionally valued for academic advancement. Progress along these lines would go a long way towards helping some of the most productive R contributors achieve career advancing recognition. There was also some considerable discussion about the kind of practical R and data science skills that should supplement the theoretical training of statisticians to help them be effective in academia as well as in industry. To get some insight into the relevant issues have a look at Jennifer Bryan’s slides for her talk Teach Data Science and They Will Come.

The following list contains 20 JSM talks with interesting package, educational or application R content.

- Animint: Interactive Web-Based Animations Using Ggplot2's Grammar of Graphics

Susan Ruth VanderPlas, Iowa State University; Carson Sievert, Iowa State University; Toby Hocking, McGill University - Applying the R Language in Streaming and Business Intelligence Applications

Louis Bajuk, TIBCO Software Inc. - A Bayesian Test of Independence of Two Categorical Variables with Covariates

Dilli Bhatta, Truman State University - Comparison of R and Vowpal Wabbit for Click Prediction in Display Advertising

Jaimyoung Kwon, AOL Advertising; Bin Ren, AOL Platforms; Rajasekhar Cherukuri, AOL Platforms; Marius Holtan, AOL Platforms - Demonstration of Statistical Concepts with Animated Graphics and Simulations in R

Andrej Blejec, National Institute of Biology - The Dendextend R Package for Manipulation, Visualization, and Comparison of Dendograms

Tal Galili, Tel Aviv University - Enhancing Reproducibility and Collaboration via Management of R Package Cohorts

Gabriel Becker, Genentech Research; Cory Barr, Anticlockwork Arts; Robert Gentleman, Genentech Research; Michael Lawrence, Genentech Research - GMM Versus GQL Logistic Regression Models for Multi-Level Correlated Data

Bei Wang, Arizona State University; Jeffrey Wilson, W. P. Carey School of Business/Arizona State University - Increasing the Accuracy of Gene Expression Classifiers by Incorporating Pathway Information: A Latent Group Selection Approach

Yaohui Zeng, The University of Iowa; Patrick Breheny, The University of Iowa - Learning statistics with R, from the Ground Up Xiaofei Wang
- Mining an R Bug Database with R

Stephen Kaluzny, TIBCO Software Inc. - Multinomial Regression for Correlated Data Using the Bootstrap in R

Jennifer Thompson, Vanderbilt University; Timothy Girard, Vanderbilt University Medical Center; Pratik Pandharipande, Vanderbilt University Medical Center; E. Wesley Ely, Vanderbilt University Medical Center; Rameela Chandrasekhar, Vanderbilt University - The Network Structure of R Packages

Andrie de Vries, Revolution Analytics Limited; Joseph Rickert - Online PCA in High Dimension: A Comparative Study

David Degras, DePaul University; Hervé Cardot, Université de Bourgogne - Perils and Solutions for Comparative Effectiveness Research in Massive Observational Databases

Marc A. Suchard, UCLA - R Package PRIMsrc: Bump Hunting by Patient Rule Induction Method for Survival, Regression, and Classification

Jean-Eudes Dazard, Case Western Reserve University; Michael Choe, Case Western Reserve University; Michael LeBlanc, Fred Hutchinson Cancer Research Center; J. Sunil Rao, University of Miami - An R Package That Collects and Archives Files and Other Details to Support Reproducible Computing

Stan Pounds, St. Jude Children's Research Hospital; Zhifa Liu, St. Jude Children's Research Hospital - SimcAusal R Package: Conducting Transparent and Reproducible Simulation Studies of Causal Effect Estimation with Complex Longitudinal Data

Oleg Sofrygin, Kaiser Permanente Northern California/UC Berkeley; Mark Johannes van der Laan, UC Berkeley; Romain Neugebauer, Kaiser Permanente Northern California Statistical Computation Using Student Collaborative Work John D. Emerson, Middlebury College - Teaching Introductory Regression with R Using Package Regclass

Adam Petrie - Using Software to Search for Optimal Cross-Over Designs

Byron Jones

by Andrie de Vries

This week at JSM2015, the annual conference of the American Statistical Association, Joseph Rickert and I gave a presentation on the topic of "The network structure of CRAN and BioConductor" (link to abstract).

Our work tested the hypothesis if one can detect statistical differences in the network graph formed by the dependencies between packages. In the dependency graph, each package is a vertex and each dependency is an edge connecting two vertices.

This presentation combines earlier work that we have discussed in blog posts during the year:

- The network structure of CRAN
- A simple statnet model of CRAN
- Finding the essential R packages using the pagerank algorithm
- Contracting and simplifying a network graph
- Finding clusters of CRAN packages using igraph
- Creating network graphs using JavaScript directly from R

Before starting the work, we formed a hypothesis that CRAN and BioConductor have discernably different package network structures.

This hypothesis is based on the intuition that these two repositories have different management structures:

- On CRAN, packages of almost any type are welcome. The CRAN maintainers have some strict policies on how a package should behave to get on CRAN (have documentation, have examples, build without warnings, etc.). However, CRAN does not prescribe anything about the subject matter or content of any package.
- In contrast, BioConductor is more focused and centrally managed. Packages must add something to the topic of high-throughput genomic data. For a great introduction, read Peter Hickey's contributed blog post, A Short Introduction to Bioconductor.

Firstly, we used the igraph package to compute descriptive network statistics. Among these, we found the clustering coefficient and the degree distribution most illuminating.

Firstly, we found that BioConductor has a higher clustering coefficient than CRAN. The clustering coefficient (also called transitivity) measures the probability that the adjacent vertices of a vertex are connected.

You can see this visually in the network graphs. It appears as if the BioConductor graph is more compact, while the CRAN graph has many packages on the perimeter that are only loosely connected to the rest of the graph.

We used a simple bootstrapping algorithm to simulate the local clustering coefficient of induced subgraphs. In this plot, CRAN (in red) has a much lower distribution of clustering coefficient than BioConductor (in blue).

The second statistical summary is the degree distribution. The degree of a node is the number of adjacent edges. Note in particular the degree distribution with nodes of degree zero, i.e. unconnected nodes.

BioConductor has a much lower fraction of packages with zero connections. It seems that the BioConductor policy encourages package authors to re-use exising material and write packages that work better together.

The presentation is available on slideshare.

The scripts we used are available at github. We think this is an important topic to study, since it could help to discover:

- Better search algorithms for finding packages that are useful to solve a specific problem
- Recommendations for packages to use

by Joseph Rickert

In a recent post on creating JavaScript network graphs directly from R, my colleague and fellow blogger, Andrie de Vries, included a link to a saved graph of CRAN. Here, I will use that same graph (network) to build a simple exponential random graph model using functions from the igraph package, and the network and ergm packages included in the statnet suite of R packages. Each node (vertex) in the saved graph represents a package on CRAN, and a directed link or edge between two nodes A -> B indicates that package A depends on package B. Since Andrie's CRAN graph does not have any external attributes associated with either the nodes or edges, the idea is to see if we can develop a model using only structural aspects of the network itself as predictors. In general, this is not an easy thing to do and we are only going to have limited success here. However, the process will illustrate some basic concepts.

A fundamental statistic associated with each node in a network is its degree, the number of edges that connect to it. (Each edge connects a node to some other node in the network forming a tie.) The degree distribution (distribution of the degree associated with each node of the network) is one way to think about the local connectivity structure of the network. The following plots illustrate different aspects CRAN's degree distribution.

The top left histogram shows that although most nodes have degree less than 5, there is a long tail with some nodes having hundreds of incident edges. (Of the 6,867 nodes in the network 1,156 have degree greater than 5). Additionally, we have:

summary(cran_deg)

# Min. 1st Qu. Median Mean 3rd Qu. Max.

# 0.000 1.000 2.000 4.296 4.000 764.000

The bottom two histograms provide greater detail for the beginning and tail-end of the degree distribution.

The purpose of the fourth plot in the panel, the log-log plot, is to look for evidence that the CRAN network may exhibit the kind of "scale free" structure that is common to social network graphs. (A straight line would provide some evidence that the network follows a power law distribution.) That is not quite what is going on here. Nevertheless, the structure exhibited in the degree distribution and the clustering clearly visible in CRAN package network plots indicate that we are not dealing with a purely random graph either.

The following shows the code to specify and fit an ergm model to the data comprising the CRAN network with just two predictors edges and degree(c(1,2)).

# Model edges form <- formula(cran_net ~ edges + degree(c(1,2))) summary.statistics(form) # edges degree1 degree2 # 14293 1400 1119 # Fit ergm fit <- ergm(form,control=control.ergm(MCMLE.maxit = 60)) summary.ergm(fit) # ========================== # Summary of model fit # ========================== # # Formula: cran_net ~ edges + degree(c(1, 2)) # # Iterations: 22 out of 60 # # Monte Carlo MLE Results: # Estimate Std. Error MCMC % p-value # edges -6.89814 0.01124 1 <1e-04 *** # degree1 2.46835 0.04519 0 <1e-04 *** # degree2 1.25087 0.03884 0 <1e-04 *** # --- # Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 # # Null Deviance: 32681073 on 23574411 degrees of freedom # Residual Deviance: 238828 on 23574408 degrees of freedom # # AIC: 238834 BIC: 238879 (Smaller is better.) #

Fitting an ergm with just edges as a predictor would be an attempt to fit a random graph to the data. (See Hunter et al. for details.) Adding the term degree(c(1,2) attempts to control for the nodes of degrees 1 and 2 which appear to deviate from the straight line in the log-log plot. The small p values associated with the coefficients and the fact that both the AIC and BIC are slightly smaller than the those resulting from a fit to a random model (AIC = 240349, BIC = 240364) indicate that the simple model has some explanatory power. But how good is the fit?

The ergm package provides a method of assessing goodness of fit through the computationally intensive scheme of simulating networks from the model and comparing statistics from the simulated networks with those calculated from the actual network. The following plot illustrates this comparison for the degree distribution, edge-wise shared partners distribution and the proportion of dyads at values of minimum geodesic distance. The fit for the first two distributions appears to be quite good, and the fit for the third distribution, while far from perfect, is not bad either.

(in the plots, the solid black lines represent the statistics computed from the GRAN graph, the light lines with box plots are the simulated data.)

The take away here is that a very simple model appears to have a surprising good fit. Constructing a better fitting and more interesting model would need to take into account characteristics of the package design and construction process. These features would then be used as additional predictors. I think a hint as to how one might go about this is contained in Andrie's observation that a many of the packages of very high degree such as Rcpp, MASS, ggplot2 etc. are concerned with providing tools or reusable infrastructure. Exploring this idea, however, and modeling other package networks such as Bioconductor are subjects for future work.

The code used for this post may be obtained here: Download CRAN_ergm_blog

To go further into ergm modeling with statnet and R have a look at benjamin's most informative and entertaining analysis of Grey's Anatomy "hook-ups".

by Andrie de Vries

In a previous post, I used page rank and community structure to create a plot of CRAN. This plot used vibrant colours to allow us to see some of the underlying structure of CRAN.

However, much of this structure was still obfuscated by the amount of detail. Concretely, a large number of dots (packages) made it difficult to easily see the community structure.

Some more investigation followed, and I discovered a few beautiful functions in the igraph package that allows you to contract and simplify a graph:

- The function contract.vertices() merges several vertices into one. By computing the community structure, one can control how this merging happens. At conclusion of the contraction, two vertices can have multiple edges.
- The equivalent step for edges is simplify(). A simplified graph contains only a single edge between two nodes. The simplification step can compute summary statistics for the combined edges, for example the sum of edge weights.

Here is the code:

by Peter Hickey (@PeteHaitch)

One of the keys to R's success as a software environment for data analysis is the availability of user-contributed packages. Most useRs will be familiar with (and very grateful for) the Comprehensive R Archive Network (CRAN). The packages available on CRAN, nearly 7000 at last count, cover common data analysis tasks, such as importing data and plotting, through to more specialised tasks, such as packages for parsing data from the web, analysing financial time series data, or analysing data from clinical trials. What may be less familiar to useRs is another large R package repository and software development project, Bioconductor.

Bioconductor is an open source, open development software project that focuses on providing tools for the analysis of high-throughput genomic data, an area of research known variously as bioinformatics or computational biology. Examples of these data are sequencing the DNA of human genomes or measuring the level of expression of genes in hundreds of tumours. Recent advances in technology mean that such data are a central part of modern biological research, be it medical, agricultural, or basic science.

The Bioconductor project began in 2001 and initiated by Robert Gentleman, one of the originators of the R language. Nowadays there is a core team of nine developers, led by Martin Morgan, who develop some of the important core packages and maintain the infrastructure of the project. As with CRAN, it is the user-contributed packages that make the Bioconductor project the valuable resource that it is. There are more than 1000 software packages in the most recent Bioconductor release. In addition to these packages, Bioconductor includes more than 900 annotation packages and 200 experiment data packages. Annotation packages help streamline the oft-tedious bookkeeping and annotation of data associated with bioinformatics research while the experiment data packages contain processed data and are a valuable teaching resource.

Since its establishment, two of the main goals of the Bioconductor project have been reproducible research and high-quality documentation. In support of these aims, Bioconductor releases packages under a biannual schedule, which is tied to the most recent 'release' version of R, and each Bioconductor software package must contain a vignette. Each vignette is a document that provides a task-oriented description of package functionality, more like a book chapter than the technical and often terse function-level documentation accessible via `?`

or `help()`

at the R console. Some of these vignettes, such as the User's Guide that accompanies the `limma`

package (pdf), include multiple case studies and carefully explain the statistical foundations of the methods implemented in the package. There is also a dedicated support forum containing many years worth of questions on common problems with answers from experts in the field.

Bioconductor has recently begun publishing separate "workflows", along with teaching materials used in courses and conferences to help users learn how to analyse high-throughput biological data. These are excellent resources for those wishing to learn more about what is available in Bioconductor and how to get the most from the project. The website also hosts detailed instructions on installing Bioconductor on your local machine or trying it out a preconfigured setup using Amazon Machine Images or Docker images.

The teaching resources have been further bolstered by material from the recent Bioconductor meeting, held in Seattle, USA on July 21-22. This annual meeting is a great mix of basic science and data analysis methodology talks, presentations on interesting Bioconductor packages, and afternoon workshops where you can learn from the developers themselves. All the workshop materials, and most of the slides from the presentations, can be found here. The meeting was preceded by Developer Day, a less formal get-together including talks and brainstorming sessions about the current state and future directions of the Bioconductor project. There is also an annual European Bioconductor meeting and, for the first time, an Asia-Pacific Bioconductor Developer's Meeting and workshop, to be held as part of GIW/InCoB 2015 in Tokyo, Japan on September 8-11.

With its more specialised focus than CRAN, Bioconductor strongly encourages that package developers make use of the excellent infrastructure provided by existing Bioconductor packages. The intention is to reduce the number of times the wheel is re-invented, as well increasing the interoperability of objects and methods from different packages. The source code of these core packages can make useful reading for R developers, particularly those wishing to learn more about the S4 object oriented system. This source code can be accessed using Subversion or via the GitHub mirror of all Bioconductor packages.

Bioconductor has been, and continues to be, an incredibly useful resource for people analysing high-throughput genomic data. The development and maintainence of the project is a considerable undertaking, and there is a great debt owed to those who established the project, brought it into being, and continue its day-to-day running. But just as important is the community of users and developers. It is this community of users and developers that sees such a project succeed and be exciting to be a part of.

by Joseph Rickert

New R packages just keep coming. The following plot, constructed with information from the monthly files on Dirk Eddelbuettel's CRANberries site, shows a plot of the number of new packages released to CRAN between January 1, 2013 and July 27, 2015 by month (not quite 31 months).

This is amazing growth! The mean rate is about 125 new packages a month. How can anyone keep up? The direct approach, of course, would be to become an avid, frequent reader of CRANberries. Every day the CRAN:New link presents the relentless roll call of new arrivals. However, dealing with this extreme level of tediousness is not for everyone.

At MRAN we are attempting to provide some help with the problem of keeping up with what's new through the old fashioned (pre-machine learning) practice of making some idiosyncratic, but not completely capricious, human generated recommendations. With every new release of RRO we publish on the Package Spotlight page brief descriptions of packages in three categories: New Packages, Updated Packages and GitHub packages. None of these lists are intended to be either comprehensive or complete in any sense.

The New Packages list includes new packages that have been released to CRAN since the previous release of RRO. My general rules for selecting packages for this list are: (1) that they should either be tools or infrastructure packages that may prove to be useful to a wide audience or (2) they should involve a new algorithm or statistical technique that I think will be of interest to statisticians and data scientists working in many different areas. The following two packages respectively illustrate these two selection rules:

metricsgraphics V0.8.5: provides an htmlwidgets widgets interface to the MetricsGraphics.js D3 JavaScript library for plotting time series data. The vignette shows what it can do

rotationForest V0.1: provides an implementation of the new Rotation Forest binary ensemble classifier described in the paper by Rodriguez et. al

I also tend to favor packages that are backed by a vignette, paper or url that provides additional explanatory material.

Of course, any scheme like this is limited by the knowledge and biases of the curator. I am particularly worried about missing packages targeted towards biotech applications that may indeed have broader appeal. The way to mitigate the shortcomings of this approach is to involve more people. So if you come across a new package that you think may have broad appeal send us a note and let us know why (open@revolutionanalytics.com).

The Updated Package list is constructed with the single criterion that the fact that the package was updated should convey news of some sort. Most of the very popular and useful packages are updated frequently, some approaching monthly updates. So, even though they are important packages the fact that they have been updated is generally no news at all. It is also the case that package authors generally do not put much effort in to describing the updates. In my experience poking around CRAN I have found that the NEWS directories for packages go mostly unused. (An exemplary exception is the NEWS for ggplot2.)

Finally, the GitHub list is mostly built from repositories that are trending on GitHub with a few serendipitous finds included.

We would be very interested in learning how you keep up with new R packages. Please leave us a comment.

Post Script:

Note that the information from CRANberries about CRAN's new, updated and removed packages is also available as an RSS feed: Download Index.

The code for generating the plot may be found here: Download New_packages

Also, we have written quite a few posts over the last year or so about the difficulties of searching for relevant packages on CRAN. Here are links to three recent posts:

How many packages are there really on CRAN?

Fishing for packages in CRAN

Working with R Studio CRAN Logs