by Joseph Rickert
The igraph package has become a fundamental tool for the study of graphs and their properties, the manipulation and visualization of graphs and the statistical analysis of networks. To get an idea of just how firmly igraph has become embedded into the R package ecosystem consider that currently igraph lists 72 reverse depends, 59 reverse imports and 24 reverse suggests. The following plot, which was made with functions from the igraph and miniCRAN packages, indicates the complexity of this network.
library(miniCRAN) library(igraph) pk < c("igraph","agop","bc3net","BDgraph","c3net","camel", "cccd", "CDVine", "CePa", "CINOEDV", "cooptrees","corclass", "cvxclustr", "dcGOR", "ddepn","dils", "dnet", "dpa", "ebdbNet", "editrules", "fanovaGraph", "fastclime", "FisHiCal", "flare", "G1DBN", "gdistance", "GeneNet", "GeneReg", "genlasso", "ggm", "gRapfa", "hglasso", "huge", "igraphtosonia", "InteractiveIGraph", "iRefR", "JGL", "lcd", "linkcomm", "locits", "loe", "micropan", "mlDNA", "mRMRe", "nets", "netweavers", "optrees", "packdep", "PAGI", "pathClass", "PBC", "phyloTop", "picasso", "PoMoS", "popgraph", "PROFANCY", "qtlnet", "RCA", "ReliabilityTheory", "rEMM", "restlos", "rgexf", "RNetLogo", "ror", "RWBP", "sand", "SEMID", "shp2graph", "SINGLE", "spacejam", "TDA", "timeordered", "tnet") dg < makeDepGraph(pk) plot(dg,main="Network of reverse depends for igraph",cex=.4,vertex.size=8)
The igraph package itself is a tour de force with over 200 functions. Learning the package can be a formidable task especially if you are trying to learn graph theory and network analysis at the same time. To help myself through this process, I sorted the functions in the igraph package into seven rough categories:
The following shows a portion of the table for the first 10 functions related to creating graphs.
Function  Description  Category of Function  
1  aging.prefatt.game  Generate an evolving random graph with preferential attachment and aging  Create Graph 
2  barabasi.game  generate scalefree graphs according to the BarabasiAlbert model  Create Graph 
3  bipartite.random.game  Generate bipartite graphs using the ErdosRenyi model  Create Graph 
4  degree.sequence.game  Generate a random graph with a given degree degree sequence  Create Graph 
5  erdos.renyi.game  Generate random graphs according to the ErdosRenyi model  Create Graph 
6  forest.fire.game  Grow a network that simulates how a fire spreads by igniting trees  Create Graph 
7  graph.adjacency  Create an igraph from an adjacency matrix  Create Graph 
8  graph.bipartite  Create a bipartite graph  Create Graph 
9  graph.complementer  Create the complementary graph for a given graph  Create Graph 
10  graph.empty  Create an empty graph  Create Graph 
The entire table may be downloaded here: Download Igraph_functions.
infomap.community() is an intriguing function listed under the Finding Communities category that looks for structure in a network by minimizing the expected description length of a random walker trajectory. The abstract to the paper by Rosvall and Bergstrom that introduced this method states:
To comprehend the multipartite organization of largescale biological and social systems, we introduce an information theoretic approach that reveals community structure in weighted and directed networks. We use the probability flow of random walks on a network as a proxy for information flows in the real system and decompose the network into modules by compressing a description of the probability flow. The result is a map that both simplifies and highlights the regularities in the structure and their relationships.
I am not willing to claim that the 37 communities found by the algorithm represent meaningful structure, however, the idea of partitioning the network based on information flow does seem relevant to the package building process. Anybody looking for a research project?
imc < infomap.community(dg) imc # Graph community structure calculated with the infomap algorithm # Number of communities: 37 # Modularity: 0.5139813 # Membership vector:
Some additional resources for working with igraph are:
by Joseph Rickert
I recently had the opportunity to look at the data used for the 2009 KDD Cup competition. There are actually two sets of files that are still available from this competition. The "large" file is a series of five .csv files that when concatenated form a data set with 50,000 rows and 15,000 columns. The "small" file also contains 50,000 rows but only 230 columns. "Target" files are also provided for both the large and small data sets. The target files contain three sets of labels for "appentency", "churn" and "upselling" so that the data can be used to train models for three different classification problems.
The really nice feature of both the large and small data sets is that they are extravagantly ugly, containing large numbers of missing variables, factor variables with thousands of levels, factor variables with only one level, numeric variables with constant values, and correlated independent variables. To top it off, the targets are severely unbalanced containing very low proportions of positive examples for all three of the classification problems. These are perfect files for practice.
Often times, the most difficult part about working with data like this is just knowing where to begin. Since getting a good look is usually a good place to start, let's look at a couple of R tools that I found to be helpful for taking that first dive into messy data.
The mi package takes a sophisticated approach to multiple imputation and provides some very advance capabilities. However, it also contains simple and powerful tools for looking at data. The function missing.pattern.plot() lets you see the pattern of missing values. The following line of code provides a gestalt for the small data set.
missing.pattern.plot(DF,clustered=FALSE,
xlab="observations",
main = "KDD 2009 Small Data Set")
Observations (rows) go from left to right and variables from bottom to top. Red indicates missing values. Looking at just the first 25 variables makes it easier to see what the plot is showing.
The function mi.info(), also in the mi package, provides a tremendous amount of information about a data set. Here is the output for the first 10 variables. The first thing the function does is list the variables with no data and the variables that are highly correlated with each other. Thereafter, the function lists a row for each variable that includes the number of missing values and the variable type. This is remarkably useful information that would otherwise take a little bit of work to discover.
> mi.info(DF) variable(s) Var8, Var15, Var20, Var31, Var32, Var39, Var42, Var48, Var52, Var55, Var79, Var141, Var167, Var169, Var175, Var185 has(have) no observed value, and will be omitted. following variables are collinear [[1]] [1] "Var156" "Var66" "Var9" [[2]] [1] "Var104" "Var105" [[3]] [1] "Var111" "Var157" "Var202" "Var33" "Var61" "Var71" "Var91" names include order number.mis all.mis type collinear 1 Var1 Yes 1 49298 No nonnegative No 2 Var2 Yes 2 48759 No binary No 3 Var3 Yes 3 48760 No nonnegative No 4 Var4 Yes 4 48421 No orderedcategorical No 5 Var5 Yes 5 48513 No nonnegative No 6 Var6 Yes 6 5529 No nonnegative No 7 Var7 Yes 7 5539 No nonnegative No 8 Var8 No NA 50000 Yes proportion No 9 Var9 No NA 49298 No nonnegative Var156, Var66 10 Var10 Yes 8 48513 No nonnegative No
For Revolution R Enterprise users the function rxGetINfo() is a real workhorse. It applies to data frames as well as data stored in .xdf files. For data in these files there is essentially no limit to how many observations can be analysed. rxGetInfo() is an example of an external memory algorithm that only reads a chunk of data at a time from the file. Hence, there is no need to try and stuff all of the data into memory.
The following is a portion of the output from running the function with the getVarinfo flag set to TRUE.
rxGetInfo(DF, getVarInfo=TRUE)
Data frame: DF Number of observations: 50000 Number of variables: 230 Variable information: Var 1: Var1, Type: numeric, Low/High: (0.0000, 680.0000) Var 2: Var2, Type: numeric, Low/High: (0.0000, 5.0000) Var 3: Var3, Type: numeric, Low/High: (0.0000, 130668.0000)
.
.
. Var 187: Var187, Type: numeric, Low/High: (0.0000, 910.0000) Var 188: Var188, Type: numeric, Low/High: (6.4200, 628.6200) Var 189: Var189, Type: numeric, Low/High: (6.0000, 642.0000) Var 190: Var190, Type: numeric, Low/High: (0.0000, 230427.0000) Var 191: Var191 2 factor levels: r__I Var 192: Var192 362 factor levels: _hrvyxM6OP _v2gUHXZeb _v2rjIKQ76 _v2TmBftjz ... zKnrjIPxRp ZlOBLJED1x ZSNq9atbb6 ZSNq9aX0Db ZSNrjIX0Db Var 193: Var193 51 factor levels: _7J0OGNN8s6gFzbM 2Knk1KF 2wnefc9ISdLjfQoAYBI 5QKIjwyXr4MCZTEp7uAkS8PtBLcn 8kO9LslBGNXoLvWEuN6tPuN59TdYxfL9Sm6oU ... X1rJx42ksaRn3qcM X2uI6IsGev yaM_UXtlxCFW5NHTcftwou7BmXcP9VITdHAto z3s4Ji522ZB1FauqOOqbkl zPhCMhkz9XiOF7LgT9VfJZ3yI Var 194: Var194 4 factor levels: CTUH lvza SEuy Var 195: Var195 23 factor levels: ArtjQZ8ftr3NB ArtjQZmIvr94p ArtjQZQO1r9fC b_3Q BNjsq81k1tWAYigY ... taul TnJpfvsJgF V10_0kx3ZF2we XMIgoIlPqx ZZBPiZh Var 196: Var196 4 factor levels: 1K8T JA1C mKeq z3mO Var 197: Var197 226 factor levels: _8YK _Clr _vzJ 0aHy ... ZEGa ZF5Q ZHNR ZNsX ZSv9 Var 198: Var198 4291 factor levels: _0Ong1z _0OwruN _0OX0q9 _3J0EW7 _3J6Cnn ... ZY74iqB ZY7dCxx ZY7YHP2 ZyTABeL zZbYk2K Var 199: Var199 5074 factor levels: _03fc1AIgInD8 _03fc1AIgL6pC _03jtWMIkkSXy _03wXMo6nInD8 ... zyR5BuUrkb8I9Lth ZZ5
.
.
.
rxGetInfo() doesn't provide all of the information that mi.info() does, but is does do a particularly nice job on factor data, giving the number of levels and showing the first few. The two functions are complementary.
For a full listing of the output shown above down load the file: Download Mi_info_output.
The latest update to the world's most popular statistical data analysis software is now available. R 3.1.2 (codename: "Pumpkin Helmet") makes a number of minor improvements and bug fixes to the R language engine. You can see the complete list of changes here, which include improvements for the logNormal distribution function, improved axis controls for histograms, a fix to the nlminb optimizer which was causing rare crashes on Windows (and traced to a bug in the gcc compiler), and some compatibility updates for the Yosemite release of OS X on Macs.
This latest update comes on the heels of another major R milestone: the CRAN package repository now features more than 6,000 usercontributed packages! (CRAN actually hit that milestone two days ago; as of this writing there are actually 6,004.) These packages are all ready to use with R 3.1.2 — the CRAN system automatically checks to make sure the packages pass all of their test with the latest R version. You can search and explore R packages on MRAN, or simply browse R packages by topic area.
Revolution R Open will be updated to include R 3.1.2 very soon: the next update is in testing now and should be ready in a couple of weeks.
rdevel mailing list: R 3.1.2 is released
by Joseph Rickert
One of the most interesting R related presentations at last week’s Strata Hadoop World Conference in New York City was the session on Distributed R by Sunil Venkayala and Indrajit Roy, both of HP Labs. In short, Distributed R is an open source project with the end goal of running R code in parallel on data that is distributed across multiple machines. The following figure conveys the general idea.
A master node controls multiple worker nodes each of which runs multiple R processes in parallel.
As I understand it, the primary use case for the Distributed R software is to move data quickly from a database into distributed data structures that can be accessed by multiple, independent R instances for coordinated, parallel computation. The Distributed R infrastructure automatically takes care of the extraction of the data and the coordination of the calculations, including the occasional movement of data from a worker node to the master node when required by the calculation. The user interface to the Distributed R mechanism is through R functions that have been designed and optimized to work with the distributed data structures, and through a special “Distributed R aware” foreach() function that allow users to write their own distributed functions using ordinary R functions.
To make all of this happen, Distributed R platform contains several components that may be briefly described as follows:
The distributed R package contains:
A really nice feature of the distributed data structures is that they can be populated and accessed by rows, columns and blocks making it possible to write efficient algorithms tuned to the structure of particular data sets. For example, data cleaning for wide data sets (many more columns than rows) can be facilitated by preprocessing individual features.
vRODBC is an ODBC client that provides R with database connectivity. This is the connection mechanism that permits the parallel loading of data from various sources data including HP’s Vertica database.
The HPdata package contains the functions that allow you to actually load distributed data structures from various data sources
The HPDGLM package implements a parallel, distributed GLM models (Presently only linear regression, logistic regression and Poisson regression models are available), The package also contains functions for cross validation and splitsample validation.
The HPdclassifier package is intended to contain several distributed classification algorithms. It currently contains a parallel distributed implementation of the random forests algorithm.
The HPdcluster package contains a parallel, distributed kmeans algorithm.
The HPdgraph package is intended to contain distributed algorithms for graph analytics. It currently contains a parallel, distributed implementation of the pagerank algorithm for directed graphs.
The following sample code, taken directly from the HPdclassifier User Guide, but modified slightly for presentation here, is similar to the examples that Venkayala and Roy showed in their presentation. Note, that after the distributed arrays are set up they are loaded in parallel with data using the the foreach function from the distributedR package.
library(HPdclassifier) # loading the library
Loading required package: distributedR
Loading required package: Rcpp
Loading required package: RInside
Loading required package: randomForest
distributedR_start() # starting the distributed environment
Workers registered  1/1.
All 1 workers are registered.
[1] TRUE
nparts < sum(ds$Inst) # number of available distributed instances
# Describe the data
nSamples < 100 # number of samples
nAttributes < 5 # number of attributes of each sample
nSpllits < 1 # number of splits in each darray
# Create the distributed arrays
dax < darray(c(nSamples,nAttributes),c(round(nSamples/nSpllits),nAttributes))
day < darray(c(nSamples,1), c(round(nSamples/nSpllits),1))
# Load the distributed arrays
foreach(i, 1:npartitions(dax),
function(x=splits(dax,i),y=splits(day,i),id=i){
x < matrix(runif(nrow(x)*ncol(x)), nrow(x),ncol(x))
y < matrix(runif(nrow(y)), nrow(y), 1)
update(x)
update(y)
})
# Fit the Random Forest Model
myrf < hpdrandomForest(dax, day, nExecutor=nparts)
# prediction
dp < predictHPdRF(myrf, dax)
Notwithstanding all of its capabilities, Distributed R is still clearly work in progress. It is only available on Linux platforms. Algorithms and data must be resident in memory. Distributed R is not available on CRAN, and even with an excellent Installation Guide, installing the platform is a bit of an involved process.
Nevertheless, Distributed R is impressive, and I think a valuable contribution to open source R. I expect that users with distributed data will find the platform to be a viable way to begin high performance computing with R.
Note that the Distributed R project discussed in this post is an HP initiative and is not in anyway related to http://www.revolutionanalytics.com/revolutionrenterprisedistributedr.
Many R scripts depend on CRAN packages, and most CRAN packages in turn depend on other CRAN packages. If you install an R package, you'll also be installing its dependencies to make it work, and possibly other packages as well to enable its full functionality.
My colleague Andrie posted some R code to map package dependencies a couple of months ago, but now you can easily explore the dependencies of any CRAN package at MRAN. Simply search for a package and click the Dependencies Graph tab. Here's a very simple one: the foreach package.
The foreach package depends on two others: iterators and codetools, which will be automatically installed for you by install.packages when you install foreach. (We'll duscuss the use of "Suggests" — as here with randomForest — later.) Now let's look at a more complex example: the caret package.
The caret package provides an interface to many of the predictive modeling packages on CRAN, and so it has several dependencies (nine, in fact — you can see the list by clicking on the Dependencies Table tab). But it also Suggests many more packages — these are packages that are not required to run caret, but if you do have them, there are more model types you can use within the caret framework.
Here's a quick overview of the types of dependencies you'll find in the charts and tables on MRAN:
MRAN is updated daily, and so the Dependencies Graph is always uptodate with the latest CRAN packages and their connections. Start exploring at the link below.
MRAN: Explore Packages
The ability to create reproducible research is an important topic for many users of R. So important, that several groups in the R community have tackled this problem. Notably, packrat from RStudio, and gRAN from Genentech (see our previous blog post).
The Reproducible R Toolkit is a new opensource initiative from Revolution Analytics. It takes a simple approach to dealing with R package versions, consisting of an R package checkpoint, and an associated daily CRAN snapshot archive, checkpointserver. Here's one illustration of the problem it solves (with apologies to xkcd):
To achieve reproducibility, we store daily snapshots of all CRAN packages. At midnight UTC each day we refresh the CRAN mirror and then store a snapshot of CRAN as it exists at that very moment. You can access these daily snapshots using the checkpoint package, which installs and consistently use these packages just as they existed at the snapshot date. Daily snapshots exist starting from 20140917.
checkpoint package
The goal of the checkpoint package is to solve the problem of package reproducibility in R. Since packages get updated on CRAN all the time, it can be difficult to recreate an environment where all your packages are consistent with some earlier state. To solve this issue, checkpoint allows you to install packages locally as they existed on a specific date from the corresponding snapshot (stored on the checkpoint server) and it configures your R session to use only these packages. Together, the checkpoint package and the checkpoint server act as a "CRAN time machine", so that anyone using checkpoint can ensure the reproducibility of scripts or projects at any time.
How to use checkpoint
One you have the checkpoint package installed, using the checkpoint() function is as simple as adding the following lines to the top of your script:
Typically, you will use the date you created the script as the argument to checkpoint. The first time you run the script, checkpoint will inspect your script (and other R files in the same project folder) for the packages used, and install the required packages with versions as of the specified date. (The first time you run the script, it will take some time to download and install the packages, but subsequent runs will use the previouslyinstalled package versions.)
The checkpoint package installs the packages in a folder specific to the current project (in a subfolder of
If you want to update the packages you use at a later date, just update the date in the checkpoint() call and checkpoint() will automatically update the locallyinstalled packages.
The checkpoint package is available on CRAN:
Worked example
The Reproducible R Toolkit was created by the Open Source Solutions group at Revolution Analytics. Special thanks go to Scott Chamberlain who helped with early development.
We'd love to know what you think about checkpoint. Leave comments here on the blog, or via the checkpoint GitHub page.
by Andrie deVries
One of the reasons that R is so popular is the CRAN archive of useful packages. However, with more than 5,900 packages on CRAN, many organisations need to maintain a private mirror of CRAN with only a subset of packages that are relevant to them.
The package miniCRAN makes this possible by determining the dependency tree for a given set of packages, then downloading all of the package dependencies. (My previous post showed how to do this.)
There are many reasons for not creating a complete mirror CRAN using rsync:
The ambition of miniCRAN is to eventually satisfy many of these considerations. For example, the github version of miniCRAN already allows you to draw a dependency graph using packages on CRAN as well as github. In due course we plan to extend the package to also download packages from any public repository or private file location, as well as github packages.
You can find miniCRAN on CRAN. To install the package, use
install.packages("miniCRAN") library("miniCRAN")
During September gave a presentation about miniCRAN at the first EARL conference in London:
For more information about miniCRAN, take a look at this Introduction to using miniCRAN. To suggest improvements or other features, please visit my miniCRAN page on GitHub. I hope you find it useful!
Hadley Wickham's dplyr package is a great toolkit for getting data ready for analysis in R. If you haven't yet taken the plunge to using dplyr, Kevin Markham has put together a great handson video tutorial for his Data School blog, which you can see below. The video covers the five main datamanipulation "verbs" that dplyr provides: filter, select, arrange, mutate and summarise/group_by. (It also introduces the glimpse function, a handy alternative to str, that I had overlooked before.)
The video also provides an introduction to the %>% ("then") operator from magrittr, which you'll likely fund useful for many other applications in addition to dplyr. Also, Kevin's video works from an Rmarkdown script to show how dplyr works, and so serves as a minitutorial for Rmarkdown as well. It's well worth 40 minutes of your time. Also, check out Kevin's blog post linked below for links to many other useful dplyr resources.
Data School: Handson dplyr tutorial for faster data manipulation in R (via Peter Aldhous)
by Joseph Rickert
One of the most difficult things about R, a problem that is particularly vexing to beginners, is finding things. This is an unintended consequence of R's spectacular, but mostly uncoordinated, organic growth. The R core team does a superb job of maintaining the stability and growth of the R language itself, but the innovation engine for new functionality is largely in the hands of the global R communty.
Several structures have been put in place to address various apsects of the finding things problem. For example, Task Views represent a monumental effort to collect and classify R packages. The RSeek site is an effective tool for web searches. RBloggers is a good place to go for R applications and CRANberries let's you know what's new. But, how do you find things that you didn't even know you were looking for?For this, the so called "misc packages" can be very helpful. Whereas the majority of R packages are focused on a particular type of analysis or class of models, or special tool, misc packages tend to be collections of functions that facilitate common tasks. (Look below for a partial list).
DescTools is a new entry to the misc package scene that I think could become very popular. The description for the package begins:
DescTools contains a bunch of basic statistic functions and convenience wrappers for efficiently describing data, creating specific plots, doing reports using MS Word, Excel or PowerPoint. The package's intention is to offer a toolbox, which facilitates the (notoriously time consuming) first descriptive tasks in data analysis, consisting of calculating descriptive statistics, drawing graphical summaries and reporting the results. Many of the included functions can be found scattered in other packages and other sources written partly by Titans of R.
So far, of the 380 functions in this collection the Desc function has my attention. This function provides very nice tabular and graphic summaries of the variables in a data frame with output that is specific to the data type. The d.pizza data frame that comes with the package has a nice mix of data types
head(d.pizza) index date week weekday area count rabate price operator driver delivery_min temperature wine_ordered wine_delivered 1 1 20140301 9 6 Camden 5 TRUE 65.655 Rhonda Taylor 20.0 53.0 0 0 2 2 20140301 9 6 Westminster 2 FALSE 26.980 Rhonda Butcher 19.6 56.4 0 0 3 3 20140301 9 6 Westminster 3 FALSE 40.970 Allanah Butcher 17.8 36.5 0 0 4 4 20140301 9 6 Brent 2 FALSE 25.980 Allanah Taylor 37.3 NA 0 0 5 5 20140301 9 6 Brent 5 TRUE 57.555 Rhonda Carter 21.8 50.0 0 0 6 6 20140301 9 6 Camden 1 FALSE 13.990 Allanah Taylor 48.7 27.0 0 0 wrongpizza quality 1 FALSE medium 2 FALSE high 3 FALSE <NA> 4 FALSE <NA> 5 FALSE medium 6 FALSE low
Here is some of the voluminous output from the function. The data frame as a whole is summarized as follows
'data.frame': 1209 obs. of 16 variables: 1 $ index : int 1 2 3 4 5 6 7 8 9 10 ... 2 $ date : Date, format: "20140301" "20140301" "20140301" "20140301" ... 3 $ week : num 9 9 9 9 9 9 9 9 9 9 ... 4 $ weekday : num 6 6 6 6 6 6 6 6 6 6 ... 5 $ area : Factor w/ 3 levels "Brent","Camden",..: 2 3 3 1 1 2 2 1 3 1 ... 6 $ count : int 5 2 3 2 5 1 4 NA 3 6 ... 7 $ rabate : logi TRUE FALSE FALSE FALSE TRUE FALSE ... 8 $ price : num 65.7 27 41 26 57.6 ... 9 $ operator : Factor w/ 3 levels "Allanah","Maria",..: 3 3 1 1 3 1 3 1 1 3 ... 10 $ driver : Factor w/ 7 levels "Butcher","Carpenter",..: 7 1 1 7 3 7 7 7 7 3 ... 11 $ delivery_min : num 20 19.6 17.8 37.3 21.8 48.7 49.3 25.6 26.4 24.3 ... 12 $ temperature : num 53 56.4 36.5 NA 50 27 33.9 54.8 48 54.4 ... 13 $ wine_ordered : int 0 0 0 0 0 0 1 NA 0 1 ... 14 $ wine_delivered: int 0 0 0 0 0 0 1 NA 0 1 ... 15 $ wrongpizza : logi FALSE FALSE FALSE FALSE FALSE FALSE ... 16 $ quality : Ord.factor w/ 3 levels "low"<"medium"<..: 2 3 NA NA 2 1 1 3 3 2 ...
The factor variable driver gets a table and a plot.
10  driver (factor) length n NAs levels unique dupes 1'209 1'204 5 7 7 y level freq perc cumfreq cumperc 1 Carpenter 272 .226 272 .226 2 Carter 234 .194 506 .420 3 Taylor 204 .169 710 .590 4 Hunter 156 .130 866 .719 5 Miller 125 .104 991 .823 6 Farmer 117 .097 1108 .920 7 Butcher 96 .080 1204 1.000
and so does the numeric variable delivery.
11  delivery_min (numeric) length n NAs unique 0s mean meanSE 1'209 1'209 0 384 0 25.653 0.312 .05 .10 .25 median .75 .90 .95 10.400 11.600 17.400 24.400 32.500 40.420 45.200 rng sd vcoef mad IQR skew kurt 56.800 10.843 0.423 11.268 15.100 0.611 0.095 lowest : 8.8 (3), 8.9, 9 (3), 9.1 (5), 9.2 (3) highest: 61.9, 62.7, 62.9, 63.2, 65.6 ShapiroWilks normality test p.value : 2.2725e16
Pretty nice for an automatic first look at the data.
For some more R treasure hunting have a look into the following short list of misc packages.
Package 
Description 
Tools for manipulating data (No 1 package downloaded for 2013) 

Convenience wrappers for functions for manipulating strings 

One of the most popular R packages of all time: functions for data analysis, graphics, utilities and much more 

Package development tools 

The “go to” package for machine learning, classification and regression training 

Good svm implementation and other machine learning algorithms 

Tools for describing data and descriptive statistics 

Tools for plotting decision trees 

Functions for numerical analysis, linear algebra, optimization, differential equations and some special functions 

Contains different highlevel graphics functions for displaying large datasets 

Relatively new package with various functions for survival data extending the methods available in the survival package. 

New this year: miscellaneous R tools to simplify the working with data types and formats including functions for working with data frames and character strings 

Some functions for Kalman filters 

Misc 3d plots including isosurfaces 

New package with utilities for producing maps 

Various programming tools like ASCIIfy() to convert characters to ASCII and checkRVersion() to see if a newer version of R is available 

A grab bag of utilities including progress bars and function timers 
by Joseph Rickert
While preparing for the DataWeek R Bootcamp that I conducted this week I came across the following gem. This code, based directly on a Max Kuhn presentation of a couple years back, compares the efficacy of two machine learning models on a training data set.
# # SET UP THE PARAMETER SPACE SEARCH GRID ctrl < trainControl(method="repeatedcv", # use repeated 10fold cross validation repeats=5, # do 5 repititions of 10fold cv summaryFunction=twoClassSummary, # Use AUC to pick the best model classProbs=TRUE) # Note that the default search grid selects 3 values of each tuning parameter # grid < expand.grid(.interaction.depth = seq(1,7,by=2), # look at tree depths from 1 to 7 .n.trees=seq(10,100,by=5), # let iterations go from 10 to 100 .shrinkage=c(0.01,0.1)) # Try 2 values of the learning rate parameter # BOOSTED TREE MODEL set.seed(1) names(trainData) trainX <trainData[,4:61] registerDoParallel(4) # Registrer a parallel backend for train getDoParWorkers() system.time(gbm.tune < train(x=trainX,y=trainData$Class, method = "gbm", metric = "ROC", trControl = ctrl, tuneGrid=grid, verbose=FALSE)) # # SUPPORT VECTOR MACHINE MODEL # set.seed(1) registerDoParallel(4,cores=4) getDoParWorkers() system.time( svm.tune < train(x=trainX, y= trainData$Class, method = "svmRadial", tuneLength = 9, # 9 values of the cost function preProc = c("center","scale"), metric="ROC", trControl=ctrl) # same as for gbm above ) # # COMPARE MODELS USING RESAPMLING # Having set the seed to 1 before running gbm.tune and svm.tune we have generated paired samplesfor comparing models using resampling. # # The resamples function in caret collates the resampling results from the two models rValues < resamples(list(svm=svm.tune,gbm=gbm.tune)) rValues$values # # BOXPLOTS COMPARING RESULTS bwplot(rValues,metric="ROC") # boxplot
After setting up a grid to search the parameter space of a model, the train() function from the caret package is used used to train a generalized boosted regression model (gbm) and a support vector machine (svm). Setting the seed produces paired samples and enables the two models to be compared using the resampling technique described in Hothorn at al, "The design and analysis of benchmark experiments", Journal of Computational and Graphical Statistics (2005) vol 14 (3) pp 675699
The performance metric for the comparison is the ROC curve. From examing the boxplots of the sampling distributions for the two models it is apparent that, in this case, the gbm has the advantage.
Also, notice that the call to registerDoParallel() permits parallel execution of the training algorithms. (The taksbar showed all foru cores of my laptop maxed out at 100% utilization.)
I chose this example because I wanted to show programmers coming to R for the first time that the power and productivity of R comes not only from the large number of machine learning models implemented, but also from the tremendous amount of infrastructure that package authors have built up, making it relatively easy to do fairly sophisticated tasks in just a few lines of code.
All of the code for this example along with the rest of the my code from the Datweek R Bootcamp is available on GitHub.