by Joseph Rickert
One of the most difficult things about R, a problem that is particularly vexing to beginners, is finding things. This is an unintended consequence of R's spectacular, but mostly uncoordinated, organic growth. The R core team does a superb job of maintaining the stability and growth of the R language itself, but the innovation engine for new functionality is largely in the hands of the global R communty.
Several structures have been put in place to address various apsects of the finding things problem. For example, Task Views represent a monumental effort to collect and classify R packages. The RSeek site is an effective tool for web searches. RBloggers is a good place to go for R applications and CRANberries let's you know what's new. But, how do you find things that you didn't even know you were looking for?For this, the so called "misc packages" can be very helpful. Whereas the majority of R packages are focused on a particular type of analysis or class of models, or special tool, misc packages tend to be collections of functions that facilitate common tasks. (Look below for a partial list).
DescTools is a new entry to the misc package scene that I think could become very popular. The description for the package begins:
DescTools contains a bunch of basic statistic functions and convenience wrappers for efficiently describing data, creating specific plots, doing reports using MS Word, Excel or PowerPoint. The package's intention is to offer a toolbox, which facilitates the (notoriously time consuming) first descriptive tasks in data analysis, consisting of calculating descriptive statistics, drawing graphical summaries and reporting the results. Many of the included functions can be found scattered in other packages and other sources written partly by Titans of R.
So far, of the 380 functions in this collection the Desc function has my attention. This function provides very nice tabular and graphic summaries of the variables in a data frame with output that is specific to the data type. The d.pizza data frame that comes with the package has a nice mix of data types
head(d.pizza) index date week weekday area count rabate price operator driver delivery_min temperature wine_ordered wine_delivered 1 1 20140301 9 6 Camden 5 TRUE 65.655 Rhonda Taylor 20.0 53.0 0 0 2 2 20140301 9 6 Westminster 2 FALSE 26.980 Rhonda Butcher 19.6 56.4 0 0 3 3 20140301 9 6 Westminster 3 FALSE 40.970 Allanah Butcher 17.8 36.5 0 0 4 4 20140301 9 6 Brent 2 FALSE 25.980 Allanah Taylor 37.3 NA 0 0 5 5 20140301 9 6 Brent 5 TRUE 57.555 Rhonda Carter 21.8 50.0 0 0 6 6 20140301 9 6 Camden 1 FALSE 13.990 Allanah Taylor 48.7 27.0 0 0 wrongpizza quality 1 FALSE medium 2 FALSE high 3 FALSE <NA> 4 FALSE <NA> 5 FALSE medium 6 FALSE low
Here is some of the voluminous output from the function. The data frame as a whole is summarized as follows
'data.frame': 1209 obs. of 16 variables: 1 $ index : int 1 2 3 4 5 6 7 8 9 10 ... 2 $ date : Date, format: "20140301" "20140301" "20140301" "20140301" ... 3 $ week : num 9 9 9 9 9 9 9 9 9 9 ... 4 $ weekday : num 6 6 6 6 6 6 6 6 6 6 ... 5 $ area : Factor w/ 3 levels "Brent","Camden",..: 2 3 3 1 1 2 2 1 3 1 ... 6 $ count : int 5 2 3 2 5 1 4 NA 3 6 ... 7 $ rabate : logi TRUE FALSE FALSE FALSE TRUE FALSE ... 8 $ price : num 65.7 27 41 26 57.6 ... 9 $ operator : Factor w/ 3 levels "Allanah","Maria",..: 3 3 1 1 3 1 3 1 1 3 ... 10 $ driver : Factor w/ 7 levels "Butcher","Carpenter",..: 7 1 1 7 3 7 7 7 7 3 ... 11 $ delivery_min : num 20 19.6 17.8 37.3 21.8 48.7 49.3 25.6 26.4 24.3 ... 12 $ temperature : num 53 56.4 36.5 NA 50 27 33.9 54.8 48 54.4 ... 13 $ wine_ordered : int 0 0 0 0 0 0 1 NA 0 1 ... 14 $ wine_delivered: int 0 0 0 0 0 0 1 NA 0 1 ... 15 $ wrongpizza : logi FALSE FALSE FALSE FALSE FALSE FALSE ... 16 $ quality : Ord.factor w/ 3 levels "low"<"medium"<..: 2 3 NA NA 2 1 1 3 3 2 ...
The factor variable driver gets a table and a plot.
10  driver (factor) length n NAs levels unique dupes 1'209 1'204 5 7 7 y level freq perc cumfreq cumperc 1 Carpenter 272 .226 272 .226 2 Carter 234 .194 506 .420 3 Taylor 204 .169 710 .590 4 Hunter 156 .130 866 .719 5 Miller 125 .104 991 .823 6 Farmer 117 .097 1108 .920 7 Butcher 96 .080 1204 1.000
and so does the numeric variable delivery.
11  delivery_min (numeric) length n NAs unique 0s mean meanSE 1'209 1'209 0 384 0 25.653 0.312 .05 .10 .25 median .75 .90 .95 10.400 11.600 17.400 24.400 32.500 40.420 45.200 rng sd vcoef mad IQR skew kurt 56.800 10.843 0.423 11.268 15.100 0.611 0.095 lowest : 8.8 (3), 8.9, 9 (3), 9.1 (5), 9.2 (3) highest: 61.9, 62.7, 62.9, 63.2, 65.6 ShapiroWilks normality test p.value : 2.2725e16
Pretty nice for an automatic first look at the data.
For some more R treasure hunting have a look into the following short list of misc packages.
Package 
Description 
Tools for manipulating data (No 1 package downloaded for 2013) 

Convenience wrappers for functions for manipulating strings 

One of the most popular R packages of all time: functions for data analysis, graphics, utilities and much more 

Package development tools 

The “go to” package for machine learning, classification and regression training 

Good svm implementation and other machine learning algorithms 

Tools for describing data and descriptive statistics 

Tools for plotting decision trees 

Functions for numerical analysis, linear algebra, optimization, differential equations and some special functions 

Contains different highlevel graphics functions for displaying large datasets 

Relatively new package with various functions for survival data extending the methods available in the survival package. 

New this year: miscellaneous R tools to simplify the working with data types and formats including functions for working with data frames and character strings 

Some functions for Kalman filters 

Misc 3d plots including isosurfaces 

New package with utilities for producing maps 

Various programming tools like ASCIIfy() to convert characters to ASCII and checkRVersion() to see if a newer version of R is available 

A grab bag of utilities including progress bars and function timers 
by James Peruvankal
There are plenty of options if you want to learn R and are looking for training: your college’s statistics department, massive open online courses like Coursera, Udacity, edX, Datacamp etc. SiliconANGLE recently published an article about top Rtraining companies.
Let’s talk about how to choose a good Rtrainer.
At Revolution Analytics we are guided by the teaching philosophy presented in the following chart:
So, if you are serious about learning R, brush up on your statistics, be prepared to jump right in and start doing things on your own, surround yourself with people who are passionate about statistics and R, and figure out how to make the whole process fun for you. If you are teaching R and want to join us in our mission to ‘take R to the Enterprise’, see if you can fit in with our team.
Facebook is a company that deals with a lot of data — more than 500 terabytes a day — and R is widely used at Facebook to visualize and analyze that data. Applications of R at Facebook include user behaviour, content trends, human resources and even graphics for the IPO prospectus. Now, four R users at Facebook (Moira Burke, Chris Saden, Dean Eckles and Solomon Messing) share their experiences using R at Facebook in a new Udacity online course, Exploratory Data Analysis.
As the name suggests, this online course uses R (via RStudio) and the ggplot2 package to provide an introduction to Exploratory Data Analysis. The R Basics chapter gives a general overview of R: installing, starting, getting help, and language basics. Then the course covers visualizing data sets of one, two and multiple variables. There's even an introduction to predictive modeling at the end of the course.
A free Udacity account is required to watch the course videos. Get started at the link below.
Udacity: Exploratory Data Analysis
Looking for a fun and useful intro to R for firsttimers, with bonus cat pictures? Look no further than R for Cats, from Scott Chamberlain of ROpenSci. In addition to very helpful tips on R syntax, data structures, and an excellent list of dos and don'ts, it also shows you how to do this with R:
Check it out: R for cats and cat lovers
by Joseph Rickert
R/Finance 2014 is just about a week away. Over the past four or five years this has become my favorite conference. It is small (300 people this year), exceptionally wellrun, and always offers an eclectic mix of theoretical mathematics, efficient, practical computing, industry best practices and trading “street smarts”. This clip of Blair Hull delivering a keynote speech at R/Finance 2012 is an example of the latter. It ought to resonate with anyone who has followed some of the hype surrounding Michael Lewis recent book Flash Boys.
In any event, I thought it would be a good time to look at the relationship between R and Finance and to highlight some resources that are available to students, quants and data scientists looking to do computational finance with R.
First off, consider what computational finance has done for R. From the point of view of the development and growth of the R language, I think it is pretty clear that computational finance has played the role of the ultimate “Killer App” for R. This high stakes, competitive environment where a theoretical edge or a marginal computational advantage can mean big rewards has led to R package development in several areas including time series, optimization, portfolio analysis, risk management, high performance computing and big data. Additionally, challenges and crisis in the financial markets have helped accelerate R’s growth into big data. In this podcast, Michael Kane talks about the analysis of the 2010 Flash Crash he did with Casey King and Richard Holowczak and describes using R with large financial datasets.
Conversely, I think that it is also clear that R has done quite a bit to further computational finance. R’s ability to facilitate rapid data analysis and visualization, its great number of available functions and algorithms and the ease with which it can interface to new data sources and other computing environments has made it a flexible tool that evolves and adapts at a pace that matches developments in the financial industry. The list of packages in the Finance Task View on CRAN indicates the symbiotic relationship between the development of R and the needs of those working in computational finance. On the one hand, there are over 70 packages under the headings Finance and Risk Management that were presumably developed to directly respond to a problem in computational finance. But, the task view also mentions that packages in the Econometrics, Multivariate, Optimization, Robust, SocialSciences and TimeSeries task views may also be useful to anyone working in computational finance. (The High Performance Computing and Machine Learning task views should probably also be mentioned.) The point is that while a good bit of R is useful to problems in computational finance, R has greatly benefited from the contributions of the computational finance community.
If you are just getting started with R and computational finance have a look at John Nolan’s R as a Tool in Computational Finance. Other resources for R and computational finance that you may find helpful are::
Package Vignettes
Several of the Finance related packages have very informative vignettes or associated websites. For example have a look at those for the packages portfolio, rugarch, rquantlib (check out the cool rotating distributions), PerformanceAnalytics, and MarkowitzR.
Data
Quandl has become a major source for financial data, which can be easily accessed from R.
Websites
Relevant websites include the RMetrics site, The R Trader, Burns Statistics and Guy Yollin’s repository of presentations
YouTube
Three videos that.I found to be particularly interesting are recordings of the presentations “Finance with R” by Ronald Hochreiter, “Using R in Academic Finance” by Sanjiv Das and Portfolio Construction in R by Elliot Norma.
Blogs
Over the past couple of years, RBloggers has posted quite a few finance related applications. Prominent among these is the series on Quantitative Finance Applications in R by Daniel Harrison on the Revolutions Blog.
Books
Books on R and Finance include the excellent RMetrics series of ebooks, Statistics and Data Analysis for Financial Engineering by David Ruppert, Financial Risk Modeling and Portfolio Optimization with R by Bernard Pfaff, Introduction to R for Quantitative Finance by Daróczi et al. and a brand new title Computational Finance: An Introductory Course with R by Agrimiro Arratia.
Coursera
This August, Eric Zivot will teach the course Introduction to Computational Finance and Financial Econometrics which will emphasize R.
The R Journal
The R Journal frequently publishes finance related papers. The present issue: Volume 5/2, December 2013 contains three relevant papers. Performance Attribution for Equity Portfolios by Yang Lu, David Kane, Temporal Disaggregation of Time Series by Christoph Sax, Peter Steiner, and betategarch: Simulation, Estimation and Forecasting of BetaSkewtEGARCH Models by Genaro Sucarrat.
Conferences
in addition to R/Finance (Chicago) and useR!2014 (Los Angeles) look for R based, computational finance expertise at the 8th R/RMetrics Workshop (Paris).
Community
RSigFinance is one of R’s most active special interest groups.
by Joseph Rickert
Worldwide R user group activity for the first Quarter of 2014 appears to be way up compared to previous years as the following plot shows.
The plot was built by counting the meetings on Revolution Analytics R Community Calendar. R users continue to value the live, in person events and facetoface meetings with their peers. Moreover, if you peruse the details of the meetings listed on the calendar you will see that there have been some fantastic presentation so far this year. And although, as with any live event, if you missed it you will never know how good it really was, some of the R user groups have left traces of what happened by posting presentation slides on public areas of their websites.
I am sorry that I missed Harold Baize’s presentation to the Berkeley R language Beginner Study Group on Using R in a Microsoft Office world. I think the following slide from Harold’s presentation indicates a pragmatic approach coupled with a wry sense of humor.
The presentation goes to discuss reading and writing to Excel, Tableau, the R2DOCX package, markdown and more. To get the presentation, download the file RinMSOffice from the Berkeley Site.
Jeroen Janssens’ presentation on Command Line Data Science to the New York Open Statistical Programming Meetup is a very nice introduction to command line essentials for either Linux or Mac OS. If you are not comfortable with cat, grep or awk this might be a place to start.
In a similar spirit of helping new users get up to speed with R, the New Hampshire R Users Group (NH UserRs) has posted a number of tutorials by David Hocking. Introduction to Linear Regression and ANOVA in R is a useful 10 minute first look. Newcomers might also find it valuable to follow up with the David Lillis presentation, Data Analysis Tips in R, that has been posted on the Wellingtone R User Group (WRUG) site. This presentation contains a short discussion on calculating Tetrachoric and polychoric correlations in R that quite a few people might find valuable.
DublinR has graciously made available all of the materials from a Bayesian Data Analysis workshop that Mick Cooney delivered earlier this year. A zipped file containing R scripts, JAG files and data can be downloaded by clicking on “bdasingle2014” at this link.
Finally, I am glad that I was present at Megan Price’s takk: How a Small NonProfit Human Rights Group uses R, to the Bay Area UseR Group. This was an inspirational talk where you really had to be there. Nevertheless, Megan’s slides do portray the big picture of how statistical analysis can make a difference in human rights investigations. The following photo of a cache of documents containing records of those who disappeared during Guatemala's civil war indicates the magnitude of the statistical sampling problem that Megan and her colleagues at HRDAG faced.
This coming Monday, March 31, 2014, is the last day for applying to Revolution Analytics R User Group Sponsorship Program for Matrix or Array level sponsorship. So if you are the organizer of an established R User group and you think a cash grant and a box of R related “goodies” could help you grow, please apply before midnight PST 3/31/14.
by Joseph Rickert
Recently, I had the opportunity to be a member of a job panel for Mathematics, Economics and Statistics students at my alma mater, CSUEB (California State University East Bay). In the context of preparing for a career in data science a student at the event asked: “Where can I find good data sets?”. This triggered a number of thoughts: the first being that it was time to update the list of data sets that I maintain and blog about from time to time. So, thanks to that reminder I have added a few new links to the page, including a new section called Data Science Practice that links to some of the data sets used as examples in Doing Data Science by Rachel Schutt and Cathy O’Neil. Additionally, I have provided a direct link to the BigData Tag on infochimps and pointed out that multiple song data sets are available.
However, to do justice to student’s question it is necessary to give some thought to exactly what a “good” practice data set might look like. Here are three characteristics that I think a practice data set should have to be good:
Here are three data sets that meet these criteria in ascending order of degree of difficulty:
The first suggestion is the MovieLens data set which contains a million ratings applied to over 10,000 movies by more than 71,000 users. The download comes in two sizes, the full set, and a 100K subset. Both versions require working with multiple files.
Near the top of anybody’s list of practice data sets, and second on my little list because of degree of difficulty is the airlines data set from the 2009 ASA challenge. This data set which contains the arrival and departure information for all domestic flights from 1987 to 2008 has become the “iris” data set for Big Data. With over 123M rows it is too big to it into your laptop’s memory and with 29 variables of different types it is rich enough to suggest several analyses. Moreover, although the version of the data set maintained on the ASA website is fixed and therefore perfect for benchmarking, the Research and Innovative Technology Administration Bureau of Transportation Statistics continues to add to the data on a monthly basis. Go to RITA to get all of the data collected since the ASA competition ended.
Last on my short list is the Million Song data set. This contains features and meta data for one million songs which were originally provided by the music intelligence company Echo Nest. The data is in the specialized HDF5 format which makes it somewhat of a challenge to access. The data set maintainers do provide wrapper functions to facilitate downloading the data and avoiding some of the complexities of the HDF5 format. However, there are no R wrappers! The last I checked, the maintainers had a paragraph about there being a problem with their code along with an invitation for R experts to contact them (This would clearly be for extra points.) For more details about the contents of the data set look here.
As a final note, it is much easier use R to analyze the Public Data Sets available through Amazon Web Services now that you can run Revolution R Enterprise in the Amazon Cloud. We hope to have more to say about exactly how to go about doing this in a future post. However, everything you need to get started is in place including a 14 day free trial (Amazon charges apply) for Revolution R Enterprise. All you need is your own Amazon account.
Please let me know if you have additional links to useful, publically available data sets that I have missed. We very much appreciate the contributions blog readers have made to the list of data sets.
by Joseph Rickert
I am a book person. I collect books on all sorts of subjects that interest me and consequently I have a fairly extensive collection of R books, many of which I find to be of great value. Nevertheless, when I am asked to recommend an R book to someone new to R I am usually flummoxed. R is growing at a fantastic rate, and people coming to R for the first time span I wide range of sophistication. And besides, owning a book is kind of personal. It is one thing to go out and buy a technical book because it is required for a course, but quite an other to make a commitment to a book all on your own. Not only must it have the right content, at the right level for you, and be written in a way that you will actually read it, a book must feel right, be typeset to appeal to your sense of aesthetics, have diagrams and illustrations to draw you in, and contain enough white space to seem approachable. Furthermore, there is a burden to owning a book. There is nothing worse than making a poor selection and having a totally incomprehensible text stare at you from a shelf. Moreover, even an old friend can impose obligations. I have read deeply from The Elements of Statistical Learning, but not everything, so there it sits: admonishing me.
Recently, however, while crawling around CRAN, it occurred to me that there is a tremendous amount of high quality material on a wide range of topics in the Contributed Documentation page that would make a perfect introduction to all sorts of people coming to R. Maybe, all it needs is a little marketing and reorganization. So, from among this treasure cache (and a few other online sources), I have assembled an R “meta” book in the following table that might be called: An R Based Introduction to Probability and Statistics with Applications.
Content 
Document 
Author 

1 
Basic Probability and Statistics 
G. Jay Kerns 

2 
Fitting Probability Distributions 
Vito Ricci 

3 

Julian J. Faraway 

4 
Experimental Design 
Vikneswaran 

5 
Survival Analysis 
John Fox 

6 
Generalized Linear Models 
Virasakdi Chongsuvivatwong 

7 


8 
Time Series 
McLeod, Yu and Mahdi 

9 

Kim Seefeld and Ernst Linder 

Machine Learning

Yanchang Zhao 

11 
Bioinformatics

Wim P. Krijnen 

12 
Forecasting 
Hyndman and Athanasopoulos 

13 
Structural Equation Models 
John Fox 

14 
Credit Scoring 
Dhruv Sharma 
The content column lists the topics that I think ought to be included in a good introductory probability and statistics textbook. With a little searching, you will be able to find a discussion of each topic in the document listed to its right. Obviously, there is a lot overlap among the documents listed, since most of them are substantial works that cover much more than the few topics that I have listed.
Finally, I don’t mean to imply that the documents in my table are the best assembled in the Contributed Documentation page. The table just represents my idiosyncratic way of organizing some of the material in a way that I hope newcomers will find useful. I think that collectively the contributed documents have everything one might look for in a first date with R. They are available, approachable, contain superb content written by R experts, and are replete with examples and R code. And, with a little effort, a casual first encounter could lead to long term relationship.
by Joseph Rickert
Recently, I was trying to remember how to make a 3D scatter plot in R when it occurred to me that the documentation on how to do this is scattered all over the place. Hence, this short organizational note that you may find useful.
First of all, for the benefit of newcomers, I should mention that R has three distinct graphics systems: (1) the “traditional” graphics system, (2) the grid graphics system and (3) ggplot2. The traditional graphic system refers to the graphics and plotting functions in base R. According to Paul Murrell’s authoritative R Graphics, these functions implement the graphics facilities of the S language, and up to about 2005 they comprised most of the graphics functionality of R. The grid graphics system is based on Deepayan Sarkar’s lattice package which implements the functionality of Bill Cleveland’s Trellis graphics. Finally, ggplot2 Hadley Wickham’s package based on Wilkinson's Grammar of Graphics, took shape between 2007 and 2009 when ggplot2 Elegant Graphics for Data Analysis appeared.
There is considerable overlap of the functionality of R’s three graphics systems, but each has its own strengths and weaknesses. For example, although ggplot2 is currently probably the most popular R package for doing presentation quality plots it does not offer 3D plots. To work effectively in R I think it is necessary to know your way around at least two of the graphics systems. To really gain a command of the visualizations that can be done in R, a person would have to be familiar with all three systems as well as the many packages for specialized visualizations: maps, social networks, arc diagrams, animations, time series etc.
But back to the relatively tame task of 3D plots: the generic function persp() in the base graphics package draws perspective plots of a surface over the x–y plane. Typing demo(persp) at the console will give you an idea of what this function can do.
The plot3D package from Karline Soetaert builds on on persp()to provide functions for both 2D and 3D plotting. The vignette for plot3D shows some very impressive plots. Load the package and type the following commands at the console: example(persp3D), example(surf3D) and example(scatter3D) to see examples of 3D surface and scatter plots. Also, try this code to see a cutaway view of a Torus.
# 3D Plot of Half of a Torus par(mar = c(2, 2, 2, 2)) par(mfrow = c(1, 1)) R < 3 r < 2 x < seq(0, 2*pi,length.out=50) y < seq(0, pi,length.out=50) M < mesh(x, y) alpha < M$x beta < M$y surf3D(x = (R + r*cos(alpha)) * cos(beta), y = (R + r*cos(alpha)) * sin(beta), z = r * sin(alpha), colkey=FALSE, bty="b2", main="Half of a Torus")
Created by Pretty R at insideR.org
The scatterplot3d package from R core members Uwe Ligges and Martin Machler is the "goto" package for 3D scatter plots. The vignette for this package is shows a rich array of plots. Load this package and type example(scatterplot3d) at the console to see examples of spirals, surfaces and 3D scatterplots.
The lattice package has its own distinctive look. Once you see one lattice plot it should be pretty easy to distinguish plots made with this package from base graphics plots. Load the packate and type example(cloud) in the console to see a 3D graph of a volcano, and 3D surface and scatter plots.
rgl from Daniel Adler and Duncan Murdoch and Rcmdr from John Fox et al. both allow interactive 3D vizualizations. Load the rgl package and type example(plot3d) to see a very cool, OpenGL, 3D scatter plot that you can grab with your mouse and rotate.
For additional references, see the scatterplot page of Robert Kabacoff's always helpful QuickR site, and Paul E. Johnson's 3D Plotting presentation.
If you're new to the R language but keen to get started with linear modeling or logistic regression in the language, take a look at this "Introduction to R" PDF, by Princeton's Germán Rodríguez. (There's also a browsable HTML version.)
In a crisp 35 pages it begins by taking you through the basics of R: simple objects, importing data, and graphics. Then, it works through several examples of linear models (formula basics, fitting a model, model diagnostics, analysis of variance and even regression spones). Finally, there's a section on Generalized Linear Models, with a focus on logistic regression. The document doesn't attempt to explain all of the capabilities of R, but instead works through a series of examples to teach by demonstration. All of the datasets used in the guide are available online, so it's easy to follow along from home.
Also by the same author: a guide to R for Stata users.
Germán Rodríguez: Introducing R