by Joseph Rickert
Worldwide R user group activity for the first Quarter of 2014 appears to be way up compared to previous years as the following plot shows.
The plot was built by counting the meetings on Revolution Analytics R Community Calendar. R users continue to value the live, in person events and facetoface meetings with their peers. Moreover, if you peruse the details of the meetings listed on the calendar you will see that there have been some fantastic presentation so far this year. And although, as with any live event, if you missed it you will never know how good it really was, some of the R user groups have left traces of what happened by posting presentation slides on public areas of their websites.
I am sorry that I missed Harold Baize’s presentation to the Berkeley R language Beginner Study Group on Using R in a Microsoft Office world. I think the following slide from Harold’s presentation indicates a pragmatic approach coupled with a wry sense of humor.
The presentation goes to discuss reading and writing to Excel, Tableau, the R2DOCX package, markdown and more. To get the presentation, download the file RinMSOffice from the Berkeley Site.
Jeroen Janssens’ presentation on Command Line Data Science to the New York Open Statistical Programming Meetup is a very nice introduction to command line essentials for either Linux or Mac OS. If you are not comfortable with cat, grep or awk this might be a place to start.
In a similar spirit of helping new users get up to speed with R, the New Hampshire R Users Group (NH UserRs) has posted a number of tutorials by David Hocking. Introduction to Linear Regression and ANOVA in R is a useful 10 minute first look. Newcomers might also find it valuable to follow up with the David Lillis presentation, Data Analysis Tips in R, that has been posted on the Wellingtone R User Group (WRUG) site. This presentation contains a short discussion on calculating Tetrachoric and polychoric correlations in R that quite a few people might find valuable.
DublinR has graciously made available all of the materials from a Bayesian Data Analysis workshop that Mick Cooney delivered earlier this year. A zipped file containing R scripts, JAG files and data can be downloaded by clicking on “bdasingle2014” at this link.
Finally, I am glad that I was present at Megan Price’s takk: How a Small NonProfit Human Rights Group uses R, to the Bay Area UseR Group. This was an inspirational talk where you really had to be there. Nevertheless, Megan’s slides do portray the big picture of how statistical analysis can make a difference in human rights investigations. The following photo of a cache of documents containing records of those who disappeared during Guatemala's civil war indicates the magnitude of the statistical sampling problem that Megan and her colleagues at HRDAG faced.
This coming Monday, March 31, 2014, is the last day for applying to Revolution Analytics R User Group Sponsorship Program for Matrix or Array level sponsorship. So if you are the organizer of an established R User group and you think a cash grant and a box of R related “goodies” could help you grow, please apply before midnight PST 3/31/14.
by Joseph Rickert
Recently, I had the opportunity to be a member of a job panel for Mathematics, Economics and Statistics students at my alma mater, CSUEB (California State University East Bay). In the context of preparing for a career in data science a student at the event asked: “Where can I find good data sets?”. This triggered a number of thoughts: the first being that it was time to update the list of data sets that I maintain and blog about from time to time. So, thanks to that reminder I have added a few new links to the page, including a new section called Data Science Practice that links to some of the data sets used as examples in Doing Data Science by Rachel Schutt and Cathy O’Neil. Additionally, I have provided a direct link to the BigData Tag on infochimps and pointed out that multiple song data sets are available.
However, to do justice to student’s question it is necessary to give some thought to exactly what a “good” practice data set might look like. Here are three characteristics that I think a practice data set should have to be good:
Here are three data sets that meet these criteria in ascending order of degree of difficulty:
The first suggestion is the MovieLens data set which contains a million ratings applied to over 10,000 movies by more than 71,000 users. The download comes in two sizes, the full set, and a 100K subset. Both versions require working with multiple files.
Near the top of anybody’s list of practice data sets, and second on my little list because of degree of difficulty is the airlines data set from the 2009 ASA challenge. This data set which contains the arrival and departure information for all domestic flights from 1987 to 2008 has become the “iris” data set for Big Data. With over 123M rows it is too big to it into your laptop’s memory and with 29 variables of different types it is rich enough to suggest several analyses. Moreover, although the version of the data set maintained on the ASA website is fixed and therefore perfect for benchmarking, the Research and Innovative Technology Administration Bureau of Transportation Statistics continues to add to the data on a monthly basis. Go to RITA to get all of the data collected since the ASA competition ended.
Last on my short list is the Million Song data set. This contains features and meta data for one million songs which were originally provided by the music intelligence company Echo Nest. The data is in the specialized HDF5 format which makes it somewhat of a challenge to access. The data set maintainers do provide wrapper functions to facilitate downloading the data and avoiding some of the complexities of the HDF5 format. However, there are no R wrappers! The last I checked, the maintainers had a paragraph about there being a problem with their code along with an invitation for R experts to contact them (This would clearly be for extra points.) For more details about the contents of the data set look here.
As a final note, it is much easier use R to analyze the Public Data Sets available through Amazon Web Services now that you can run Revolution R Enterprise in the Amazon Cloud. We hope to have more to say about exactly how to go about doing this in a future post. However, everything you need to get started is in place including a 14 day free trial (Amazon charges apply) for Revolution R Enterprise. All you need is your own Amazon account.
Please let me know if you have additional links to useful, publically available data sets that I have missed. We very much appreciate the contributions blog readers have made to the list of data sets.
by Joseph Rickert
I am a book person. I collect books on all sorts of subjects that interest me and consequently I have a fairly extensive collection of R books, many of which I find to be of great value. Nevertheless, when I am asked to recommend an R book to someone new to R I am usually flummoxed. R is growing at a fantastic rate, and people coming to R for the first time span I wide range of sophistication. And besides, owning a book is kind of personal. It is one thing to go out and buy a technical book because it is required for a course, but quite an other to make a commitment to a book all on your own. Not only must it have the right content, at the right level for you, and be written in a way that you will actually read it, a book must feel right, be typeset to appeal to your sense of aesthetics, have diagrams and illustrations to draw you in, and contain enough white space to seem approachable. Furthermore, there is a burden to owning a book. There is nothing worse than making a poor selection and having a totally incomprehensible text stare at you from a shelf. Moreover, even an old friend can impose obligations. I have read deeply from The Elements of Statistical Learning, but not everything, so there it sits: admonishing me.
Recently, however, while crawling around CRAN, it occurred to me that there is a tremendous amount of high quality material on a wide range of topics in the Contributed Documentation page that would make a perfect introduction to all sorts of people coming to R. Maybe, all it needs is a little marketing and reorganization. So, from among this treasure cache (and a few other online sources), I have assembled an R “meta” book in the following table that might be called: An R Based Introduction to Probability and Statistics with Applications.
Content 
Document 
Author 

1 
Basic Probability and Statistics 
G. Jay Kerns 

2 
Fitting Probability Distributions 
Vito Ricci 

3 

Julian J. Faraway 

4 
Experimental Design 
Vikneswaran 

5 
Survival Analysis 
John Fox 

6 
Generalized Linear Models 
Virasakdi Chongsuvivatwong 

7 


8 
Time Series 
McLeod, Yu and Mahdi 

9 

Kim Seefeld and Ernst Linder 

Machine Learning

Yanchang Zhao 

11 
Bioinformatics

Wim P. Krijnen 

12 
Forecasting 
Hyndman and Athanasopoulos 

13 
Structural Equation Models 
John Fox 

14 
Credit Scoring 
Dhruv Sharma 
The content column lists the topics that I think ought to be included in a good introductory probability and statistics textbook. With a little searching, you will be able to find a discussion of each topic in the document listed to its right. Obviously, there is a lot overlap among the documents listed, since most of them are substantial works that cover much more than the few topics that I have listed.
Finally, I don’t mean to imply that the documents in my table are the best assembled in the Contributed Documentation page. The table just represents my idiosyncratic way of organizing some of the material in a way that I hope newcomers will find useful. I think that collectively the contributed documents have everything one might look for in a first date with R. They are available, approachable, contain superb content written by R experts, and are replete with examples and R code. And, with a little effort, a casual first encounter could lead to long term relationship.
by Joseph Rickert
Recently, I was trying to remember how to make a 3D scatter plot in R when it occurred to me that the documentation on how to do this is scattered all over the place. Hence, this short organizational note that you may find useful.
First of all, for the benefit of newcomers, I should mention that R has three distinct graphics systems: (1) the “traditional” graphics system, (2) the grid graphics system and (3) ggplot2. The traditional graphic system refers to the graphics and plotting functions in base R. According to Paul Murrell’s authoritative R Graphics, these functions implement the graphics facilities of the S language, and up to about 2005 they comprised most of the graphics functionality of R. The grid graphics system is based on Deepayan Sarkar’s lattice package which implements the functionality of Bill Cleveland’s Trellis graphics. Finally, ggplot2 Hadley Wickham’s package based on Wilkinson's Grammar of Graphics, took shape between 2007 and 2009 when ggplot2 Elegant Graphics for Data Analysis appeared.
There is considerable overlap of the functionality of R’s three graphics systems, but each has its own strengths and weaknesses. For example, although ggplot2 is currently probably the most popular R package for doing presentation quality plots it does not offer 3D plots. To work effectively in R I think it is necessary to know your way around at least two of the graphics systems. To really gain a command of the visualizations that can be done in R, a person would have to be familiar with all three systems as well as the many packages for specialized visualizations: maps, social networks, arc diagrams, animations, time series etc.
But back to the relatively tame task of 3D plots: the generic function persp() in the base graphics package draws perspective plots of a surface over the x–y plane. Typing demo(persp) at the console will give you an idea of what this function can do.
The plot3D package from Karline Soetaert builds on on persp()to provide functions for both 2D and 3D plotting. The vignette for plot3D shows some very impressive plots. Load the package and type the following commands at the console: example(persp3D), example(surf3D) and example(scatter3D) to see examples of 3D surface and scatter plots. Also, try this code to see a cutaway view of a Torus.
# 3D Plot of Half of a Torus par(mar = c(2, 2, 2, 2)) par(mfrow = c(1, 1)) R < 3 r < 2 x < seq(0, 2*pi,length.out=50) y < seq(0, pi,length.out=50) M < mesh(x, y) alpha < M$x beta < M$y surf3D(x = (R + r*cos(alpha)) * cos(beta), y = (R + r*cos(alpha)) * sin(beta), z = r * sin(alpha), colkey=FALSE, bty="b2", main="Half of a Torus")
Created by Pretty R at insideR.org
The scatterplot3d package from R core members Uwe Ligges and Martin Machler is the "goto" package for 3D scatter plots. The vignette for this package is shows a rich array of plots. Load this package and type example(scatterplot3d) at the console to see examples of spirals, surfaces and 3D scatterplots.
The lattice package has its own distinctive look. Once you see one lattice plot it should be pretty easy to distinguish plots made with this package from base graphics plots. Load the packate and type example(cloud) in the console to see a 3D graph of a volcano, and 3D surface and scatter plots.
rgl from Daniel Adler and Duncan Murdoch and Rcmdr from John Fox et al. both allow interactive 3D vizualizations. Load the rgl package and type example(plot3d) to see a very cool, OpenGL, 3D scatter plot that you can grab with your mouse and rotate.
For additional references, see the scatterplot page of Robert Kabacoff's always helpful QuickR site, and Paul E. Johnson's 3D Plotting presentation.
If you're new to the R language but keen to get started with linear modeling or logistic regression in the language, take a look at this "Introduction to R" PDF, by Princeton's Germán Rodríguez. (There's also a browsable HTML version.)
In a crisp 35 pages it begins by taking you through the basics of R: simple objects, importing data, and graphics. Then, it works through several examples of linear models (formula basics, fitting a model, model diagnostics, analysis of variance and even regression spones). Finally, there's a section on Generalized Linear Models, with a focus on logistic regression. The document doesn't attempt to explain all of the capabilities of R, but instead works through a series of examples to teach by demonstration. All of the datasets used in the guide are available online, so it's easy to follow along from home.
Also by the same author: a guide to R for Stata users.
Germán Rodríguez: Introducing R
by Stephen Weller, Senior Support Engineer at Revolution Analytics, and Joseph Rickert
For someone trying to learn any new technology getting help with a problem on a public forum can be stressful. Knowing where to go, deciding how to pose a question and figuring out how to deal with a response can be challenging. Moreover, an unpleasant interaction could be ego bruising and a real setback to learning. Before posting a question on an internet forum do everything you can to make it a positive experience for everyone involved. Here are some recommendations Steve and I have for getting help with R questions.
Preliminary Work
Two of the most novice friendly places to go for help in the R world are the RHelp mailing list on CRAN and the R section of stack overflow. Both of these forums are monitored by experts who are very willing to patiently answer questions, but not always well disposed towards mindreading. Maximize your chances of getting a quick, positive response by formulating your question or problem as clearly as possible with minimum ambiguity. And then: do your homework. The rproject posting guide shows several ways to search for R help, lists the common mistakes people make in posting questions and provides a host of details on the resources available for getting help and the mechanics of using the various R mailing lists.
Stack overflow provides some excellent suggestions on posting questions. Doing the work to thoroughly research your question is also at the top of their list. Moreover, they point out that taking the trouble to do this makes you a valuable contributor to the R community. They write:
Sharing your research helps everyone. Tell us what you found ...and why it didn’t meet your needs. This demonstrates that you’ve taken the time to try to help yourself, it saves us from reiterating obvious answers, and above all, it helps you get a more specific and relevant answer!
Posting Your Question
When comes time to post your question you may find Steves guidelines helpful. These are based years of trouble shooting problems as a member of Revolution’s Technical Support organization.
Finally, here are three examples from the Revolution Technical support archives that illustrate good and bad posts. Two are examples of what Steve calls "pretty well framed support questions" and one is an example of a question that lacks needed information.
#1  a good post
A few folks here are trying to load the rJava library, they have JAVA_HOME set to their 64 bit java (1.6) but are getting this error.
call: inDL(x, as.logical(local), as.logical(now), ...)
error: unable to load shared library 'z:/R/win64library/2.11/rJava/libs/x64/rJava.dll':
LoadLibrary failure: The specified module could not be found. We get this when trying to load the library in the console line. I am sure we’re missing something, but we are not sure what.
thanks
What makes this a good post is that it provides information on the versions of Java and R being run and provides a complete error message.
#2  a good post
Platform: Windows (32bit)
I am working with a 1.9 GB SPSS data file with 99 variables and 4,684,587 cases. When I try to read the file into Revolution Analytics R using the following command: inDataFileR3C < "D:/2012 Base Year/RevolutionR/RandomVariables3C.sav" reaValExtData < rxImport(inData = inDataFileR3C, outFile = "D:/2012 Base Year/RevolutionR/RandomVariables3C.xdf", stringsAsFactors = TRUE,rowsPerRead = 50000) I get the following error message: Rows Read: 50000, Total Rows Processed: 2550000, Total Chunk Time: 12.152 seconds Rows Read: 50000, Total Rows Processed: 2600000Failed to allocate 15300000 bytes. Error in rxCall("Rx_ImportDataSource", params) : bad allocation However, if I break the SPSS data file into 2 parts, one with 2,300,000 cases and the second with 2,384,857 cases, both parts can be read into R successfully.
Thank you,
This post provides very specific information on the error involved and on what the user did to troubleshoot the problem.
#3  Not a good post
Hi,
On a number of occasions I have been importing fairly large csv's (2  3 million rows). I know these are properly formatted (e.g. data is encapsulated by double quotes) and have row counts from using the wc command in a Unix environment. When I import these using rxImport, fewer rows are imported. Is there any reason why this might occur? No errors are reported and the job seems to complete successfully. Changing the number of rowsPerRead doesn't seem to make any difference.
Thanks in advance for any advice.
This question is missing the key information required to reproduce and troubleshoot the problem:
Steve estimates that roughly 50% of the time the support engineers at Revolution Analytics have to ask for more information. When you post a request for help do your best to become part of the solution.
R can do a lot of really amazing things, but to use just about any of R's many features you need to first import your data and get it into the appropriate shape. For R beginners, this "data wrangling" task can be daunting. Fortunately, ComputerWorld's Sharon Machlis has created an indepth tutorial for many data preparation tasks, which is well worth working through to get a sense of datahandling in R.
This 8page tutorial provides stepbystep instructions in the R language for adding columns to a data set, aggregating data by subgroup, sorting data, and reshaping data (converting "wide" data sets to "long" data sets, and vice versa). Unlike older R tutorials, this guide uses newer contributed R packages (including Hadley Wickham's reshape2 package) for many tasks. That's a good choice: especially the at the earlier stages of learning R, it's well worth learning these modern data manipulation tools rather than the more complicated standard R syntax. Check out the full tutorial at the link below.
ComputerWorld: 4 data wrangling tasks in R for advanced beginners
If you learned statistics using Stata software but have an interest in learning the R language, it's worth checking out R~Stata: Notes on Exporing Data by Princeton's Oscar TorresReyna. DLab's Laura Nelson provides an overview, but in short it's a collection of 30 PDF slides that introduces R for Stata users, and provides translation tables like the one below converting R and Stata code for various tasks:
Stata users should also check out Bob Muenchen's book R for Stata Users and the companion online training course. Princeton also has a good list of other resources for getting started with R for data analysis.
DLab: A Quick and Easy Way to Turn Your Stata Knowledge into R Knowledge
The .Rprofile file is a great way to customize your R session every time you start it up. You can use it to change R's defaults, define handy commandline functions, automatically load your favourite packages — anything you like! The Getting Genetics Blog has a nice example .Rprofile file to give you some inspiration on what to do. One popular setting is options(stringsAsFactors=FALSE), which prevents R from converting character data into factor objects when you import data frames.
One word of warning: if you often share R scripts with others, don't get too reliant on your .RProfile file. Your script may be assuming default settings that your colleagues may not share. Be sure to check your script still runs correctly when you start R with R noinitfile before you share it. Check help(Startup) in R for details.
Getting Genetics Done: Customize your .Rprofile and Keep Your Workspace Clean
If you're an absolute beginner to the R language, this Intro to R video series from Google Developers is a great place to get started. Just download R for your system, start the playlist below, and follow along with the onscreen examples. (The video uses the MacOS X version of R, but you should be able to follow along just fine on Windows as well.)
Each video is between 2 and 4 minutes long, and clearly covers a specific topic about R. The series starts in Section 1 with the absolute basics (entering commands, vectors, variables and arithmetic). Sections 2 and 3 progress to slightly more complex object types (data frames and lists) and programming constructs (loops and flow control). By the time you complete Section 4 you'll be able to write and use your own functions in R. I've embedded the complete series below:
Note that the videos also feature closed captioning, which can be helpful to follow along with new function names and technical terms about R.
If you'd like to take a look at the individual videos in the series, you can find the playlist index at the link below.
Google Developers (YouTube): Intro to R (playlist)