by Joseph Rickert

Early October: somewhere the leaves are turning brilliant colors, temperatures are cooling down and that back to school feeling is in the air. And for more people than ever before, it is going to seem to be a good time to commit to really learning R. I have some suggestions for R courses below, but first: What does it mean to learn R anyway? My take is that the answer depends on a person's circumstances and motivation.

I find the following graphic to be helpful in sorting things out.

The X axis is time on Malcolm Gladwell's "Outliers" scale. His idea is that it takes 10,000 hours of real effort to master anything, R, Python or Rock and Roll Guitar. The Y axis lists increasingly difficult R tasks, and the arrows within the plot area are labels increasingly proficient types of R users.

The point I want to make here is that a significant amount of very productive R work happens in the area around the red ellipse. So, while their is no avoiding "10,000" hours of hard work to become an R Jedi knight, a curious and motivated person can master enough R to accomplish his/her programming goals with a more modest commitment. There are three main reasons for this:

- R's functional programming style is very well suited for statistical modeling, data visualization and data science tasks
- The 7,000
^{+}packages available in the R ecosystem provide tens of thousands of functions that make it possible to accomplish quite a bit without having to write much code - Numerous, high quality books and online material devoted to teaching statistical theory and data science with R

If you have some background in some area of statistics or data science a viable strategy for learning R is to identify a resource that works for you and just jump into the middle of things, picking up R as you go along.

The lists below link to courses that can either start you on a formal programming path, or help you become a productive R user in a particular application area. Some of the courses are "live events" that you take with a cohort of students, others are set up for self study.

The courses devoted to teaching R as a programming language are

- The Data Scientist’s toolbox
- R Programming
- Introduction to R Programming
- Introduction to R
- R Programing - Introduction 1
- Introduction a la programacion estadistica con R
- O’Reilly Code School

The first two courses above are from Coursera's Data Science Specialization sequence. Taught by Roger Peng, Jeff Leek and Brian Caffo they are probably the gold standard for MOOC R courses. I am a little late with this post. The Data Scientists's toolbox started this past Monday but there is still time to catch up. The third course, Introduction to R Programming, is a relatively new edX course from Microsoft's online offerings that is getting great reviews. The fourth course on the list a solid introduction to R from DataCamp. R Programming - Introduction 1 is a beginner's introduction to R taught by Paul Murrell or Tal Galili. Next listed, is a Spanish language introduction to R from Coursera and O'Reilly's interactive Code School course.

These next three lists contain courses from DataCamp and statistics.com and online resources from R Studio that introduce more advanced features of R by buildng on basic R programming skills. Note that the final course on the DataCamp list introduces Big Data features of Revolution R Enterprise which is available in the Azure Marketplace.

- Intermediate R
- Data Visualization in R with ggvis
- Data Manipulation with dplyr
- Data Analysis in R, the data.table Way
- Reporting with R Markdown
- Big data Analysis with Revolution R Enterprise

This next section lists courses from the major MOOCs, and non-MOOCs DataCamp and statistics.com that use R to teach various quantitative disciplines

**Coursera Courses**

- Data Analysis and Statistical Inference
- Developing Data Products
- Exploratory Data Analysis
- Getting and Cleaning Data
- Introduction to Computational Finance and Financial Econometrics
- Measuring Causal Effects in the Social Sciences
- Regression Models
- Reproducible Research
- Statistical Inference
- Statistics One

**edX Courses**

- Data Analysis for Life Sciences 1: Statistics and R
- Data Analysis for life Sciences 2: Introduction to Linear Models and Matrix Algebra
- Data Analysis for life Sciences 6: High-performance Computing for Reproducible Genomics
- Explore Statistics with R
- Sabermetrics 101: Introduction to Baseball Analytics

**Udacity Course**

DataCamp

statistics.com

Finally, here are a couple of google apps and Swirl, a new platform for teaching and learning R that may be useful for learning on the go.

It's time to "go back to school" and make some headway against those 10,000 hours.

by Jens Carl Streibig, Professor Emeritus at University of Copenhagen

*Editor's introduction: for background on the miniCRAN package, see our previous blog posts:*

MiniCRAN saves my neck when out in regions where seamless running internet is and exception rather than the rule. R is definitely the programme to offer universities and research institutions in agriculture because it is open source, no money involved, and the help, although sometimes a bit nerdy, is easy to access. I usually tell my student not to buy books on specific topics because R is dynamic and within a couple of years some of the functions in the book is obsolete and thud discourage the average user. Look at the documentation at the r-project.org or in rseek.org.

I have recently been teaching in Turkey and Iran. Sometimes the internet is ok other times it is not. Before it was a struggle to get the particularly packages downloaded and install via RStudio. In a workshop in Iran we could not download the essential packages. A shrewd student downloaded dependencies and distributed the zipfiles to her fellow students. After some glitches we got all up and running.

When I became aware of miniCRAN at the useR!2015 meeting all my R problems were almost solved, with help from the maintainer, Andrie de Vries at Revolution Analytics, we got it to work, when given a workshop on dose-response, also in Iran two weeks ago. Everything went all right for those students who could not install the packages at home. Some windows version were in a poor state of repair, so they could not run RStudio and we had to provide all the dependencies, but no problem they were all in the miniCRAN repository.

by Andrie de Vries

Every once in a while I try to remember how to do interpolation using R. This is not something I do frequently in my workflow, so I do the usual sequence of finding the appropriate help page:

?interpolate

Help pages:

stats::approx Interpolation Functions

stats::NLSstClosestX Inverse Interpolation

stats::spline Interpolating Splines

So, the help tells me to use approx() to perform linear interpolation. This is an interesting function, because the help page also describes approxfun() that does the same thing as approx(), except that approxfun() returns a function that does the interpolation, whilst approx() returns the interpolated values directly.

(In other words, approxfun() acts a little bit like a predict() method for approx().)

The help page for approx() also points to stats::spline() to do spline interpolation and from there you can find smooth.spline() for smoothing splines.

Talking about smoothing, base R also contains the function smooth(), an implementation of running median smoothers (algorithm proposed by Tukey).

Finally I want to mention loess(), a function that estimates Local Polynomial Regression Fitting. (The function loess() underlies the stat_smooth() as one of the defaults in the package ggplot2.)

I set up a little experiment to see how the different functions behave. To do this, I simulate some random data in the shape of a sine wave. Then I use each of these functions to interpolate or smooth the data.

On my generated data, the interpolation functions approx() and spline() gives a quite ragged interpolation. The smoothed median function smooth() doesn't do much better - there simply is too much variance in the data.

The smooth.spline() function does a great job at finding a smoother using default values.

The last two plots illustrate loess(), the local regression estimator. Notice that loess() needs a tuning parameter (span). The lower the value of the smoothing parameter, the smaller the number of points that it functions on. Thus with a value of 0.1 you can see a much smoother interpolation than at a value of 0.5.

Here is the code:

If you've thought about learning the R language but didn't know how to start, there's a new, free course on edX that starts you from the R basics and lets you learn R by trying R as you go.

Presented by DataCamp and Microsoft, the course starts from the very basics of R (arithmetic on the command line, creating variables), progresses through the basic data types (vector, matrix, factor, list and data frame) and ends with a module on data visualization. The course consists of lecture-style videos interspersed with quizzes to test your knowledge. And best of all, you can try out what you've learned at the R command line using the DataCamp online interface -- so you don't even have to install R yourself! A browser is all you need.

It's perfect for newcomers to R, even if you don't have experience in other programming languages. If you want to get a (paid) certification you'll need to complete the course by September 1, but you can view all of the course materials, quizzes and labs anytime for free. You can learn more about the course at the DataCamp blog or get started now at the link below.

by John Mount

Data Scientist, Win-Vector LLC

R has a number of very good packages for manipulating and aggregating data (plyr, sqldf, RevoScaleR, data.table, and more), but when it comes to accumulating results the beginning R user is often at sea. The R execution model is a bit exotic so many R users are very uncertain which methods of accumulating results are efficient and which are inefficient.

In this latest "R as it is" we will quickly become expert at efficiently accumulating results in R. To read more please click here.

by Joseph Rickert

The XXXV Sunbelt Conference of the International Network for Social Network Analysis (INSNA) was held last month at Brighton beach in the UK. (And I am still bummed out that I was not there.)

A run of 35 conferences is impressive indeed, but the social network analysts have been at it for an even longer time than that:

and today they are still on the cutting edge of the statistical analysis of networks. The conference presentations have not been posted yet, but judging from the conference workshops program there was plenty of R action in Brighton.

Social network analysis at this level involves some serious statistics and mastering a very specialized vocabulary. However, it seems to me that some knowledge of this field will become important to everyone working in data science. Supervised learning models and statistical models that assume independence among the predictors will most likely represent only the first steps that data scientists will take in exploring the complexity of large data sets.

And, maybe of equal importance is that fact that working with network data is great fun. Moreover, software tools exist in R and other languages that make it relatively easy to get started with just a few pointers.

From a statistical inference point of view what you need to know is Exponential Random Graph Models (ERGMs) are at the heart of modern social network analysis. An ERGM is a statistical model that enables one to predict the probability of observing a given network from a specified given class of networks based on both observed structural properties of the network plus covariates associated with the vertices of the network. The exponential part of the name comes from exponential family of functions used to specify the form of these models. ERGMs are analogous to generalized linear models except that ERGMs take into account the dependency structure of ties (edges) between vertices. For a rigorous definition of ERGMs see sections 3 and 4 of the paper by Hunter et al. in the 2008 special issue of the JSS, or Chapter 6 in Kolaczyk and Csárdi's book *Statistical Analysis of Network Data with R*. (I have found this book to be very helpful and highly recommend it. Not only does it provide an accessible introduction to ERGMs it also begins with basic network statistics and the igraph package and then goes on to introduce some more advanced topics such as modeling processes that take place on graphs and network flows.)

In the R world, the place to go to work with ERGMs is the statnet.org. statnet is a suite of 15 or so CRAN packages that provide a complete infrastructure for working with ERGMs. statnet.org is a real gem of a site that contains documentation for all of the statnet packages along with tutorials, presentations from past Sunbelt conferences and more.

I am particularly impressed with the Shiny based GUI for learning how to fit ERGMs. Try it out on the Shiny webpage or in the box below. Click the **Get Started** button. Then select "built-in network" and "ecoli 1" under **File type**. After that, click the right arrow in the upper right corner. You should see a plot of the ecoli graph.

--------------------------------------------------------------------------------------------------------------------------

You will be fitting models in no time. And since the commands used to drive the GUI are similar to specifying the parameters for the functions in the ergm package you will be writing your own R code shortly after that.

by Bill Jacobs, Director Technical Sales, Microsoft Advanced Analytics

In the course of working with our Hadoop users, we are often asked, what's the best way to integrate R with Hadoop?

The answer, in nearly all cases is, It depends.

Alternatives ranging from open source R on workstations, to parallelized commercial products like Revolution R Enterprise and many steps in between present themselves. Between these extremes, lie a range of options with unique abilities scale data, performance, capability and ease of use.

And so, the right choice or choices depends on your data size, budget, skill, patience and governance limitations.

In this post, I’ll summarize the alternatives using pure open source R and some of their advantages. In a subsequent post, I’ll describe the options for achieving even greater scale, speed, stability and ease of development by combining open source and commercial technologies.

These two posts are written to help current R users who are novices at Hadoop understand and select solutions to evaluate.

As with most thing open source, the first consideration is of course monetary. Isn’t it always? The good news is that there are multiple alternatives that are free, and additional capabilities under development in various open source projects.

We see generally 4 options for building R to Hadoop integration using entirely open source stacks.

This baseline approach’s greatest advantage is simplicity and cost. It’s free. End to end free. What else in life is?

Through packages Revolution contributed to open source including rhdfs and rhbase, R users can directly ingest data from both the hdfs file system and the hbase database subsystems in Hadoop. Both connectors are part of the RHadoop package created and maintained by Revolution and are a go-to choice.

Additional options exist as well. The RHive package executes Hive’s HQL SQL-like query language directly from R, and provides functions for retrieving metadata from Hive such as database names, table names, column names, etc.

The rhive package, in particular, has the advantage that its data operations some work to be pushed down into Hadoop, avoiding data movement and parallelizing operations for big speed increases. Similar “push-down” can be achieved with rhbase as well. However, neither are particularly rich environments, and invariably, complex analytical problems will reveal some gaps in capability.

Beyond the somewhat limited push-down capabilities, R’s best at working on modest data sampled from hdfs, hbase or hive, and in this way, current R users can get going with Hadoop quickly.

Once you tire of R’s memory barriers on your laptop the obvious next path is a shared server. With today’s technologies, you can equip a powerful server for only a few thousand dollars, and easily share it between a few users. Using Windows or Linux with 256GB, 512GB of RAM, R can be used to analyze files in to the hundreds of gigabytes, albeit not as fast as perhaps you’d like.

Like option 1, R on a shared server can also leverage push-down capabilities of the rhbase and rhive packages to achieve parallelism and avoid data movement. However, as with workstations, the pushdown capabilities of rhive and rhbase are limited.

And of course, while lots of RAM keeps the dread out of memory exhustion at bay, it does little for compute performance, and depends on sharing skills learned [or perhaps not learned] in kindergarten. For these reasons, consider a shared server to be a great add-on to R on workstations but not a complete substitute.

Replacing the CRAN download of R with the R distribution: Revolution R Open (RRO) enhances performance further. RRO is, like R itself, open source and 100% R and free for the download. It accelerates math computations using the Intel Math Kernel Libraries and is 100% compatible with the algorithms in CRAN and other repositories like BioConductor. No changes are required to R scripts, and the acceleration the MKL libraries offer varies from negligible to an order of magnitude for scripts making intensive use of certain math and linear algebra primitives. You can anticipate that RRO can double your average performance if you’re doing math operations in the language.

As with options 1 and 2, Revolution R Open can be used with connectors like rhdfs, and can connect and push work down into Hadoop through rhbase and rhive.

Once you find that your problem set is too big, or your patience is being taxed on a workstation or server and the limitations of rhbase and rhive push down are impeding progress, you’re ready for running R inside of Hadoop.

The open source RHadoop project that includes rhdfs, rhbase and plyrmr also includes a package rmr2 that enables R users to build Hadoop map and reduce operations using R functions. Using mappers, R functions are applied to all of the data blocks that compose an hdfs file, an hbase table or other data sets, and the results can be sent to a reducer, also an R function, for aggregation or analysis. All work is conducted inside of Hadoop but is built in R.

Let’s be clear. Applying R functions on each hdfs file segment is a great way to accelerate computation. But for most, it is the avoidance of moving data that really accentuates performance. To do this, rmr2 applies R functions to the data residing on Hadoop nodes rather than moving the data to where R resides.

While rmr2 gives essentially unlimited capabilities, as a data scientist or statistician, your thoughts will soon turn to computing entire algorithms in R on large data sets. To use rmr2 in this way complicates development, for the R programmer because he or she must write the entire logic of the desired algorithm or adapt existing CRAN algorithms. She or he must then validate that the algorithm is accurate and reflects the expected mathematical result, and write code for the myriad corner cases such as missing data.

rmr2 requires coding on your part to manage parallelization. This may be trivial for data transformation operations, aggregates, etc., or quite tedious if you’re trying to train predictive models or build classifiers on large data.

While rmr2 can be more tedious than other approaches, it is not untenable, and most R programmers will find rmr2 much easier than resorting to Java-based development of Hadoop mappers and reducers. While somewhat tedious, it is a) fully open source, b) helps to parallelize computation to address larger data sets, c) skips painful data movement, d) is broadly used so you’ll find help available, and e), is free. Not bad.

rmr2 is not the only option in this category – a similar package called rhipe is also and provides similar capabilities. rhipe is described here and here and is downloadable from GitHub.

The range of open source-based options for using R with Hadoop is expanding. The Apache Spark community, for example is rapidly improving R integration via the predictably named SparkR. Today, SparkR provides access to Spark from R much as rmr2 and rhipe do for Hadoop MapReduce do today.

We expect that, in the future, the SparkR team will add support for Spark’s MLLIB machine learning algorithm library, providing execution directly from R. Availability dates haven’t been widely published.

Perhaps the most exciting observation is that R has become “table stakes” for platform vendors. Our partners at Cloudera, Hortonworks, MapR and others, along with database vendors and others, are all keenly aware of the dominance of R among the large and growing data science community, and R’s importance as a means to extract insights and value from the burgeoning data repositories built atop Hadoop.

In a subsequent post, I’ll review the options for creating even greater performance, simplicity, portability and scale available to R users by expanding the scope from open source only solutions to those like Revolution R Enterprise for Hadoop.

by Joseph Rickert

It is incredibly challenging to keep up to date with R packages. As of today (6/16/15), there are 6,789 listed on CRAN. Of course, the CRAN Task Views are probably the best resource for finding what's out there. A tremendous amount of work goes into maintaining and curating these pages and we should all be grateful for the expertise, dedication and efforts of the task view maintainers. But, R continues to grow at a tremendous rate. (Have a look at growth curve in Bob Muenchen's 5/22/15 post R Now Contains 150 Times as Many Commands as SAS). CRANberries, a site that tracks new packages and package updates, indicates that over the last few months the list of R packages has been growing by about 100 packages per month. How can anybody hope to keep current?

So, on any given day, expect that finding out what R packages exist that may pertain to any particular topic will require some work. What follows, is a beginners guide to fishing for packages in CRAN. This example looks for "Bayesian" packages using some simple web page scraping and elementary text mining.

The Bayesian Inference Task View lists 144 packages. This is probably everything that is really important, but let's see what else is to be found that has anything at all to do with Bayesian Inference. In the first block of code, R's available.packages() function fetches the list of packages available from my Windows PC. (This is an extremely interesting function and I don't do justice to it here.) Then, this list is used to scrape the package descriptions from the various package webpages. The loop takes some time to run so I saved the package descriptions both in a csv file and a in a .RData workspace.

library(svTools) library(RCurl) library(tm) #----------------------------------------- # TWO HELPER FUNCTIONS # Funcion to get ackage description from CRAN package page getDesc <- function(package){ l1 <- regexpr("</h2>",package) ind1 <- as.integer(l1[[1]]) + 9 l2 <- regexpr("Version",package) ind2 <- as.integer(l2[[1]]) - (46 + nchar("package")) desc <- substring(package,ind1,ind2) return(desc) } # Function to get CRAN package page getPackage <- function(name){ url <- paste("http://cran.r-project.org/web/packages/",name,"/index.html",sep="") txt <- getURL(url,ssl.verifypeer=FALSE) return(txt) } #-------------------------------------------- # SCRAPE PACKAGE DATA FROM CRAN # Get the list of R packages packages <- as.data.frame(available.packages()) head(packages) dim(packages) pkgNames <- rownames(packages) rm(packages) # Dont need this any more pkgDesc <- vector() for (i in 1:length(pkgNames)){ pkgDesc[i] <- getDesc(getPackage(pkgNames[i])) } length(pkgDesc) #6598 #---------------------------------------------- # SOME HOUSEKEEPING # cranP <- data.frame(pkgNames,pkgDesc) # write.csv(cranP,"C:/DATA/CRAN/CRAN_pkgs_6_15_15") # save.image("pkgs.RData") # load("pkgs.RData")

When I did this a few days ago 6,598 packages were available. The next section of code turns the vector of package descriptions into a document corpus and creates a document term matrix with a row for each package and 20,781worth of terms. Taking the transpose of the term matrix makes it easier to see what is going on. The matrix is extremely sparse (only one 1 shows up) as this small portion of the matrix illustrates and all of the terms are pretty much useless. Removing the sparse terms cuts the matrix down to only 372 terms.

# SOME SIMPLE TEXT MINING # Make a corpus out of package descriptions pCorpus <- VCorpus(VectorSource(pkgDesc)) pCorpus inspect(pCorpus[1:3]) # Function to prepare corpus prepC <- function(corpus){ c <- tm_map(corpus, stripWhitespace) c <- tm_map(c,content_transformer(tolower)) c <- tm_map(c,removeWords,stopwords("english")) c <- tm_map(c,removePunctuation) c <- tm_map(c,removeNumbers) return(c)} pCorpusPrep <- prepC(pCorpus) #------------------------------------------------------------ # Create the document term matrix dtm <- DocumentTermMatrix(pCorpusPrep) dtm # <<DocumentTermMatrix (documents: 6598, terms: 20781)>> # Non-/sparse entries: 142840/136970198 # Sparsity : 100% # Maximal term length: 83 # Weighting : term frequency (tf) # Work with the transpose to list keywords as rows inspect(t(dtm[100:105,90:105])) # Docs # Terms 100 101 102 103 104 105 # accomodated 0 0 0 0 0 0 # accompanied 0 0 0 0 0 0 # accompanies 0 0 0 0 0 0 # accompany 0 0 0 0 0 0 # accompanying 0 0 0 0 0 0 # accomplished 0 0 0 0 0 0 # accomplishes 0 0 0 0 0 0 # accordance 0 0 0 0 0 0 # according 0 0 1 0 0 0 # accordingly 0 0 0 0 0 0 # accordinglyp 0 0 0 0 0 0 # account 0 0 0 0 0 0 # accounted 0 0 0 0 0 0 # accounting 0 0 0 0 0 0 # accountp 0 0 0 0 0 0 # accounts 0 0 0 0 0 0 # Reduce the number of sparse terms dtms <- removeSparseTerms(dtm,0.99) dim(dtms) # 6598 372

I am pretty much counting on some luck here, hoping that "Bayesian" will be one of the remaining 372 terms. This last bit of code finds 229 packages associated with the keyword "Bayesian"

# Find the Bayesian packages dtmsT <- t(dtms) keywords <- row.names(dtmsT) bi <- which(keywords == "bayesian") # Find the index of an interesting keyword bayes <- inspect(dtmsT)[bi,] # Vexing that it prints to console bayes_packages_index <- names(bayes[bayes==1]) # Here are the "Bayesian" packages bayes_packages <- pkgNames[as.numeric(bayes_packages_index)] length(bayes_packages) #229 # Here are the descriptions of the "Bayesian" packages bayes_pkgs_desc <- pkgDesc[bayes==1])

Here is the list of packages found.

Not all of these "fish" are going to be worth keeping, but at least we have reduced the search to something manageable. In 10 or 15 minutes of fishing you might catch something interesting.

R is an environment for programming with data, so unless you're doing a simulation study you'll need some data to work with. If you don't have data of your own, we've made a list of open data sets you can use with R to accompany the latest release of Revolution R Open.

At the Data Sources on the Web page on MRAN, you can find links to dozens of open data sources both large and more. You'll find some classics of data science and machine learning, like the Enron emails data set, and the famous Airlines data. You can find official statistics on economics and government from countries around the world, including links to every country's official data repositories at UNdata. There are links to scientific data, including several sources from the social sciences. And of course you'll find links to various financial data sources (but not all of these are 100% free to use).

Many of the data sets are indicated as ready-to-use in R format; for the others, you can use R's various data import tools to access the data (for which there is a great guide at ComputerWorld).

Got other suggestions for great open data sources? Let us know in the comments below, or send an email to mran@revolutionanalytics.com.

MRAN: Data Sources on the Web

Computerworld's Sharon Machlis published today a very useful list of R packages that every R user should know. The list covers packages for data import, data wrangling, data visualization and package development, but for beginning R users the biggest challenge is usually just dealing with data. To that end, I thought it was worth listing the package for data access and manipulation, which I thoroughly endorse:

**Data import/access**: readr (text data files), rio (many binary data file formats), readxl (Excel spreadsheets), googlesheets (Google Sheets), RMySQL (MySQL databases), quantmod (economic and financial data sources);**Data manipulation**: dplyr (general data frame processing); data.table (aggregation and filtering); tidyr (tidying messy data into row/col format); sqldf (SQL queries on data frames), zoo (time series data wrangling)

Check out Sharon's complete list below for details on these and many other useful R packages.

ComputerWorld: Great R packages for data import, wrangling & visualization