by Andrie de Vries
Last week we announced the availability of Revolution R Open, an enhanced distribution of R. One of the enhancements is the inclusion of high performance linear algebra libraries, specifically the Intel MKL. This library significantly speeds up many statistical calculations, e.g. the matrix algebra that forms the basis of many statistical algorithms.
Several years ago, David Smith wrote a blog post about multithreaded R, where he explored the benefits of the MKL, in particular on Windows machines.
In this post I explore whether anything has changed.
To best use the power available in the machines of today, Revolution R Open is installed by default with the Intel Math Kernel Library (MKL), which provides BLAS and LAPACK library functions used by R. Intel MKL makes it possible for so many common R operations to use all of the processing power available.
The MKL's default behavior is to use as many parallel threads as there are available cores. There’s nothing you need to do to benefit from this performance improvement — not a single change to your R script is required.
However, you can still control or restrict the number of threads using the setMKLthreads()
function from the Revobase
package delivered with Revolution R Open. For example, you might want to limit the number of threads to reserve some of the processing capacity for other activities, or if you’re doing explicit parallel programming with the ParallelR suite or other parallel programming tools.
You can set the maximum number of threads as follows:
setMKLthreads(<value>)
Where the <value>
is the maximum number of parallel threads, not to exceed the number of available cores.
Compared to open source R, the MKL offers significant performance gains, particularly on Windows.
Here are the results of 5 tests on matrix operations, run on a Samsung laptop with an Intel i7 4-core CPU. From the graphic you can see that a matrix multiplication runs 27 times faster with the MKL than without, and linear discriminant analysis is 3.6 times faster.
You can replicate the same tests by using this code:
---
---
Another famous benchmark was published by Simon Urbanek, one of the members of R-core. You can find his code at Simon's benchmark page. His benchmark consists of three different classes of test:
I compared the total execution time of the benchmark script in RRO (with MKL) and R. Using Revolution R Open, the benchmark tests completed in 47.7 seconds. This compared to ~176 seconds using R-3.1.1 on the same machine.
To replicate these results you can use the following script runs (sources) his code directly from the URL and captures the total execution time:
---
---
Here is a summary of each of the individual tests:
R-3.1.1 | RRO | Performance gain | |
I. Matrix calculation | |||
Create, transpose and deform matrix | 1.01 | 1.01 | 0.0 |
Matrix computation | 0.40 | 0.40 | 0.0 |
Sort random values | 0.72 | 0.74 | 0.0 |
Cross product | 11.50 | 0.42 | 26.4 |
Linear regression | 5.56 | 0.25 | 20.9 |
II. Matrix functions | |||
Fast Fourier Transform | 0.45 | 0.47 | 0.0 |
Compute eigenvalues | 0.74 | 0.39 | 0.9 |
Calculate determinant | 2.87 | 0.24 | 10.8 |
Cholesky decomposition | 4.50 | 0.25 | 16.8 |
Matrix inverse | 2.71 | 0.25 | 9.9 |
III. Programmation | |||
Vector calculation | 0.67 | 0.67 | 0.0 |
Matrix calculation | 0.26 | 0.26 | 0.0 |
Recursion | 0.95 | 1.06 | -0.1 |
Loops | 0.43 | 0.43 | 0.0 |
Mixed control flow | 0.41 | 0.37 | 0.1 |
Total test time | 165.60 | 47.72 | 2.5 |
The Intel MKL makes a notable difference for many matrix computations. When running the Urbanek benchmark using the MKL on Windows, you can expect a performance gain of ~2.5x.
The caveat is that different the standard R distribution on different operating systems use different math libraries. For example, R on Mac OSx uses the ATLAS blas, which gives you comparable performance to the MKL.
To find out more about Revolution R Open, go to http://mran.revolutionanalytics.com/open/
You can download RRO at http://mran.revolutionanalytics.com/download/
Many R scripts depend on CRAN packages, and most CRAN packages in turn depend on other CRAN packages. If you install an R package, you'll also be installing its dependencies to make it work, and possibly other packages as well to enable its full functionality.
My colleague Andrie posted some R code to map package dependencies a couple of months ago, but now you can easily explore the dependencies of any CRAN package at MRAN. Simply search for a package and click the Dependencies Graph tab. Here's a very simple one: the foreach package.
The foreach package depends on two others: iterators and codetools, which will be automatically installed for you by install.packages when you install foreach. (We'll duscuss the use of "Suggests" — as here with randomForest — later.) Now let's look at a more complex example: the caret package.
The caret package provides an interface to many of the predictive modeling packages on CRAN, and so it has several dependencies (nine, in fact — you can see the list by clicking on the Dependencies Table tab). But it also Suggests many more packages — these are packages that are not required to run caret, but if you do have them, there are more model types you can use within the caret framework.
Here's a quick overview of the types of dependencies you'll find in the charts and tables on MRAN:
MRAN is updated daily, and so the Dependencies Graph is always up-to-date with the latest CRAN packages and their connections. Start exploring at the link below.
MRAN: Explore Packages
My second-favourite keynote from yesterday's Strata Hadoop World conference was this one, from Pinterest's John Rauser. To many people (especially in the Big Data world), Statistics is a series of complex equations, but a just a little intuition goes a long way to really understanding data. John illustrates this wonderfully using an example of data collected to determine whether consuming beer causes mosquitoes to bite you more:
The big lesson here, IMO, is that so many statistical problems can seem complex, but you can actually get a lot of insight by recognizing that your data is just one possible instance of a random process. If you have a hypothesis for what that process is, you can simulate it, and get an intuitive sense of how surprising your data is. R has excellent tools for simulating data, and a couple of hours spent writing code to simulate data can often give insights that will be valuable for the formal data analysis to come.
(By the way, my favourite keynote from the connference was Amanda Cox's keynote on data visualization at the New York Times, which featured several examples developed in R. Sadly, though, it wasn't recorded.)
O'Reilly Strata: Statistics Without the Agonizing Pain
by Joseph Rickert
There is something about R user group meetings that both encourages, and nourshies a certain kind of "after hours" creativity. Maybe it is the pressure of having to make a presentation about stuff you do at work interesting to a general audience, or maybe it is just the desire to reach a high level of play. But, R user group presentations often manage to make some obscure area of computational statistics seem to be not only accessible, but also relevant and fun. Here are a couple of examples of what I mean.
Recently Xiaocun Sun conducted an Image processing workshop for KRUG, the Knoxville R User's Group. As the folowing slide indicates, he used the EBImage Bioconductor package, a package that I imagine few people who don't do medical imaging for a living would be likely to stubmle upon by accident, to illustrate the basics of image processing.
Xiacuns's presentation along with R code is available for download from the KRUG site.
As a second example, consider the presentation that Antonio Piccolboni recently made to the Bay Area useR Group (BARUG): 10 Eigenmaps of the United States of America. Inspired by an article in the New York Times, Antonio decided to undertake his own idiosyncratic tour through the Census data and look at socio-economic trends in the United States. His analysis is both thought provoking and visually compelling. For example, concerning the following map Antonio writes:
This map shows a very interesting ring pattern around some cities, including Atlanta, Dallas an Minneapolis. The red areas show strong population increase, including migration, and increase in available housing and high median income. The blue areas have a higher death rate, Federal Government payments to individuals, more widows, single person households and older people receiving social security.
Antonio's presentation might well illustrate the theme: "Data Scientist reads the Sunday paper and finds data to begin a conversation about what he read with his quantitative, R-literate friends".
This kind of active reading fits nicely with ideas about responsible, quantitative journalism that Chris Wiggins expresses in a presentation he recently made to the New York Open Statistical Programming Meetup. Here, Chris provides some insight into the role of Data Science at the New York Times and offers advice on using data to study relevant issues and clearly communicate findings. One major point in Chris' presentation is that data science plus clear communication can have a very positive influence on shaping our culture.
It is not an exaggeration to say that the kind of work that Xiaocun, Antonio and other R user group presenters undertake in their spare time "for fun" is valuable and important beyond the immediate goals of learning and teaching R.
For the past 7 years, Revolution Analytics has been the leading provider of R-based software and services to companies around the globe. Today, we're excited to announce a new, enhanced R distribution for everyone: Revolution R Open.
Revolution R Open is a downstream distribution of R from the R Foundation for Statistical Computing. It's built on the R 3.1.1 language engine, so it's 100% compatible with any scripts, packages or applications that work with R 3.1.1. It also comes with enhancements to improve your R experience, focused on performance and reproducibility:
Today we are also introducing MRAN, a new website where you can find information about R, Revolution R Open, and R Packages. MRAN includes tools to explore R Packages and R Task Views, making it easy to find packages to extend R's capabilities. MRAN is updated daily.
Revolution R Open is available for download now. Visit mran.revolutionanalytics.com/download for binaries for Windows, Mac, Ubuntu, CentOS/Red Hat Linux and (of course) the GPLv2 source distribution.
With the new Revolution R Plus program, Revolution Analytics is offering technical support and open-source assurance for Revolution R Open and several other open source projects from Revolution Analytics (including DeployR Open, ParallelR and RHadoop). If you are interested in subscribing, you can find more information at www.revolutionanalytics.com/plus . And don't forget that big-data R capabilities are still available in Revolution R Enterprise.
We hope you enjoy using Revolution R Open, and that your workplace will be confident adopting R with the backing of technical support and open source assurance of Revolution R Plus. Let us know what you think in the comments!
Part 2 of a series
by Daniel Hanson, with contributions by Steve Su (author of the GLDEX package)
In our previous article, we introduced the four-parameter Generalized Lambda Distribution (GLD) and looked at fitting a 20-year set of returns from the Wilshire 5000 Index, comparing the results of two methods, namely the Method of Moments, and the Method of Maximum Likelihood.
Errata: One very important omission in Part 1, however, was not putting
require(GLDEX)
prior to the examples shown. Many thanks to a reader who pointed this out in the comments section last time.
Let’s also recall the code we used for obtaining returns from the Wilshire 5000 index, and the first four moments of the data (details are in Part 1):
require(quantmod) # quantmod package must be installed
getSymbols("VTSMX", from = "1994-09-01")
VTSMX.Close <- VTSMX[,4] # Closing prices
VTSMX.vector <- as.vector(VTSMX.Close)
# Calculate log returns
Wilsh5000 <- diff(log(VTSMX.vector), lag = 1)
Wilsh5000 <- 100 * Wilsh5000[-1] # Remove the NA in the first position,
# and put in percent format
# Moments of Wilshire 5000 market returns:
fun.moments.r(Wilsh5000,normalise="Y") # normalise="Y" -- subtracts the 3
# from the normal dist value.
# Results:
# mean variance skewness kurtosis
# 0.02824676 1.50214916 -0.30413445 7.82107430
Finally, in Part 1, we looked at two methods for fitting a GLD to this data, namely the Method of Moments (MM), and the Method of Maximum Likelihood (ML). We found that MM gave us a near perfect match in mean, variance, skewness, and kurtosis, but goodness of fit measures showed that we could not conclude that the market data was drawn from the fitted distribution. On the other hand, ML gave us a much better fit, but it came at the price of skewness being way off compared to that of the data, and kurtosis not being determined by the fitting algorithm (NA).
Method of L-Moments (LM)
Steve Su, in his contributions to this article series, suggested the option of a “third way”, namely the Method of L-Moments. Also, as mentioned in the paper L-moments and TL-moments of the generalized lambda distribution, (William H. Asquith, 2006),
“The method of L-moments is an alternative technique, which is suitable and popular for heavy-tailed distributions. The method of L-moments is particularly useful for distributions, such as the generalized lambda distribution (GLD), that are only expressible in inverse or quantile function form.”
Additional details on the method and algorithm for computing it can be found in this paper, noted above.
As we will see in the example that follows, the result is essentially a compromise between our first two results, but the goodness of fit is still far preferable to that of the Method of Moments.
We follow the same approach as above, but using the GLDEX function fun.RMFMKL.lm(.) to calculate the fitted distribution:
# Fit the LM distribution:
require(GLDEX) # Remembered it this time!
wshLambdaLM = fun.RMFMKL.lm(Wilsh5000)
# Compute the associated moments
fun.theo.mv.gld(wshLambdaLM[1], wshLambdaLM[2], wshLambdaLM[3], wshLambdaLM[4],
param = "fmkl", normalise="Y")
# The results are:
# mean variance skewness kurtosis
# 0.02824678 1.56947022 -1.32265715 291.58852044
As was the case with the maximum likelihood fit, the mean and variance are reasonably close, but the skewness and kurtosis do not match the empirical data. However, the skew is not as far off as in the ML case, and we are at least able to calculate a kurtosis value.
Looking at our goodness of fit tests based on the KS statistic:
fun.diag.ks.g(result = wshLambdaLM, data = Wilsh5000, no.test = 1000,
param = "fmkl")
# Result: 740/1000
ks.gof(Wilsh5000,"pgl", wshLambdaLM, param="fmkl")
# D = 0.0201, p-value = 0.03383
In the first case, our result of 740/1000 suggests a much better fit than the Method of Moments (53/1000) in Part 1, while falling slightly from the ratio we obtained with the Method of Maximum Likelihood. In the second test, a p-value of 0.03383 is not overwhelmingly convincing, but technically it does allow us to accept the hypothesis that our data is drawn from the same distribution at an α = 0.025 or 0.01 confidence level.
Perhaps more interesting is just looking at the plot
bins <- nclass.FD(Wilsh5000) # We get 158 (Freedman-Diaconis Rule)
fun.plot.fit(fit.obj = wshLambdaLM, data = Wilsh5000, nclass = bins, param = "fmkl",
xlab = "Returns", main = "Method of L-Moments")
Again, compared to the result for the Method of Moments in Part 1, the plot suggests that we have a better fit.
The QQ plot, however, is not much different than what we got in the Maximum Likelihood case; in particular, losses in the left tail are not underestimated by the fitted distribution as they are in the MM case:
qqplot.gld(fit=wshLambdaLM,data=Wilsh5000,param="fkml", type = "str.qqplot",
main = "Method of L-Moments")
Which Option is “Best”?
Steve Su points out that there is no one “best” solution, as there are trade-offs and competing constraints involved in the algorithms, and it is one reason why, in addition to the methods described above, so many different functions are available in the GLDEX package. On one point, however, there is general agreement in the literature, and that is the Method of Moments -- even with the appeal of matching the moments of the empirical data -- is an inferior method to others that result in a better fit. This is also discussed in the paper by Asquith, namely, that the method of moments “generally works well for light-tailed distributions. For heavy-tailed distributions, however, use of the method of moments can be questioned.”
Comparison with the Normal Distribution
For a strawman comparison, we can fit the Wilshire 5000 returns to a normal distribution in R, and run the KS test as follows:
f <- fitdistr(Wilsh5000, densfun = "normal")
ks.test(Wilsh5000, "pnorm", f$estimate[1], f$estimate[2], alternative = "two.sided")
The results are as follows:
# One-sample Kolmogorov-Smirnov test
# data: Wilsh5000
# D = D = 0.0841, p-value < 2.2e-16
# alternative hypothesis: two-sided
With a p-value that small, we can firmly reject the returns data as being drawn from a fitted normal distribution.
We can also get a look at the plot of the implied normal distribution overlaid upon the fit we obtained with the method of L-moments, as follows:
x <- seq(min(Wilsh5000), max(Wilsh5000), bins)
# Chop the domain into bins = 158nintervals to get sample points
# from the approximated normal distribution (Freedman-Diaconis)
fun.plot.fit(fit.obj = wshLambdaLM, data = Wilsh5000, nclass = bins,
param = "fmkl", xlab = "Returns")
curve(dnorm(x, mean=f$estimate[1], sd=f$estimate[2]), add=TRUE,
col = "red", lwd = 2) # Normal curve in red
Although it may be a little difficult to see, note that between -3 and -4 on the horizontal axis, the tail of the normal fit (in red) falls below that of the GLD (in blue), and it is along this left tail where extreme events can occur in the markets. The normal distribution implies a lower probability of these “black swan” events than the more representative GLD.
This is further confirmed by looking at the QQ plot vs a normal distribution fit. Note how the theoretical fit (horizontal axis in this case, using the Base R function qqnorm(.); ie, the axes are switched compared to those in our previous QQ plots) vastly underestimates losses in the left tail.
qqnorm(Wilsh5000, main = "Normal Q-Q Plot")
qqline(Wilsh5000)
In summary, from these plots, we can see that the GLD fit, particularly using ML or LM, is a superior alternative to what we get with the normal distribution fit when estimating potential index losses.
Conclusion
We have seen, using R and the GLDEX package, how a four parameter distribution such as the Generalized Lambda Distribution can be used to fit a more realistic distribution to market data as compared to the normal distribution, particularly considering the fat tails typically present in returns data that cannot be captured by a normal distribution. While the Method of Moments as a fitting algorithm is highly appealing due to its preserving the moments of the empirical distribution, we sacrifice goodness of fit that can be obtained using other methods such as Maximum Likelihood, and L-Moments.
The GLD has been demonstrated in financial texts and research literature as a suitable distributional fit for determining market risk measures such as Value at Risk (VaR), Expected Shortfall (ES), and other metrics. We will look at examples in an upcoming article.
Again, very special thanks are due to Dr Steve Su for his contributions and guidance in presenting this topic.
The ability to create reproducible research is an important topic for many users of R. So important, that several groups in the R community have tackled this problem. Notably, packrat from RStudio, and gRAN from Genentech (see our previous blog post).
The Reproducible R Toolkit is a new open-source initiative from Revolution Analytics. It takes a simple approach to dealing with R package versions, consisting of an R package checkpoint, and an associated daily CRAN snapshot archive, checkpoint-server. Here's one illustration of the problem it solves (with apologies to xkcd):
To achieve reproducibility, we store daily snapshots of all CRAN packages. At midnight UTC each day we refresh the CRAN mirror and then store a snapshot of CRAN as it exists at that very moment. You can access these daily snapshots using the checkpoint package, which installs and consistently use these packages just as they existed at the snapshot date. Daily snapshots exist starting from 2014-09-17.
checkpoint package
The goal of the checkpoint package is to solve the problem of package reproducibility in R. Since packages get updated on CRAN all the time, it can be difficult to recreate an environment where all your packages are consistent with some earlier state. To solve this issue, checkpoint allows you to install packages locally as they existed on a specific date from the corresponding snapshot (stored on the checkpoint server) and it configures your R session to use only these packages. Together, the checkpoint package and the checkpoint server act as a "CRAN time machine", so that anyone using checkpoint can ensure the reproducibility of scripts or projects at any time.
How to use checkpoint
One you have the checkpoint package installed, using the checkpoint() function is as simple as adding the following lines to the top of your script:
Typically, you will use the date you created the script as the argument to checkpoint. The first time you run the script, checkpoint will inspect your script (and other R files in the same project folder) for the packages used, and install the required packages with versions as of the specified date. (The first time you run the script, it will take some time to download and install the packages, but subsequent runs will use the previously-installed package versions.)
The checkpoint package installs the packages in a folder specific to the current project (in a subfolder of
If you want to update the packages you use at a later date, just update the date in the checkpoint() call and checkpoint() will automatically update the locally-installed packages.
The checkpoint package is available on CRAN:
Worked example
The Reproducible R Toolkit was created by the Open Source Solutions group at Revolution Analytics. Special thanks go to Scott Chamberlain who helped with early development.
We'd love to know what you think about checkpoint. Leave comments here on the blog, or via the checkpoint GitHub page.
The Fantasy Football Analytics blog shares these 14 reasons why R is better than Excel for data analysis:
The two most important in my mind are #2 (automation) and #7 (reproducibility), reasons that apply to any GUI-driven tool. The ability to use code to repeat your analyses and reproduce the results consistently cannot be overstated.
For more detailed background behind each of these reasons, and four situations where it's best to use Excel, check out the complete blog blost linked below.
Fantasy Football Analytics: Why R is Better Than Excel for Fantasy Football (and most other) Data Analysis
by Joseph Rickert
In a recent post I talked about the information that can be developed by fitting a Tweedie GLM to a 143 million record version of the airlines data set. Since I started working with them about a year or so ago, I now see Tweedie models everywhere. Basically, any time I come across a histogram that looks like it might be a sample from a gamma distribution except for a big spike at zero, I see a candidate for a Tweedie model. (Having a Tweedie hammer makes lots of things look like Tweedie nails.) Nevertheless, apparently lots of people are seeing Tweedie these days. Even the scolarly citations for Maurice Tweedie's original paper are up.
Tweedie distributions are a subset of what are called Exponential Dispersion Models. EDMs are two parameter distributions from the linear exponential family that also have a dispersion parameter f. Statistician Bent Jørgensen solidified the concept of EDMs in a 1987 paper, and named the following class of EDMs after Tweedie.
An EDM random variable Y follows a Tweedie distribution if
var(Y) = f * V(m)
where m is the mean of the distribution, f is the dispersion parameter, V is function describing the mean/variance relationship of the distribution and p is a constant such that:
V(m) = m^{p }
Some very familiar distributions fall into the Tweedie family. Setting p = 0 gives a normal distribution. p = 1 is Poisson. p = 2 gives a gamma distribution and p = 3 yields an inverse Gaussian. However, much of the action for fitting Tweedie GLMs is for values of p between 1 and 2. In this interval, closed form distribution functions don’t exist, but as it turns out, Tweedies in this interval are compound Poisson distributions. (A compound Poisson random variable Y is the sum of N independent gamma random variables where N follows a Poisson distribution and N and the gamma random variates are independent.)
This last fact helps to explain why Tweedies are so popular. For example, one might model the insurance claims for a customer as a series of independent gamma random variables and the number of claims in some time interval as a Poisson random variable. Or, the gamma random variables could be models for precipation, and the total rainfall resulting from N rainstorms would follow a Tweedie distribution. The possibilities are endless.
R has a quite a few resources for working with Tweedie models. Here are just a few. You can fit Tweedie GLM model with the tweedie function in the statmod package .
# Fit an inverse-Gaussion glm with log-link
glm(y~x,family=tweedie(var.power=3,link.power=0))
The tweedie package has several interesting functions for working with Tweedie models including a function to generate random samples.The following graph shows four different Tweedie histograms as the power parameter moves from 1.2 to 1.9.
It is apparent that increasing the power shifts mass away from zero towards the right.
(Note the code for producing these plots which includes some nice code from Stephen Turner for putting all for ggplots on a single graph are availale are available. Download Plot_tweedie)
Package poistweedie also provides functions for simulating Tweedie models. Package HDtweedie implements an iteratively reweighted least squares algorithm for computing solution paths for grouped lasso and grouped elastic net Tweedie models. And, package cplm provides both likelihood and Bayesian functions for working with compound Poisson models. Be sure to have a look at the vignette for this package to see compound Poisson distributions in action.
Finally, two very readable references for both the math underlying Tweedie models and the algorithms to compute them are a couple of papers by Dunn and Smyth: here and here.
In case you missed them, here are some articles from September of particular interest to R users.
Norm Matloff argues that T-tests shouldn't be part of the Statistics curriculum and questions the "star system" for p-values in R.
A nice video introduction to the dplyr package and the %>% operator, presented by Kevin Markham.
An animation of police militarization in the US, created with R and open data published by the New York Times.
An overview of the miscellaneous R functions in the DescTools package.
Some guidance from Will Stanton on becoming a "data hacker" using R and Hadoop.
A tutorial on publishing ggplot2 graphics to the web with plotly.
A Shiny app that implements the Traveling Salesman problem and animates the simulating annealing algorithm behind the solution.
R code for comparing performance of machine learning models.
Presentations at DataWeek on applications of R at companies.
Announcing new members for the R Foundation and the R Core team.
A graduate student uses R to look at the popularity of posts on Reddit.
Google introduces the CausalImpact package for R, and uses it to evaluate performance of marketing campaigns.
A review of several recent and upcoming conferences that include R-related tracks.
More presentations and video interviews from the useR! 2014 conference, from DataScience.LA.
A detailed Rcpp example based on the Collatz Conjecture.
Use Rmarkdown to create documents combining text, mathematics, and R graphical and tabular output.
A very early example of data analysis: Nile floods in 450 BC.
The Rockefeller Institute of Government uses R to simulate the finances of public sector pension funds.
General interest stories (not related to R) in the past month included: ET for the Atari 2600, Talk Like a Pirate day photos, a parody lifestyle magazine for data scientists and the spread of the Ice Bucket Challenge.
As always, thanks for the comments and please send any suggestions to me at david@revolutionanalytics.com. Don't forget you can follow the blog using an RSS reader, via email using blogtrottr, or by following me on Twitter (I'm @revodavid). You can find roundups of previous months here.