For his PhD at Delft University of Technology's Faculty of Mechanical Engineering, Thomas Geijtenbeek created robots that learned how to walk. These were virtual robots — simulations in a computer system — but with realistic muscles, joints and mass that behave in real-life ways.
When you see computer-generated figures move around in movies or computer games, the motion is either hand-generated (using a process not very different from that of animating old Looney Tunes cartoons), or the motion is captured from a human actor. The former method is very time consuming, and limited by the skill and time of the animator. The latter motion-capture method is more common these days, but doesn't lend itself well to non-human characters (like extrapolating a man's motions to that of a giant ape) or situations that are dangerous or impossible for an actor.
Thomas Geijtenbeek's approach is very different: rather than programming the motion of the robots in advance, he created a genetic algorithm to explore all the possible sequences of virtual muscle movements that result in a target motion of, say, a brisk walking speed. As you can see in the video below, early generations of the algorithm fail miserably. But over thousands of generations, a successful sequence of muscle movements evolves:
Because the simulation includes realistic power and torque for the virtual muscles (even neural delay is incorporated!), the result is a very realistic walking action. Better yet, when he sets the target speed to different numbers — say, 10 km/hr for a run — different gaits for the motion naturally emerge, rather than just a speeded-up walking motion. Even a hopping motion emerges naturally for a kangaroo-shaped robot! (You can different gaits in action at around 1:33 in this longer video.)
Now, if we can just combine these genetic algorithms with real-world robots with independent power ... well, let's just hope our robot overlords are friendly!
That's all for this week — we'll be back on Monday. See you then!
If you haven't heard the buzz about Docker but you often need to spin up Linux-based VM's for testing, simulations, etc. then you should check it out. In short, Docker rocks: we use it for testing our Linux-based distros of Revolution R Open. If you want to use R and Docker together, Dirk Eddelbuettel and Carl Boettiger have made it easy with Rocker, and have also provided a nice explanation of Docker itself:
While its use (superficially) resembles that of virtual machines, it is much more lightweight as it operates at the level of a single process (rather than an emulation of an entire OS layer). This also allows it to start almost instantly, require very little resources and hence permits an order of magnitude more deployments per host than a virtual machine.
Rocker provides pre-built Docker images for the currently-released base R distribution, as well as the in-progress R-devel build. It also provides a container for an RStudio Server instance.
You can find more info about Rocker at the blog post below, or at the Rocker Github page.
Dirk Eddelbuettel: Introducing Rocker: Docker for R
by Joseph Rickert
One of the most interesting R related presentations at last week’s Strata Hadoop World Conference in New York City was the session on Distributed R by Sunil Venkayala and Indrajit Roy, both of HP Labs. In short, Distributed R is an open source project with the end goal of running R code in parallel on data that is distributed across multiple machines. The following figure conveys the general idea.
A master node controls multiple worker nodes each of which runs multiple R processes in parallel.
As I understand it, the primary use case for the Distributed R software is to move data quickly from a database into distributed data structures that can be accessed by multiple, independent R instances for coordinated, parallel computation. The Distributed R infrastructure automatically takes care of the extraction of the data and the coordination of the calculations, including the occasional movement of data from a worker node to the master node when required by the calculation. The user interface to the Distributed R mechanism is through R functions that have been designed and optimized to work with the distributed data structures, and through a special “Distributed R aware” foreach() function that allow users to write their own distributed functions using ordinary R functions.
To make all of this happen, Distributed R platform contains several components that may be briefly described as follows:
The distributed R package contains:
A really nice feature of the distributed data structures is that they can be populated and accessed by rows, columns and blocks making it possible to write efficient algorithms tuned to the structure of particular data sets. For example, data cleaning for wide data sets (many more columns than rows) can be facilitated by preprocessing individual features.
vRODBC is an ODBC client that provides R with database connectivity. This is the connection mechanism that permits the parallel loading of data from various sources data including HP’s Vertica database.
The HPdata package contains the functions that allow you to actually load distributed data structures from various data sources
The HPDGLM package implements a parallel, distributed GLM models (Presently only linear regression, logistic regression and Poisson regression models are available), The package also contains functions for cross validation and split-sample validation.
The HPdclassifier package is intended to contain several distributed classification algorithms. It currently contains a parallel distributed implementation of the random forests algorithm.
The HPdcluster package contains a parallel, distributed kmeans algorithm.
The HPdgraph package is intended to contain distributed algorithms for graph analytics. It currently contains a parallel, distributed implementation of the pagerank algorithm for directed graphs.
The following sample code, taken directly from the HPdclassifier User Guide, but modified slightly for presentation here, is similar to the examples that Venkayala and Roy showed in their presentation. Note, that after the distributed arrays are set up they are loaded in parallel with data using the the foreach function from the distributedR package.
library(HPdclassifier) # loading the library
Loading required package: distributedR
Loading required package: Rcpp
Loading required package: RInside
Loading required package: randomForest
distributedR_start() # starting the distributed environment
Workers registered - 1/1.
All 1 workers are registered.
[1] TRUE
nparts <- sum(ds$Inst) # number of available distributed instances
# Describe the data
nSamples <- 100 # number of samples
nAttributes <- 5 # number of attributes of each sample
nSpllits <- 1 # number of splits in each darray
# Create the distributed arrays
dax <- darray(c(nSamples,nAttributes),c(round(nSamples/nSpllits),nAttributes))
day <- darray(c(nSamples,1), c(round(nSamples/nSpllits),1))
# Load the distributed arrays
foreach(i, 1:npartitions(dax),
function(x=splits(dax,i),y=splits(day,i),id=i){
x <- matrix(runif(nrow(x)*ncol(x)), nrow(x),ncol(x))
y <- matrix(runif(nrow(y)), nrow(y), 1)
update(x)
update(y)
})
# Fit the Random Forest Model
myrf <- hpdrandomForest(dax, day, nExecutor=nparts)
# prediction
dp <- predictHPdRF(myrf, dax)
Notwithstanding all of its capabilities, Distributed R is still clearly work in progress. It is only available on Linux platforms. Algorithms and data must be resident in memory. Distributed R is not available on CRAN, and even with an excellent Installation Guide, installing the platform is a bit of an involved process.
Nevertheless, Distributed R is impressive, and I think a valuable contribution to open source R. I expect that users with distributed data will find the platform to be a viable way to begin high performance computing with R.
Note that the Distributed R project discussed in this post is an HP initiative and is not in anyway related to http://www.revolutionanalytics.com/revolution-r-enterprise-distributedr.
by Andrie de Vries
Last week we announced the availability of Revolution R Open, an enhanced distribution of R. One of the enhancements is the inclusion of high performance linear algebra libraries, specifically the Intel MKL. This library significantly speeds up many statistical calculations, e.g. the matrix algebra that forms the basis of many statistical algorithms.
Several years ago, David Smith wrote a blog post about multithreaded R, where he explored the benefits of the MKL, in particular on Windows machines.
In this post I explore whether anything has changed.
To best use the power available in the machines of today, Revolution R Open is installed by default with the Intel Math Kernel Library (MKL), which provides BLAS and LAPACK library functions used by R. Intel MKL makes it possible for so many common R operations to use all of the processing power available.
The MKL's default behavior is to use as many parallel threads as there are available cores. There’s nothing you need to do to benefit from this performance improvement — not a single change to your R script is required.
However, you can still control or restrict the number of threads using the setMKLthreads()
function from the Revobase
package delivered with Revolution R Open. For example, you might want to limit the number of threads to reserve some of the processing capacity for other activities, or if you’re doing explicit parallel programming with the ParallelR suite or other parallel programming tools.
You can set the maximum number of threads as follows:
setMKLthreads(<value>)
Where the <value>
is the maximum number of parallel threads, not to exceed the number of available cores.
Compared to open source R, the MKL offers significant performance gains, particularly on Windows.
Here are the results of 5 tests on matrix operations, run on a Samsung laptop with an Intel i7 4-core CPU. From the graphic you can see that a matrix multiplication runs 27 times faster with the MKL than without, and linear discriminant analysis is 3.6 times faster.
You can replicate the same tests by using this code:
---
---
Another famous benchmark was published by Simon Urbanek, one of the members of R-core. You can find his code at Simon's benchmark page. His benchmark consists of three different classes of test:
I compared the total execution time of the benchmark script in RRO (with MKL) and R. Using Revolution R Open, the benchmark tests completed in 47.7 seconds. This compared to ~176 seconds using R-3.1.1 on the same machine.
To replicate these results you can use the following script runs (sources) his code directly from the URL and captures the total execution time:
---
---
Here is a summary of each of the individual tests:
R-3.1.1 | RRO | Performance gain | |
I. Matrix calculation | |||
Create, transpose and deform matrix | 1.01 | 1.01 | 0.0 |
Matrix computation | 0.40 | 0.40 | 0.0 |
Sort random values | 0.72 | 0.74 | 0.0 |
Cross product | 11.50 | 0.42 | 26.4 |
Linear regression | 5.56 | 0.25 | 20.9 |
II. Matrix functions | |||
Fast Fourier Transform | 0.45 | 0.47 | 0.0 |
Compute eigenvalues | 0.74 | 0.39 | 0.9 |
Calculate determinant | 2.87 | 0.24 | 10.8 |
Cholesky decomposition | 4.50 | 0.25 | 16.8 |
Matrix inverse | 2.71 | 0.25 | 9.9 |
III. Programmation | |||
Vector calculation | 0.67 | 0.67 | 0.0 |
Matrix calculation | 0.26 | 0.26 | 0.0 |
Recursion | 0.95 | 1.06 | -0.1 |
Loops | 0.43 | 0.43 | 0.0 |
Mixed control flow | 0.41 | 0.37 | 0.1 |
Total test time | 165.60 | 47.72 | 2.5 |
The Intel MKL makes a notable difference for many matrix computations. When running the Urbanek benchmark using the MKL on Windows, you can expect a performance gain of ~2.5x.
The caveat is that different the standard R distribution on different operating systems use different math libraries. For example, R on Mac OSx uses the ATLAS blas, which gives you comparable performance to the MKL.
To find out more about Revolution R Open, go to http://mran.revolutionanalytics.com/open/
You can download RRO at http://mran.revolutionanalytics.com/download/
by Jamie F Olson
Professional Services Consultant, Revolution Analytics
One challenge in transitioning R code into a production environment is ensuring consistency and reliability. These challenges span a wide variety of issues, but runtime characteristics are an important operational characteristic. Specifically, production code should have a consistent, predictable runtime for a particular computational infrastructure. Among other things, this makes it possible to plan and scale IT infrastructure based on operational requirements.
Analytics in general and R in particular possess certain characteristics that can make this challenging. Many statistical models don't have a single consistent run time cost but are instead based on iterative algorithms that continue until some convergence criteria is met. This means that the actual run time for any given model can depend significantly on the actual data being modeled. This is particularly important considering that in production analytics workflows, the data is continuously changing.
Let's consider the bats exponential smoothing multi-seasonal state space model in the forecast package. We repeatedly take random subsets of the "taylor" data to demonstrate the variability in run time:
library(forecast)
subsample_ts <- function(x, N, f = frequency(x)) {
x_i <- floor(runif(1, 1, length(x) - N))
msts(x[x_i:(x_i + N)], f)
}
times <- replicate(10, system.time(tbats(subsample_ts(taylor,
100, c(48, 336)), use.parallel = FALSE)))
sd(times["elapsed", ])
## [1] 0.9392113
summary(times["elapsed", ])
##Min. 1st Qu. MedianMean 3rd Qu.Max.
## 3.963 4.816 5.568 5.522 6.140 7.085
As you can see, certain configurations of data can be require more time to model. In order to circumvent this, we may want to ensure that any particular model has a pre-defined maximum run time. This does not prevent the run time from varying, but it does ensure a "maximum upper bound" on run time performance.
There are a variety of tools we can use in R to implement this. For example, there is a base::setTimeLimit function that allows us to set certain time limits for any "top-level computation".
system.time(Sys.sleep(5))
##user system elapsed
## 0.000 0.000 5.005
system.time(local({
setTimeLimit(elapsed = 1, transient = TRUE)
Sys.sleep(5)
}))
## Error in Sys.sleep(5): reached elapsed time limit
## Timing stopped at: 0 0 5.006
This simple function can be extremely useful in many scenarios, but it may not be enough in production environments. The setTimeLimit documentation describes a key limitation: Time limits are checked ... "only at points in compiled C and Fortran code identified by the code author."
Note: It's important to mention that just because a function uses C code doesn't mean that setTimeLimit won't work as expected. The "top-level" R code in such functions can be interrupted as well additional points in the C code explicitly checked in the source code, but it's up to the
package author to implement these checks. In situations where this is not a concern, you may also be interested in the withTimeout function in the R.utils package.
This can create a problem for statistical models that offload the computation to C code, like the neural network models for autoregressive processes estimated by the nnetar function from the forecast package. In these cases setTimeLimit may not have the intended effect:
system.time(nnetar(as.ts(taylor)))
##user system elapsed
## 50.261 0.017 50.328
system.time(local({
setTimeLimit(elapsed = 4, transient = TRUE)
nnetar(as.ts(taylor))
}))
## Error in NROW(x): reached elapsed time limit
## Timing stopped at: 5.144 0.005 5.155
The OpenCPU project contains eval_fork function which provides an alternative method of controlling runtime on non-Windows platforms. A slightly simplified version is reproduced below.
The core idea is to use fork to run the desired command in a separate process that if then "killed" after the desired amount of time has elapsed. This is accomplished via parallel::mcparallel function, which will return after "timeout" seconds with either a result, if the expression completed, or "NULL".
eval_fork <- function(..., timeout = 60) {
myfork <- parallel::mcparallel({
eval(...)
}, silent = FALSE)
# wait max n seconds for a result.
myresult <- parallel::mccollect(myfork, wait = FALSE, timeout = timeout)
# kill fork after collect has returned
tools::pskill(myfork$pid, tools::SIGKILL)
tools::pskill(-1 * myfork$pid, tools::SIGKILL)
# clean up:
parallel::mccollect(myfork, wait = FALSE)
# timeout?
if (is.null(myresult))
stop("reached elapsed time limit")
# move this to distinguish between timeout and NULL returns
myresult <- myresult[[1]]
# send the buffered response
return(myresult)
}
system.time(eval_fork(nnetar(as.ts(taylor)), timeout = 4))
## Error in eval_fork(nnetar(as.ts(taylor)), timeout = 4): reached elapsed time limit
## Timing stopped at: 0.005 0.023 4.032
We can safely capture the results from eval_fork using try:
mynnet <- try(eval_fork(nnetar(as.ts(taylor)), timeout = 4),
silent = TRUE)
These techniques make it possible to achieve consistent operational requirements in an environment of changing data and computational unpredictable algorithms. In the next post on R in Production, we'll talk more about how to use try and similar functions to capture and safely handle warnings and errors.
Many R scripts depend on CRAN packages, and most CRAN packages in turn depend on other CRAN packages. If you install an R package, you'll also be installing its dependencies to make it work, and possibly other packages as well to enable its full functionality.
My colleague Andrie posted some R code to map package dependencies a couple of months ago, but now you can easily explore the dependencies of any CRAN package at MRAN. Simply search for a package and click the Dependencies Graph tab. Here's a very simple one: the foreach package.
The foreach package depends on two others: iterators and codetools, which will be automatically installed for you by install.packages when you install foreach. (We'll duscuss the use of "Suggests" — as here with randomForest — later.) Now let's look at a more complex example: the caret package.
The caret package provides an interface to many of the predictive modeling packages on CRAN, and so it has several dependencies (nine, in fact — you can see the list by clicking on the Dependencies Table tab). But it also Suggests many more packages — these are packages that are not required to run caret, but if you do have them, there are more model types you can use within the caret framework.
Here's a quick overview of the types of dependencies you'll find in the charts and tables on MRAN:
MRAN is updated daily, and so the Dependencies Graph is always up-to-date with the latest CRAN packages and their connections. Start exploring at the link below.
MRAN: Explore Packages
It's been a super-busy time at Strata this week, so I'm taking the easy route for Because it's Friday this week: funny dog and cat videos. If you're not one of the 10 million people who have seen Sad Dog Diary, well, now's your chance:
And if you're more of a cat person, there's also Sad Cat Diary:
That's all for this week! Have a great weekend, and we'll be back on Monday.
My second-favourite keynote from yesterday's Strata Hadoop World conference was this one, from Pinterest's John Rauser. To many people (especially in the Big Data world), Statistics is a series of complex equations, but a just a little intuition goes a long way to really understanding data. John illustrates this wonderfully using an example of data collected to determine whether consuming beer causes mosquitoes to bite you more:
The big lesson here, IMO, is that so many statistical problems can seem complex, but you can actually get a lot of insight by recognizing that your data is just one possible instance of a random process. If you have a hypothesis for what that process is, you can simulate it, and get an intuitive sense of how surprising your data is. R has excellent tools for simulating data, and a couple of hours spent writing code to simulate data can often give insights that will be valuable for the formal data analysis to come.
(By the way, my favourite keynote from the connference was Amanda Cox's keynote on data visualization at the New York Times, which featured several examples developed in R. Sadly, though, it wasn't recorded.)
O'Reilly Strata: Statistics Without the Agonizing Pain
by Joseph Rickert
There is something about R user group meetings that both encourages, and nourshies a certain kind of "after hours" creativity. Maybe it is the pressure of having to make a presentation about stuff you do at work interesting to a general audience, or maybe it is just the desire to reach a high level of play. But, R user group presentations often manage to make some obscure area of computational statistics seem to be not only accessible, but also relevant and fun. Here are a couple of examples of what I mean.
Recently Xiaocun Sun conducted an Image processing workshop for KRUG, the Knoxville R User's Group. As the folowing slide indicates, he used the EBImage Bioconductor package, a package that I imagine few people who don't do medical imaging for a living would be likely to stubmle upon by accident, to illustrate the basics of image processing.
Xiacuns's presentation along with R code is available for download from the KRUG site.
As a second example, consider the presentation that Antonio Piccolboni recently made to the Bay Area useR Group (BARUG): 10 Eigenmaps of the United States of America. Inspired by an article in the New York Times, Antonio decided to undertake his own idiosyncratic tour through the Census data and look at socio-economic trends in the United States. His analysis is both thought provoking and visually compelling. For example, concerning the following map Antonio writes:
This map shows a very interesting ring pattern around some cities, including Atlanta, Dallas an Minneapolis. The red areas show strong population increase, including migration, and increase in available housing and high median income. The blue areas have a higher death rate, Federal Government payments to individuals, more widows, single person households and older people receiving social security.
Antonio's presentation might well illustrate the theme: "Data Scientist reads the Sunday paper and finds data to begin a conversation about what he read with his quantitative, R-literate friends".
This kind of active reading fits nicely with ideas about responsible, quantitative journalism that Chris Wiggins expresses in a presentation he recently made to the New York Open Statistical Programming Meetup. Here, Chris provides some insight into the role of Data Science at the New York Times and offers advice on using data to study relevant issues and clearly communicate findings. One major point in Chris' presentation is that data science plus clear communication can have a very positive influence on shaping our culture.
It is not an exaggeration to say that the kind of work that Xiaocun, Antonio and other R user group presenters undertake in their spare time "for fun" is valuable and important beyond the immediate goals of learning and teaching R.
For the past 7 years, Revolution Analytics has been the leading provider of R-based software and services to companies around the globe. Today, we're excited to announce a new, enhanced R distribution for everyone: Revolution R Open.
Revolution R Open is a downstream distribution of R from the R Foundation for Statistical Computing. It's built on the R 3.1.1 language engine, so it's 100% compatible with any scripts, packages or applications that work with R 3.1.1. It also comes with enhancements to improve your R experience, focused on performance and reproducibility:
Today we are also introducing MRAN, a new website where you can find information about R, Revolution R Open, and R Packages. MRAN includes tools to explore R Packages and R Task Views, making it easy to find packages to extend R's capabilities. MRAN is updated daily.
Revolution R Open is available for download now. Visit mran.revolutionanalytics.com/download for binaries for Windows, Mac, Ubuntu, CentOS/Red Hat Linux and (of course) the GPLv2 source distribution.
With the new Revolution R Plus program, Revolution Analytics is offering technical support and open-source assurance for Revolution R Open and several other open source projects from Revolution Analytics (including DeployR Open, ParallelR and RHadoop). If you are interested in subscribing, you can find more information at www.revolutionanalytics.com/plus . And don't forget that big-data R capabilities are still available in Revolution R Enterprise.
We hope you enjoy using Revolution R Open, and that your workplace will be confident adopting R with the backing of technical support and open source assurance of Revolution R Plus. Let us know what you think in the comments!