Changes are coming to the internals of the R engine which promise to improve performance and reduce memory use, with dramatic impacts in some circumstances. The changes were first proposed by Gabe Becker at the DSC Conference in 2016 (and updated in 2017), and the implementation by Luke Tierney and Gabe Becker is now making its way into the development branches of the R sources.
ALTREP is an alternative way of representing R objects internally. (The details are described in this draft vignette, and there are a few more examples in this presentation.) It's easiest to illustrate by example.
Today, a vector of numbers in R is represented as a contiguous block of memory in RAM. In other words, if you create the sequence of a million integers 1:1000000
, R creates a block of memory 4Mb in size (4 bytes per number) to store each number in the sequence. But what if we could use a more efficient representation just for sequences like this? WIth ALTREP, a sequence like this is instead represented by just its start and end values, which takes up almost no memory at all. That means, in the future you'll be able to write loops like this:
for (i in 1:1e10) do_something()
without getting errors like this:
Error: cannot allocate vector of size 74.5 Gb
ALTREP has the potential to make many other operations faster or even instantaneous, even on very large objects. Here are a few examples of functions that could be sped up:
is.na(x)
— ALTREP will keep track of whether a vector includes NAs or not, so that R no longer has to inspect the entire vectorsort(x)
— ALTREP will keep track of whether a vector is already sorted, and sorting will be instantaneous in this casex < 5
— knowing that the vector is already sorted, ALTREP can find the breakpoint very quickly (in O(log n) time), and return a "sequence" logical vector that consumes basically no memorymatch(y,x)
— if ALTREP knows that x is already sorted, matching is also much fasteras.character(numeric_vector)
— ALTREP can defer converting a numeric vector to characters until the character representation is actually needed
That last benefit will likely have a large impact on the handling of data frames, which carry around a column of character row labels which start out as a numeric sequence. Development builds are already demonstrating a huge performance gain in the linear regression function lm()
as a result:
> n <- 10000000 > x <- rnorm(n) > y <- rnorm(n) # With R 3.4 > system.time(lm(y ~ x)) user system elapsed 9.225 0.703 9.929 # With ALTREP > system.time(lm(y ~ x)) user system elapsed 1.886 0.610 2.496
The ALTREP framework is designed to be extensible, so that package authors can define their own alternative representations of standard R objects. For example, an R vector could be represented as a distributed object as in a system like Spark, while still behaving like an ordinary R vector to the R engine. An example package, simplemmap, illustrates this concept by defining an implementation of vectors as memory-mapped objects that live on disk instead of in RAM.
There's no definite date yet when ALTREP will be in an official R release, and my guess is that there's likely to be an extensive testing period to shake out any bugs caused by changing the internal representation of R objects. But the fact that the implementation is already making its way into the R sources is hugely promising, and I look forward to testing out the real-world impacts. You can read more about the current state of ALTREP in the draft vignette by Luke Tierney, Gabe Becker and Tomas Kalibera linked below.
R-Project.org: ALTREP: Alternative Representations for R Objects
Comments
You can follow this conversation by subscribing to the comment feed for this post.