by Seth Mottaghinejad, Data Scientist at Microsoft
R and big data
There are many R packages dedicated to letting users (or useRs if you prefer) deal with big data in R. (We will intentionally avoid using proper case for 'big data', because (1) the term has been somewhat hackneyed, and (2) for the sake of this article we can think of big data as any dataset too large to fit into memory as a data.frame so that standard R functions can run on them.) Even without third party packages, base R still puts some toolkits at our disposal, which boil down to doing one of two things: We can either format the data more economically so that it can still be squeeze into memory, or we can deal with the data piecemeal, bypassing the need to load it into memory all at once.
An example of the first approach is to format character vectors as factor, when doing so is appropriate, because factor is stored as integer under the hood which takes less space than the strings it represents. There are of course many other advantages to using factor, but let's not digress. An example of the second approach consists of processing the data only a certain number of rows at a time, i.e. chunk by chunk, where each chunk can fit into memory and brought into R as a data.frame.
RevoScaleR and big data
The aforementioned chunk-wise processing of data is what the RevoScaleR package does behind the scenes. For example, if we run the rxLinMod function (the counterpart to base R's lm function), we can run a regression model on a very large dataset, presumably is too large to fit into memory. Even if the dataset could still fit into memory (servers nowadays can have easily have 500GB of RAM), processing it using lm would take considerably longer than rxLinMod because the latter is a parallel algorithm, meaning that it breaks up the data into chunks and decides what intermediate results are kept from each chunk and how they are aggregated at the end to produce a single result, which in the case of rxLinMod is a linear model. (What those intermediate results are and how they are to be aggregated is what makes developing parallel algorithms a challenging and interesting problem. A great deal of effort is spent on taking a non-parallel algorithm and rewriting it to work in a parallel way. But that's a topic for another day.)
On a local machine, there are two kinds of data types that work with RevoScaleR: