At last month's useR! 2011 conference at Warwick University, there were two talks on the RevoScaleR package for big data statistics in R.
For the past several decades the rising tide of technology -- especially the increasing speed of single processors -- has allowed the same data analysis code to run faster and on bigger data sets. That happy era is ending. The size of data sets is increasing much more rapidly than the speed of single cores, of I/O, and of RAM. To deal with this, we need software that can use multiple cores, multiple hard drives, and multiple computers.
That is, we need scalable data analysis software. It needs to scale from small data sets to huge ones, from using one core and one hard drive on one computer to using many cores and many hard drives on many computers, and from using local hardware to using remote clouds.
R is the ideal platform for scalable data analysis software. It is easy to add new functionality in the R environment, and easy to integrate it into existing functionality. R is also powerful, flexible and forgiving. I will discuss the approach to scalability we have taken at Revolution Analytics with our package RevoScaleR. A key part of this approach is to efficiently operate on "chunks" of data -- sets of rows of data for selected columns. I will discuss this approach from the point of view of:
- Storing data on disk
- Importing data from other sources
- Reading and writing of chunks of data
- Handling data in memory
- Using multiple cores on single computers
- Using multiple computers
- Automatically parallelizing "external memory" algorithms
The second presentation was from Revolution Analytics VP of Development Susan Ranney, who presented an application of using RevoScaleR to merge analyze census data: It's a boy! An analysis of 96 million birth records using R.
The fact that more boys than girls are born each year is well established – across time and across cultures. But there are variations in the degree to which this is true. For example, there is evidence that the sex ratio at birth declines as the age of the mother increases, and babies of a higher birth order are more likely to be girls. Different sex ratios at birth are seen for different racial groups, and a downward trend in the sex ratio in the United States since the early 1970s has been observed.
Although these effects are very small at the individual level, the impact can be large on the number of “excess” males born each year. To analyze the role of these factors in the sex ratio at birth, it is appropriate to use data on many individual births over multiple years.
Such data are in fact readily available. Public-use data sets containing information on all birthsin the United States are available on an annual basis from 1985 to 2008. But, as Joseph Adler points out in R in a Nutshell, “The natality files are gigantic; they’re approximately 3.1 GM uncompressed. That’s a little larger than R can easily process.” An additional challenge to using these data files is that the format and contents of the data sets often change from year to year.
Using the RevoScaleR package, these hurdles are overcome and the power and flexibility of the R language can be applied to the analysis birth records. Relevant variables from each year are read from the fixed format data files into RevoScaleR’s .xdf file format using R functions. Variables are then recoded in R where necessary in order to create a set of variables common across years. The data are combined into a single, multi-year .xdf file containing tens of millions of birth records with information such as the sex of the baby, the birth order, the age and race of the mother, and the year of birth.
Detailed tabular data can be quickly extracted from the .xdf file and easily visualized using lattice graphics. Trends in births, and more specifically the sex ratio at birth, are examined across time and demographic characteristics. Finally, logistic regressions are computed on the full .xdf file examining the conditioned effects of factors such age of mother, birth order, and race on the probability of having a boy baby.
Both presentations are available for download from the link below.
SlideShare: Revolution Analytics presentations