This post from Stephen Weller is part of a series from members of the Revolution Analytics Engineering team. Learn more about the RevoScaleR package, available free to academics as part of Revolution R Enterprise — ed.
The RevoScaleR package, installed with Revolution R Enterprise, offers parallel external memory algorithms that help R break through memory and performance limitations.
RevoScaleR contains:
- The .xdf data file format, designed for fast processing of blocks of data, and
- A growing number of external memory implementations of the statistical algorithms most commonly used with large data sets
Here is a sample RevoScaleR analysis that uses a subset of the airline on-time data reported each month to the U.S. Department of Transportation (DOT) and Bureau of Transportation Statistics (BTS) by the 16 U.S. air carriers. This data contains three columns: two numeric variables, ArrDelay and CRSDepTime, and a categorical variable, DayOfWeek. It is located in the SampleData folder of the RevoScaleR package, so you can easily run this example in your Revolution R Enterprise session.
- Import the sample airline data from a comma-delimited text file to an .xdf file. When we import the data, we convert the string variable to a (categorical) factor variable using stringsAsFactors:
We use the RevoScaleR 'F()' function here, which tells the rxLinMod() function to treat a variable as a 'factor' variable. We also use the ability to create new variables "on-the-fly" by using the transforms argument to create the variable "Weekend":
test.linmod.fit <- rxLinMod(ArrDelay ~ F(Weekend) : F(CRSDepTime),
transforms=list(Weekend = (DayOfWeek == "Saturday") | (DayOfWeek == "Sunday")),
cube = TRUE, data = "airline.xdf")
The 'test.linmod.fit$countDF' component, contains the group means and cell counts. Since the independent variables in our regression were all categorical, the group means are the same as the coefficients. We can do a quick check by taking the sum of the differences:
linModDF <- test.linmod.fit$countDF
sum(linModDF$ArrDelay - coef(test.linmod.fit))
The output from our linear model estimation includes standard errors of the coefficient estimates. We can use these to create confidence bounds around the estimated coefficients. Let's add them as additional variables in our data frame:
linModDF$coef.std.error <- as.vector(test.linmod.fit$coef.std.error)
linModDF$lowerConfBound <- linModDF$ArrDelay - 2*linModDF$coef.std.error
linModDF$upperConfBound <- linModDF$ArrDelay + 2*linModDF$coef.std.error
The line plot is informative, as it clearly shows that our estimates of arrival delays in the early hours of the morning are not very precise because of the small number of observations.
A comparison of timing between Open Source R and proprietary Revolution R would have been nice...specially since 600K X 3 data can be crunched in GNU R anyways. And, what about a real big data problem...for eg 3000 columns instead of 3 and 6 million rows instead of 600K? I'd like to see that.
Posted by: nick | May 25, 2011 at 10:29
I second Nick's comments.
Posted by: FC | May 31, 2011 at 02:45
You can run the same analysis on the large airline data, which contains 123,534,969 observations and 30 variables.
The large airline data can be downloaded here:
http://www.revolutionanalytics.com/subscriptions/datasets/AirlineData87to08.zip
The extracted zip contents of the XDF datafile is 13.4 GB large.
Posted by: Stephen Weller | June 15, 2011 at 15:51