This post is to announce some new features for the dplyrXdf package, which provides useful tools for manipulating Xdf data files for use with the RevoScaleR library included with Microsoft R Client and Microsoft R Server.
Extraction operators
Sometimes it's useful to be able to extract variables from an Xdf file. With a data frame, you can do this with the $
and [[
operators: for example iris$Species
and iris[["Species"]]
both return the Species column (as a vector) from the iris dataset. This update to dplyrXdf implements the same functionality for Xdf files:
sampDir <- system.file("sampleData", package="RevoScaleR") airline <- RxXdfData(file.path(sampDir, "AirlineDemoSmall.xdf")) ArrDelay <- airline$ArrDelay head(ArrDelay) # [1] 6 -8 -2 1 -2 -14
By default, the entire column is returned, so you should be careful using these operators when you have very large Xdf files.
The subset
verb
In dplyr, subsetting data is handled by two verbs: filter
for subsetting by rows, and select
for subsetting by columns. This is fine for data frames, where everything runs in memory; and for SQL databases, where the hard work is done by the database. For Xdf files, however, this is suboptimal, as each verb translates into a separate I/O step where the data is read from disk, subsetted, then written out again. This can waste a lot of time with large datasets.
You can get around this by using the .rxArgs
argument in a verb to pass commands directly to the underlying RevoScaleR functions. For example, filter(xdf, .rxArgs = list(varsToKeep=*)))
would subset by rows, and simultaneously use the varsToKeep
parameter to tell rxDataStep
to subset by columns. But this is inelegant. It would be better if there was a verb that could natively subset in both dimensions, without having to rely on workarounds.
As it turns out, base R has a subset
generic which (as the name says) performs subsetting on both rows and columns. You've probably used it with data frames:
subset(iris, Species == "setosa", c(Sepal.Length, Sepal.Width)) # Sepal.Length Sepal.Width # 1 5.1 3.5 # 2 4.9 3.0 # 3 4.7 3.2 # 4 4.6 3.1 # 5 5.0 3.6 # ...
Here, the first argument to subset specifies the rows, and the second argument the columns to return. The code for Xdf files works along the same lines:
airSubset <- subset(airline, DayOfWeek == "Monday", c(ArrDelay, CRSDepTime)) head(airSubset) # ArrDelay CRSDepTime # 1 6 9.666666 # 2 -8 19.916666 # 3 -2 13.750000 # 4 1 11.750000 # 5 -2 6.416667 # 6 -14 13.833333
You can also use the same helper functions to choose columns as with select
:
airSubset2 <- subset(airline, , starts_with("A")) names(airSubset2) # [1] "ArrDelay"
Getting it
You can get dplyrXdf from Github, most easily with the devtools package: devtools::install_github("RevolutionAnalytics/dplyrXdf")
. Please send bug reports/feedback/praise/raspberries to me, at [email protected].
Comments
You can follow this conversation by subscribing to the comment feed for this post.