This post is to announce some new features for the dplyrXdf package, which provides useful tools for manipulating Xdf data files for use with the RevoScaleR library included with Microsoft R Client and Microsoft R Server.
Sometimes it's useful to be able to extract variables from an Xdf file. With a data frame, you can do this with the
[[ operators: for example
iris[["Species"]] both return the Species column (as a vector) from the iris dataset. This update to dplyrXdf implements the same functionality for Xdf files:
By default, the entire column is returned, so you should be careful using these operators when you have very large Xdf files.
In dplyr, subsetting data is handled by two verbs:
filter for subsetting by rows, and
select for subsetting by columns. This is fine for data frames, where everything runs in memory; and for SQL databases, where the hard work is done by the database. For Xdf files, however, this is suboptimal, as each verb translates into a separate I/O step where the data is read from disk, subsetted, then written out again. This can waste a lot of time with large datasets.
You can get around this by using the
.rxArgs argument in a verb to pass commands directly to the underlying RevoScaleR functions. For example,
filter(xdf, .rxArgs = list(varsToKeep=*))) would subset by rows, and simultaneously use the
varsToKeep parameter to tell
rxDataStep to subset by columns. But this is inelegant. It would be better if there was a verb that could natively subset in both dimensions, without having to rely on workarounds.
As it turns out, base R has a
subset generic which (as the name says) performs subsetting on both rows and columns. You've probably used it with data frames:
Here, the first argument to subset specifies the rows, and the second argument the columns to return. The code for Xdf files works along the same lines:
You can also use the same helper functions to choose columns as with
You can get dplyrXdf from Github, most easily with the devtools package:
devtools::install_github("RevolutionAnalytics/dplyrXdf"). Please send bug reports/feedback/praise/raspberries to me, at firstname.lastname@example.org.