by Hong Ooi
Sr. Data Scientist, Microsoft
In a previous post I introduced the dplyrXdf package, which combines the ease of use and simplicity of dplyr with the scalability of Revolution Analytics' xdf data format. Most of dplyrXdf works the same way as dplyr, so if you know how to use the latter, then you also (mostly) know how to use the former. However, there are some places in which the two packages are different. These will be described below.
Single-table verbs
As with the previous post, I’ll use the nycflights13 flights data to demonstrate the use of dplyrXdf.
The .rxArgs
parameter
The RevoScaleR functions typically have several arguments beyond those used by dplyrXdf verbs. While usually you don’t need to touch these, it can sometimes be useful to do so. For example, when using mutate
or transmute
, you could specify more complicated transformations via a transformFunc
(see the help for rxTransform
). Similarly, rather than chaining together a mutate
and a summarise
— which would involve creating an intermediate file — you could incorporate the variable transformation into the summarise
itself. More low-level uses of such arguments include setting the block size for an xdf file, changing the compression level, limiting the number of rows, and so on.
Most of the one-table dplyrXdf verbs accept an .rxArgs
argument as a way of transmitting these extra arguments to the underlying RevoScaleR code. This should be a named list specifying the names and values of the arguments to be passed. The exact arguments will vary depending on the verb in question; here is a list of the verbs and the underlying RevoScaleR function that they call:
filter
andselect
:rxDataStep
mutate
andtransmute
:rxDataStep
summarise
: depending on the method chosen,rxCube
orrxSummary
arrange
:rxSort
distinct
:rxDataStep
factorise
:rxFactors
doXdf
:rxDataStep
You should use the .rxArgs
argument with caution, as some verbs may modify the data as part of their normal functioning, so the results you get back may not be as expected. It’s also easy to write convoluted code that makes your dplyrXdf pipelines harder to read. However, when working with big datasets this feature can help save a lot of processing time by avoiding unnecessary disk traffic.
See the help for each of the RevoScaleR functions for which arguments they support, and that you can pass via.rxArgs
.
Transforming data with mutate
and transmute
These verbs work much the same way as in dplyr. You can also use the .rxArgs
parameter to specify additional transformation arguments to rxDataStep
. Here are some examples to illustrate the use of mutate
and transmute
:
# a simple transformation flightsMut <- mutate(flightsXdf, delay = (dep_delay + arr_delay)/2) head(flightsMut) #> year month day dep_time dep_delay arr_time arr_delay carrier tailnum #> 1 2013 1 1 517 2 830 11 UA N14228 #> 2 2013 1 1 533 4 850 20 UA N24211 #> 3 2013 1 1 542 2 923 33 AA N619AA #> 4 2013 1 1 544 -1 1004 -18 B6 N804JB #> 5 2013 1 1 554 -6 812 -25 DL N668DN #> 6 2013 1 1 554 -4 740 12 UA N39463 #> flight origin dest air_time distance hour minute delay #> 1 1545 EWR IAH 227 1400 5 17 6.5 #> 2 1714 LGA IAH 227 1416 5 33 12.0 #> 3 1141 JFK MIA 160 1089 5 42 17.5 #> 4 725 JFK BQN 183 1576 5 44 -9.5 #> 5 461 LGA ATL 116 762 5 54 -15.5 #> 6 1696 EWR ORD 150 719 5 54 4.0 # a more complex transformation involving a transformFunc flightsTrans <- transmute(flightsXdf, .rxArgs=list( transformFunc=function(varlist) with(varlist, { delay <- (dep_delay + arr_delay)/2 date <- as.Date(sprintf("%d-%02d-%02d", year, month, day)) weekday <- weekdays(date) weekendDelay <- ifelse(weekday %in% c("Saturday", "Sunday"), delay, NA) list(delay=delay, weekday=weekday, weekendDelay=weekendDelay) }) ) ) head(flightsTrans) #> delay weekday weekendDelay #> 1 6.5 Tuesday NA #> 2 12.0 Tuesday NA #> 3 17.5 Tuesday NA #> 4 -9.5 Tuesday NA #> 5 -15.5 Tuesday NA #> 6 4.0 Tuesday NA # fit a model using open source R, and then score the training dataset # we pass the model object via transformObjects, and the package to load # via transformPackages library(rpart) flightsModel <- rpart(arr_delay ~ dep_delay + carrier + hour, data=flights) flightsScores <- transmute(flightsXdf, pred=predict(model, data.frame(dep_delay, carrier, hour)), .rxArgs=list( transformObjects=list(model=flightsModel), transformPackages="rpart" ) ) head(flightsScores) #> pred #> 1 -8.433385 #> 2 -8.433385 #> 3 -8.433385 #> 4 -8.433385 #> 5 -8.433385 #> 6 -8.433385
It’s important to note that any arguments passed in a .transformFunc
are processed after the transformations in the main call to mutate
and transmute
. For example, this won’t work:
mutate(flightsXdf, delayHrs=delay/60, .rxformArgs=list( transformFunc=function(varlist) with(varlist, { delay <- (dep_delay + arr_delay)/2 list(delay=delay) }), transformVars=c("dep_delay", "arr_delay") ) ) #> Error in doTryCatch(return(expr), name, parentenv, handler) : #> Error in executing R code: object 'delay' not found
Here, the variable delayHrs
depends on delay
, which is defined in a transformFunc
. However, when dplyrXdf tries to compute delayHrs
, the transformFunc
has not yet been executed and so the code fails.
Summarising data with summarise
You can incorporate much of the functionality of filter
and mutate
by including selection and transformation parameters in the .rxArgs
argument. Here is an example:
flightsSmry <- flightsXdf %>% group_by(carrier) %>% summarise(sumdist=sum(dist_km), .rxArgs=list(rowSelection=month > 6, transforms=list(dist_km=distance * 1.6093)) ) head(flightsSmry) #> carrier sumdist #> 1 UA 74549477 #> 2 AA 35197578 #> 3 B6 47741787 #> 4 DL 48677012 #> 5 EV 25681489 #> 6 MQ 12132540
For best performance, when using summarise
you should request only those summary statistics supported byrxCube
and/or rxSummary
: sum, mean, min, max, sd, var and n (the count of observations). If you request something else, dplyrXdf will split the dataset into multiple data frames, one per group, and call dplyr::summarise
on each data frame; this will generally work as intended, but may be slow.
The dplyrXdf version of summarise
can choose from a number of methods for computing the summary statistics. While it’s usually smart enough to choose the best method, you can set this manually with the .method
argument, which takes a number from 1 to 5:
- Use
rxCube
- Use
rxSummary
- Use
rxSummary
but create the groups by concatenating the grouping variables together; this is to work around a limitation in the RevoScaleR functions on the maximum number of cells in a cube - Split the dataset into multiple data frames, call
dplyr::summarise
on each - Split the dataset into multiple xdf files, call
rxSummary
on each
Only methods 1 and 2 support the use of .rxArgs
.
In addition, dplyrXdf summarise
doesn’t support (yet) expressions as summary statistics.
Selecting variables with select
The dplyrXdf version of select
doesn’t support renaming variables as part of the selection. Instead, use a rename
following the select
to rename variables. The rename
verb is very fast since it only modifies the metadata portion of an xdf file, rather than the data itself.
Creating factors with factorise
Many RevoScaleR functions are optimised to work best with factors, or require factors as input. dplyrXdf provides a simple shell to the rxFactors
function to convert non-factor variables to factors. The syntax is as follows:
factorise(data, x1, x2, ...)
where x1
, x2
, … are the variables to convert. Note that the generated factor variables will overwrite the originals. For performance reasons, the levels of the generated factors are not sorted in alphabetical order.
The verbs in dplyrXdf will usually create factors on the fly as needed, so you shouldn’t need to call factorise
very often. However, should you need it, factorise
provides an explicit way to create factors within the framework of dplyrXdf and pipelines.
There are a number of ways to specify the variables to convert, in addition to naming them explicitly. The functions all_character()
, all_numeric()
and all_integer()
will convert all the variables falling under these categories. A logical variable counts as integer for this purpose. You can also use the helper functions available to dplyr::select_vars
to choose variables based on their names.
By default, if no variables are specified in the factorise
call, then all character variables will be converted to factors. As with select
, renaming variables as part of the factor conversion is not supported.
Executing code with do
and doXdf
The do
verb is an exception to the rule that dplyrXdf
verbs write their output as xdf files. This is because do
executes arbitrary R code, and can return arbitrary R objects; while a data frame is capable of storing these objects, an xdf file is limited to character and numeric vectors only.
The doXdf
verb is similar to do
, but where do
splits its input into one data frame per group, doXdf
splits it into one xdf file per group. This allows do
-like functionality with grouped data where each group can be arbitrarily large. The syntax for the two functions is essentially the same, although the code passed to doXdf
must obviously know how to handle xdfs. Both do
and doXdf
only support named arguments.
# fit a regression model by carrier, using rxLinMod flightsMods <- flightsXdf %>% group_by(carrier) %>% doXdf(model=rxLinMod(arr_delay ~ dep_delay + hour, data=.)) #> Rows Read: 336776, Total Rows Processed: 336776, Total Chunk Time: 0.313 seconds flightsMods$model[[1]] #> Call: #> rxLinMod(formula = arr_delay ~ dep_delay + hour, data = .) #> #> Linear Regression Results for: arr_delay ~ dep_delay + hour #> Data: . (RxXdfData Data Source) #> File name: #> C:\Users\hongooi\AppData\Local\Temp\RtmpA5JLqv\filea2c24997434..group..9E.xdf #> Dependent variable(s): arr_delay #> Total independent variables: 3 #> Number of valid observations: 17294 #> Number of missing observations: 1166 #> #> Coefficients: #> arr_delay #> (Intercept) -3.8654201 #> dep_delay 1.0335003 #> hour -0.3950968
Two-table verbs
dplyrXdf currently supports the main table-join verbs from dplyr: left_join
, right_join
, inner_join
and full_join
. The syntax is the same as for the dplyr versions, including joining on non-matching column names. It’s also possible to join xdf files with data frames.
The somewhat more obscure semi_join
and anti_join
verbs are not (yet) supported, mainly because there is no corresponding functionality for these types of joins in rxMerge
. You can mimic semi_join
with the following workaround, which is an inner_join
call combined with a distinct
to remove duplicated rows:
planesXdf <- rxDataFrameToXdf(planes, "planes.xdf", overwrite=TRUE) # same as semi_join(flights, planes, by="tailnum") flightsSemi <- inner_join(flightsXdf, select(planesXdf, tailnum) %>% distinct, by="tailnum") #> Rows Read: 3322, Total Rows Processed: 3322, Total Chunk Time: 0.025 seconds head(flightsSemi) #> year month day dep_time dep_delay arr_time arr_delay carrier tailnum #> 1 2013 1 10 626 -4 802 2 EV N10156 #> 2 2013 1 10 1120 48 1320 40 EV N10156 #> 3 2013 1 10 1619 39 1831 47 EV N10156 #> 4 2013 1 11 632 -2 810 -12 EV N10156 #> 5 2013 1 11 1116 -4 1328 -8 EV N10156 #> 6 2013 1 11 1845 26 1959 27 EV N10156 #> flight origin dest air_time distance hour minute #> 1 4560 EWR PIT 60 319 6 26 #> 2 4269 EWR CHS 99 628 11 20 #> 3 4667 EWR MSP 175 1008 16 19 #> 4 4334 EWR CMH 81 463 6 32 #> 5 4298 EWR MCI 171 1092 11 16 #> 6 4520 EWR PWM 49 284 18 45
Similarly, you can mimic anti_join
as follows:
# same as anti_join(flights, planes, by="tailnum") flightsAnti <- left_join(flightsXdf, transmute(planesXdf, tailnum, pl=rep(1, .rxNumRows)) %>% distinct, by="tailnum") %>% filter(is.na(pl)) #> Rows Read: 3322, Total Rows Processed: 3322, Total Chunk Time: 0.025 seconds head(flightsAnti) #> year month day dep_time dep_delay arr_time arr_delay carrier tailnum #> 1 2013 1 2 NA NA NA NA AA <NA> #> 2 2013 1 2 NA NA NA NA UA <NA> #> 3 2013 1 3 NA NA NA NA UA <NA> #> 4 2013 1 3 NA NA NA NA UA <NA> #> 5 2013 1 4 NA NA NA NA 9E <NA> #> 6 2013 1 4 NA NA NA NA 9E <NA> #> flight origin dest air_time distance hour minute pl #> 1 133 JFK LAX NA 2475 NA NA NA #> 2 623 EWR ORD NA 719 NA NA NA #> 3 714 EWR MIA NA 1085 NA NA NA #> 4 719 EWR DFW NA 1372 NA NA NA #> 5 3405 JFK DCA NA 213 NA NA NA #> 6 3716 EWR DTW NA 488 NA NA NA
The set operations intersect
, union
and setdiff
are also not supported. A workaround for union
is to use rxMerge
with type="union"
(which does a UNION ALL, in SQL lingo), followed by a distinct
:
# same as union(flightsXdf, flightsXdf) flightsUnion <- rxMerge(flightsXdf, flightsXdf, outFile="flightsUnion.xdf", type="union", overwrite=TRUE) %>% distinct #> Rows Read: 336776, Total Rows Processed: 336776, Total Chunk Time: 1.936 seconds #> Rows Read: 336776, Total Rows Processed: 673552, Total Chunk Time: 1.992 seconds nrow(flightsXdf) #> [1] 336776 nrow(flightsUnion) # same as nrow(flightsXdf) #> [1] 336776
Comments