by Hong Ooi, Sr. Data Scientist, Microsoft
Version 0.90 of the dplyrXdf package has just been released. dplyrXdf is a package that brings dplyr pipelines and data transformation verbs to Microsoft R Server’s xdf files. This version includes several changes, mostly to address performance and efficiency concerns, which I’ll detail these below.
The .outFile argument
All dplyrXdf verbs now support a special argument
.outFile, which determines how the output data is handled. If you don’t specify a value for this argument, the data will be saved to a
tbl_xdf which will be managed by dplyrXdf. This supports the default behaviour, whereby data files are automatically created and deleted inside a pipeline. There are two other options for
If you specify
.outFile = NULL, the data will be returned in memory as a data frame.
.outFileis a character string giving a file name, the data will be saved to an xdf file at that location, and a persistent xdf data source will be returned.
This should improve the efficiency of pipelines with large datasets, by reducing the amount of I/O. Previously, to save the output of a pipeline, you had to call the
persist verb at the end:
In this example,
mutate would save a temporary xdf file in dplyrXdf’s working directory, and
persist would then copy that file to the final output location. Now, you can save the output directly to the final location as follows:
This omits a redundant file save and copy, thus speeding things up.
persist verb remains available, for situations where you have already run a pipeline and want to save its output after the fact.
Setting the dplyrXdf working directory
By default, dplyrXdf will save the data files it creates into the R working directory. On some systems, this may be located on a drive or filesystem that is relatively small; this is rarely an issue with open source R, but can be problematic when working with large xdf files. You can now change the location of the xdf tbl directory with the