If you want to get data out of R and into another application or system, simply copying the data as it resides in memory generally isn't an option. Instead you have to serialize the data (into a file, usually), which the other application can then deserialize to recreate the original data. R has several options to serialize data frames:
- You can serialize (export) data to comma-separated files (CSVs), which can be imported by just about any application. R has several packages to read and write CSVs, including
fwrite
andfread
from the data.table package. The downside is that as an ASCII format, CSVs are inefficient, particularly for numeric data. - The base R function
saveRDS
(and its deserialization counterpart,readRDS
) can write any R object to a file. This is fairly efficient binary representation of the data, but not many applications can read RDS files. - The feather package provides the functions
read_feather
andwrite_feather
, an efficient binary format based on the open Apache Arrow framework.
And now there's a new package to add to the list: the fst package. Like the data.table package (the fast data.frame replacement for R), the primary focus of the fst package is speed. The chart below compares the speed of reading and writing data to/from CSV files (with fwrite/fread), feather, fts, and the native R RDS format. The vertical axis is throughput in megabytes per second — more is better. As you can see, fst outperforms the other options for both reading (orange) and writing (green).
The fst package achieves these impressive benchmarks thanks to the magic of compression. Even in this modern age of fast, solid-state storage, it's still (usually) faster to spend time using the CPU to compress the data first, rather than simply writing a larger file to disk. (The same applies to de-compression, and because that's an easier task than compression, there are even more performance gains to be had when reading.) The benefits are dependent on the data itself though: data without a lot of repetition (or, in the worst case, truly random numerical data) won't see performance gains like this.
Nonetheless, fst looks like it will be a useful package for applications that need to export data from R and into another R session as quickly as possible. (The fst format isn't supported by any systems other than R, as far as I know.) The package is still in its early days — the authors warn that the file format is likely to change in the future — but it likely has a place in high-performance R applications that rely on data transfer.
fst package: Lightning Fast Serialization of Data Frames for R
Add brotli compression package to the test as well.
Posted by: MySchizoBuddy | February 03, 2017 at 05:24
What about the compressed/serialized file size?
Posted by: MySchizoBuddy | February 03, 2017 at 05:25
The fst home page provides benchmarks for different methods that include file size.
Posted by: David Smith | February 03, 2017 at 06:28
Thanks David for a nice blog on my package fst. In reaction to the comment of MySchizoBuddy: adding brotli compression would be an option, especially for text compression. The build-in dictionary might help achieve higher compression ratio's at reasonable speeds, thanks for the suggestion.
Posted by: MarcusKlik | February 06, 2017 at 05:39
I'm surprised to see data.table can write at 20x read - can someone confirm that is accurate?
Also, why aren't all the tests done on the same file (see table size in MB from original fst source website)?
Posted by: Mike Bishop | February 07, 2017 at 11:16