R user John Christie points out a handy feature introduced in R 2.10: you can now read directly from a text file compressed using gzip or other file-compression tools. He notes:
R added transparent decompression for certain kinds of compressed files in the latest version (2.10). If you have your files compressed with bzip2, xvz, or gzip they can be read into R as if they are plain text files. You should have the proper filename extensions.
The command...
myData <- read.table('myFile.gz')
#gzip compressed files have a "gz" extensionWill work just as if 'myFile.gz' were the raw text file.
Compressing a large ASCII data file can certainly save disk space: for a file containing mostly numbers, a 50%+ reduction in file size is typical. (John's example of reducing a 100Mb file to 500Kb is surprising to me though -- perhaps it was binary data?) But does this space saving come at a cost in speed when it comes to read in the file? On the downside, the CPU does have to decompress the file before R can read it in. On the other hand, CPUs are pretty fast these days, and perhaps the time required to decompress the file is less that the additional time it would take to read the uncompressed data from disk. Let's try it out and see.
First, let's create a big object in R and write it to a file. To optimize the potential for compression, we'll use purely numeric data.
X <- matrix(rnorm(1e7), ncol=10)
write.table(X, file="bigdata.txt", sep=",", row.names=FALSE, col.names=FALSE)
Next, let's compress the file with gzip (making a copy first, so we can retain the uncompressed version):
system("cp bigdata.txt bigdata-compressed.txt")
system("rm bigdata-compressed.txt.gz")
system("gzip bigdata-compressed.txt")
Compressing the file reduces its uncompressed 182Mb size on disk by about 55%:
> compr <- file.info("bigdata-compressed.txt.gz")$size
> big <- file.info("bigdata.txt")$size
> print(c(big, compr))
[1] 181819246 84528258
> print(1-compr/big)
[1] 0.5350973
So now for the acid test: is it quicker to read the compressed or uncompressed file?
> system.time(read.table("bigdata.txt", sep=","))
user system elapsed
170.901 1.996 192.137
> system.time(read.table("bigdata-compressed.txt.gz", sep=","))
user system elapsed
65.511 0.937 66.198
There you go: reading the compressed file is nearly three times faster, despite the time required by the CPU to decompress it first. Impressive! By the way, you get similar (if less striking) results using the low-level (and faster) function scan instead of read.table:
> system.time(scan("bigdata-compressed.txt.gz", sep=",", what=rep(0,10)))
Read 10000000 items
user system elapsed
19.582 0.310 20.071
> system.time(scan("bigdata.txt", sep=",", what=rep(0,10)))
Read 10000000 items
user system elapsed
30.781 0.270 31.369
For comparison, writing the uncompressed file in the first place took about 30 seconds. All of these timings were done on a fairly powerful dual-core MacBook Pro, so as always your mileage may vary. But it does seem that on modern hardware where the CPU performance exceeds the disk performance, compressing files is the way to go.
r-help mailing list: Dec 1 2009 Tip of the Day (John Christie)
For gzip, at least, this has been available in R considerably before 2.10 - but you did need to use the gzfile connection.
Posted by: Hadley | December 03, 2009 at 10:15
The following works too (in older versions of R):
d <- read.table( gzfile("yourfile.tab"), header=TRUE, sep="\t")
Posted by: Avram | December 03, 2009 at 10:23
I made an error in the tip of the day, I had a 5meg file, not 500k. I get 20:1 compression because the random part of the data isn't continuous but discrete. As well, there is a lot of the data that's highly redundant factor coding and survey data.
(it reads faster too)
R can write out the gzip file without calling gzip by using a gzip connection (?connections).
Posted by: John Christie | December 03, 2009 at 10:24
Oh, sweet! I'm compressing my data files with 7zip, though. Is there a connection for 7z-compressed file streams?
Posted by: Michael R. Head | December 03, 2009 at 15:44
Hello,
Compressing files provide you with some time savings as CPU (used to uncompress) is much faster than disk (used to read).
SAS uses this trick when you give it the option to compress your datasets. SAS, however, internally zips rows, not the whole table. It provides you with time and space savings when your data has many columns, not just a few.
The main bottleneck in R is that R stores information column-wise, not row-wise (unlike SAS and most DBMSs). Zipped or not, you still need to "transpose" your data and that is very time consuming.
The "right" approach would be to use a data input system that would store columns... columnwise. That is in the spirit of my "colbycol" package in CRAN. I guess that there are huge potential improvements if you work with R together with a columnwise DBMS such as LucidDB or similar... Just an idea for a future post!
Posted by: Carlos J. Gil Bellosta | December 04, 2009 at 05:55
I've noticed that Stata (.dta) files seem more compact and faster-to-load into R than either text or workspace (.RData) files. So this post attracted my attention. Below are the results of a comparsion of the performance of text, compressed and Stata files.
First, I tried to replicate the results in the original post. The results were about the same.
> X <- matrix(rnorm(1e7), ncol=10)
# 181.8 MB
> system.time(read.table("bigdata.txt", sep=","))
user system elapsed
161.986 1.646 162.375
# 84.5 MB
> system.time(read.table("bigdata-compressed.txt.gz", sep=","))
user system elapsed
87.909 1.077 88.308
# 181.8 MB
> system.time(scan("bigdata.txt", sep=",", what=rep(0,10)))
Read 10000000 items
user system elapsed
31.860 0.211 31.983
# 84.5 MB
> system.time(scan("bigdata-compressed.txt.gz", sep=",", what=rep(0,10)))
Read 10000000 items
user system elapsed
20.467 0.297 20.704
Second, I then computed the loading times for an uncompressed Stata (.dta) file.
> library(foreign)
> write.dta(as.data.frame(X), file = "X.dta", version = 7)
The results are pretty dramatic:
# 80.0 MB
> system.time(read.dta("X.dta"))
user system elapsed
1.414 0.218 1.623
I've gotten similar results with datasets which have both numeric and non-numeric data. I'm guessing the difference may be due to the fact that Stata files are binary. However, at this point, I think that replication of these results would be interesting and helpful.
Posted by: Peter M. Li | December 04, 2009 at 22:33
I think you are being a bit too naive. How can a factor of two in file size explain a factor of three in speed up?
If I run the experiment several times the difference seems to go away:
> system.time(read.table('bigdata.txt', sep=','))
user system elapsed
169.48 0.55 170.58
> system.time(read.table('bigdata.txt', sep=','))
user system elapsed
96.75 0.22 100.43
> system.time(read.table('bigdata.txt', sep=','))
user system elapsed
106.78 0.89 126.79
> system.time(read.table('bigdata-compressed.txt.gz', sep=','))
user system elapsed
105.89 0.22 110.94
> system.time(read.table('bigdata-compressed.txt.gz', sep=','))
user system elapsed
93.68 0.17 94.04
> system.time(read.table('bigdata-compressed.txt.gz', sep=','))
user system elapsed
97.16 0.16 97.66
Also, if I restart R and do the experiments in the opposite order I get the opposite result:
> system.time(read.table('bigdata-compressed.txt.gz', sep=','))
user system elapsed
163.35 0.37 163.89
> system.time(read.table('bigdata.txt', sep=','))
user system elapsed
89.71 0.28 95.51
I know absolutely nothing about the internal workings of R, but I would guess some kind of library or extension needs to be loaded the first time you run read.table, and what you are seeing is simply the penalty of the first run. Maybe some kind of cache somewhere also influences your result.
I have experienced compressed data speeding up computations in other situations, but I think you need to do more thorough experiments and statistics to show it in this instance.
Posted by: Jesper Nielsen | December 09, 2009 at 10:48