« R 2.10.1 to be released December 14 | Main | In case you missed it: November roundup »

December 03, 2009

Comments

Feed You can follow this conversation by subscribing to the comment feed for this post.

For gzip, at least, this has been available in R considerably before 2.10 - but you did need to use the gzfile connection.

The following works too (in older versions of R):
d <- read.table( gzfile("yourfile.tab"), header=TRUE, sep="\t")


I made an error in the tip of the day, I had a 5meg file, not 500k. I get 20:1 compression because the random part of the data isn't continuous but discrete. As well, there is a lot of the data that's highly redundant factor coding and survey data.

(it reads faster too)

R can write out the gzip file without calling gzip by using a gzip connection (?connections).

Oh, sweet! I'm compressing my data files with 7zip, though. Is there a connection for 7z-compressed file streams?

Hello,

Compressing files provide you with some time savings as CPU (used to uncompress) is much faster than disk (used to read).

SAS uses this trick when you give it the option to compress your datasets. SAS, however, internally zips rows, not the whole table. It provides you with time and space savings when your data has many columns, not just a few.

The main bottleneck in R is that R stores information column-wise, not row-wise (unlike SAS and most DBMSs). Zipped or not, you still need to "transpose" your data and that is very time consuming.

The "right" approach would be to use a data input system that would store columns... columnwise. That is in the spirit of my "colbycol" package in CRAN. I guess that there are huge potential improvements if you work with R together with a columnwise DBMS such as LucidDB or similar... Just an idea for a future post!

I've noticed that Stata (.dta) files seem more compact and faster-to-load into R than either text or workspace (.RData) files. So this post attracted my attention. Below are the results of a comparsion of the performance of text, compressed and Stata files.

First, I tried to replicate the results in the original post. The results were about the same.

> X <- matrix(rnorm(1e7), ncol=10)

# 181.8 MB
> system.time(read.table("bigdata.txt", sep=","))
user system elapsed
161.986 1.646 162.375

# 84.5 MB
> system.time(read.table("bigdata-compressed.txt.gz", sep=","))
user system elapsed
87.909 1.077 88.308

# 181.8 MB
> system.time(scan("bigdata.txt", sep=",", what=rep(0,10)))
Read 10000000 items
user system elapsed
31.860 0.211 31.983

# 84.5 MB
> system.time(scan("bigdata-compressed.txt.gz", sep=",", what=rep(0,10)))
Read 10000000 items
user system elapsed
20.467 0.297 20.704

Second, I then computed the loading times for an uncompressed Stata (.dta) file.

> library(foreign)
> write.dta(as.data.frame(X), file = "X.dta", version = 7)

The results are pretty dramatic:

# 80.0 MB
> system.time(read.dta("X.dta"))
user system elapsed
1.414 0.218 1.623

I've gotten similar results with datasets which have both numeric and non-numeric data. I'm guessing the difference may be due to the fact that Stata files are binary. However, at this point, I think that replication of these results would be interesting and helpful.

I think you are being a bit too naive. How can a factor of two in file size explain a factor of three in speed up?

If I run the experiment several times the difference seems to go away:

> system.time(read.table('bigdata.txt', sep=','))
user system elapsed
169.48 0.55 170.58
> system.time(read.table('bigdata.txt', sep=','))
user system elapsed
96.75 0.22 100.43
> system.time(read.table('bigdata.txt', sep=','))
user system elapsed
106.78 0.89 126.79
> system.time(read.table('bigdata-compressed.txt.gz', sep=','))
user system elapsed
105.89 0.22 110.94
> system.time(read.table('bigdata-compressed.txt.gz', sep=','))
user system elapsed
93.68 0.17 94.04
> system.time(read.table('bigdata-compressed.txt.gz', sep=','))
user system elapsed
97.16 0.16 97.66

Also, if I restart R and do the experiments in the opposite order I get the opposite result:

> system.time(read.table('bigdata-compressed.txt.gz', sep=','))
user system elapsed
163.35 0.37 163.89
> system.time(read.table('bigdata.txt', sep=','))
user system elapsed
89.71 0.28 95.51

I know absolutely nothing about the internal workings of R, but I would guess some kind of library or extension needs to be loaded the first time you run read.table, and what you are seeing is simply the penalty of the first run. Maybe some kind of cache somewhere also influences your result.

I have experienced compressed data speeding up computations in other situations, but I think you need to do more thorough experiments and statistics to show it in this instance.

The comments to this entry are closed.

Search Revolutions Blog




Got comments or suggestions for the blog editor?
Email David Smith.
Follow revodavid on Twitter Follow David on Twitter: @revodavid
Get this blog via email with Blogtrottr