In a post earlier this month, it seemed as though compressing a data file before reading it into R could save you some time. With some feedback from readers and further experimentation, we might need to revisit that conclusion
To recap, in our previous experiment it took 170 seconds to read a 182Mb text file into R. But if we compressed the file first, it only took 65 seconds. Apparently, the benefits of reducing the amount of disk access (by dealing with a smaller file) far outweighed the CPU time required to decompress the file for reading.
In that experiment, though, each file was only read once. If you simply repeat the read statement on the uncompressed file, you see a sudden decrease in the time required to read it:
> system.time(read.table("bigdata.txt", sep=","))
user system elapsed
165.042 1.316 165.807
> system.time(read.table("bigdata.txt", sep=","))
user system elapsed
94.248 0.934 94.673
(This was on MacOS, using the R GUI. I also tried using R from the terminal on MacOS, and also from the R GUI in Windows, using both regular R and REvolution R. There were some slight variations in the timings, but in general I got similar results.)
So what's going on here (other than my embarrassing failure as a statistician to replicate my measurements the first time round)? One possibility is that we're seeing the effects of disk cache: when you access data on a hard drive, most modern drives will temporarily store some of the data in high-speed memory. This makes it faster to access the file in subsequent attempts, for as long as the file data remains in the cache. But that doesn't explain why we don't see a similar speedup in repeated readings of the compressed file:
> system.time(read.table("bigdata-compressed.txt.gz", sep=","))
user system elapsed
89.464 0.868 90.436
> system.time(read.table("bigdata-compressed.txt.gz", sep=","))
user system elapsed
97.651 1.035 98.887
I'd expect the second reading to be faster if disk cache had an effect, so I don't think disk cache is the culprit here. More revealing is the fact that the first use of read.table in any R session takes longer than subsequent ones. Reading from the gzipped file is slower than reading from the uncompressed file if it's the first read of the session:
> system.time(read.table("bigdata-compressed.txt.gz", sep=","))
user system elapsed
150.429 1.304 152.447
> system.time(read.table("bigdata.txt", sep=","))
user system elapsed
78.717 0.986 79.773
So what's going on here? (This was using R from the terminal under MacOS; I got similar results using the R GUI on MacOS.) I don't have a good explanation, frankly. Maybe the additional time is required by R to load libraries or to page in the R executable (but why would it scale with the file size, then?). Note that we got the speed benefits from reading the uncompressed file second, which rules out disk cache having any significant benefits. If any one has any good explanations, I'd love to hear them.
So what file type is the fastest for reading into R? Reader Peter M. Li took a much more systematic approach to answering that question than I did, running fifty trials for compressed and uncompressed files using both read.table and scan. (We can safely assume that this level of replication nullifies any first-read or caching effects.) He also tested Stata files (an open, binary data file format that R can both read and write). Peter also tested different file sizes for each file type, with files containing one thousand, 10 thousand, 100 thousand, one million and 10 million observations. His results are summarized in the graph below, with log(file size) on the X axis and log(time to read) on the Y axis:
So, what can we conclude from all of this? Let's see:
- In general, compressing your text data doesn't speed up reading it into R. If anything, it's slower.
- The only time compressing files might be beneficial is for large files read with read.table (but not scan)
- There's a speed penalty the first time you use read.table in an R session.
- Reading data from Stata files has significant performance benefits compared to text-based files.



I'm the original poster of this idea to R help (originally just meant for our local user group) and I reiterate the claim that the read is faster for compressed files, but it obviously must be qualified.
I knew of exactly the kinds of problems listed here and so was more than aware that I had to replicate my results, and I did. For the data I'm working with right now I get about a 50% slow down with the uncompressed file. (4.6 v, 6.8 s)
However, my read data file, as opposed to random numbers, (the one I'm working with now and just retested on) is 28 MB uncompressed and 1.6 MB compressed, Those are much larger differences than you're getting with randomly generated datasets. That's because if randomness increases then compression (or compressability) decreases. Compression relies on non-random properties of data. So, real data with factors of various kinds and sorting of various kinds is not random across all those properties (even if the response variable is). It compresses more and the advantage appears.
It will also depend on the particular computer. I regularly use a laptop. It only has a slightly slower CPU computation capability than a desktop machine but has much slower disk access. I'm thinking a 30% speedup might turn into a wash on a desktop.
I tried it with random data as well, but generated it a different way...
x <- matrix(sample(1:3, 1e7, TRUE), ncol=100)
That file sees less compression than I got (20 MB txt v. 3.4 MB gzipped). But, it's more than you get from rnorm because, even though it's random, it has more redundancies. That also loads faster for me compressed (5.2 gz v. 6.6 txt)
If you end up with lots of these files, the space savings, and the faster transport to others (regardless of method), made the gz option a no brainer for me. It's much easier to email a 1.6 MB file than a 28 MB one. For others, on machines with raid arrays and no issues with storage or transferring files, uncompressed might be the way to go.
It's great to see the dta as yet another option but it's probably more general than that. I'm sure other formats that contain information about the data within them and that use more compact, but not compressed, binary formats, will also be faster to read.
John Christie
Posted by: JC | December 29, 2009 at 09:16
So, after looking at my last statement in my prior post I decided to test load() and save() commands. If I'm going to make it a binary format perhaps R's native object format would be a good one to select.
It turns out that, as per usual, the binary format is more compact. I noted that, for randomly generated integers from a small sequence, the difference between rdata and gzipped wasn't very large. I used the same data generated in my prior post from the sample command and the rdata file is 3.6MB v. 3.4MB gz. However, the speedup was dramatic in reading. It is now 1.33 seconds down from 5.2.
So, I decided to test this on my real data. Recall that that data compressed much more than random data. But, it's possible an rdata file will be even smaller. That's because R already internally compresses factors in an intelligent way and I had many factors. The results are again compelling in favour of load() and save(). My original read times were 6.8s txt, and 4.5s gz, Also, the file sizes were 28MB and 1.6MB. Now, the rdata file is 0.258 MB and the read time 0.6s!!!!
I'm pretty convinced that it's unlikely you'll beat the native R data data format if you want to go binary for both size and speed. If you care about pure text (like insuring you can read in other programs) then compression may give you benefits.
Posted by: JC | December 29, 2009 at 09:28
You'd have to go to the code to find out for sure, but my guess would be some sort of system memory allocation penalty that gets paid on the first read.data and then on subsequent calls you don't need to malloc, just return a pointer to a (conveniently sized) chunk of RAM you haven't released back to the system yet.
Posted by: Byron | December 29, 2009 at 13:59
How often does one read the same data file more than once in a session?
Posted by: GB | December 29, 2009 at 22:57