« Video: Visualizing data in R using ggplot2 | Main | What's up with Darwin's weather? »

December 29, 2009

Comments

Feed You can follow this conversation by subscribing to the comment feed for this post.

I'm the original poster of this idea to R help (originally just meant for our local user group) and I reiterate the claim that the read is faster for compressed files, but it obviously must be qualified.

I knew of exactly the kinds of problems listed here and so was more than aware that I had to replicate my results, and I did. For the data I'm working with right now I get about a 50% slow down with the uncompressed file. (4.6 v, 6.8 s)

However, my read data file, as opposed to random numbers, (the one I'm working with now and just retested on) is 28 MB uncompressed and 1.6 MB compressed, Those are much larger differences than you're getting with randomly generated datasets. That's because if randomness increases then compression (or compressability) decreases. Compression relies on non-random properties of data. So, real data with factors of various kinds and sorting of various kinds is not random across all those properties (even if the response variable is). It compresses more and the advantage appears.

It will also depend on the particular computer. I regularly use a laptop. It only has a slightly slower CPU computation capability than a desktop machine but has much slower disk access. I'm thinking a 30% speedup might turn into a wash on a desktop.

I tried it with random data as well, but generated it a different way...

x <- matrix(sample(1:3, 1e7, TRUE), ncol=100)

That file sees less compression than I got (20 MB txt v. 3.4 MB gzipped). But, it's more than you get from rnorm because, even though it's random, it has more redundancies. That also loads faster for me compressed (5.2 gz v. 6.6 txt)

If you end up with lots of these files, the space savings, and the faster transport to others (regardless of method), made the gz option a no brainer for me. It's much easier to email a 1.6 MB file than a 28 MB one. For others, on machines with raid arrays and no issues with storage or transferring files, uncompressed might be the way to go.

It's great to see the dta as yet another option but it's probably more general than that. I'm sure other formats that contain information about the data within them and that use more compact, but not compressed, binary formats, will also be faster to read.

John Christie

So, after looking at my last statement in my prior post I decided to test load() and save() commands. If I'm going to make it a binary format perhaps R's native object format would be a good one to select.

It turns out that, as per usual, the binary format is more compact. I noted that, for randomly generated integers from a small sequence, the difference between rdata and gzipped wasn't very large. I used the same data generated in my prior post from the sample command and the rdata file is 3.6MB v. 3.4MB gz. However, the speedup was dramatic in reading. It is now 1.33 seconds down from 5.2.

So, I decided to test this on my real data. Recall that that data compressed much more than random data. But, it's possible an rdata file will be even smaller. That's because R already internally compresses factors in an intelligent way and I had many factors. The results are again compelling in favour of load() and save(). My original read times were 6.8s txt, and 4.5s gz, Also, the file sizes were 28MB and 1.6MB. Now, the rdata file is 0.258 MB and the read time 0.6s!!!!

I'm pretty convinced that it's unlikely you'll beat the native R data data format if you want to go binary for both size and speed. If you care about pure text (like insuring you can read in other programs) then compression may give you benefits.

You'd have to go to the code to find out for sure, but my guess would be some sort of system memory allocation penalty that gets paid on the first read.data and then on subsequent calls you don't need to malloc, just return a pointer to a (conveniently sized) chunk of RAM you haven't released back to the system yet.

How often does one read the same data file more than once in a session?

The comments to this entry are closed.

Search Revolutions Blog




Got comments or suggestions for the blog editor?
Email David Smith.
Follow revodavid on Twitter Follow David on Twitter: @revodavid
Get this blog via email with Blogtrottr
‚Äč