(Hadley Wickham, author of ggplot2 and several other R packages, guest blogs today about forthcoming big-data improvements to his R graphics package -- ed.)
Hi! I'm Hadley Wickham and I'm guest posting on the Revolutions blog to give you a taste of some of the visualisation work that my research team and I worked on this summer. This work has been generously funded by Revolution Analytics and while, as you'll see, it works particularly well with RevoScaleR, it's also contributing to changes that will help all ggplot2 users.
This summer three undergrads, James Rigby, Jonathan Stewart, Hyun Bin Kang and one grad student, Ben White, have been working on the answer to an important question: how can we make a scatterplot work when you have hundreds of millions of points? Scatterplots are one of the most important tools of exploratory data analysis, but they start to break down even with relatively small datasets because of overplotting: you can't tell how many points are plotted at each location. They also get slower and slower the more points you try to draw.
The answer to both these problems is relatively simple: instead of plotting the raw data, plot densities, or distributions. These can generated simply (by binning and counting the number of points in each bin), or with more sophistication (by smoothing the bin counts to get a kernel density estimate). RevoScaleR makes this process incredibly fast: you can bin tens of millions observations in a few seconds on commodity hardware, and a kernel density estimate only takes a fraction more to compute.
Once you have the density, what can you do with it? The following plots show two of the ideas that we came up with. The examples show the diamonds data set from ggplot2, but the beauty of these techniques is that they'll work regardless of how much data you have - the extra complexity is taken care of in density computation stage.
A scatterplot of depth and table coloured by the z dimension is uninformative because of the extreme amount of overplotting:
One way to make it better is bin and smooth in 3d and then plot the conditional distribution of z at multiple values of depth and table.
The shape shows the distribution of z, and the colour shows the total density at that location -- higher values mean there are more data points in that location. This plot reveals much more than the previous one: most of the data points are concentrated near depth 56 and table 60 where the distribution of z is skewed towards smaller values. As depth increases, the average value of z also seems to increase.
A scatterplot of carat vs. price coloured by the colour of the diamond:
There is some hint that J colours are relatively cheaper (for a given size they have lower prices) but it's hard to see anything else because of the overplotting. Binning the data and displaying the distribution of colour in each bin makes the important patterns much easier to see.
Colours D, E and F are more expensive side, and H, I and particularly J are cheaper. Only bins containing more than 100 points are included to avoid drawing the eye to regions with little data.
I'm currently with working another student, Yue Hu, to turn our research into a robust R package.
Great post Hadley. These are really some innovative and creative plots.
On the first plot: I love the spark lines. They give the viewer a picture of the whole conditional distribution of z in each cell. What about overlaying the spark lines over a heat map in order to better show trends in central tendency? I guess you would lose the ability to show cell density though...
On the second plot: The total count is difficult to see. Instead of shading just the border of the cell, might it be easier to see if the background of the cell was shaded instead? Or, you could just show the unnormalized bar charts.
Posted by: Ian Fellows | October 21, 2011 at 11:09
Oh, yes, please.
One question/comment on the second plot. Is the bottom row center item the only one with a high count, and if so, which count bin does it fall in? The outlining just doesn't show that for me.
Posted by: Richard Careaga | October 21, 2011 at 19:22
It would be nice to see zoo among these packages.
Posted by: skan | October 26, 2011 at 09:19
Thanks for these plots. Can someone provide or point to the code that created the continuous-density plot?
thanks
Posted by: Peter Crosta | December 02, 2011 at 10:24
This looks great! Any news on this? Is there a package available now providing this functionality?
Posted by: Markus Spatz | November 30, 2012 at 05:12
How to create the last graph in R is outlined at
http://stackoverflow.com/questions/7714677/r-scatterplot-with-too-many-points/16122003#16122003
Posted by: MM | April 21, 2013 at 15:29