*Today's guest post comes from Garrett Grolemund, a software developer at RStudio **—** ed*.

I think of graphs as a type of visual summary for data. Yet I rarely see graphs used this way within visualizations. Consider tile plots. They group data into 2d bins and then summarize each group with a number. This approach is a go-to tool for understanding overplotted data, but it discards a lot of information. Since we’re already using graphs, why not summarize the data in each bin visually? In the same space that we devote to a single colored tile, we can draw a subplot that retains enough information to display interesting patterns. Take, for example, this visualization of the WikiLeaks Afghanistan War Diary. It replaces each tile with a bar graph that shows the number of casualties by type for the specified region. We still get a sense of where the highest frequencies of casualties occur, but we can also see trends. For example, civilian casualties outnumber combatant casualties in the capital city of Kabul.

I refer to this type of chart as an embedded plot, because it embeds subplots into a larger plot. Embedded subplots have been around for a long time. Charles Minard was embedding pie graphs into maps of France as early as 1862. Glyph plots, facetted plots, and other exotic graphs also rely on principles of embedded plots. However, embedded subplots seem to be under-utilized when we consider how useful they can be.

First, embedded plots can make patterns clear in the presence of overplotting. The diamonds data set in ggplot2 contains 53,000 observations. If we try to explore the data with a scatterplot, points occlude each other and hide patterns. Binning with embedded plots makes patterns visible, and this would be true even if the data contained 100,000, a million or even a trillion points.

Embedded plots are also useful for displaying spatio-temporal data, as in this illustration of daily temperatures in the western hemisphere from 1995-201. Because embedded plots provide additional axes we can plot longitude (x), latitude (y), and time (theta) and still have graphical power left over for variables of interest. Here daily temperature is mapped to r and the mean temperature for each region is mapped to the fill color.

Embedded plots can also show multidimensional relationships and interaction effects. The same subplots created above can be reorganized on new axes to show the relationship of seasonality to maximum and minimum temperatures. Surprisingly, the hottest places in the western hemisphere are not those near the equator.

Embedded plots may be under-used because they are difficult to make. Some programs like Gaugain, or the lattice and ggplot2 packages in R can make one or two specific types of embedded plots, but this doesn’t leave much flexibility when exploring a complicated data set. Things do not have to be this way. Embedded plots fit into the grammar of graphics quite nicely if we recognize that geoms are (very simple) subplots and subplots are (somewhat sophisticated) geoms. This realization creates some tantalizing insights about graphics. For example, graphs are hierarchical, or recursive. Also, facets are a type of geom (subplots) plotted against two categorical variables. The ggsubplot package extends ggplot2 to allow subplots to be used as a geom. Each of the graphs above was made with ggsubplot and the ggsubplot syntax closely follows that of ggplot2. For example, the graph of Afghanistan above is made with the following code (plus regular ggplot2 methods for tweaking color palettes and appearance)

```
ggplot(casualties, aes(lon, lat)) +
map_afghanistan +
geom_subplot2d(aes(subplot = geom_bar(aes(victim, ..count.., fill = victim))),
bins = c(15,12), ref = NULL, width = rel(0.8), height = rel(1))
```

Complete details about the ggsubplot package can be found in the package vignette. ggsubplot is available from CRAN. Development versions of ggsubplot can be obtained at GitHub.

Now for a word of caution: I would not recommend using embedded plots when a simpler graph would suffice. They can be hard to interpret. But when embedded plots are necessary, use them with confidence. They do not violate any data to ink ratio; embedded plots increase data in proportion to ink. And they organize multiple levels of information in an admirably intuitive way. Embedded plots take a little longer to comprehend than simpler graphs, but they also contain more data to be comprehended. Once a viewer has processed all of the relevant information, embedded plots display patterns with the same "interocular impact" that Tukey prized in simpler graphs. In return for a little patience, embedded plots make it easy to see relationships that would be difficult or impossible to perceive otherwise.

*Garrett Grolemund has recently left academia to* develop software and course content for RStudio. W*ith his dissertation adviser Hadley Wickham, he has *worked to refine and promote R, an open-source computer language used for statistical computing and graphics. Grolemund’s research focuses on data analysis, statistical computation, statistics education and visualization. With Wickham, he co-authored the lubridate R package which provides methods to parse, manipulate, and do arithmetic with date-times. Grolemund earned a B.A. in psychology and a master’s degree in statistics, both in 2003, from Harvard University and a PhD in statistics from Rice University in 2012. He spent a year as a teaching fellow at Harvard University, another year as a clinical trials coordinator at Massachusetts General Hospital and, before coming to Rice, a year as a researcher at the UCLA School of Law Library. At Rice he has taught such classes as Statistics 405: “Introduction to Data Analysis,” and “Visualization in R with ggplot2.”

For base graphics the TeachingDemos package has had the subplot and my.symbols functions for a few years.

Posted by: Greg Snow | September 14, 2012 at 08:39

Hello Garrett,

This is an excellent approach to plot large data sets. The data for the wardiary on the wikileaks link seems to be down, would it possible for you post the .csv of the data, so I can try out your examples.

Thanks,

John

Posted by: John Thompson | September 15, 2012 at 10:21

Great Stuff! Would you mind posting the code you wrote? (Unless I've missed it!).

Posted by: Isaiah | September 15, 2012 at 16:05

John,

There is a cleaned version of the data set in the ggsubplot package. It's saved as casualties.

Isaiah,

I put the code here for you: http://github.com/garrettgman/ggsubplot/issues/2

I also updated the package vignette (in the development version) to include the missing graphs. It is here: http://github.com/garrettgman/ggsubplot/blob/master/inst/doc/manual.pdf (click "View Raw" to download)

Posted by: Garrett | September 18, 2012 at 06:58

Garrett,

Nice package. However, it requires R v2.15 while Revolution R runs 2.14. Is is possible to build your package under R 2.14?

Posted by: Antonio | October 03, 2012 at 22:41

I noticed ggsubpolot was removed from the CRAN repository. Or maybe I am missing something? Any chance it will be back soonish? I am running R version 2.15.3. Thanks.

Posted by: David | October 10, 2013 at 07:55