by Vidisha Vachharajani
Freelance Statistical Consultant
R showcases several useful clustering tools, but the one that seems particularly powerful is the marriage of hierarchical clustering with a visual display of its results in a heatmap. The term “heatmap” is often confusing, making most wonder – which is it? A "colorful visual representation of data in a matrix" or "a (thematic) map in which areas are represented in patterns ("heat" colors) that are proportionate to the measurement of some information being displayed on the map"? For our sole clustering purpose, the former meaning of a heatmap is more appropriate, while the latter is a choropleth.
The reason why we would want to link the use of a heatmap with hierarchical clustering is the former’s ability to lucidly represent the information in a hierarchical clustering (HC) output, so that it is easily understood and more visually appealing. It is also (as an in-built package in R, "heatmap.2") a mechanism of applying HC to both rows and columns in a data matrix, so that it yields meaningful groups that share certain features (within the same group) and are differentiated from each other (across different groups).
Consider the following simple example which uses the "States" data sets in the car package. States contains the following features:
- region: U. S. Census regions. A factor with levels: ENC, East North Central; ESC, East South Central; MA, Mid-Atlantic; MTN, Mountain; NE, New England; PAC, Pacific; SA, South Atlantic; WNC, West North Central; WSC, West South Central.
- pop: Population: in 1,000s.
- SATV: Average score of graduating high-school students in the state on the verbal component of the Scholastic Aptitude Test (a standard university admission exam).
- SATM: Average score of graduating high-school students in the state on the math component of the Scholastic Aptitude Test.
- percent: Percentage of graduating high-school students in the state who took the SAT exam.
- dollars: State spending on public education, in \$1000s per student.
- pay: Average teacher's salary in the state, in $1000s.
We wish to account for all but the first column (region) to create groups of states that are common with respect to the different pieces of information we have about them. For instance, what states are similar vis-a-vis exam scores vs. state education spending? Instead of doing just a hierarchical clustering, we can implement both the HC and the visualization in one step, using the heatmap.2() function in the gplots package.
# R CODE (output = "initial_plot.png")
library(gplots) # contains the heatmap.2 package library(car) States[1:3,] # look at the data scaled <- scale(States[,-1]) # scale all but the first column to make information comparable heatmap.2(scaled, # specify the (scaled) data to be used in the heatmap cexRow=0.5, cexCol=0.95, # decrease font size of row/column labels scale="none", # we have already scaled the data trace="none") # cleaner heatmap
This initial heatmap gives us a lot of information about the potential state grouping. We have a classic HC dendrogram on the far left of the plot (the output we would have gotten from an "hclust()" rendering). However, in order to get an even cleaner look, and have groups fall right out of the plot, we can induce row and column separators, rendering an "all-the-information-in-one-glance" look. Placement information of the separators come from the HC dendrograms (both row and column). Lets also play around with the colors to get a "red-yellow-green" effect for the scaling, which will render the underlying information even more clearly. Finally, we'll also eliminate the underlying dendrograms, so we simply have a clean color plot with underlying groups (this option can be easily undone from the code below).
# R CODE (output = "final_plot.png") # Use color brewer library(RColorBrewer) my_palette <- colorRampPalette(c('red','yellow','green'))(256) scaled <- scale(States[,-1]) # scale all but the first column to make information comparable heatmap.2(scaled, # specify the (scaled) data to be used in the heatmap cexRow=0.5,
cexCol=0.95, # decrease font size of row/column labels col = my_palette, # arguments to read in custom colors
colsep=c(2,4,5), # Adding on the separators that will clarify plot even more
rowsep = c(6,14,18,25,30,36,42,47),
sepcolor="black",
sepwidth=c(0.01,0.01),
scale="none", # we have already scaled the data
dendrogram="none", # no need to see dendrograms in this one
trace="none") # cleaner heatmap
This plot gives us a nice, clear picture of the groups that come off of the HC implementation, as well as in context of column (attribute) groups. For instance, while Idaho, Okalahoma, Missouri and Arkansas perform well on the verbal and math SAT components, the state spending on education and average teacher salary is much lower than the other states. These attributes are reversed for Connecticut, New Jersey, DC, New York, Pennsylvania and Alaska.
This hierarchical-clustering/heatmap partnership is a useful, productive one, especially when one is digging through massive data, trying to glean some useful cluster-based conclusions, and render the conclusions in a clean, pretty, easily interpretable fashion.
Useful, but a red-blue palette (e.g. 'RdBu' in brewer.pal) is probably preferred. Especially for people with color-deficiencies.
Posted by: Pete | April 22, 2015 at 05:44
Thank you for the post Vidisha.
I wanted to mention another package (which I authored), for also controlling the dendrograms in the heatmap. It is called "dendextend".
It has a detailed vignette here:
http://cran.r-project.org/web/packages/dendextend/vignettes/introduction.html
The relevant section is under: Enhancing other packages -> gplots.
Here is a figure which gives a nice example of what can be done:
Best,
Tal
Posted by: Tal Galili | April 22, 2015 at 07:09
Nice tool, but how to color the branches and tick labels? Eg. if I cut the tree to have four gruops like this heatmap?:
http://i.stack.imgur.com/y4qvO.jpg
It is possible?
Best regards :)
Darwin
Posted by: Darwin Alexander | April 22, 2015 at 08:59
Interesting tool! However I am new to R, and I notice that the data basically has numerical variables. I have a database that has categorical variables that are used to define a record. Basically cust info which determines age group, sex, description of the customer etc. Can you suggest a suitable clustering algorithm to group this customer set into different groups
Posted by: Aakash | May 13, 2015 at 21:41