by Andrie de Vries
This week at JSM2015, the annual conference of the American Statistical Association, Joseph Rickert and I gave a presentation on the topic of "The network structure of CRAN and BioConductor" (link to abstract).
Our work tested the hypothesis if one can detect statistical differences in the network graph formed by the dependencies between packages. In the dependency graph, each package is a vertex and each dependency is an edge connecting two vertices.
This presentation combines earlier work that we have discussed in blog posts during the year:
Before starting the work, we formed a hypothesis that CRAN and BioConductor have discernably different package network structures.
This hypothesis is based on the intuition that these two repositories have different management structures:
Firstly, we used the igraph package to compute descriptive network statistics. Among these, we found the clustering coefficient and the degree distribution most illuminating.
Firstly, we found that BioConductor has a higher clustering coefficient than CRAN. The clustering coefficient (also called transitivity) measures the probability that the adjacent vertices of a vertex are connected.
You can see this visually in the network graphs. It appears as if the BioConductor graph is more compact, while the CRAN graph has many packages on the perimeter that are only loosely connected to the rest of the graph.
We used a simple bootstrapping algorithm to simulate the local clustering coefficient of induced subgraphs. In this plot, CRAN (in red) has a much lower distribution of clustering coefficient than BioConductor (in blue).
The second statistical summary is the degree distribution. The degree of a node is the number of adjacent edges. Note in particular the degree distribution with nodes of degree zero, i.e. unconnected nodes.
BioConductor has a much lower fraction of packages with zero connections. It seems that the BioConductor policy encourages package authors to re-use exising material and write packages that work better together.
The presentation is available on slideshare.
The scripts we used are available at github. We think this is an important topic to study, since it could help to discover:
by Nina Zumel
Data Scientist Win-Vector LLC
An all too common approach to modeling in data science is to throw all possible variables at a modeling procedure and "let the algorithm sort it out." This is tempting when you are not sure what are the true causes or predictors of the phenomenon you are interested in, but it presents dangers, too. Very wide data sets are computationally difficult for some modeling procedures; and more importantly, they can lead to overfit models that generalize poorly on new data. In extreme cases, wide data can fool modeling procedures into finding models that look good on training data, even when that data has no signal. We showed some examples of this previously in our "Bad Bayes" blog post.
In this latest "Statistics as it should be" article, we will look at a heuristic to help determine which of your input variables have signal.
by Joseph Rickert
In a recent post on creating JavaScript network graphs directly from R, my colleague and fellow blogger, Andrie de Vries, included a link to a saved graph of CRAN. Here, I will use that same graph (network) to build a simple exponential random graph model using functions from the igraph package, and the network and ergm packages included in the statnet suite of R packages. Each node (vertex) in the saved graph represents a package on CRAN, and a directed link or edge between two nodes A -> B indicates that package A depends on package B. Since Andrie's CRAN graph does not have any external attributes associated with either the nodes or edges, the idea is to see if we can develop a model using only structural aspects of the network itself as predictors. In general, this is not an easy thing to do and we are only going to have limited success here. However, the process will illustrate some basic concepts.
A fundamental statistic associated with each node in a network is its degree, the number of edges that connect to it. (Each edge connects a node to some other node in the network forming a tie.) The degree distribution (distribution of the degree associated with each node of the network) is one way to think about the local connectivity structure of the network. The following plots illustrate different aspects CRAN's degree distribution.
The top left histogram shows that although most nodes have degree less than 5, there is a long tail with some nodes having hundreds of incident edges. (Of the 6,867 nodes in the network 1,156 have degree greater than 5). Additionally, we have:
summary(cran_deg)
# Min. 1st Qu. Median Mean 3rd Qu. Max.
# 0.000 1.000 2.000 4.296 4.000 764.000
The bottom two histograms provide greater detail for the beginning and tail-end of the degree distribution.
The purpose of the fourth plot in the panel, the log-log plot, is to look for evidence that the CRAN network may exhibit the kind of "scale free" structure that is common to social network graphs. (A straight line would provide some evidence that the network follows a power law distribution.) That is not quite what is going on here. Nevertheless, the structure exhibited in the degree distribution and the clustering clearly visible in CRAN package network plots indicate that we are not dealing with a purely random graph either.
The following shows the code to specify and fit an ergm model to the data comprising the CRAN network with just two predictors edges and degree(c(1,2)).
# Model edges form <- formula(cran_net ~ edges + degree(c(1,2))) summary.statistics(form) # edges degree1 degree2 # 14293 1400 1119 # Fit ergm fit <- ergm(form,control=control.ergm(MCMLE.maxit = 60)) summary.ergm(fit) # ========================== # Summary of model fit # ========================== # # Formula: cran_net ~ edges + degree(c(1, 2)) # # Iterations: 22 out of 60 # # Monte Carlo MLE Results: # Estimate Std. Error MCMC % p-value # edges -6.89814 0.01124 1 <1e-04 *** # degree1 2.46835 0.04519 0 <1e-04 *** # degree2 1.25087 0.03884 0 <1e-04 *** # --- # Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 # # Null Deviance: 32681073 on 23574411 degrees of freedom # Residual Deviance: 238828 on 23574408 degrees of freedom # # AIC: 238834 BIC: 238879 (Smaller is better.) #
Fitting an ergm with just edges as a predictor would be an attempt to fit a random graph to the data. (See Hunter et al. for details.) Adding the term degree(c(1,2) attempts to control for the nodes of degrees 1 and 2 which appear to deviate from the straight line in the log-log plot. The small p values associated with the coefficients and the fact that both the AIC and BIC are slightly smaller than the those resulting from a fit to a random model (AIC = 240349, BIC = 240364) indicate that the simple model has some explanatory power. But how good is the fit?
The ergm package provides a method of assessing goodness of fit through the computationally intensive scheme of simulating networks from the model and comparing statistics from the simulated networks with those calculated from the actual network. The following plot illustrates this comparison for the degree distribution, edge-wise shared partners distribution and the proportion of dyads at values of minimum geodesic distance. The fit for the first two distributions appears to be quite good, and the fit for the third distribution, while far from perfect, is not bad either.
(in the plots, the solid black lines represent the statistics computed from the GRAN graph, the light lines with box plots are the simulated data.)
The take away here is that a very simple model appears to have a surprising good fit. Constructing a better fitting and more interesting model would need to take into account characteristics of the package design and construction process. These features would then be used as additional predictors. I think a hint as to how one might go about this is contained in Andrie's observation that a many of the packages of very high degree such as Rcpp, MASS, ggplot2 etc. are concerned with providing tools or reusable infrastructure. Exploring this idea, however, and modeling other package networks such as Bioconductor are subjects for future work.
The code used for this post may be obtained here: Download CRAN_ergm_blog
To go further into ergm modeling with statnet and R have a look at benjamin's most informative and entertaining analysis of Grey's Anatomy "hook-ups".
by Joseph Rickert
The XXXV Sunbelt Conference of the International Network for Social Network Analysis (INSNA) was held last month at Brighton beach in the UK. (And I am still bummed out that I was not there.)
A run of 35 conferences is impressive indeed, but the social network analysts have been at it for an even longer time than that:
and today they are still on the cutting edge of the statistical analysis of networks. The conference presentations have not been posted yet, but judging from the conference workshops program there was plenty of R action in Brighton.
Social network analysis at this level involves some serious statistics and mastering a very specialized vocabulary. However, it seems to me that some knowledge of this field will become important to everyone working in data science. Supervised learning models and statistical models that assume independence among the predictors will most likely represent only the first steps that data scientists will take in exploring the complexity of large data sets.
And, maybe of equal importance is that fact that working with network data is great fun. Moreover, software tools exist in R and other languages that make it relatively easy to get started with just a few pointers.
From a statistical inference point of view what you need to know is Exponential Random Graph Models (ERGMs) are at the heart of modern social network analysis. An ERGM is a statistical model that enables one to predict the probability of observing a given network from a specified given class of networks based on both observed structural properties of the network plus covariates associated with the vertices of the network. The exponential part of the name comes from exponential family of functions used to specify the form of these models. ERGMs are analogous to generalized linear models except that ERGMs take into account the dependency structure of ties (edges) between vertices. For a rigorous definition of ERGMs see sections 3 and 4 of the paper by Hunter et al. in the 2008 special issue of the JSS, or Chapter 6 in Kolaczyk and Csárdi's book Statistical Analysis of Network Data with R. (I have found this book to be very helpful and highly recommend it. Not only does it provide an accessible introduction to ERGMs it also begins with basic network statistics and the igraph package and then goes on to introduce some more advanced topics such as modeling processes that take place on graphs and network flows.)
In the R world, the place to go to work with ERGMs is the statnet.org. statnet is a suite of 15 or so CRAN packages that provide a complete infrastructure for working with ERGMs. statnet.org is a real gem of a site that contains documentation for all of the statnet packages along with tutorials, presentations from past Sunbelt conferences and more.
I am particularly impressed with the Shiny based GUI for learning how to fit ERGMs. Try it out on the Shiny webpage or in the box below. Click the Get Started button. Then select "built-in network" and "ecoli 1" under File type. After that, click the right arrow in the upper right corner. You should see a plot of the ecoli graph.
--------------------------------------------------------------------------------------------------------------------------
You will be fitting models in no time. And since the commands used to drive the GUI are similar to specifying the parameters for the functions in the ergm package you will be writing your own R code shortly after that.
by Joseph Rickert
June was a hot month for extreme statistics and R. Not only did we close out the month with useR! 2015, but two small conferences in the middle of the month brought experts together from all over the world to discuss two very difficult areas of statistics that generate quite a bit of R code.
The Extreme Value Analysis conference is a prestigious event that is held every two years in different parts of the world. This year, over 230 participants from 26 countries met from June 15th through 19th at the University of Michigan, Ann Arbor for EVA 2015. The program included theoretical advances as well as novel applications of Extreme Value Theory in fields including finance,
economics,insurance, hydrology, traffic safety, terrorism risk, climate and environmental extremes. You can get a good idea of the topics discussed at the EVA from the book of abstracts which includes an author index as well as a keyword index. The conference organizers are in the process of obtaining permissions to post the slides from the talk. These should be available soon.
In the meantime, have a look at the slides from two excellent presentations from the Workshop on Statistical Computing which was held the day before the main conference. Eric Gilleland's Introduction to Extreme Value Analysis provides a gentle introduction for anyone willing to look at some math. Eric begins with some motivating examples, develops some key concepts and illustrates them with R and even provides some history along the way. This quote from Emil Gumbel, a founding giant in the field, should be every modeler's mantra: “Il est impossible que l’improbable n’arrive jamais”. ("It's impossible for the improbable to never occur" -- ed)
In Modeling spatial extremes with the SpatialExtremes package, Mathieu Ribatet works through a complete example in R by fitting and evaluating a model and running simulations. This motivating slide from the presentation describes the kind of problems he is considering.
In our world of climate extremes and financial black swans there are probably few topics more of more immediate concern to statisticians that EVA, but the vexing problem of dealing with missing values might be one of them. So, it was not surprising that at nearly the same time (June 18th and 19th) a 150 people or so gathered on the other side of the world in Rennes France for missData 2015.
Over the years, R developers have expended considerable energy creating routines to missing values. The transcan function in the Hmisc package "automatically transforms continuous and categorical variables to have maximum correlation with the best linear combination of the other variables". mice provides functions using the Fully Conditional Specification using the MICE algorithm. (See the slides from Stef van Buuern's presentation Fully Conditional Specification: Past, present and beyond for a perspective on FCS and the reading list at left.) mi provides functions for missing value imputation in a Bayesian framework as does the BaBooN package and VIM provides for visualizing the structure of missing values. Slides for almost all of the talks are available online at the conference program page and videos will be available soon. Have a look at the slides from the lightning talk by Matthias Templ and Alexander Kowarik to see what the VIM package can do.
Revolution Analytics was very pleased to have been able to sponsor both of these conferences. For the next EVA mark your calendars to visit Delft, the Netherlands in 2017.
by Joseph Rickert
I had barely begun reading Statistics Done Wrong: the Woefully Complete Guide by Alex Reinhart (no starch press 2015) when I stated to wonder about the origin of the aphorism "Don't shoot the messenger." It occurred to me that this might be a reference to a primitive emotion that wells up unbidden when you hear bad news in such a way that you know things are not going to get better any time soon.
It was on page 4 that I read: "Even properly done statistics can't be trusted." Ouch! Now, to be fair, the point the author is trying to make here is that it is often not possible, based solely on the evidence contained in a scientific paper, to determine if an author sifted through his data until he turned up something interesting. But, coming as it does after mentioning J.P.A Ioannidis' conclusion that most published research findings are probably false, that the average scores of medical school faculty on tests of basic statistical knowledge don’t get much better than 75%, and that both pharmaceutical companies and the scientific journals themselves bias research by failing to publish studies with negative results, Reinhart’s sentence really stings. Moreover, Reinhart is so zealous in his efforts to expose the numerous ways a practicing scientist can go wrong in attempting to "employ statistics" it is reasonable (despite the optimism he expresses in the final chapters) for a reader in the book’s target demographic of practicing scientists with little formal training in statistics to conclude that the subject is just insanely difficult.
Is the practice of statistics just too difficult? Before permitting myself a brief comment on this I’ll start with an easier and more immediate question: Is this book worth reading? To this question, the answer is an unqualified yes.
Anyone starting out on a journey would like to know ahead of time where the road is dangerous, were the hard climbs are, and most of all: where be the dragons? Statistics Done Wrong is as good a map to the traps lurking in statistical analysis adventures that you are ever likely to find. In less than 150 pages it covers the pitfalls of p-values, the perils of being underpowered, the disappointments of false discoveries, the follies mistaking correlation for causation, the evils of torturing data and the need for exploratory analysis to avoid Simpson’s paradox.
About three quarters of the way into the book (Chapter 8), Reinhart moves beyond the basic hypothesis testing to consider some of the problems associated with fitting linear models. There follows a succinct but lucid presentation some essential topics including over fitting, unnecessary dichotomization, variable selection via stepwise regression, the subtle ways in which one can be led into mistaking correlation for causation, the need for clarity in dealing with missing data and the difficulties of recognizing and accounting for bias.
That is a lot of ground to cover, but Reinhart manages it with some style and with an eye for relevant contemporary issues. For example, in his discussion on statistical significance Reinhart says:
And because any medication or intervention usually has some real effect, you can always get a statistically significant result by collecting so much data that you detect extremely tiny but relatively unimportant differences (p9).
And then, he follows up with a very amusing quote from Bruce Thomson's 1992 paper that wryly explains that significance tests on large data sets are often little more than confirmations of the fact that a lot of data was collected. Here we have a “Big Data” problem, deftly dealt with in 1992, but in a journal that no data scientist is ever likely to have read.
The bibliography contained in the notes to each chapter of Statistics Done Wrong is a major strength of the book. Nearly every transgression recorded and every lamentable tale of the sorry state of statistical practice is backed up with a reference to the literature. This impressive exercise at scholarly research adds some weight and depth to the book’s contents and increases it usefulness as a guide.
Also, to my surprise and great delight, Reinhart manages a short discussion that elucidates the differences between R. A. Fisher’s conception of p-values and the treatment given by Neyman and Pearson in their formal theory of hypothesis testing. The confounding of these two very different approaches in what Gigerenzer et al. call the “Null Ritual” is perhaps the root cause of most of the misuse and abuse of significance testing in the scientific literature. However, you can examine dozens of the most popular text books on elementary statistics and find no mention of it.
In the closing chapters of the Statistics Done Wrong Reinhart effects a change of tone and discusses some of the structural difficulties with the practice of statistics in the medical and health sciences that have contributed to the present pandemic of the publication of false, misleading or just plain useless results. Topics include the lack of incentives for researchers to publish inconclusive and negative results, the reluctance of many researchers to share data and the willingness of some to attempt to game the system by deliberately publishing “doctored” results. Reinhart handles these topics nicely and uses them to motivate contemporary work on reproducible research and the need to cultivate a culture of reproducible and open research. Reinhart ends the book with recommendations for the new researcher that allows him to finish the book on a surprisingly upbeat note. The bearer of bad news concludes by offering hope.
I highly recommend Statistics Done Wrong to be read as the author intended: as supplementary material. In the preface, Reinhart writes:
But this is not a textbook, so I will not teach you how to use these techniques in any technical detail. I only hope to make you aware of the most common problems so you are able to pick the statistical technique best suited to your question.
Statistics Done Wrong is the kind of study guide that I think could benefit almost anyone slogging through a statistical analysis for the first time. It seems to me that the author achieve his stated goal with admiral economy and just a few shortcomings. The book, which entirely avoids the use of mathematical symbolism, would have benefited from precise definitions of the key concepts presented (p-values, confidence intervals etc.) and from a little R code to back these definitions. These are, however, relatively minor failings.
Now, back to the big question: is the practice of statistics just too difficult? Yes, I think that the catalogue of errors and numerous opportunities for going wrong documented by Reinhart indicates that the practice of statistics is more difficult than it needs to be. My take on why this is so is expressed (perhaps inadvertently) by Reinhart in the statement of his of his goal for the book quoted above. As long as statistics is conceived and taught as the process of selecting the right technique to answer isolated questions, rather than as an integrated system for thinking with data, we are all going to have a difficult time of it.
Mixed models (which include random effects, essentially parameters drawn from a random distribution) are tricky beasts. Throw non-Normal distributions into the mix for Generalized Linear Mixed Models (GLMMs), or go non-linear, and things get trickier still. It was a new field of Statistics when I was working on the Oswald package for S-PLUS, and even 20 years later some major questions have yet to be fully answered (like, how do you calculate the degrees of freedom for a significance test?).
These days lme4, nlme and MCMCglmm are the go-to R packages for mixed models, and if you're using them you likely have questions. The r-sig-mixed-models FAQ is a good compendium of answers, and includes plenty of references for further reading. You can also join in the discussions on mixed models at the r-sig-mixed-models mailing list.
John Mount Ph. D.
Data Scientist at Win-Vector LLC
An A/B test is a very simple controlled experiment where one group is subject to a new treatment (often group "B") and the other group (often group "A") is considered a control group. The classic example is attempting to compare defect rates of two production processes (the current process, and perhaps a new machine).
In our time an A/B test typically compares the conversion to sales rate of different web-traffic sources or different web-advertising creatives (like industrial defects, a low rate process). An A/B test uses a randomized "at the same time" test design to help mitigate the impact of any possible interfering or omitted variables. So you do not run "A" on Monday and then "B" on Tuesday, but instead continuously route a fraction of your customers to each treatment. Roughly a complete "test design" is: how much traffic to route to A, how much traffic to route to B, and how to chose A versus B after the results are available. A/B testing is one of the simplest controlled experimental design problems possible (and one of the simplest examples of a Markov decision process). And that is part of the problem: it is likely the first time a person will need to truly worry about:
All of these are technical terms we will touch on in this article. However, we argue the biggest sticking point of A/B testing is: it requires a lot more communication between the business partner (sponsoring the test) and the analyst (designing and implementing the test) than a statistician or data scientist would care to admit. In this first article of a new series called "statistics as it should be" we will discuss some of the essential issues in planning A/B tests. To continue please click here.
by B. W. Lewis
This note warns about potentially misleading results when using the use=pairwise.complete.obs
and related options in R’s cor
and cov
functions. Pitfalls are illustrated using a very simple pathological example followed by a brief list of alternative ways to deal with missing data and some references about them.
R includes excellent facilities for handling missing values for all native data types. Perhaps counterintuitively, marking a value as missing conveys the information that the value is not known. Donald Rumsfeld might call it a “known unknown^{1}.” Upon encountering a missing value, we can deal with it by simply omitting it, imputing it somehow, or through several other possible approaches. R does a good job of making our choice of missing value approach explicit.
The cov
and cor
functions in the R programming language include several options for dealing with missing data. The use="pairwise.complete.obs"
option is particularly confusing, and can easily lead to faulty comparisons. This note explains and warns against its use.
Consider the following tiny example:
(x = matrix(c(-2,-1,0,1,2,1.5,2,0,1,2,NA,NA,0,1,2),5))
## [,1] [,2] [,3]
## [1,] -2 1.5 NA
## [2,] -1 2.0 NA
## [3,] 0 0.0 0
## [4,] 1 1.0 1
## [5,] 2 2.0 2
The functions V=cov(x)
computes the symmetric covariance matrix V
with entries defined by the pairwise covariance of columns of x
,
where i=j=1,2,3
in this example. The function cor(x)
similarly computes the symmetric correlation matrix with entries defined by pairwise correlation of the columns of x
. For example:
cov(x)
## [,1] [,2] [,3]
## [1,] 2.5 0.0 NA
## [2,] 0.0 0.7 NA
## [3,] NA NA NA
cor(x)
## [,1] [,2] [,3]
## [1,] 1 0 NA
## [2,] 0 1 NA
## [3,] NA NA 1
Due to missing values in the third column of x
we know that we don’t know the covariance between x[,3]
and anything else. Thanks to an arguably questionable^{2} choice in R’s cov2cor
function, R reports that the correlation of x[,3]
with itself is one, but we don’t know the correlation between x[,3]
and the other columns.
The use="complete"
option is one way to deal with missing values. It simply removes rows of the matrix x
with missing observations. Since the columns of the third through fifth rows of our example matrix are all identical we expect perfect correlation across the board, and indeed:
cor(x, use="complete")
## [,1] [,2] [,3]
## [1,] 1 1 1
## [2,] 1 1 1
## [3,] 1 1 1
Reasonable people might question this approach. Deleting two observations has a huge effect on the correlation between x[,1]
and x[,2]
in this example mostly because of the large change in x[,1]
. The result really says that we should collect more observations!
The use="pairwise.complete.obs"
is an even less reasonable way to deal with missing values. When specified, R computes correlations for each pair of columns using vectors formed by omitting rows with missing values on a pairwise basis. Thus each column vector may vary depending on it’s pairing, resulting in correlation values that are not even comparable. Consider our simple example again:
cor(x, use="pairwise.complete.obs")
## [,1] [,2] [,3]
## [1,] 1 0 1
## [2,] 0 1 1
## [3,] 1 1 1
By this bizarre measurement, the correlation of x[,1]
and x[,2]
is zero (as we saw above in the first example), and yet cor
claims that x[,3]
is perfectly correlated with both x[,1]
and x[,2]
. In other words, the result is nonsense. As Rumsfeld might say, we’ve converted known unknowns into unknown knowns.
What’s going on here is that the reported correlations are not comparable because they are computed against different vectors: all of x[,1]
and x[,2]
are compared to each other, but only parts of x[,1]
and x[,2]
are compared to x[,3]
.
The bad result is obvious for our small example. But the danger here is in large matrices with lots of missing values, where it may be impossible to use the pairwise
option in a meaningful way.
If you want to run correlations on lots of vectors with missing values, consider simply using the R default of use="everything"
and propagating missing values into the correlation matrix. This makes it clear what you don’t know.
If you really don’t want to do that, consider imputing the missing values. The simplest method replaces missing values in each column with the mean of the non-missing values in the respective column:
m = mean(na.omit(x[,3]))
xi = x
xi[is.na(x)] = m
cor(xi)
## [,1] [,2] [,3]
## [1,] 1.0000000 0.0000000 0.4472136
## [2,] 0.0000000 1.0000000 0.8451543
## [3,] 0.4472136 0.8451543 1.0000000
This can be done really efficiently when “centering” a matrix by simply replacing missing values of the centered matrix with zero.
Sometimes it might make more sense to use a piecewise constant interpolant, referred to as “last observation carry forward” especially when dealing with time series and ordered data. In yet other cases a known default value (perhaps from a much larger population than the one under study) might be more appropriate.
Another basic approach bootstraps the missing values from the non-missing ones:
i = is.na(x[,3])
N = sum(i)
b = replicate(500, {x[i,3] = sample(x[!i,3], size=N, replace=TRUE);cor(x[,1:2],x[,3])})
# Average imputed values of cor(x[,1],x[,3]) and cor(x[,2],x[,3])
apply(b,1,mean)
## [1] 0.3722048 0.6845172
# Standard deviation of imputed values of cor(x[,1],x[,3]) and cor(x[,2],x[,3])
apply(b,1,sd)
## [1] 0.3361684 0.2156523
If you have lots of observations, consider partitioning them with a basic clustering algorithm first and then imputing the missing values from their respective cluster cohorts. Or consider a matching method or a regression-based imputation method. See the references below for many more details.
There are of course many excellent R packages and references on missing data. I recommend consulting the following packages and references:
The cov2cor
function used by cor
always puts ones along the diagonal of a correlation matrix; that choice is valid only if all unknowns may assume bounded and valid numeric values which is actually pretty reasonable. But in a rare example of inconsistency in R cor(x[,3],x[,3])
returns NA
. Yikes!)↩
by Vidisha Vachharajani
Freelance Statistical Consultant
R showcases several useful clustering tools, but the one that seems particularly powerful is the marriage of hierarchical clustering with a visual display of its results in a heatmap. The term “heatmap” is often confusing, making most wonder – which is it? A "colorful visual representation of data in a matrix" or "a (thematic) map in which areas are represented in patterns ("heat" colors) that are proportionate to the measurement of some information being displayed on the map"? For our sole clustering purpose, the former meaning of a heatmap is more appropriate, while the latter is a choropleth.
The reason why we would want to link the use of a heatmap with hierarchical clustering is the former’s ability to lucidly represent the information in a hierarchical clustering (HC) output, so that it is easily understood and more visually appealing. It is also (as an in-built package in R, "heatmap.2") a mechanism of applying HC to both rows and columns in a data matrix, so that it yields meaningful groups that share certain features (within the same group) and are differentiated from each other (across different groups).
Consider the following simple example which uses the "States" data sets in the car package. States contains the following features:
We wish to account for all but the first column (region) to create groups of states that are common with respect to the different pieces of information we have about them. For instance, what states are similar vis-a-vis exam scores vs. state education spending? Instead of doing just a hierarchical clustering, we can implement both the HC and the visualization in one step, using the heatmap.2() function in the gplots package.
# R CODE (output = "initial_plot.png")
library(gplots) # contains the heatmap.2 package library(car) States[1:3,] # look at the data scaled <- scale(States[,-1]) # scale all but the first column to make information comparable heatmap.2(scaled, # specify the (scaled) data to be used in the heatmap cexRow=0.5, cexCol=0.95, # decrease font size of row/column labels scale="none", # we have already scaled the data trace="none") # cleaner heatmap
This initial heatmap gives us a lot of information about the potential state grouping. We have a classic HC dendrogram on the far left of the plot (the output we would have gotten from an "hclust()" rendering). However, in order to get an even cleaner look, and have groups fall right out of the plot, we can induce row and column separators, rendering an "all-the-information-in-one-glance" look. Placement information of the separators come from the HC dendrograms (both row and column). Lets also play around with the colors to get a "red-yellow-green" effect for the scaling, which will render the underlying information even more clearly. Finally, we'll also eliminate the underlying dendrograms, so we simply have a clean color plot with underlying groups (this option can be easily undone from the code below).
# R CODE (output = "final_plot.png") # Use color brewer library(RColorBrewer) my_palette <- colorRampPalette(c('red','yellow','green'))(256) scaled <- scale(States[,-1]) # scale all but the first column to make information comparable heatmap.2(scaled, # specify the (scaled) data to be used in the heatmap cexRow=0.5,
cexCol=0.95, # decrease font size of row/column labels col = my_palette, # arguments to read in custom colors
colsep=c(2,4,5), # Adding on the separators that will clarify plot even more
rowsep = c(6,14,18,25,30,36,42,47),
sepcolor="black",
sepwidth=c(0.01,0.01),
scale="none", # we have already scaled the data
dendrogram="none", # no need to see dendrograms in this one
trace="none") # cleaner heatmap
This plot gives us a nice, clear picture of the groups that come off of the HC implementation, as well as in context of column (attribute) groups. For instance, while Idaho, Okalahoma, Missouri and Arkansas perform well on the verbal and math SAT components, the state spending on education and average teacher salary is much lower than the other states. These attributes are reversed for Connecticut, New Jersey, DC, New York, Pennsylvania and Alaska.
This hierarchical-clustering/heatmap partnership is a useful, productive one, especially when one is digging through massive data, trying to glean some useful cluster-based conclusions, and render the conclusions in a clean, pretty, easily interpretable fashion.