by Joseph Rickert
The XXXV Sunbelt Conference of the International Network for Social Network Analysis (INSNA) was held last month at Brighton beach in the UK. (And I am still bummed out that I was not there.)
A run of 35 conferences is impressive indeed, but the social network analysts have been at it for an even longer time than that:
and today they are still on the cutting edge of the statistical analysis of networks. The conference presentations have not been posted yet, but judging from the conference workshops program there was plenty of R action in Brighton.
Social network analysis at this level involves some serious statistics and mastering a very specialized vocabulary. However, it seems to me that some knowledge of this field will become important to everyone working in data science. Supervised learning models and statistical models that assume independence among the predictors will most likely represent only the first steps that data scientists will take in exploring the complexity of large data sets.
And, maybe of equal importance is that fact that working with network data is great fun. Moreover, software tools exist in R and other languages that make it relatively easy to get started with just a few pointers.
From a statistical inference point of view what you need to know is Exponential Random Graph Models (ERGMs) are at the heart of modern social network analysis. An ERGM is a statistical model that enables one to predict the probability of observing a given network from a specified given class of networks based on both observed structural properties of the network plus covariates associated with the vertices of the network. The exponential part of the name comes from exponential family of functions used to specify the form of these models. ERGMs are analogous to generalized linear models except that ERGMs take into account the dependency structure of ties (edges) between vertices. For a rigorous definition of ERGMs see sections 3 and 4 of the paper by Hunter et al. in the 2008 special issue of the JSS, or Chapter 6 in Kolaczyk and Csárdi's book Statistical Analysis of Network Data with R. (I have found this book to be very helpful and highly recommend it. Not only does it provide an accessible introduction to ERGMs it also begins with basic network statistics and the igraph package and then goes on to introduce some more advanced topics such as modeling processes that take place on graphs and network flows.)
In the R world, the place to go to work with ERGMs is the statnet.org. statnet is a suite of 15 or so CRAN packages that provide a complete infrastructure for working with ERGMs. statnet.org is a real gem of a site that contains documentation for all of the statnet packages along with tutorials, presentations from past Sunbelt conferences and more.
I am particularly impressed with the Shiny based GUI for learning how to fit ERGMs. Try it out on the Shiny webpage or in the box below. Click the Get Started button. Then select "built-in network" and "ecoli 1" under File type. After that, click the right arrow in the upper right corner. You should see a plot of the ecoli graph.
--------------------------------------------------------------------------------------------------------------------------
You will be fitting models in no time. And since the commands used to drive the GUI are similar to specifying the parameters for the functions in the ergm package you will be writing your own R code shortly after that.
by Joseph Rickert
June was a hot month for extreme statistics and R. Not only did we close out the month with useR! 2015, but two small conferences in the middle of the month brought experts together from all over the world to discuss two very difficult areas of statistics that generate quite a bit of R code.
The Extreme Value Analysis conference is a prestigious event that is held every two years in different parts of the world. This year, over 230 participants from 26 countries met from June 15th through 19th at the University of Michigan, Ann Arbor for EVA 2015. The program included theoretical advances as well as novel applications of Extreme Value Theory in fields including finance,
economics,insurance, hydrology, traffic safety, terrorism risk, climate and environmental extremes. You can get a good idea of the topics discussed at the EVA from the book of abstracts which includes an author index as well as a keyword index. The conference organizers are in the process of obtaining permissions to post the slides from the talk. These should be available soon.
In the meantime, have a look at the slides from two excellent presentations from the Workshop on Statistical Computing which was held the day before the main conference. Eric Gilleland's Introduction to Extreme Value Analysis provides a gentle introduction for anyone willing to look at some math. Eric begins with some motivating examples, develops some key concepts and illustrates them with R and even provides some history along the way. This quote from Emil Gumbel, a founding giant in the field, should be every modeler's mantra: “Il est impossible que l’improbable n’arrive jamais”. ("It's impossible for the improbable to never occur" -- ed)
In Modeling spatial extremes with the SpatialExtremes package, Mathieu Ribatet works through a complete example in R by fitting and evaluating a model and running simulations. This motivating slide from the presentation describes the kind of problems he is considering.
In our world of climate extremes and financial black swans there are probably few topics more of more immediate concern to statisticians that EVA, but the vexing problem of dealing with missing values might be one of them. So, it was not surprising that at nearly the same time (June 18th and 19th) a 150 people or so gathered on the other side of the world in Rennes France for missData 2015.
Over the years, R developers have expended considerable energy creating routines to missing values. The transcan function in the Hmisc package "automatically transforms continuous and categorical variables to have maximum correlation with the best linear combination of the other variables". mice provides functions using the Fully Conditional Specification using the MICE algorithm. (See the slides from Stef van Buuern's presentation Fully Conditional Specification: Past, present and beyond for a perspective on FCS and the reading list at left.) mi provides functions for missing value imputation in a Bayesian framework as does the BaBooN package and VIM provides for visualizing the structure of missing values. Slides for almost all of the talks are available online at the conference program page and videos will be available soon. Have a look at the slides from the lightning talk by Matthias Templ and Alexander Kowarik to see what the VIM package can do.
Revolution Analytics was very pleased to have been able to sponsor both of these conferences. For the next EVA mark your calendars to visit Delft, the Netherlands in 2017.
by Joseph Rickert
I had barely begun reading Statistics Done Wrong: the Woefully Complete Guide by Alex Reinhart (no starch press 2015) when I stated to wonder about the origin of the aphorism "Don't shoot the messenger." It occurred to me that this might be a reference to a primitive emotion that wells up unbidden when you hear bad news in such a way that you know things are not going to get better any time soon.
It was on page 4 that I read: "Even properly done statistics can't be trusted." Ouch! Now, to be fair, the point the author is trying to make here is that it is often not possible, based solely on the evidence contained in a scientific paper, to determine if an author sifted through his data until he turned up something interesting. But, coming as it does after mentioning J.P.A Ioannidis' conclusion that most published research findings are probably false, that the average scores of medical school faculty on tests of basic statistical knowledge don’t get much better than 75%, and that both pharmaceutical companies and the scientific journals themselves bias research by failing to publish studies with negative results, Reinhart’s sentence really stings. Moreover, Reinhart is so zealous in his efforts to expose the numerous ways a practicing scientist can go wrong in attempting to "employ statistics" it is reasonable (despite the optimism he expresses in the final chapters) for a reader in the book’s target demographic of practicing scientists with little formal training in statistics to conclude that the subject is just insanely difficult.
Is the practice of statistics just too difficult? Before permitting myself a brief comment on this I’ll start with an easier and more immediate question: Is this book worth reading? To this question, the answer is an unqualified yes.
Anyone starting out on a journey would like to know ahead of time where the road is dangerous, were the hard climbs are, and most of all: where be the dragons? Statistics Done Wrong is as good a map to the traps lurking in statistical analysis adventures that you are ever likely to find. In less than 150 pages it covers the pitfalls of p-values, the perils of being underpowered, the disappointments of false discoveries, the follies mistaking correlation for causation, the evils of torturing data and the need for exploratory analysis to avoid Simpson’s paradox.
About three quarters of the way into the book (Chapter 8), Reinhart moves beyond the basic hypothesis testing to consider some of the problems associated with fitting linear models. There follows a succinct but lucid presentation some essential topics including over fitting, unnecessary dichotomization, variable selection via stepwise regression, the subtle ways in which one can be led into mistaking correlation for causation, the need for clarity in dealing with missing data and the difficulties of recognizing and accounting for bias.
That is a lot of ground to cover, but Reinhart manages it with some style and with an eye for relevant contemporary issues. For example, in his discussion on statistical significance Reinhart says:
And because any medication or intervention usually has some real effect, you can always get a statistically significant result by collecting so much data that you detect extremely tiny but relatively unimportant differences (p9).
And then, he follows up with a very amusing quote from Bruce Thomson's 1992 paper that wryly explains that significance tests on large data sets are often little more than confirmations of the fact that a lot of data was collected. Here we have a “Big Data” problem, deftly dealt with in 1992, but in a journal that no data scientist is ever likely to have read.
The bibliography contained in the notes to each chapter of Statistics Done Wrong is a major strength of the book. Nearly every transgression recorded and every lamentable tale of the sorry state of statistical practice is backed up with a reference to the literature. This impressive exercise at scholarly research adds some weight and depth to the book’s contents and increases it usefulness as a guide.
Also, to my surprise and great delight, Reinhart manages a short discussion that elucidates the differences between R. A. Fisher’s conception of p-values and the treatment given by Neyman and Pearson in their formal theory of hypothesis testing. The confounding of these two very different approaches in what Gigerenzer et al. call the “Null Ritual” is perhaps the root cause of most of the misuse and abuse of significance testing in the scientific literature. However, you can examine dozens of the most popular text books on elementary statistics and find no mention of it.
In the closing chapters of the Statistics Done Wrong Reinhart effects a change of tone and discusses some of the structural difficulties with the practice of statistics in the medical and health sciences that have contributed to the present pandemic of the publication of false, misleading or just plain useless results. Topics include the lack of incentives for researchers to publish inconclusive and negative results, the reluctance of many researchers to share data and the willingness of some to attempt to game the system by deliberately publishing “doctored” results. Reinhart handles these topics nicely and uses them to motivate contemporary work on reproducible research and the need to cultivate a culture of reproducible and open research. Reinhart ends the book with recommendations for the new researcher that allows him to finish the book on a surprisingly upbeat note. The bearer of bad news concludes by offering hope.
I highly recommend Statistics Done Wrong to be read as the author intended: as supplementary material. In the preface, Reinhart writes:
But this is not a textbook, so I will not teach you how to use these techniques in any technical detail. I only hope to make you aware of the most common problems so you are able to pick the statistical technique best suited to your question.
Statistics Done Wrong is the kind of study guide that I think could benefit almost anyone slogging through a statistical analysis for the first time. It seems to me that the author achieve his stated goal with admiral economy and just a few shortcomings. The book, which entirely avoids the use of mathematical symbolism, would have benefited from precise definitions of the key concepts presented (p-values, confidence intervals etc.) and from a little R code to back these definitions. These are, however, relatively minor failings.
Now, back to the big question: is the practice of statistics just too difficult? Yes, I think that the catalogue of errors and numerous opportunities for going wrong documented by Reinhart indicates that the practice of statistics is more difficult than it needs to be. My take on why this is so is expressed (perhaps inadvertently) by Reinhart in the statement of his of his goal for the book quoted above. As long as statistics is conceived and taught as the process of selecting the right technique to answer isolated questions, rather than as an integrated system for thinking with data, we are all going to have a difficult time of it.
Mixed models (which include random effects, essentially parameters drawn from a random distribution) are tricky beasts. Throw non-Normal distributions into the mix for Generalized Linear Mixed Models (GLMMs), or go non-linear, and things get trickier still. It was a new field of Statistics when I was working on the Oswald package for S-PLUS, and even 20 years later some major questions have yet to be fully answered (like, how do you calculate the degrees of freedom for a significance test?).
These days lme4, nlme and MCMCglmm are the go-to R packages for mixed models, and if you're using them you likely have questions. The r-sig-mixed-models FAQ is a good compendium of answers, and includes plenty of references for further reading. You can also join in the discussions on mixed models at the r-sig-mixed-models mailing list.
John Mount Ph. D.
Data Scientist at Win-Vector LLC
An A/B test is a very simple controlled experiment where one group is subject to a new treatment (often group "B") and the other group (often group "A") is considered a control group. The classic example is attempting to compare defect rates of two production processes (the current process, and perhaps a new machine).
In our time an A/B test typically compares the conversion to sales rate of different web-traffic sources or different web-advertising creatives (like industrial defects, a low rate process). An A/B test uses a randomized "at the same time" test design to help mitigate the impact of any possible interfering or omitted variables. So you do not run "A" on Monday and then "B" on Tuesday, but instead continuously route a fraction of your customers to each treatment. Roughly a complete "test design" is: how much traffic to route to A, how much traffic to route to B, and how to chose A versus B after the results are available. A/B testing is one of the simplest controlled experimental design problems possible (and one of the simplest examples of a Markov decision process). And that is part of the problem: it is likely the first time a person will need to truly worry about:
All of these are technical terms we will touch on in this article. However, we argue the biggest sticking point of A/B testing is: it requires a lot more communication between the business partner (sponsoring the test) and the analyst (designing and implementing the test) than a statistician or data scientist would care to admit. In this first article of a new series called "statistics as it should be" we will discuss some of the essential issues in planning A/B tests. To continue please click here.
by B. W. Lewis
This note warns about potentially misleading results when using the use=pairwise.complete.obs
and related options in R’s cor
and cov
functions. Pitfalls are illustrated using a very simple pathological example followed by a brief list of alternative ways to deal with missing data and some references about them.
R includes excellent facilities for handling missing values for all native data types. Perhaps counterintuitively, marking a value as missing conveys the information that the value is not known. Donald Rumsfeld might call it a “known unknown^{1}.” Upon encountering a missing value, we can deal with it by simply omitting it, imputing it somehow, or through several other possible approaches. R does a good job of making our choice of missing value approach explicit.
The cov
and cor
functions in the R programming language include several options for dealing with missing data. The use="pairwise.complete.obs"
option is particularly confusing, and can easily lead to faulty comparisons. This note explains and warns against its use.
Consider the following tiny example:
(x = matrix(c(-2,-1,0,1,2,1.5,2,0,1,2,NA,NA,0,1,2),5))
## [,1] [,2] [,3]
## [1,] -2 1.5 NA
## [2,] -1 2.0 NA
## [3,] 0 0.0 0
## [4,] 1 1.0 1
## [5,] 2 2.0 2
The functions V=cov(x)
computes the symmetric covariance matrix V
with entries defined by the pairwise covariance of columns of x
,
where i=j=1,2,3
in this example. The function cor(x)
similarly computes the symmetric correlation matrix with entries defined by pairwise correlation of the columns of x
. For example:
cov(x)
## [,1] [,2] [,3]
## [1,] 2.5 0.0 NA
## [2,] 0.0 0.7 NA
## [3,] NA NA NA
cor(x)
## [,1] [,2] [,3]
## [1,] 1 0 NA
## [2,] 0 1 NA
## [3,] NA NA 1
Due to missing values in the third column of x
we know that we don’t know the covariance between x[,3]
and anything else. Thanks to an arguably questionable^{2} choice in R’s cov2cor
function, R reports that the correlation of x[,3]
with itself is one, but we don’t know the correlation between x[,3]
and the other columns.
The use="complete"
option is one way to deal with missing values. It simply removes rows of the matrix x
with missing observations. Since the columns of the third through fifth rows of our example matrix are all identical we expect perfect correlation across the board, and indeed:
cor(x, use="complete")
## [,1] [,2] [,3]
## [1,] 1 1 1
## [2,] 1 1 1
## [3,] 1 1 1
Reasonable people might question this approach. Deleting two observations has a huge effect on the correlation between x[,1]
and x[,2]
in this example mostly because of the large change in x[,1]
. The result really says that we should collect more observations!
The use="pairwise.complete.obs"
is an even less reasonable way to deal with missing values. When specified, R computes correlations for each pair of columns using vectors formed by omitting rows with missing values on a pairwise basis. Thus each column vector may vary depending on it’s pairing, resulting in correlation values that are not even comparable. Consider our simple example again:
cor(x, use="pairwise.complete.obs")
## [,1] [,2] [,3]
## [1,] 1 0 1
## [2,] 0 1 1
## [3,] 1 1 1
By this bizarre measurement, the correlation of x[,1]
and x[,2]
is zero (as we saw above in the first example), and yet cor
claims that x[,3]
is perfectly correlated with both x[,1]
and x[,2]
. In other words, the result is nonsense. As Rumsfeld might say, we’ve converted known unknowns into unknown knowns.
What’s going on here is that the reported correlations are not comparable because they are computed against different vectors: all of x[,1]
and x[,2]
are compared to each other, but only parts of x[,1]
and x[,2]
are compared to x[,3]
.
The bad result is obvious for our small example. But the danger here is in large matrices with lots of missing values, where it may be impossible to use the pairwise
option in a meaningful way.
If you want to run correlations on lots of vectors with missing values, consider simply using the R default of use="everything"
and propagating missing values into the correlation matrix. This makes it clear what you don’t know.
If you really don’t want to do that, consider imputing the missing values. The simplest method replaces missing values in each column with the mean of the non-missing values in the respective column:
m = mean(na.omit(x[,3]))
xi = x
xi[is.na(x)] = m
cor(xi)
## [,1] [,2] [,3]
## [1,] 1.0000000 0.0000000 0.4472136
## [2,] 0.0000000 1.0000000 0.8451543
## [3,] 0.4472136 0.8451543 1.0000000
This can be done really efficiently when “centering” a matrix by simply replacing missing values of the centered matrix with zero.
Sometimes it might make more sense to use a piecewise constant interpolant, referred to as “last observation carry forward” especially when dealing with time series and ordered data. In yet other cases a known default value (perhaps from a much larger population than the one under study) might be more appropriate.
Another basic approach bootstraps the missing values from the non-missing ones:
i = is.na(x[,3])
N = sum(i)
b = replicate(500, {x[i,3] = sample(x[!i,3], size=N, replace=TRUE);cor(x[,1:2],x[,3])})
# Average imputed values of cor(x[,1],x[,3]) and cor(x[,2],x[,3])
apply(b,1,mean)
## [1] 0.3722048 0.6845172
# Standard deviation of imputed values of cor(x[,1],x[,3]) and cor(x[,2],x[,3])
apply(b,1,sd)
## [1] 0.3361684 0.2156523
If you have lots of observations, consider partitioning them with a basic clustering algorithm first and then imputing the missing values from their respective cluster cohorts. Or consider a matching method or a regression-based imputation method. See the references below for many more details.
There are of course many excellent R packages and references on missing data. I recommend consulting the following packages and references:
The cov2cor
function used by cor
always puts ones along the diagonal of a correlation matrix; that choice is valid only if all unknowns may assume bounded and valid numeric values which is actually pretty reasonable. But in a rare example of inconsistency in R cor(x[,3],x[,3])
returns NA
. Yikes!)↩
by Vidisha Vachharajani
Freelance Statistical Consultant
R showcases several useful clustering tools, but the one that seems particularly powerful is the marriage of hierarchical clustering with a visual display of its results in a heatmap. The term “heatmap” is often confusing, making most wonder – which is it? A "colorful visual representation of data in a matrix" or "a (thematic) map in which areas are represented in patterns ("heat" colors) that are proportionate to the measurement of some information being displayed on the map"? For our sole clustering purpose, the former meaning of a heatmap is more appropriate, while the latter is a choropleth.
The reason why we would want to link the use of a heatmap with hierarchical clustering is the former’s ability to lucidly represent the information in a hierarchical clustering (HC) output, so that it is easily understood and more visually appealing. It is also (as an in-built package in R, "heatmap.2") a mechanism of applying HC to both rows and columns in a data matrix, so that it yields meaningful groups that share certain features (within the same group) and are differentiated from each other (across different groups).
Consider the following simple example which uses the "States" data sets in the car package. States contains the following features:
We wish to account for all but the first column (region) to create groups of states that are common with respect to the different pieces of information we have about them. For instance, what states are similar vis-a-vis exam scores vs. state education spending? Instead of doing just a hierarchical clustering, we can implement both the HC and the visualization in one step, using the heatmap.2() function in the gplots package.
# R CODE (output = "initial_plot.png")
library(gplots) # contains the heatmap.2 package library(car) States[1:3,] # look at the data scaled <- scale(States[,-1]) # scale all but the first column to make information comparable heatmap.2(scaled, # specify the (scaled) data to be used in the heatmap cexRow=0.5, cexCol=0.95, # decrease font size of row/column labels scale="none", # we have already scaled the data trace="none") # cleaner heatmap
This initial heatmap gives us a lot of information about the potential state grouping. We have a classic HC dendrogram on the far left of the plot (the output we would have gotten from an "hclust()" rendering). However, in order to get an even cleaner look, and have groups fall right out of the plot, we can induce row and column separators, rendering an "all-the-information-in-one-glance" look. Placement information of the separators come from the HC dendrograms (both row and column). Lets also play around with the colors to get a "red-yellow-green" effect for the scaling, which will render the underlying information even more clearly. Finally, we'll also eliminate the underlying dendrograms, so we simply have a clean color plot with underlying groups (this option can be easily undone from the code below).
# R CODE (output = "final_plot.png") # Use color brewer library(RColorBrewer) my_palette <- colorRampPalette(c('red','yellow','green'))(256) scaled <- scale(States[,-1]) # scale all but the first column to make information comparable heatmap.2(scaled, # specify the (scaled) data to be used in the heatmap cexRow=0.5,
cexCol=0.95, # decrease font size of row/column labels col = my_palette, # arguments to read in custom colors
colsep=c(2,4,5), # Adding on the separators that will clarify plot even more
rowsep = c(6,14,18,25,30,36,42,47),
sepcolor="black",
sepwidth=c(0.01,0.01),
scale="none", # we have already scaled the data
dendrogram="none", # no need to see dendrograms in this one
trace="none") # cleaner heatmap
This plot gives us a nice, clear picture of the groups that come off of the HC implementation, as well as in context of column (attribute) groups. For instance, while Idaho, Okalahoma, Missouri and Arkansas perform well on the verbal and math SAT components, the state spending on education and average teacher salary is much lower than the other states. These attributes are reversed for Connecticut, New Jersey, DC, New York, Pennsylvania and Alaska.
This hierarchical-clustering/heatmap partnership is a useful, productive one, especially when one is digging through massive data, trying to glean some useful cluster-based conclusions, and render the conclusions in a clean, pretty, easily interpretable fashion.
by Joseph Rickert
A recent post by David Smith included a map that shows the locations of R user groups around the world. While is exhilarating to see how R user groups span the globe, the map does not give any idea about the size of the community at each location. The following plot, constructed from information on the websites of the groups listed in Revolutions Analytics' Local R User Group Directory (the same source for the map) shows the membership size for the largest 25 groups.
There are 11 groups with over a thousand members and a couple more who are close to achieving that milestone. With the possible exception of Groupe des utilisateurs du logicel R, I believe all of the groups in the top 25 hold regular, face-to-face meetings: so the plot gives some idea of the size, location and density of the "social" R community.
There are, however, quite a few problems with the data that make it less than optimal for even the limited goal of characterizing the number of R users who regularly participate in face-to-face R events.
Nevertheless, I think this plot along with the map mentioned above do give some idea of where the action is in the R world.
Note that the data used to build the plot may be obtained here: Download RUGS_WW_4_7_15 and the code is here: Download RUG_Bar_Chart
Also note that it is still a good time to start a new R user group. The deadline for funding for large user groups through Revolution Analytics 2015 R User Group Sponsorship Program has passed. However, we will be taking applications for new groups until the end of September. The $120 grant for Vector level groups should be enough to finance a site on meetup.com. If you are thinking of starting a new group have a look at the link above as well as our tips for getting started.
In today's data-oriented world, just about every retailer has amassed a huge database of purchase transaction. Each transaction consists of a number of products that have been purchased together. A natural question that you could answer from this database is: What products are typically purchased together? This is called Market Basket Analysis (or Affinity Analysis). A closely related question is: Can we find relationships between certain products, which indicate the purchase of other products? For example, if someone purchases avocados and salsa, it's likely they'll purchase tortilla chips and limes as well. This is called association rule learning, a data mining technique used by retailers to improve product placement, marketing, and new product development.
R has an excellent suite of algorithms for market basket analysis in the arules package by Michael Hahsler and colleagues. It includes support for both the Apriori algorithm and the ECLAT (equivalence class transformation algorithm). You can find an in-depth description of both techniques (including several examples) in the Introduction to arules vignette. The slides below, by Yanchang Zhao provide a nice overview, and you can find further examples at RDataMining.com.
by Ari Lamstein
Today I will walk through an analysis of San Francisco Zip Code Demographics using my new R package choroplethrZip. This package creates choropleth maps of US Zip Codes and connects to the US Census Bureau. A choropleth is a map that shows boundaries of regions (such as zip codes) and colors those regions according to some metric (such as population).
Zip codes are a common geographic unit for businesses to work with, but rendering them is difficult. Official zip codes are maintained by the US Postal Service, but they exist solely to facilitate mail delivery. The USPS does not release a map of them, they change frequently and, in some cases, are not even polygons. The most authoritative map I could find of US Zip codes was the Census Bureau’s Map of Zip Code Tabulated Areas (ZCTAs). Despite shipping with only a simplified version of this map (60MB instead of 500MB), choroplethrZip is still too large for CRAN. It is instead hosted on github, and you can install it from an R console like this:
# install.github(“devtools”)
library(devtools)
install_github('arilamstein/choroplethrZip@v1.1.1')
The package vignettes (1, 2) explain basic usage. In this article I’d like to demonstrate a more in depth example: showing racial and financial characteristics of each zip code in San Francisco. The data I use comes from the 2013 American Community Survey (ACS) which is run by the US Census Bureau. If you are new to the ACS you might want to view my vignette on Mapping US Census Data.
One table that deals with race and ethnicity is B03002 - Hipanic or Latino Origin by Race. Many people will be surprised by the large number of categories. This is because the US Census Bureau has a complex framework for categorizing race and ethnicity. Since my purpose here is to demonstrate technology, I will simplify the data by only dealing with only a handful of the values: Total Hispanic or Latino, White (not Hispanic), Black (not Hispanic) and Asian (not Hispanic).
The R code for getting this data into a data.frame can be viewed here, and the code for generating the graphs in this post can be viewed here. Here is a boxplot of the ethnic breakdown of the 27 ZCTAs in San Francisco.
This boxplot shows that there is a wide variation in the racial and ethnic breakdown of San Francisco ZCTAs. For example, the percentage of White people in each ZCTA ranges from 7% to 80%. The percentage of black and hispanics seem to have a tighter range, but also contains outliers. Also, while Asian Americans only make up the 5% of the total US population, the median ZCTA in SF is 30% Asian.
Viewing this data with choropleth maps allows us to associate locations with these values.
When discussing demographics people often ask about Per Capita Income. Here is a boxplot of the Per capita Income in San Francisco ZCTAs.
The range of this dataset - from $15,960 to $144,400 - is striking. Equally striking is the outlier at the top. We can use a continuous scale choropleth to highlight the outlier. We can also use a four color choropleth to show the locations of the quartiles.
The outlier for income is zip 94105, which is where a large number of tech companies are located. The zips in the southern part of the city tend to have a low income.
After viewing this analysis readers might wish to do a similar analysis for the city where they live. To facilitate this I have created an interactive web application. The app begins by showing a choropleth map of a random statistic of the Zips in a random Metropolitan Statistical Area (MSA). You can choose another statistic, zoom in, or select another MSA.
In the event that the application does not load (for example, if I reach my monthly quota at my hosting company) then you can run the app from source, which is available here.
I hope that you have enjoyed this exploration of Zip code level demographics with choroplethrZip. I also hope that it encourages more people to use R for demographic statistics.