Walter Mebane has updated the analysis we looked at on Tuesday. The analysis now includes town-level polling data from the 2005 election, leading Mebane to conclude "moderately strong support for a diagnosis that the 2009 election was afflicted by significant fraud". Pollster.com has a good review of the updated report; the report itself and the updated R scripts and data are also available.
Update: You might ask whether it's relevant that this analysis was done in R: couldn't it have been done in any statistics package? In this case, I don't think so: here we have a real-time news event, where data is trickling in on a daily, even hourly basis, yet requires deep statistical analysis. R has three key strengths in this context. First, it's a language designed for rapid implementation of analyses: despite messy data, data matching problems, and unusual data manipulations (second digit extraction, anyone?), it enabled Mebane to complete a sophisticated analysis very soon after the data were available. Second, R is backed by a library of thousands of statistical routines for every application imaginable: Mebane relied heavily on his already-published multinomRob package to generate robust estimates (which in turn revealed those suspicious outliers). Thirdly, R is open-source: Mebane published his code and data with the knowledge that anyone could inspect and verify his analysis without being restricted by software license issues. I don't think any other software is capable of this level of analysis, this quickly and this openly. (Octave comes close I guess, as in this analysis, but the stats had to essentially be hand-coded from the equations, limiting its application to those analyses you can code from scratch yourself.)
Pollster.com: Mebane: "Moderately Strong Support" for Iran Fraud
Another analysis, somewhat different, supports the fraud hypothesis: here.
Posted by: Jon Baron | June 18, 2009 at 17:43
second_digit <- floor(x * 10^-ceiling(log(x,10)-2) - 10*floor(x * 10^-ceiling(log(x,10)-1)))
Posted by: Corey | June 18, 2009 at 21:36
Thanks, Corey. Now, for comparison, how would one do that in Excel or SAS? :)
Incidentally, I was wondering why Mebane analyzed the second digit for the Benford's Law analysis (rather than the first as Roukema did -- see also Gelman's comments on this paper). I found the answer (see comment by "tomi"):
"Another important issue concerns whether Benford's Law should be expected to apply to all the digits in reported vote counts. In particular, for precinct-level data there are good reasons to doubt that the first digits of vote counts will satisfy Benford's Law. Brady (2005) develops a version of this argument. The basic point is that often precincts are designed to include roughly the same number of voters. If a candidate has roughly the same level of support in all the precincts, which means the candidate's share of the votes is roughly the same in all the precincts, then the vote counts will have the same first digit in all of the precincts. Imagine a situation where all precincts contain about 1,000 voters each, and a candidate has the support of roughly fifty percent of the voters in every precinct. Then most of the precinct vote totals for the candidate will begin with the digits `4' or '5.' This result will hold no matter how mixed the processes may be that get the candidate to roughly fifty percent support in each precinct. For Benford's Law to be satisfied for the first digits of vote counts clearly depends on the occurrence of a fortuitous distribution of precinct sizes and in the alignment of precinct sizes with each candidate's support. It is difficult to see how there might be some connection to generally occurring political processes. So we may turn to the second significant digits of the vote counts, for which at least there is no similar knock down contrary argument." (From a 1996 paper by Mebane.)
Posted by: David Smith | June 22, 2009 at 10:53
A more readable implementation might be
second_digit <- as.numeric(substring(x,2,2))
Posted by: Felix Andrews | June 28, 2009 at 00:39