In June 2013, the conflict between opposition and government forces around the Syrian city of Aleppo had intensified. Rockets struck residential districts, and car-bombs exploded near key facilities.
Many people died. But as is common in conflict areas, the reports of the number of dead varied by the source of the information. While some agencies reported a surge in casualties in the Aleppo area around June 2013, others did not.
The true number of casualties in conflicts like the Syrian war seems unknowable, but the mission of the Human Rights Data Analysis Group (HRDAG) is to make sense of such information, clouded as it is by the fog of war. They do this not by nominating one source of information as the "best", but instead with statistical modeling of the differences between sources.
In a fascinating talk at Strata Santa Clara in February, HRDAG's Director of Research Megan Price explained the statistical technique she used to make sense of the conflicting information. Each of the four agencies shown in the chart above published a list of identified victims. By painstakingly linking the records between the different agencies (no simple task, given incomplete information about each victim and variations in capturing names, ages etc.), HRDAG can get a more complete sense of the total number of casualties. But the real insight comes from recognizing that some victims were reported by no agency at all. By looking at the rates at which some known victims were not reported by all of the agencies, HRDAG can estimate the number of victims that were identified by nobody, and thereby get a more accurate count of total casualties. (The specific statistical technique used was Random Forests, using the R language. You can read more about the methodology here.)
HRDAG is doing a noble and difficult job of understanding the facts of war from incomplete data. "If we base our conclusions about what's happening in Syria on the observed data — on the reporting rates — we get those questions wrong", said Megan in her Strata talk. "When estimate what is missing, we have a much more accurate estimate of reality."
Strata: Record Linkage and Other Statistical Models for Quantifying Conflict Casualties in Syria
> By looking at the rates at which some known victims were not reported by all of the agencies, HRDAG can estimate the number of victims that were identified by nobody, and thereby get a more accurate count of total casualties. (The specific statistical technique used was Random Forests, using the R language. You can read more about the methodology here.)
This description seems to be a bit off from the presentation: she says they use random forests for the de-duplication process, but doesn't go into any apparent detail on what they do with the final dataset to estimate how many victims are missing. My guess would be they're using capture-recapture analysis (possibly as implemented in Rcapture) for estimating missingness because their problem would be ideal for that technique, but I could be wrong.
Posted by: gwern | March 15, 2014 at 16:46