There's growing awareness that the data we collect, and in particular the variables we include as factors in our predictive models, can lead to unwanted bias in outcomes: from loan applications, to law enforcement, and in many other areas. In some instances, such bias is even directly regulated by laws like the Fair Housing Act in the US. But even if we explicitly remove "obvious" variables like sex, age or ethnicity from predictive models, unconscious bias might still be a factor in our predictions as a result of highly-correlated proxy variables that are included in our model.
As a result, we need to be aware of the biases in our model and take steps to address them. For an excellent general overview of the topic, I highly recommend watching the recent presentation by Rachel Thomas, "Analyzing and Preventing Bias in ML". And for a practical demonstration of one way you can go about detecting proxy bias in R, take a look at the vignette created by my colleague Paige Bailey for the ROpenSci conference, "Ethical Machine Learning: Spotting and Preventing Proxy Bias".
The vignette details general principles you can follow to identify proxy bias in an analysis, in the context of a case study analyzed using R. The case study considers data and a predictive model that might be used by a bank manager to determine the creditworthiness of a loan applicant. Even though race was not explicitly included in the adaptive boosting model (from the C5.0 package), the predictions are still biased by race:
That's because zipcode, a variable highly associated with race, was included in the model. Read the complete vignette linked below to see how Paige modified the model to ameliorate that bias, while still maintaining its predictive power. All of the associated R code is available in the iPython Notebook.
GitHub (ropenscilabs): Ethical Machine Learning: Spotting and Preventing Proxy Bias (Paige Bailey)
While I applaud these efforts to address this bias, are there other ways to fix the potential historical bias that exists due to red-lining in the training data? To me this is the biggest issue with recidivism models.
With credit models, it seems possible to use experts to re-label the data in a unbiased way before training. I don't see how one can do that in the context of recidivism.
Posted by: Chris | June 17, 2018 at 18:38