While there are many admirable efforts to increase participation by women in STEM fields, in many programming teams men still outnumber women, often by a significant margin. Specifically by how much is a fraught question, and accurate statistics are hard to come by. Another interesting question is whether the gender disparity varies by language, and how to define a "typical programmer" for a given language.
Jeff Allen from Trestle Tech recently took an interesting approach using R to gather data on gender ratios for programmers: get a list of the top coders for each programming language, and then count the number of men and women in each list. Neither task is trivial. For a list of coders, Jeff scraped GitHub's list of trending repositories over the past month by programming language, and then extracted the avatars for the listed contributors. Then, he used the Microsoft Cognitive Services Face API on the avatar to determine the apparent gender of each contributor, and then tally up the results. You can find the R code he used on GitHub.
I used Jeff's code to re-run his results based on GitHub's latest monthly rankings. The first thing I needed to do was to request an API Key; a trial key is free with a Microsoft account. (The number of requests per second, but the R code is written to limit the rate of requests accordingly.) I limited my search to the languages C++, C#, Java, Javascript, Python, R and Ruby. The percentage of contributors identified as female, within each language, are shown below:
According to this analysis, none of the contributors top C++ projects on GitHub are female; by contrast, almost 10% of contributors to R projects are female.
Now, these data need to be taken with a grain of salt. The main issue is numbers: fewer than 100 programmers per language are identified as "top programmers" via this method, and sometimes significantly fewer (just 45 top C++ contributors were identified). Part of the reason for this is that not all programmers use their face as an avatar; those that used a symbol, logo or cartoon were not counted. Furthermore, it's reasonable to assume that there's a disparity in the rate at which women use their own face as an avatar compared to men, which would add bias to the above results in addition to the variability from the small numbers. Finally, the gender determination is based on an algorithm which classifies faces as only male or female, and isn't guaranteed to match the gender identity of the programmer (or their avatar).
Nonetheless, it's an interesting example of using social network data in conjunction with cognitive APIs to conduct demographic studies. You can examples of using other data from the facial analysis, including apparent happiness by language, at the link below.
(Update June 15: re-ran the analysis and updated the chart above to actually display percentages, not ratios, on the y-axis. The numbers changed slightly as the GitHub data changed. The old chart is here.)
Trestle Tech: EigenCoder: Programming Stereotypes
"According to this analysis, none of the contributors top C++ projects on GitHub are male; by contrast, almost 10% of contributors to R projects are female." ... do you mean ' none of the top C++ projects on Github are by female;' ?
Posted by: Debajyoti Nag | June 16, 2016 at 03:40
I think the most interesting story is the male bias in programming and would like to see the vertical axis go from 0 to 1. That, and they're proportional data and we could still see differences between languages. Thanks for the post.
Posted by: Chris | June 16, 2016 at 04:42
Thanks for pointing that out, @Debajyoti. I've corrected the error in the post above.
Posted by: David Smith | June 16, 2016 at 06:30
The reasons you state this should be taken "with a grain of salt" I think are much worse than just a grain of salt. The implication is that these figures represent actual trends with some small error for the stated reasons, but it is possible those reasons are enough to obliterate the reported trends entirely. Although it is interesting to use resources like this to do some analysis, I don't think we should ignore methodological rigor when reporting results.
Posted by: Brian Stamper | June 17, 2016 at 07:06
Are the percentages above the percentage of all avatars that were identified as female or the percentage of those where the gender could be identified?
It would be interesting to see the percentages of male, female and unknown.
Posted by: Stephen Dragoni | June 21, 2016 at 07:05
The ratio is not surprising, however It would be nice to see the actually counts of these categories, A non probabilistic sample (top 100) on top of a probability sample (gender) can really skew results.
Posted by: Ralph winters | July 09, 2016 at 07:23