Now that Spain has won the World Cup, it's interesting to go back and look at some metrics from the matches and see if we can tease out what characteristics made for a winning Cup team this time around. Fortunately, the Guardian's Data Blog has made a wealth of World Cup statistics available, with data on every player of every team (position, shots at goal, passes, tackles made, and saves), plus aggregate statistics for each team (goals, % shots on target, fouls, and much more). The data are ripe for analysis in R, especially given that you can download the data directly from the cloud as an R object with the following commands:
players <- read.csv("http://spreadsheets.google.com/pub?key=tOM2qREmPUbv76waumrEEYg&single=true&gid=1&range=A1%3AH596&output=csv")
teams <- read.csv("http://spreadsheets.google.com/pub?key=tOM2qREmPUbv76waumrEEYg&single=true&gid=0&range=A1%3AAG15&output=csv")
The method I've described before for accessing a Google Spreadsheet from R didn't quite apply here, as those instructions assume you own the document (and have access to the Publish menu). But some experimentation and tweaking of the spreadsheet URL made it work: the key parameters seem to be the "&gid=" (sheet number) and "%range=" (cell ranges, use %3A to encode the colon) and "&output=csv" to download in CSV format. It would be nice if Google published the specs to form URLs like these, but as far as I know they don't.
Anyway, a couple of bloggers have used these data to great effect to express the results of the World Cup visually using R graphics. For example, the R Charts blog used ggplot2 to look at the number of fouls committed by each team during the tournament:
(Personally, I would have sorted the rows by descending number of fouls, rather than alphabetically.) Interesting to see that Cup champions Spain are in the middle of the pack on fouls, whereas runners-up Netherlands lead this table (boosted heavily by their performance in the final).
Blogger Jason Priem also took a look at the data, this time with a scatterplot of goals per game by fouls per game, related to how far each team advanced in the competition:
(Download Jason's code for this chart here.) Again it's interesting to see the positions of the two finalists here, with Netherlands on the extreme frontier for both fouls and goals, while spain is moderate on goals per game and near the lowest on fouls per games.
It's a rich dataset and I'm sure other Revolutions readers could come up with some equally interesting visualizations. If you do, tell us about it in the comments.
Guardian Data Blog: World Cup 2010 statistics: every match and every player in data
I wonder if there is data available for the "area of play" shown as a heatmap at nyt.
Posted by: apeescape | July 12, 2010 at 18:16
I think the first chart is a bit misleading because it doesn't adjust the number of fouls for the number of games played. Netherlands with over 90 fouls in the tournament seems extreme but in terms of fouls per game (vertical axis of the second chart) they are not.
Posted by: Douglas Bates | July 13, 2010 at 07:38
Useless chart. 16 of the 32 teams played only 3 games, 4 teams (Spain, Netherlands, Uruguay and Germany) played 7 games, the other teams had numbers in between. So the absolute number of fouls is completely meaningless. Applying statistical algorithms w/o theoretical foundation is always a bad idea.
Posted by: clattr | July 13, 2010 at 11:35
Agree with both comments above -- but that's the beauty of having the data available, you can always visualize it another way. The teams data set does have the fouls per game metric, which is comparable between teams playing different numbers of games.
Posted by: David Smith | July 13, 2010 at 11:37
Enough with the visualization craze! There are a bunch of graphical designers/pseudo artists, that think that because the use Processing or can program some R they are producing better ways to interpret data. They don't have a grasp at all of statistics and the are confusing people with pseudo scientific crap.
Posted by: Tufed Wardte | July 13, 2010 at 11:54
First Chart is total crap. Even basic understanding of statistics are lacking.
Posted by: Georgina | July 13, 2010 at 12:52
2010 hiljemaltmbt zapatosmüüakse poes. See on väga populaarne Euroopas kingad. Tema unikaalne stiil ja tulemuseks on fitness, see äratas paljude inimeste tähelepanu.Kas olete valmis aeglane inimeste Mis pihta? Meede, tere tulemast meiembt zapatosloendurid. Kogu südamest teile.
Posted by: xiaoqi | July 21, 2010 at 18:06
Tema unikaalne stiil ja tulemuseks on fitness, see äratas paljude inimeste tähelepanu.There are a bunch of graphical designers/pseudo artists, that think that because the use Processing or can program some R they are producing better ways to interpret data.
Posted by: ffxiv gil | August 13, 2010 at 17:45