Last month the Australian people signaled their approval of legalizing same-sex marriage by a 62%:38% margin in a national survey. (On a personal note, I was elated and relieved by the result: my husband and I have discussed eventually retiring to Australia, and with this decision our marriage would be recognized there.) While fears of a surprise Brexit-like electoral backlash proved unfounded, researchers including R user Miles McBain explored the results for correlations to demographic variables. This process wasn't as simple as it might have been though: the Australian Bureau of Statistics released the results as a pair of Excel files that violate just about every good practice for sharing data in spreadsheets:
Miles shares the R code he used to extract useful data from this spreadsheet as a blog post that makes a great case study in dealing with messy data using R. The post demonstrates how he used the read_excel
function (from readxl package) to extract specific sub-tables from the spreadsheet by specifying row and column ranges, and then use the dplyr package to clean up and merge the data. If you want to explore the data yourself, you can find the R code and the source data in this Github repository.
In a follow-up post, Miles combines the same-sex marriage survey data with Australian Census data to explore various demographic relationships. Unlike the US Census data (which is easily accessible in R thanks to the tidycensus package), there's no interface package for Australian Census data. (Selected tables are available in the Census2016 package, however.) Instead, Miles demonstrates how to use R to download and extract data from the the "Census DataPacks" (CSV data files and Excel data dictionaries) provided by the Australian Bureau of Statistics. Yet more data wrangling allows Miles to create summary charts of the responses, such as this chart of proportion voting No by percent of the district population declaring a religious affiliation, broken down by state. As you may expect, those districts with more religious populations voted No at greater rates.
Both of these post provide great examples of working with government data, which is often provided in inconvenient formats with messy structures. Follow the links below for step-by-step guides, including the R code used to extract the data, structure it for analysis, and create useful charts.
Medium (Miles McBain): Tidying the Australian Same Sex Marriage Postal Survey Data with R; Combining Australian Census data with the Same Sex Marriage Postal Survey in R
I agree with your assessment of ABS spreadsheets in general. There are even more things that are annoying, like changes made by hand with different releases.
However in this case you may be slightly at fault. I say slightly because it's not obvious. I had not looked at the survey results. When I did I noticed that it was disseminated using ABSStat.
This means you have a choice for the format of your data. You could have saved it as a csv file. What is not obvious is that unlike the Excel format it does not retain the structure of the data, but flattens it. Also you have to choose the csv string option. So no not really your fault.
Posted by: Tom M | December 01, 2017 at 19:34
Hey Tom,
They might have been produced by the ABSStat system but they don't appear to be accessible from there as far as I can tell.
I got them from this page: https://marriagesurvey.abs.gov.au/results/downloads.html
Right now xlsx is the only available option.
Posted by: Miles McBain | December 02, 2017 at 03:41
Hi Miles,
The AMLPS data is on ABS.Stat and available via REST web service call or csv export under "Snapshots of Australia | Australian Marriage Law Postal Survey". This was also released on the 15th.
It would be interesting to see your clean-up and Census combination work using this data source.
Posted by: Julian | December 04, 2017 at 22:56