Spreadsheets remain an important way for people to share and work with data. Among other providers, Google has provided the ability to create online spreadsheets and other documents.
Back in 2009, David Smith posted a blog entry on how to use R, and specifically the XML package to import data from a Google Spreadsheet. Once you marked your Google sheet as exported, it took about two lines of code to import your data into a data frame.
But things have changed
More recently, it seems that Google changed and improved the Spreadsheet product. Google's own overview of changes lists some changes, but one change isn't on this list. In the previous version, it was possible to publish a sheet as a csv file. In the new version it is still possible to publish a sheet, but the ability to do this as csv is no longer there.
On April 5, 2014 somebody asked a question on StackOverflow on how to deal with this.
Because I had the same need to import data from a spreadsheet shared in our team, I set out to find and answer.
Quick overview of publishing a sheet in the new version of Google docs
To publish a Google Docs spreadsheet is really as simple as following these three steps:
- Create a google sheet
- Publish to web
- Copy the document link
R code to read the Google data
Here is the code. You will need to load the XML package before using this.
The function readGoogleSheet()
returns a list of data frames, one for each table found on the Google sheet.
Next, we need a function to clean the individual tables.
cleanGoogleTable()
removes empty lines inserted by Google, removes the row names (if they exist) and allows you to skip empty lines before the table starts:
See it in action
I created a Google Spreadsheet with data about the periodic table of elements. I then published this sheet. If you follow the link, you will notice that the document contains two sheets. Sheet1 contains the periodic table data.
To read the table, try the following three lines of code:
gdoc <- "https://docs.google.com/spreadsheets/d/1MQ50_tn76GqQAOpFigcHms4zFqkoM_JS4sOittv_vgA/pubhtml"
elem <- readGoogleSheet(gdoc)
m <- cleanGoogleTable(elem, table=1)
head(m)
Atomic no Name Symbol Group Period Block State at STP Occurrence
1 1 Hydrogen H 1 1 s Gas Primordial
2 2 Helium He 18 1 s Gas Primordial
3 3 Lithium Li 1 2 s Solid Primordial
4 4 Beryllium Be 2 2 s Solid Primordial
5 5 Boron B 13 2 p Solid Primordial
6 6 Carbon C 14 2 p Solid Primordial
tail(m)
Atomic no Name Symbol Group Period Block State at STP Occurrence
113 113 (Ununtrium) Uut 13 7 p Synthetic
114 114 (Ununquadium) Uuq 14 7 p Synthetic
115 115 (Ununpentium) Uup 15 7 p Synthetic
116 116 (Ununhexium) Uuh 16 7 p Synthetic
117 117 (Ununseptium) Uus 17 7 p Synthetic
118 118 (Ununoctium) Uuo 18 7 p Synthetic
You can actually force googledocs to give you a csv, the option is just not in the menus.
https://docs.google.com/spreadsheets/d/KEY/export?format=csv&id=KEY
inserting your unique key inplace of 'KEY'. You can then just pull it into R with RCurl:
require(RCurl)
url <- https://docs.google.com/spreadsheets/d/KEY/export?format=csv&id=KEY
myCsv <- getURL(url,.opts=list(ssl.verifypeer=FALSE))
test <- read.csv(textConnection(myCsv))
Good alternative method though. Especially as I haven't figured out how to tab through multiple sheets within a single document.
Posted by: Jack | June 03, 2014 at 09:11
Hi, Jack
Yes, I'm aware of this method. But this also seemed to have changed with the new version of Google Spreadsheets.
I tried this with my URL:
url <- "https://docs.google.com/spreadsheets/d/1MQ50_tn76GqQAOpFigcHms4zFqkoM_JS4sOittv_vgA/export?format=csv&id=KEY"
myCsv <- getURL(url,.opts=list(ssl.verifypeer=FALSE))
test <- read.csv(textConnection(myCsv))
test
The result is not a useful data frame:
X.HTML.
1
2 Moved Temporarily
3
4
5 Moved Temporarily
6 The document has moved here.
7
8
If you find a way of fixing this, I'd be most interested!
Andrie
Posted by: Andrie de Vries | June 03, 2014 at 09:18
not working as mentioned please help.
Satish
Posted by: Satish | June 04, 2014 at 01:39
Hi Andrie,
Interesting. Did you include your KEY in the URL twice?? (I noticed your URL still has the words KEY in it at the end). For some reason you need to include the key twice. I've just tried it on my system and it is working, and producing a useable data frame.
url <- "https://docs.google.com/spreadsheets/d/1MQ50_tn76GqQAOpFigcHms4zFqkoM_JS4sOittv_vgA/export?format=csv&id=1MQ50_tn76GqQAOpFigcHms4zFqkoM_JS4sOittv_vgA"
Your method is much slicker though!!
Jack
Posted by: Jack | June 04, 2014 at 06:25