Sometimes the data we need isn't packaged up nicely into a simple comma-separated file or database. It's out there, but only in unstructured (or semi-structured) form: displayed as a table on a Web page, for example. With the RCurl package, some regular expressions, and a little knowledge of HTML, it's possible to extract (or scrape) the structured data you need. Programming R gives a simple example of scraping the r-help listserv archives to tabulate the most prolific posters. (Incidentally, the same techniques are used in the O'Reilly Short Cut Data Mashups with R, in the context of a much more detailed example.)



If web scraping should be a last resort, what alternatives do you recommend? I operate a website that requires historical information on the yield of Treasury securities with various maturities. We get this from the Treasury's website, and several times in the past few years they have made a slight change that broke our application. The only alternatives I can think of are to (1) forget about an automated tool altogether, and "hand copy" the rates from the Treasury site into a web application; or (2) Try out my web-scraping application every day (or maybe several times a day) just to make sure it is still working. Neither alternative is very attractive to me.
Posted by: John S. | July 20, 2009 at 14:10
I wrote a post on multiple ways to use R and other tools
to web scrape. Web scraping is useful for downloading public ( and free data) for data augmentation. The best tool was a Firefox add-on called iMacros.The post is here
http://www.decisionstats.com/2009/01/26/
Note there is another way- use RDCOM package in R.
Posted by: Ajay Ohri | July 20, 2009 at 15:09
John, short of asking the site owner to also offer the data in a structured format (like CSV or XML) there's little one can do. It's especially frustrating in the case of government data, which as citizens we've already paid to have collected. That's why the open government initiative is so important.
Posted by: David Smith | July 20, 2009 at 17:10
Visit http://www.website-scraping.com I used their services for my project and these guys know what they they are doing.
Posted by: Vibhu | September 13, 2009 at 09:21
everal times in the past few years they have made a slight change that broke our application.
Posted by: qings blog | December 01, 2009 at 21:31
David
Thanks for the tip. I wanted to get spectra data for greenhouse gases to generate a 5 panel chart.
I found a good site for the source data, NIST Chemistry Webbook, unfortunately the data was buried in web pages with variable record lengths, depending on the gas.
I generated an R script to read in, reformat and plot each series in a separate panel.
Here's link to my post. There's a link to my R script, posted it on Google docs for easy download.
Please keep the R script tips coming.
D Kelly O'Day
http://chartgraphs.wordpress.com
Posted by: D Kelly O'Day | December 07, 2009 at 07:29
Try ScrapePro Web Scraper Designer instead of the useless XPath method. http://www.scrapepro.com
Posted by: csharpp | September 22, 2010 at 14:09
The only alternatives I can think of are to (1) forget about an automated tool altogether, and "hand copy" the rates from the Treasury site into a web application
Posted by: koency | November 15, 2010 at 01:35