« Commercial open source is still open source | Main | The models behind the Netflix prize »

July 20, 2009

Comments

Feed You can follow this conversation by subscribing to the comment feed for this post.

If web scraping should be a last resort, what alternatives do you recommend? I operate a website that requires historical information on the yield of Treasury securities with various maturities. We get this from the Treasury's website, and several times in the past few years they have made a slight change that broke our application. The only alternatives I can think of are to (1) forget about an automated tool altogether, and "hand copy" the rates from the Treasury site into a web application; or (2) Try out my web-scraping application every day (or maybe several times a day) just to make sure it is still working. Neither alternative is very attractive to me.

I wrote a post on multiple ways to use R and other tools
to web scrape. Web scraping is useful for downloading public ( and free data) for data augmentation. The best tool was a Firefox add-on called iMacros.The post is here
http://www.decisionstats.com/2009/01/26/

Note there is another way- use RDCOM package in R.

John, short of asking the site owner to also offer the data in a structured format (like CSV or XML) there's little one can do. It's especially frustrating in the case of government data, which as citizens we've already paid to have collected. That's why the open government initiative is so important.

Visit http://www.website-scraping.com I used their services for my project and these guys know what they they are doing.

everal times in the past few years they have made a slight change that broke our application.

David

Thanks for the tip. I wanted to get spectra data for greenhouse gases to generate a 5 panel chart.

I found a good site for the source data, NIST Chemistry Webbook, unfortunately the data was buried in web pages with variable record lengths, depending on the gas.

I generated an R script to read in, reformat and plot each series in a separate panel.

Here's link to my post. There's a link to my R script, posted it on Google docs for easy download.

Please keep the R script tips coming.

D Kelly O'Day
http://chartgraphs.wordpress.com

Try ScrapePro Web Scraper Designer instead of the useless XPath method. http://www.scrapepro.com

The only alternatives I can think of are to (1) forget about an automated tool altogether, and "hand copy" the rates from the Treasury site into a web application

The comments to this entry are closed.

Search Revolutions Blog




Got comments or suggestions for the blog editor?
Email David Smith.
Follow revodavid on Twitter Follow David on Twitter: @revodavid
Get this blog via email with Blogtrottr
‚Äč