This past Tuesday was Talk Like A Pirate Date, the unofficial holiday of R (aRRR!) users worldwide. In recognition of the day, Bob Rudis used R to create this map of worldwide piracy incidents from 2013 to 2017.
The post provides a useful and practical example of extracting data from a website without an API, otherwise known as "scraping" data. In this case the website was the International Chamber of Commerce, and the provided R code demonstrates several useful R packages for scraping:
- rvest, for extracting data from a web pages' HTML source
- purrr, and specifically the safely function, for streamlining the process of iterating over pages that may return errors
- purrrly, for iterating over the rows of scraped data
- splashr, for automating the process of taking a screenshot of a webpage
- and robotstxt, for checking whether automated downloads of the website content are allowed
That last package is an important one, because while it's almost always technically possible to automate the process of extracting data from a website, it's not always allowed or even legal. Inspecting the robots.txt file is one check you should definitely make, and you should also check the terms of service of the website which may also prohibit the practice. Even then, you should be respectful and take care not to overload the server with frequent and/or repeated calls, as Bob demonstrates by spacing requests by 5 seconds. Finally -- and most importantly! -- even if scraping isn't expressly forbidden, using scraped data may not be ethical, especially when the data is about people who are unable to give their individual consent to your use of the data. This Forbes article about analyzing data scraped from a dating website offers an instructive tale in that regard.
This piracy data example however provides a case study of using websites and the data it provides in the right way. Follow the link below all for the details.
Comments
You can follow this conversation by subscribing to the comment feed for this post.