Those of us who work with R dream of just one thing: getting clean data, ready for analysis. Rarely does this happen, of course. As Steve Lohr wrote in the New York Times in 2014, those working with data:
spend from 50 percent to 80 percent of their time mired in this more mundane labor of collecting and preparing unruly digital data, before it can be explored for useful nuggets.
Sometimes messy data comes in, ahem, creatively organized spreadsheets. Other times it comes as numeric values that Excel has “helpfully” made out of dates (of course, there’s a function to deal with this).
When things are really dire, it does come to you at all. Instead, it lives on a website. And not a website that has CSVs ready for download. A website website. As in, it’s HTML all the way down.
I faced this situation recently for a project I’m working on building a food resource map for the Asian Pacific Islander community of San Francisco. I needed to bring in data on food pantries in San Francisco, but unfortunately the data existed on a website with no way to download the data.
Enter, web scraping. While some people hear the term and think it involves nefarious practices, it is actually quite common practice for gathering data when data is only available in this less structured way (though it is important to check the terms of service of websites to make sure they don’t prohibit it). R has a package to help with web scraping called
In a recent office hours session, I walked through code that I developed to bring in data on food pantries. The video is below and the code is available on GitHub.
When life gives you lemons, as the saying goes, make lemonade. For data, that means making do with the data that is available. And when that data is a website, using the
rvest package is a great way to bring it into R!