I’ve spent my last weeks as a Knight Lab student fellow exploring web scrapers for non-programmers through an open source browser plugin called Haystax.
As a journalism student who picked up computer science, I love scraping because you create a program that acts like a reporter, tracking the information you want from web pages you specify. It’s a useful technique to save journalists time copying and pasting data from an organization’s website, and scraping can teach a working knowledge of HTML and a scripting language like Python, Ruby or PHP.
But a newsroom might not be the most comfortable place to learn Python the hard way for a reporter on a deadline. What if journalists could scrape data from the web without writing code?
Reporter’s Lab at Duke University has written consumer reviews of commercial and open source projects that let users scrape web pages with graphical interfaces instead of writing code. The folks at Reporter’s Lab gave extra attention to proprietary and open source scraping tools for journalists when Google decommissioned Needlebase, according to Tyler Dukes, a reporter at WRAL in Raleigh, North Carolina, who served as managing editor for Reporter’s Lab.
Dukes attended a hack in San Francisco last summer and rallied a team of journalists, civic activists and developers around the idea of a visual scraper, a browser plugin that scrapes data when a user highlights text. Haystax was born in a weekend as a fork of Web X-Ray Goggles, a Mozilla tool that lets users select and modify HTML by hovering over elements on a web page.
Developer Randall Leeds added commands to select an HTML table and scrape its rows and columns into a downloadable CSV by pressing the ‘S’ key. If a table spans multiple pages, like an employee database with hundreds of pages of entries, Haystax can scrape the HTML tables from each page into one spreadsheet.
Haystax scrapes using XPath, a language that describes where elements are on a webpage. When users identify the HTML table they want to scrape with Haystax, the program stores the location of the table on the page as an XPath and can use that location to search for tables when it visits other pages as it scrapes.
However, Haystax and its table extraction have their limits. Its HTML table to CSV conversion assumes column descriptions appear as headers on the top row of a table, but some sites display data descriptions on the side. More importantly, data is not always presented in HTML tables. And although Haystax can scrape sequential pages, web databases often require users enter search terms or specify query parameters to access data.
Programmers often write custom scraper scripts to collect from a specific dataset, but an open source tool like Haystax should be scalable and useful in different situations. There are lots of situations a scraper could encounter, from different datatypes to authentication hurdles.
Each scraper reviewed by Reporter’s Lab is tested on the same four datasets, which include a public directory of South Dakota lobbyists and a database of British Columbia teachers. The examples were provided to Reporter’s Lab by newsrooms that encountered these examples in the course of their reporting, Dukes said.
Haystax does well with its intended use case, the South Dakota lobbyists, but can’t handle query-based scraping. Some of the commercial scrapers Reporter’s Lab reviewed, like Helium Scraper or Outwit Hub, succeed in more tests but also cost between $50 and $150.
Because Haystax is open source (visit the GitHub repo), part of my research was to see if we could improve it and overcome some of these limitations. However, it seemed that its most appealing features–simplicity and close integration with the browser–might also be the source of its limitations. It's challenging to write a complex application within the confines of Firefox's API. It's also quite a feat to design an interface which can support the many special cases without overwhelming non-technical journalists.
“I think the goal really here is can we get a suite of tools that can take care of a lot of the cases,” Dukes said. “There's no silver bullet, but if you can give people simple tools for tackling simple problems, it frees time for hard problems.”
It’s a mentality that ScraperWiki seems to be adapting with its new beta version that offers tools for different skill levels. The original ScraperWiki lets users write and publish scrapers and store data on a profile page, but gave advanced programmers and beginners the same options. The new version gives developers more options for their programming environments but also lets users scrape Tweets without writing code.
“New ScraperWiki is built for two markets,” says the beta release page. “The coders get the advantage of simple tools to automate the boring bits, and non-coders get to look under the hood and learn from (and maybe even improve) the tools they use every day.”
Scraping is a valuable tool for reporters of all kinds, making open projects like Haystax and ScraperWiki important to the journalism community.