Web scrapers for journalists: Haystax and other graphical interface systems

I’ve spent my last weeks as a Knight Lab student fellow exploring web scrapers for non-programmers through an open source browser plugin called Haystax.

As a journalism student who picked up computer science, I love scraping because you create a program that acts like a reporter, tracking the information you want from web pages you specify. It’s a useful technique to save journalists time copying and pasting data from an organization’s website, and scraping can teach a working knowledge of HTML and a scripting language like Python, Ruby or PHP.

But a newsroom might not be the most comfortable place to learn Python the hard way for a reporter on a deadline. What if journalists could scrape data from the web without writing code?

Reporter’s Lab at Duke University has written consumer reviews of commercial and open source projects that let users scrape web pages with graphical interfaces instead of writing code. The folks at Reporter’s Lab gave extra attention to proprietary and open source scraping tools for journalists when Google decommissioned Needlebase, according to Tyler Dukes, a reporter at WRAL in Raleigh, North Carolina, who served as managing editor for Reporter’s Lab.

Dukes attended a hack in San Francisco last summer and rallied a team of journalists, civic activists and developers around the idea of a visual scraper, a browser plugin that scrapes data when a user highlights text. Haystax was born in a weekend as a fork of Web X-Ray Goggles, a Mozilla tool that lets users select and modify HTML by hovering over elements on a web page.

Developer Randall Leeds added commands to select an HTML table and scrape its rows and columns into a downloadable CSV by pressing the ‘S’ key. If a table spans multiple pages, like an employee database with hundreds of pages of entries, Haystax can scrape the HTML tables from each page into one spreadsheet.

Haystax scrapes using XPath, a language that describes where elements are on a webpage. When users identify the HTML table they want to scrape with Haystax, the program stores the location of the table on the page as an XPath and can use that location to search for tables when it visits other pages as it scrapes.

However, Haystax and its table extraction have their limits. Its HTML table to CSV conversion assumes column descriptions appear as headers on the top row of a table, but some sites display data descriptions on the side. More importantly, data is not always presented in HTML tables. And although Haystax can scrape sequential pages, web databases often require users enter search terms or specify query parameters to access data.

Programmers often write custom scraper scripts to collect from a specific dataset, but an open source tool like Haystax should be scalable and useful in different situations. There are lots of situations a scraper could encounter, from different datatypes to authentication hurdles.

Each scraper reviewed by Reporter’s Lab is tested on the same four datasets, which include a public directory of South Dakota lobbyists and a database of British Columbia teachers. The examples were provided to Reporter’s Lab by newsrooms that encountered these examples in the course of their reporting, Dukes said.

Haystax does well with its intended use case, the South Dakota lobbyists, but can’t handle query-based scraping. Some of the commercial scrapers Reporter’s Lab reviewed, like Helium Scraper or Outwit Hub, succeed in more tests but also cost between $50 and $150.

Because Haystax is open source (visit the GitHub repo), part of my research was to see if we could improve it and overcome some of these limitations. However, it seemed that its most appealing features–simplicity and close integration with the browser–might also be the source of its limitations. It's challenging to write a complex application within the confines of Firefox's API. It's also quite a feat to design an interface which can support the many special cases without overwhelming non-technical journalists.

“I think the goal really here is can we get a suite of tools that can take care of a lot of the cases,” Dukes said. “There's no silver bullet, but if you can give people simple tools for tackling simple problems, it frees time for hard problems.”

It’s a mentality that ScraperWiki seems to be adapting with its new beta version that offers tools for different skill levels. The original ScraperWiki lets users write and publish scrapers and store data on a profile page, but gave advanced programmers and beginners the same options. The new version gives developers more options for their programming environments but also lets users scrape Tweets without writing code.

“New ScraperWiki is built for two markets,” says the beta release page. “The coders get the advantage of simple tools to automate the boring bits, and non-coders get to look under the hood and learn from (and maybe even improve) the tools they use every day.”

Scraping is a valuable tool for reporters of all kinds, making open projects like Haystax and ScraperWiki important to the journalism community.

About the author

Dan Keemahill

Undergraduate Fellow

Latest Posts

  • Prototyping Augmented Reality

    Something that really frustrates me is that, while I’m excited about the potential AR has for storytelling, I don’t feel like I have really great AR experiences that I can point people to. We know that AR is great for taking a selfie with a Pikachu and it’s pretty good at measuring spaces (as long as your room is really well lit and your phone is fully charged) but beyond that, we’re really still figuring...

    Continue Reading

  • Capturing the Soundfield: Recording Ambisonics for VR

    When building experiences in virtual reality we’re confronted with the challenge of mimicking how sounds hit us in the real world from all directions. One useful tool for us to attempt this mimicry is called a soundfield microphone. We tested one of these microphones to explore how audio plays into building immersive experiences for virtual reality. Approaching ambisonics with the soundfield microphone has become popular in development for VR particularly for 360 videos. With it,...

    Continue Reading

  • How to translate live-spoken human words into computer “truth”

    Our Knight Lab team spent three months in Winter 2018 exploring how to combine various technologies to capture, interpret, and fact check live broadcasts from television news stations, using Amazon’s Alexa personal assistant device as a low-friction way to initiate the process. The ultimate goal was to build an Alexa skill that could be its own form of live, automated fact-checking: cross-referencing a statement from a politician or otherwise newsworthy figure against previously fact-checked statements......

    Continue Reading

  • Northwestern is hiring a CS + Journalism professor

    Work with us at the intersection of media, technology and design.

    Are you interested in working with journalism and computer science students to build innovative media tools, products and apps? Would you like to teach the next generation of media innovators? Do you have a track record building technologies for journalists, publishers, storytellers or media consumers? Northwestern University is recruiting for an assistant or associate professor for computer science AND journalism, who will share an appointment in the Medill School of Journalism and the McCormick School...

    Continue Reading

  • Introducing StorylineJS

    Today we're excited to release a new tool for storytellers.

    StorylineJS makes it easy to tell the story behind a dataset, without the need for programming or data visualization expertise. Just upload your data to Google Sheets, add two columns, and fill in the story on the rows you want to highlight. Set a few configuration options and you have an annotated chart, ready to embed on your website. (And did we mention, it looks great on phones?) As with all of our tools, simplicity...

    Continue Reading

  • Join us in October: NU hosts the Computation + Journalism 2017 symposium

    An exciting lineup of researchers, technologists and journalists will convene in October for Computation + Journalism Symposium 2017 at Northwestern University. Register now and book your hotel rooms for the event, which will take place on Friday, Oct. 13, and Saturday, Oct. 14 in Evanston, IL. Hotel room blocks near campus are filling up fast! Speakers will include: Ashwin Ram, who heads research and development for Amazon’s Alexa artificial intelligence (AI) agent, which powers the...

    Continue Reading

Storytelling Tools

We build easy-to-use tools that can help you tell better stories.

View More