Web scrapers for journalists: Haystax and other graphical interface systems

I’ve spent my last weeks as a Knight Lab student fellow exploring web scrapers for non-programmers through an open source browser plugin called Haystax.

As a journalism student who picked up computer science, I love scraping because you create a program that acts like a reporter, tracking the information you want from web pages you specify. It’s a useful technique to save journalists time copying and pasting data from an organization’s website, and scraping can teach a working knowledge of HTML and a scripting language like Python, Ruby or PHP.

But a newsroom might not be the most comfortable place to learn Python the hard way for a reporter on a deadline. What if journalists could scrape data from the web without writing code?

Reporter’s Lab at Duke University has written consumer reviews of commercial and open source projects that let users scrape web pages with graphical interfaces instead of writing code. The folks at Reporter’s Lab gave extra attention to proprietary and open source scraping tools for journalists when Google decommissioned Needlebase, according to Tyler Dukes, a reporter at WRAL in Raleigh, North Carolina, who served as managing editor for Reporter’s Lab.

Dukes attended a hack in San Francisco last summer and rallied a team of journalists, civic activists and developers around the idea of a visual scraper, a browser plugin that scrapes data when a user highlights text. Haystax was born in a weekend as a fork of Web X-Ray Goggles, a Mozilla tool that lets users select and modify HTML by hovering over elements on a web page.

Developer Randall Leeds added commands to select an HTML table and scrape its rows and columns into a downloadable CSV by pressing the ‘S’ key. If a table spans multiple pages, like an employee database with hundreds of pages of entries, Haystax can scrape the HTML tables from each page into one spreadsheet.

Haystax scrapes using XPath, a language that describes where elements are on a webpage. When users identify the HTML table they want to scrape with Haystax, the program stores the location of the table on the page as an XPath and can use that location to search for tables when it visits other pages as it scrapes.

However, Haystax and its table extraction have their limits. Its HTML table to CSV conversion assumes column descriptions appear as headers on the top row of a table, but some sites display data descriptions on the side. More importantly, data is not always presented in HTML tables. And although Haystax can scrape sequential pages, web databases often require users enter search terms or specify query parameters to access data.

Programmers often write custom scraper scripts to collect from a specific dataset, but an open source tool like Haystax should be scalable and useful in different situations. There are lots of situations a scraper could encounter, from different datatypes to authentication hurdles.

Each scraper reviewed by Reporter’s Lab is tested on the same four datasets, which include a public directory of South Dakota lobbyists and a database of British Columbia teachers. The examples were provided to Reporter’s Lab by newsrooms that encountered these examples in the course of their reporting, Dukes said.

Haystax does well with its intended use case, the South Dakota lobbyists, but can’t handle query-based scraping. Some of the commercial scrapers Reporter’s Lab reviewed, like Helium Scraper or Outwit Hub, succeed in more tests but also cost between $50 and $150.

Because Haystax is open source (visit the GitHub repo), part of my research was to see if we could improve it and overcome some of these limitations. However, it seemed that its most appealing features–simplicity and close integration with the browser–might also be the source of its limitations. It's challenging to write a complex application within the confines of Firefox's API. It's also quite a feat to design an interface which can support the many special cases without overwhelming non-technical journalists.

“I think the goal really here is can we get a suite of tools that can take care of a lot of the cases,” Dukes said. “There's no silver bullet, but if you can give people simple tools for tackling simple problems, it frees time for hard problems.”

It’s a mentality that ScraperWiki seems to be adapting with its new beta version that offers tools for different skill levels. The original ScraperWiki lets users write and publish scrapers and store data on a profile page, but gave advanced programmers and beginners the same options. The new version gives developers more options for their programming environments but also lets users scrape Tweets without writing code.

“New ScraperWiki is built for two markets,” says the beta release page. “The coders get the advantage of simple tools to automate the boring bits, and non-coders get to look under the hood and learn from (and maybe even improve) the tools they use every day.”

Scraping is a valuable tool for reporters of all kinds, making open projects like Haystax and ScraperWiki important to the journalism community.

Latest Posts

  • A Big Change That Will Probably Affect Your Storymaps

    A big change is coming to StoryMapJS, and it will affect many, if not most existing storymaps. When making a storymap, one way to set a style and tone for your project is to set the "map type," also known as the "basemap." When we launched StoryMapJS, it included options for a few basemaps created by Stamen Design. These included the "watercolor" style, as well as the default style for new storymaps, "Toner Lite." Stamen...

    Continue Reading

  • Introducing AmyJo Brown, Knight Lab Professional Fellow

    AmyJo Brown, a veteran journalist passionate about supporting and reshaping local political journalism and who it engages, has joined the Knight Lab as a 2022-2023 professional fellow. Her focus is on building The Public Ledger, a data tool structured from local campaign finance data that is designed to track connections and make local political relationships – and their influence – more visible. “Campaign finance data has more stories to tell – if we follow the...

    Continue Reading

  • Interactive Entertainment: How UX Design Shapes Streaming Platforms

    As streaming develops into the latest age of entertainment, how are interfaces and layouts being designed to prioritize user experience and accessibility? The Covid-19 pandemic accelerated streaming services becoming the dominant form of entertainment. There are a handful of new platforms, each with thousands of hours of content, but not much change or differentiation in the user journeys. For the most part, everywhere from Netflix to illegal streaming platforms use similar video streaming UX standards, and...

    Continue Reading

  • Innovation with collaborationExperimenting with AI and investigative journalism in the Americas.

    Lee este artículo en español. How might we use AI technologies to innovate newsgathering and investigative reporting techniques? This was the question we posed to a group of seven newsrooms in Latin America and the US as part of the Americas Cohort during the 2021 JournalismAI Collab Challenges. The Collab is an initiative that brings together media organizations to experiment with AI technologies and journalism. This year,  JournalismAI, a project of Polis, the journalism think-tank at...

    Continue Reading

  • Innovación con colaboraciónCuando el periodismo de investigación experimenta con inteligencia artificial.

    Read this article in English. ¿Cómo podemos usar la inteligencia artificial para innovar las técnicas de reporteo y de periodismo de investigación? Esta es la pregunta que convocó a un grupo de siete organizaciones periodísticas en América Latina y Estados Unidos, el grupo de las Américas del 2021 JournalismAI Collab Challenges. Esta iniciativa de colaboración reúne a medios para experimentar con inteligencia artificial y periodismo. Este año, JournalismAI, un proyecto de Polis, la think-tank de periodismo...

    Continue Reading

  • AI, Automation, and Newsrooms: Finding Fitting Tools for Your Organization

    If you’d like to use technology to make your newsroom more efficient, you’ve come to the right place. Tools exist that can help you find news, manage your work in progress, and distribute your content more effectively than ever before, and we’re here to help you find the ones that are right for you. As part of the Knight Foundation’s AI for Local News program, we worked with the Associated Press to interview dozens of......

    Continue Reading

Storytelling Tools

We build easy-to-use tools that can help you tell better stories.

View More