Web scrapers for journalists: Haystax and other graphical interface systems

I’ve spent my last weeks as a Knight Lab student fellow exploring web scrapers for non-programmers through an open source browser plugin called Haystax.

As a journalism student who picked up computer science, I love scraping because you create a program that acts like a reporter, tracking the information you want from web pages you specify. It’s a useful technique to save journalists time copying and pasting data from an organization’s website, and scraping can teach a working knowledge of HTML and a scripting language like Python, Ruby or PHP.

But a newsroom might not be the most comfortable place to learn Python the hard way for a reporter on a deadline. What if journalists could scrape data from the web without writing code?

Reporter’s Lab at Duke University has written consumer reviews of commercial and open source projects that let users scrape web pages with graphical interfaces instead of writing code. The folks at Reporter’s Lab gave extra attention to proprietary and open source scraping tools for journalists when Google decommissioned Needlebase, according to Tyler Dukes, a reporter at WRAL in Raleigh, North Carolina, who served as managing editor for Reporter’s Lab.

Dukes attended a hack in San Francisco last summer and rallied a team of journalists, civic activists and developers around the idea of a visual scraper, a browser plugin that scrapes data when a user highlights text. Haystax was born in a weekend as a fork of Web X-Ray Goggles, a Mozilla tool that lets users select and modify HTML by hovering over elements on a web page.

Developer Randall Leeds added commands to select an HTML table and scrape its rows and columns into a downloadable CSV by pressing the ‘S’ key. If a table spans multiple pages, like an employee database with hundreds of pages of entries, Haystax can scrape the HTML tables from each page into one spreadsheet.

Haystax scrapes using XPath, a language that describes where elements are on a webpage. When users identify the HTML table they want to scrape with Haystax, the program stores the location of the table on the page as an XPath and can use that location to search for tables when it visits other pages as it scrapes.

However, Haystax and its table extraction have their limits. Its HTML table to CSV conversion assumes column descriptions appear as headers on the top row of a table, but some sites display data descriptions on the side. More importantly, data is not always presented in HTML tables. And although Haystax can scrape sequential pages, web databases often require users enter search terms or specify query parameters to access data.

Programmers often write custom scraper scripts to collect from a specific dataset, but an open source tool like Haystax should be scalable and useful in different situations. There are lots of situations a scraper could encounter, from different datatypes to authentication hurdles.

Each scraper reviewed by Reporter’s Lab is tested on the same four datasets, which include a public directory of South Dakota lobbyists and a database of British Columbia teachers. The examples were provided to Reporter’s Lab by newsrooms that encountered these examples in the course of their reporting, Dukes said.

Haystax does well with its intended use case, the South Dakota lobbyists, but can’t handle query-based scraping. Some of the commercial scrapers Reporter’s Lab reviewed, like Helium Scraper or Outwit Hub, succeed in more tests but also cost between $50 and $150.

Because Haystax is open source (visit the GitHub repo), part of my research was to see if we could improve it and overcome some of these limitations. However, it seemed that its most appealing features–simplicity and close integration with the browser–might also be the source of its limitations. It's challenging to write a complex application within the confines of Firefox's API. It's also quite a feat to design an interface which can support the many special cases without overwhelming non-technical journalists.

“I think the goal really here is can we get a suite of tools that can take care of a lot of the cases,” Dukes said. “There's no silver bullet, but if you can give people simple tools for tackling simple problems, it frees time for hard problems.”

It’s a mentality that ScraperWiki seems to be adapting with its new beta version that offers tools for different skill levels. The original ScraperWiki lets users write and publish scrapers and store data on a profile page, but gave advanced programmers and beginners the same options. The new version gives developers more options for their programming environments but also lets users scrape Tweets without writing code.

“New ScraperWiki is built for two markets,” says the beta release page. “The coders get the advantage of simple tools to automate the boring bits, and non-coders get to look under the hood and learn from (and maybe even improve) the tools they use every day.”

Scraping is a valuable tool for reporters of all kinds, making open projects like Haystax and ScraperWiki important to the journalism community.

Latest Posts

  • Building a Community for VR and AR Storytelling

    In 2016 we founded the Device Lab to provide a hub for the exploration of AR/VR storytelling on campus. In addition to providing access to these technologies for Medill and the wider Northwestern community, we’ve also pursued a wide variety of research and experimental content development projects. We’ve built WebVR timelines of feminist history and looked into the inner workings of ambisonic audio. We’ve built virtual coral reefs and prototyped an AR experience setting interviews...

    Continue Reading

  • A Brief Introduction to NewsgamesCan video games be used to tell the news?

    When the Financial Times released The Uber Game in 2017, the game immediately gained widespread popularity with more than 360,000 visits, rising up the ranks as the paper’s most popular interactive piece of the year. David Blood, the game’s lead developer, said that the average time spent on the page was about 20 minutes, which was substantially longer than what most Financial Times interactives tend to receive, according to Blood. The Uber Game was so successful that the Financial...

    Continue Reading

  • With the 25th CAR Conference upon us, let’s recall the first oneWhen the Web was young, data journalism pioneers gathered in Raleigh

    For a few days in October 1993, if you were interested in journalism and technology, Raleigh, North Carolina was the place you had to be. The first Computer-Assisted Reporting Conference offered by Investigative Reporters & Editors brought more than 400 journalists to Raleigh for 3½ days of panels, demos and hands-on lessons in how to use computers to find stories in data. That seminal event will be commemorated this week at the 25th CAR Conference, which...

    Continue Reading

  • Prototyping Augmented Reality

    Something that really frustrates me is that, while I’m excited about the potential AR has for storytelling, I don’t feel like I have really great AR experiences that I can point people to. We know that AR is great for taking a selfie with a Pikachu and it’s pretty good at measuring spaces (as long as your room is really well lit and your phone is fully charged) but beyond that, we’re really still figuring...

    Continue Reading

  • Capturing the Soundfield: Recording Ambisonics for VR

    When building experiences in virtual reality we’re confronted with the challenge of mimicking how sounds hit us in the real world from all directions. One useful tool for us to attempt this mimicry is called a soundfield microphone. We tested one of these microphones to explore how audio plays into building immersive experiences for virtual reality. Approaching ambisonics with the soundfield microphone has become popular in development for VR particularly for 360 videos. With it,...

    Continue Reading

  • Audience Engagement and Onboarding with Hearken Auditing the News Resurrecting History for VR Civic Engagement with City Bureau Automated Fact Checking Conversational Interface for News Creative Co-Author Crowdsourcing for Journalism Environmental Reporting with Sensors Augmented Reality Visualizations Exploring Data Visualization in VR Fact Flow Storytelling with GIFs Historical Census Data Information Spaces in AR/VR Contrasting Forms Of Interactive 3D Storytelling Interactive Audio Juxtapose Legislator Tracker Storytelling with Augmented Reality Music Magazine Navigating Virtual Reality Open Data Reporter Oscillations Personalize My Story Photo Bingo Photojournalism in 3D for VR and Beyond Podcast Discoverability Privacy Mirror Projection Mapping ProPublica Illinois Rethinking Election Coverage SensorGrid API and Dashboard Sidebar Smarter News Exploring Software Defined Radio Story for You Storyline: Charts that tell stories. Storytelling Layers on 360 Video Talking to Data Visual Recipes Watch Me Work Writing and Designing for Chatbots
  • Prototyping Spatial Audio for Movement Art

    One of Oscillations’ technical goals for this quarter’s Knight Lab Studio class was an exploration of spatial audio. Spatial audio is sound that exists in three dimensions. It is a perfect complement to 360 video, because sound sources can be localized to certain parts of the video. Oscillations is especially interested in using spatial audio to enhance the neuroscientific principles of audiovisual synchrony that they aim to emphasize in their productions. Existing work in spatial......

    Continue Reading

Storytelling Tools

We build easy-to-use tools that can help you tell better stories.

View More