Five data scraping tools for would-be data journalists

This past Fall, I spent time with the NPR News Apps team (now known as NPR Visuals) coding up some projects, working mainly as a visual/interaction designer. But in the last few months, I’ve been working on a project that involves scraping newspaper articles and Twitter APIs for data.

I was a relative beginner with Python — I’d pair coded a bit with others and made some basic programs, but nothing too complicated. I knew I needed to develop a more in-depth knowledge of web scraping and data parsing skills and of course took to the web to help. Along the way, I found a few tools that were exceptionally useful for expanding my knowledge. So if you too are just starting out with scraping, here are five of the most useful tools I’ve encountered while learning and working on my project.

1. Scraper

A free Chrome browser extension, Scraper is a handy, easy-to-use tool when you need to extract plain text from websites. After downloading and installing it, just highlight the part of the webpage you’d like to scrape, right click, and then choose “Scrape similar…”. And just like that, a window will pop-up with information that’s similar to what you just highlighted, already rendered and ready to be exported to Google Docs.

Scraper in action.

Scraper is best for plain-text extraction, so don’t expect to be able to scrape images or more complicated objects. It also doesn’t perform great on a huge volume of text, but it’s very easy and fast to use, especially for a beginner. Heads up: It uses XPath to determine which parts of the page’s structure to scrape, though the developers have also included jQuery selectors recently. You can get by without that knowledge, but it’s more powerful if you have a decent grasp of the code.

2. Outwit Hub

Outwit Hub is another browser extension you can get for free, though this time for Firefox. It’s a pretty robust product among the free ones that exist, especially because it can work for both advanced and beginner users. More advanced users can specify exactly what they need to extract, but beginner users can simply choose to download all the PDFs, images, or documents, etc. listed on a given page.

Outwit Hub returns scraped data visually, which makes it easy for beginners to see what's been collected.

It also returns the scraped data in a visual presentation, so complete non-coders will have an easy time understanding what’s being returned. Extracted data can be exported into a variety of formats, and images/documents can be saved directly to your hard drive. They also have some tutorials online for those who need additional help.

3. Scraperwiki (Classic version here)

Scraperwiki has updated its platform recently. While they still allow experienced coders to run their own code in-browser, they’ve moved more into custom- and pre-made tools for beginner coders (e.g. Pre-made tools to scrape Twitter).

These tools can be pretty useful if you don’t have any coding knowledge, but what happens if you’re in that in-between stage? You know, the one where you’ve coded before, but you’d still like a guide before coding a scraper on your own. Well, that’s where classic Scraperwiki comes in.

Classic Scraperwiki allows you to browse scrapers others have written, which can save you time in writing scrapers that target the same data or learn how others put together code.

Classic Scraperwiki allows you to browse scrapers that others have written, which means you can also grab scrapers that target the same data you want, saving you some time. But that archive is most useful if you need a solid way to see how others put together code. It’s a great resource for learning how to do your own scrapes and one day writing your own code from scratch.

4. BeautifulSoup

A parsing library for Python, BeautifulSoup delves more into code than the previous options, but still does so with clear, easy-to-understand methods for navigating, searching, dissecting, and finally extracting the data that you need. It doesn’t take too much code to grab some data, and the installation’s pretty swift too. You’ll need to use the command line. But if you’ve never used your terminal before, don’t get intimidated! Here’s how you can do it. I’m going to assume you’ve already installed Python. First, we’re going to install pip, which is almost like an app store for Python code. With that in mind, open up your terminal and type in:

$ sudo easy_install pip

Press enter. Voila! Pip should be installed. Then,  install BeautifulSoup with pip:

$ pip install beautifulsoup4

Ultimately, it does a pretty good job of fetching contents from a given URL and allows you to parse data without much hassle. You’ll want to come into this with some knowledge of Python, but if you’re looking to move on from ready-made tools and to create code that grabs exactly what you need, BeautifulSoup is a good place to start.

5. Scrapy

Similar to BeautifulSoup, Scrapy also delves further into writing your own code. However, Scrapy is more robust and acts as a full-scale web-spider or web scraper framework. On the other hand, BeautifulSoup can be limited by your designated URL unless you set up an infinite loop manually.

Scrapy is a Python package that can be installed via pip just like BeautifulSoup:

$ pip install Scrapy

In my opinion, Scrapy has a steeper learning curve than BeautifulSoup, but it does have more features. For example, since it’s a full framework, it has full unicode, redirection, handling, etc. It also has incredibly thorough documentation, so if you’re willing to push your code a bit further, you can do a lot with Scrapy.

Now that you have these tools, start digging around the Internet to find something you’re interested in. Whether you just want a simple way to compile info into a Google doc or to start manipulating data yourself, there’s always a tool to help you. Have fun, data and hacker journalists!

Latest Posts

  • A Google Spreadsheets change affecting TimelineJS users

    Google recently changed something about their Sheets service which is causing many people to run into an error when they are making a new timeline. Note: there should be no impact on existing timelines! After this change, many of you click on the "preview" and get this message: An unexpected error occurred trying to read your spreadsheet data [SyntaxError] Timeline configuration has no events. There is a straightforward work-around, but it requires those of you who have...

    Continue Reading

  • How Americans think and feel about gun violence

    A man killed his wife, then himself. I want you to see his face and learn that he enjoyed fishing with his grandchildren. A small-time drug dealer is shot by two men in a parking lot. I find his Facebook profile and a photo shows him striking a playfully irreverent pose, giving the camera the middle finger. The photo’s comments take a mournful turn after a certain date. “Rest easy bro ???” Gun Memorial runs...

    Continue Reading

  • Software developers interested in journalism: Northwestern and The Washington Post want you!

    Northwestern University and The Washington Post are offering a unique opportunity for two talented software developers interested in applying their programming skills in media and journalism. Here’s the proposition: (1) a full-tuition scholarship to earn a master’s degree in journalism at Northwestern University, followed by (2) a six-month paid internship with The Post’s world-class engineering team, with the possibility of subsequent full-time employment. These opportunities are made possible by the John S. and James L....

    Continue Reading

  • What happened when Gun Memorial let anyone contribute directly to victim profiles

    If you’re reporting local or niche news, there’s a good chance that your audience collectively knows more about the story than you do. That’s especially true for us at Gun Memorial, a small publication with a nationwide mission of covering every American who is shot dead. In our latest, mostly successful, experiment, we let readers add to our stories without editor intervention. This article shares some lessons from that experience. Asking for reader contributions A...

    Continue Reading

  • How conversational interfaces make the internet more accessible for everyone

    This story is part of a series on bringing the journalism we produce to as many people as possible, regardless of language, access to technology, or physical capability. Find the series introduction, as well as a list of published stories here. In 2004, human-computer interaction professor Alan Dix published the third edition of Human-Computer Interaction along with his colleagues, Janet Finley, Gregory Abowd, and Russell Beale. In a chapter called “The Interaction,” the authors wrote...

    Continue Reading

  • Three tools to help you make colorblind-friendly graphics

    This story is part of a series on bringing the journalism we produce to as many people as possible, regardless of language, access to technology, or physical capability. Find the series introduction, as well as a list of published stories here. I am one of the 8% of men of Northern European descent who suffers from red-green colorblindness. Specifically, I have a mild case of protanopia (also called protanomaly), which means that my eyes lack...

    Continue Reading

Storytelling Tools

We build easy-to-use tools that can help you tell better stories.

View More