A beginner's guide to collecting Twitter data (and a bit of web scraping)

As a student fellow at the Knight Lab, I get the opportunity to work on a variety of different projects. Recently, I’ve been working with Larry Birnbaum, a Knight Lab co-founder, and Shawn O’Banion, a computer science Ph.D. student, to build an application that takes a user’s Twitter handle, analyzes their activity and returns a list of celebrities that they tweet most like.

It’s not an earth-shattering project, but it is a fun way for Twitter users to see who they tweet like and perhaps discover a few interesting things about themselves in the process. It also gave me a great excuse to experiment with the tools available in the open source community for web scraping and mining Twitter data, which you can read about below.

The tools listed here are primarily for Python, but equivalent versions of these libraries exist in other languages — just search around!

Who’s a celebrity, exactly?

The first step in building this project was to gather a list of celebrities to compare users against. To do this, I searched the web for sites that had celebrity information. IMDB was the perfect solution as it had an extensive list of celebrities (actors, movie directors, singers, sports figures, etc) and provided the information in a structured format that was straightforward to collect using a web scraping tool.

Tools used:

  • Beautiful Soup — A useful Python library for scraping web pages that has extensive documentation and community support. Choosing elements to save from a page is as simple as writing a CSS selector.

Collecting tweets

After gathering a list of celebrities, I needed to find them on Twitter and save their handles. Twitter’s API provides a straightforward way to query for users and returns results in a JSON format which makes it easy to parse in a Python script. One wrinkle when dealing with celebrities is that fake accounts use similar or identical names and could be difficult to detect. Luckily, Twitter includes a handy data field in each user object that indicates whether the account is verified, which I checked before saving the handle.

Once the celebrity name was associated with a Twitter handle, the next step was to again use Twitter’s API to download the user’s tweets and save them into a database.

When gathering data you will often encounter the “rate limit exceeded” error message. This is because Twitter imposes a limit on the number of API calls a single app can make in set “window” of times (currently 15 minutes). To get around this problem, you can either make multiple Twitter Apps and request additional OAuth credentials or set up a cronjob task to run every 15 minutes. Doing so will allow for your script to run during scheduled times or intervals in the background, leaving you free to perform other tasks.

A few tips for writing cronjob tasks that I found extremely helpful when collecting data:

  • Construct your scripts in a way that cycles through your API keys to stay within the rate limit.
  • Be sure to catch exception errors that may occur when accessing Twitter’s API and write to an error file for later review. This will allow for your scripts to run unattended and not crash the entire program when an error occurs.
  • Run your scripts on a remote computer (unless you want to keep your computer on the entire time the scripts are running!).


Tools used:

  • Twitter API —  A Python wrapper for performing API requests such as searching for users and downloading tweets. This library handles all of the OAuth and API queries for you and provides it to you in a simple Python interface. Be sure to create a Twitter App and get your OAuth keys — you will need them to get access to Twitter’s API.
  • MongoDB —  An open source document storage database and is the go-to “NoSQL” database. It makes working with a database feel like working with Javascript.
  • PyMongo — A Python wrapper for interfacing with a MongoDB instance. This library lets you connect your Python scripts with your database and read/insert records.
  • Cronjobs — A time based job scheduler that lets you run scripts at designated times or intervals (e.g. always at 12:01 a.m. or every 15 minutes).


Once the tweets have been successfully stored in your database, you can manipulate the data to fit the needs of your project. For my project, I removed common words and created an index on the text of the collected tweets to perform the similarity comparisons.

Accessing the Firehose

If you’re ready to go beyond the data limits that Twitter imposes for free access, you can upgrade to Twitter’s Firehose API where you can get nearly unlimited access to Twitter’s data stream via one of the various data providers that Twitter partners with, including Dataminr (CNN recently partnered with Dataminr build an application that alerts journalists in newsrooms of breaking news and emerging trends), Datasift, Gnip, Lithium, Topsy.

What now?

While the number of projects you could build using Twitter data is close to infinite, there are a few cool and fun civic-minded projects already out there. NoHomophobes.com gives you a glimpse of how prevelant homophobic speach is on Twitter. Closer to home, Knight Lab has developed a number a different projects using the tools above: twXplorer, BookRx, and  NeighborhoodBuzz to name a few. While the scope of these projects range from text aggregation to recommendation engines to sentiment analysis, they all leverage the use of various open source tools to access Twitter data and build applications on top of it.

About the author

Allen Zeng

Undergraduate Fellow

Latest Posts

  • Building a Community for VR and AR Storytelling

    In 2016 we founded the Device Lab to provide a hub for the exploration of AR/VR storytelling on campus. In addition to providing access to these technologies for Medill and the wider Northwestern community, we’ve also pursued a wide variety of research and experimental content development projects. We’ve built WebVR timelines of feminist history and looked into the inner workings of ambisonic audio. We’ve built virtual coral reefs and prototyped an AR experience setting interviews...

    Continue Reading

  • A Brief Introduction to NewsgamesCan video games be used to tell the news?

    When the Financial Times released The Uber Game in 2017, the game immediately gained widespread popularity with more than 360,000 visits, rising up the ranks as the paper’s most popular interactive piece of the year. David Blood, the game’s lead developer, said that the average time spent on the page was about 20 minutes, which was substantially longer than what most Financial Times interactives tend to receive, according to Blood. The Uber Game was so successful that the Financial...

    Continue Reading

  • With the 25th CAR Conference upon us, let’s recall the first oneWhen the Web was young, data journalism pioneers gathered in Raleigh

    For a few days in October 1993, if you were interested in journalism and technology, Raleigh, North Carolina was the place you had to be. The first Computer-Assisted Reporting Conference offered by Investigative Reporters & Editors brought more than 400 journalists to Raleigh for 3½ days of panels, demos and hands-on lessons in how to use computers to find stories in data. That seminal event will be commemorated this week at the 25th CAR Conference, which...

    Continue Reading

  • Prototyping Augmented Reality

    Something that really frustrates me is that, while I’m excited about the potential AR has for storytelling, I don’t feel like I have really great AR experiences that I can point people to. We know that AR is great for taking a selfie with a Pikachu and it’s pretty good at measuring spaces (as long as your room is really well lit and your phone is fully charged) but beyond that, we’re really still figuring...

    Continue Reading

  • Capturing the Soundfield: Recording Ambisonics for VR

    When building experiences in virtual reality we’re confronted with the challenge of mimicking how sounds hit us in the real world from all directions. One useful tool for us to attempt this mimicry is called a soundfield microphone. We tested one of these microphones to explore how audio plays into building immersive experiences for virtual reality. Approaching ambisonics with the soundfield microphone has become popular in development for VR particularly for 360 videos. With it,...

    Continue Reading

  • Audience Engagement and Onboarding with Hearken Auditing the News Resurrecting History for VR Civic Engagement with City Bureau Automated Fact Checking Conversational Interface for News Creative Co-Author Crowdsourcing for Journalism Environmental Reporting with Sensors Augmented Reality Visualizations Exploring Data Visualization in VR Fact Flow Storytelling with GIFs Historical Census Data Information Spaces in AR/VR Contrasting Forms Of Interactive 3D Storytelling Interactive Audio Juxtapose Legislator Tracker Storytelling with Augmented Reality Music Magazine Navigating Virtual Reality Open Data Reporter Oscillations Personalize My Story Photo Bingo Photojournalism in 3D for VR and Beyond Podcast Discoverability Privacy Mirror Projection Mapping ProPublica Illinois Rethinking Election Coverage SensorGrid API and Dashboard Sidebar Smarter News Exploring Software Defined Radio Story for You Storyline: Charts that tell stories. Storytelling Layers on 360 Video Talking to Data Visual Recipes Watch Me Work Writing and Designing for Chatbots
  • Prototyping Spatial Audio for Movement Art

    One of Oscillations’ technical goals for this quarter’s Knight Lab Studio class was an exploration of spatial audio. Spatial audio is sound that exists in three dimensions. It is a perfect complement to 360 video, because sound sources can be localized to certain parts of the video. Oscillations is especially interested in using spatial audio to enhance the neuroscientific principles of audiovisual synchrony that they aim to emphasize in their productions. Existing work in spatial......

    Continue Reading

Storytelling Tools

We build easy-to-use tools that can help you tell better stories.

View More