Beyond spreadsheets for CAR reporters: Algorithms

The lightning talks at NICAR are often the highlight of the computer-assisted reporting conference, but Chase Davis (who recently did a Q&A with us) really grabbed my attention with his “Five Algorithms in Five Minutes” talk, complete with a mic drop. So much so, that three months later I'm still thinking about it and all of the ways that I might put these algorithms to use.

NICAR coincided with my internship at The Sacramento Bee, my hometown paper, where I was spending lots of time with—and eating much Chipotle with—computer-assisted reporter Phillip Reese. Reese has become the go-to data expert in the Bee newsroom by helping reporters find numbers to back up their stories. He knows how to not screw up data and bulletproof his spreadsheets by keeping track of your records, making backups and asking the experts. I admire his ability to find newsworthy trends and outliers with averages, medians, percent change and sorting spreadsheets such that I share his articles on Twitter with #PhillipReeseFanClub.

So when Davis showed NICAR how algorithms can help reporters dig through data, I thought of our two-person data team in Sacramento. My machine learning research at the Knight Lab and study of algorithms in computer science classes have further shown me how we could have applied these data techniques at the Bee.

Reese and I worked on several projects involving data for all 58 counties in California, so when the state finance department dropped a report predicting how each county’s population will change by 2050, we dove in to analyze (and map) their findings for the next day’s paper. We looked at percent change and tried to find interesting outliers, but that meant keeping track of dozens of demographics variables.

Principal Component Analysis (PCA) could have really helped us out that day. It compresses a correlated variables in a dataset to make interesting variables stand out. In that story we found that the state predicted almost every county to grow due to growths in the Hispanic population, PCA would have singled out those counties that bucked the trend.

Reese localized the gun violence discussion in a Sunday A1 story by profiling Lemon Hill, the Sacramento neighborhood with the most reported assaults with a firearm and shooting into a building. Among the ornery comments on the Sacramento Bee website, some noted that police report data could be influenced by variables like population size, because it’s logical that more shootings occur where more people live. Although Lemon Hill leads all neighborhoods in gun crimes, Multidimensional Scaling, similar to PCA, could certify for its rank as a dangerous neighborhood by controlling for factors like population.

When rumors began at the start of my internship that Sacramento’s professional basketball team would move to Seattle, Reese found population data to compare Sacramento to other NBA cities. An implementation of the nearest neighbors algorithm could create a similar comparison, but using more variables — like income or geographic size — to find cities that are comparable in more ways than population size.

I spent hours of my internship building a hexagon map of homicides and shootings in Sacramento. Hexagonal binning is a popular mapping technique because they generate more clusters points, isn’t too difficult to render on browsers, and just plain looks cool. But I also could have used a dbscan algorithm to show concentrations of shootings. Davis’s Python script takes latitude and longitude pairs and creates clusters based on a provided distance.

I also spent hours of my internship examining more than 100,000 PDFs from the state transportation agency for an investigation of construction on the Bay Bridge. PDF scraping technologies like Tabula could have saved me lots of time, but I would have loved to run a Locality Sensitive Hashing script to find similarities in the text.

Computer-assisted reporters know performing operations on data requires clean, structured spreadsheets. These algorithms are no exception. Davis’s scripts for machine learning algorithms all use Python libraries like numpy and scikit-learn that apply algorithms to data in CSV files. Installing the necessary libraries and editing the Python code to run these scripts on your own datasets means flexing some programming muscles, but if computer-assisted reporters can find front-page stories in seas of census data, they can wield a command prompt and take next step in data analysis.

After spreadsheets, algorithms are the logical next step.

Latest Posts

  • Building a Community for VR and AR Storytelling

    In 2016 we founded the Device Lab to provide a hub for the exploration of AR/VR storytelling on campus. In addition to providing access to these technologies for Medill and the wider Northwestern community, we’ve also pursued a wide variety of research and experimental content development projects. We’ve built WebVR timelines of feminist history and looked into the inner workings of ambisonic audio. We’ve built virtual coral reefs and prototyped an AR experience setting interviews...

    Continue Reading

  • A Brief Introduction to NewsgamesCan video games be used to tell the news?

    When the Financial Times released The Uber Game in 2017, the game immediately gained widespread popularity with more than 360,000 visits, rising up the ranks as the paper’s most popular interactive piece of the year. David Blood, the game’s lead developer, said that the average time spent on the page was about 20 minutes, which was substantially longer than what most Financial Times interactives tend to receive, according to Blood. The Uber Game was so successful that the Financial...

    Continue Reading

  • With the 25th CAR Conference upon us, let’s recall the first oneWhen the Web was young, data journalism pioneers gathered in Raleigh

    For a few days in October 1993, if you were interested in journalism and technology, Raleigh, North Carolina was the place you had to be. The first Computer-Assisted Reporting Conference offered by Investigative Reporters & Editors brought more than 400 journalists to Raleigh for 3½ days of panels, demos and hands-on lessons in how to use computers to find stories in data. That seminal event will be commemorated this week at the 25th CAR Conference, which...

    Continue Reading

  • Prototyping Augmented Reality

    Something that really frustrates me is that, while I’m excited about the potential AR has for storytelling, I don’t feel like I have really great AR experiences that I can point people to. We know that AR is great for taking a selfie with a Pikachu and it’s pretty good at measuring spaces (as long as your room is really well lit and your phone is fully charged) but beyond that, we’re really still figuring...

    Continue Reading

  • Capturing the Soundfield: Recording Ambisonics for VR

    When building experiences in virtual reality we’re confronted with the challenge of mimicking how sounds hit us in the real world from all directions. One useful tool for us to attempt this mimicry is called a soundfield microphone. We tested one of these microphones to explore how audio plays into building immersive experiences for virtual reality. Approaching ambisonics with the soundfield microphone has become popular in development for VR particularly for 360 videos. With it,...

    Continue Reading

  • Audience Engagement and Onboarding with Hearken Auditing the News Resurrecting History for VR Civic Engagement with City Bureau Automated Fact Checking Conversational Interface for News Creative Co-Author Crowdsourcing for Journalism Environmental Reporting with Sensors Augmented Reality Visualizations Exploring Data Visualization in VR Fact Flow Storytelling with GIFs Historical Census Data Information Spaces in AR/VR Contrasting Forms Of Interactive 3D Storytelling Interactive Audio Juxtapose Legislator Tracker Storytelling with Augmented Reality Music Magazine Navigating Virtual Reality Open Data Reporter Oscillations Personalize My Story Photo Bingo Photojournalism in 3D for VR and Beyond Podcast Discoverability Privacy Mirror Projection Mapping ProPublica Illinois Rethinking Election Coverage SensorGrid API and Dashboard Sidebar Smarter News Exploring Software Defined Radio Story for You Storyline: Charts that tell stories. Storytelling Layers on 360 Video Talking to Data Visual Recipes Watch Me Work Writing and Designing for Chatbots
  • Prototyping Spatial Audio for Movement Art

    One of Oscillations’ technical goals for this quarter’s Knight Lab Studio class was an exploration of spatial audio. Spatial audio is sound that exists in three dimensions. It is a perfect complement to 360 video, because sound sources can be localized to certain parts of the video. Oscillations is especially interested in using spatial audio to enhance the neuroscientific principles of audiovisual synchrony that they aim to emphasize in their productions. Existing work in spatial......

    Continue Reading

Storytelling Tools

We build easy-to-use tools that can help you tell better stories.

View More