Beyond spreadsheets for CAR reporters: Algorithms

The lightning talks at NICAR are often the highlight of the computer-assisted reporting conference, but Chase Davis (who recently did a Q&A with us) really grabbed my attention with his “Five Algorithms in Five Minutes” talk, complete with a mic drop. So much so, that three months later I'm still thinking about it and all of the ways that I might put these algorithms to use.

NICAR coincided with my internship at The Sacramento Bee, my hometown paper, where I was spending lots of time with—and eating much Chipotle with—computer-assisted reporter Phillip Reese. Reese has become the go-to data expert in the Bee newsroom by helping reporters find numbers to back up their stories. He knows how to not screw up data and bulletproof his spreadsheets by keeping track of your records, making backups and asking the experts. I admire his ability to find newsworthy trends and outliers with averages, medians, percent change and sorting spreadsheets such that I share his articles on Twitter with #PhillipReeseFanClub.

So when Davis showed NICAR how algorithms can help reporters dig through data, I thought of our two-person data team in Sacramento. My machine learning research at the Knight Lab and study of algorithms in computer science classes have further shown me how we could have applied these data techniques at the Bee.

Reese and I worked on several projects involving data for all 58 counties in California, so when the state finance department dropped a report predicting how each county’s population will change by 2050, we dove in to analyze (and map) their findings for the next day’s paper. We looked at percent change and tried to find interesting outliers, but that meant keeping track of dozens of demographics variables.

Principal Component Analysis (PCA) could have really helped us out that day. It compresses a correlated variables in a dataset to make interesting variables stand out. In that story we found that the state predicted almost every county to grow due to growths in the Hispanic population, PCA would have singled out those counties that bucked the trend.

Reese localized the gun violence discussion in a Sunday A1 story by profiling Lemon Hill, the Sacramento neighborhood with the most reported assaults with a firearm and shooting into a building. Among the ornery comments on the Sacramento Bee website, some noted that police report data could be influenced by variables like population size, because it’s logical that more shootings occur where more people live. Although Lemon Hill leads all neighborhoods in gun crimes, Multidimensional Scaling, similar to PCA, could certify for its rank as a dangerous neighborhood by controlling for factors like population.

When rumors began at the start of my internship that Sacramento’s professional basketball team would move to Seattle, Reese found population data to compare Sacramento to other NBA cities. An implementation of the nearest neighbors algorithm could create a similar comparison, but using more variables — like income or geographic size — to find cities that are comparable in more ways than population size.

I spent hours of my internship building a hexagon map of homicides and shootings in Sacramento. Hexagonal binning is a popular mapping technique because they generate more clusters points, isn’t too difficult to render on browsers, and just plain looks cool. But I also could have used a dbscan algorithm to show concentrations of shootings. Davis’s Python script takes latitude and longitude pairs and creates clusters based on a provided distance.

I also spent hours of my internship examining more than 100,000 PDFs from the state transportation agency for an investigation of construction on the Bay Bridge. PDF scraping technologies like Tabula could have saved me lots of time, but I would have loved to run a Locality Sensitive Hashing script to find similarities in the text.

Computer-assisted reporters know performing operations on data requires clean, structured spreadsheets. These algorithms are no exception. Davis’s scripts for machine learning algorithms all use Python libraries like numpy and scikit-learn that apply algorithms to data in CSV files. Installing the necessary libraries and editing the Python code to run these scripts on your own datasets means flexing some programming muscles, but if computer-assisted reporters can find front-page stories in seas of census data, they can wield a command prompt and take next step in data analysis.

After spreadsheets, algorithms are the logical next step.

About the author

Dan Keemahill

Undergraduate Fellow

Latest Posts

  • Prototyping Augmented Reality

    Something that really frustrates me is that, while I’m excited about the potential AR has for storytelling, I don’t feel like I have really great AR experiences that I can point people to. We know that AR is great for taking a selfie with a Pikachu and it’s pretty good at measuring spaces (as long as your room is really well lit and your phone is fully charged) but beyond that, we’re really still figuring...

    Continue Reading

  • Capturing the Soundfield: Recording Ambisonics for VR

    When building experiences in virtual reality we’re confronted with the challenge of mimicking how sounds hit us in the real world from all directions. One useful tool for us to attempt this mimicry is called a soundfield microphone. We tested one of these microphones to explore how audio plays into building immersive experiences for virtual reality. Approaching ambisonics with the soundfield microphone has become popular in development for VR particularly for 360 videos. With it,...

    Continue Reading

  • How to translate live-spoken human words into computer “truth”

    Our Knight Lab team spent three months in Winter 2018 exploring how to combine various technologies to capture, interpret, and fact check live broadcasts from television news stations, using Amazon’s Alexa personal assistant device as a low-friction way to initiate the process. The ultimate goal was to build an Alexa skill that could be its own form of live, automated fact-checking: cross-referencing a statement from a politician or otherwise newsworthy figure against previously fact-checked statements......

    Continue Reading

  • Northwestern is hiring a CS + Journalism professor

    Work with us at the intersection of media, technology and design.

    Are you interested in working with journalism and computer science students to build innovative media tools, products and apps? Would you like to teach the next generation of media innovators? Do you have a track record building technologies for journalists, publishers, storytellers or media consumers? Northwestern University is recruiting for an assistant or associate professor for computer science AND journalism, who will share an appointment in the Medill School of Journalism and the McCormick School...

    Continue Reading

  • Introducing StorylineJS

    Today we're excited to release a new tool for storytellers.

    StorylineJS makes it easy to tell the story behind a dataset, without the need for programming or data visualization expertise. Just upload your data to Google Sheets, add two columns, and fill in the story on the rows you want to highlight. Set a few configuration options and you have an annotated chart, ready to embed on your website. (And did we mention, it looks great on phones?) As with all of our tools, simplicity...

    Continue Reading

  • Join us in October: NU hosts the Computation + Journalism 2017 symposium

    An exciting lineup of researchers, technologists and journalists will convene in October for Computation + Journalism Symposium 2017 at Northwestern University. Register now and book your hotel rooms for the event, which will take place on Friday, Oct. 13, and Saturday, Oct. 14 in Evanston, IL. Hotel room blocks near campus are filling up fast! Speakers will include: Ashwin Ram, who heads research and development for Amazon’s Alexa artificial intelligence (AI) agent, which powers the...

    Continue Reading

Storytelling Tools

We build easy-to-use tools that can help you tell better stories.

View More