Beyond spreadsheets for CAR reporters: Algorithms

The lightning talks at NICAR are often the highlight of the computer-assisted reporting conference, but Chase Davis (who recently did a Q&A with us) really grabbed my attention with his “Five Algorithms in Five Minutes” talk, complete with a mic drop. So much so, that three months later I'm still thinking about it and all of the ways that I might put these algorithms to use.

NICAR coincided with my internship at The Sacramento Bee, my hometown paper, where I was spending lots of time with—and eating much Chipotle with—computer-assisted reporter Phillip Reese. Reese has become the go-to data expert in the Bee newsroom by helping reporters find numbers to back up their stories. He knows how to not screw up data and bulletproof his spreadsheets by keeping track of your records, making backups and asking the experts. I admire his ability to find newsworthy trends and outliers with averages, medians, percent change and sorting spreadsheets such that I share his articles on Twitter with #PhillipReeseFanClub.

So when Davis showed NICAR how algorithms can help reporters dig through data, I thought of our two-person data team in Sacramento. My machine learning research at the Knight Lab and study of algorithms in computer science classes have further shown me how we could have applied these data techniques at the Bee.

Reese and I worked on several projects involving data for all 58 counties in California, so when the state finance department dropped a report predicting how each county’s population will change by 2050, we dove in to analyze (and map) their findings for the next day’s paper. We looked at percent change and tried to find interesting outliers, but that meant keeping track of dozens of demographics variables.

Principal Component Analysis (PCA) could have really helped us out that day. It compresses a correlated variables in a dataset to make interesting variables stand out. In that story we found that the state predicted almost every county to grow due to growths in the Hispanic population, PCA would have singled out those counties that bucked the trend.

Reese localized the gun violence discussion in a Sunday A1 story by profiling Lemon Hill, the Sacramento neighborhood with the most reported assaults with a firearm and shooting into a building. Among the ornery comments on the Sacramento Bee website, some noted that police report data could be influenced by variables like population size, because it’s logical that more shootings occur where more people live. Although Lemon Hill leads all neighborhoods in gun crimes, Multidimensional Scaling, similar to PCA, could certify for its rank as a dangerous neighborhood by controlling for factors like population.

When rumors began at the start of my internship that Sacramento’s professional basketball team would move to Seattle, Reese found population data to compare Sacramento to other NBA cities. An implementation of the nearest neighbors algorithm could create a similar comparison, but using more variables — like income or geographic size — to find cities that are comparable in more ways than population size.

I spent hours of my internship building a hexagon map of homicides and shootings in Sacramento. Hexagonal binning is a popular mapping technique because they generate more clusters points, isn’t too difficult to render on browsers, and just plain looks cool. But I also could have used a dbscan algorithm to show concentrations of shootings. Davis’s Python script takes latitude and longitude pairs and creates clusters based on a provided distance.

I also spent hours of my internship examining more than 100,000 PDFs from the state transportation agency for an investigation of construction on the Bay Bridge. PDF scraping technologies like Tabula could have saved me lots of time, but I would have loved to run a Locality Sensitive Hashing script to find similarities in the text.

Computer-assisted reporters know performing operations on data requires clean, structured spreadsheets. These algorithms are no exception. Davis’s scripts for machine learning algorithms all use Python libraries like numpy and scikit-learn that apply algorithms to data in CSV files. Installing the necessary libraries and editing the Python code to run these scripts on your own datasets means flexing some programming muscles, but if computer-assisted reporters can find front-page stories in seas of census data, they can wield a command prompt and take next step in data analysis.

After spreadsheets, algorithms are the logical next step.

Tagged

Latest Posts

Lab , projects | Jul 21, 2025

What if news avoiders are right, and you don’t need journalism?

Journalistic training emphasizes that our societies NEED journalism, but it’s fair to ask if anyone actually NEEDS the journalism we’re currently getting. Many people worldwide are not asking ‘if’ they need today’s journalism – they’re showing they don’t: 40% “often or sometimes avoid the news these days,” according to the latest [Digital News Report](https://reutersinstitute.politics.ox.ac.uk/digital-news-report/2025/dnr-executive-summary#avoidance) (42% in the US, 46% in the UK, and over 60% in some other countries). Too often, traditional journalism fails to...

Continue Reading
Lab , projects | Oct 6, 2023

A Big Change That Will Probably Affect Your Storymaps

by Joe Germuska

A big change is coming to StoryMapJS, and it will affect many, if not most existing storymaps. When making a storymap, one way to set a style and tone for your project is to set the "map type," also known as the "basemap." When we launched StoryMapJS, it included options for a few basemaps created by Stamen Design. These included the "watercolor" style, as well as the default style for new storymaps, "Toner Lite." Stamen...

Continue Reading
People | Jan 31, 2023

Introducing AmyJo Brown, Knight Lab Professional Fellow

AmyJo Brown, a veteran journalist passionate about supporting and reshaping local political journalism and who it engages, has joined the Knight Lab as a 2022-2023 professional fellow. Her focus is on building The Public Ledger, a data tool structured from local campaign finance data that is designed to track connections and make local political relationships – and their influence – more visible. “Campaign finance data has more stories to tell – if we follow the...

Continue Reading
Ideas | May 31, 2022

Interactive Entertainment: How UX Design Shapes Streaming Platforms

by Max Johnson

As streaming develops into the latest age of entertainment, how are interfaces and layouts being designed to prioritize user experience and accessibility? The Covid-19 pandemic accelerated streaming services becoming the dominant form of entertainment. There are a handful of new platforms, each with thousands of hours of content, but not much change or differentiation in the user journeys. For the most part, everywhere from Netflix to illegal streaming platforms use similar video streaming UX standards, and...

Continue Reading
Lab projects | Dec 13, 2021

Innovation with collaborationExperimenting with AI and investigative journalism in the Americas.

by Mago Torres | magiccia

Lee este artículo en español. How might we use AI technologies to innovate newsgathering and investigative reporting techniques? This was the question we posed to a group of seven newsrooms in Latin America and the US as part of the Americas Cohort during the 2021 JournalismAI Collab Challenges. The Collab is an initiative that brings together media organizations to experiment with AI technologies and journalism. This year, JournalismAI, a project of Polis, the journalism think-tank at...

Continue Reading
Lab projects , En Español | Dec 13, 2021

Innovación con colaboraciónCuando el periodismo de investigación experimenta con inteligencia artificial.

by Mago Torres | magiccia

Read this article in English. ¿Cómo podemos usar la inteligencia artificial para innovar las técnicas de reporteo y de periodismo de investigación? Esta es la pregunta que convocó a un grupo de siete organizaciones periodísticas en América Latina y Estados Unidos, el grupo de las Américas del 2021 JournalismAI Collab Challenges. Esta iniciativa de colaboración reúne a medios para experimentar con inteligencia artificial y periodismo. Este año, JournalismAI, un proyecto de Polis, la think-tank de periodismo...

Continue Reading