The lightning talks at NICAR are often the highlight of the computer-assisted reporting conference, but Chase Davis (who recently did a Q&A with us) really grabbed my attention with his “Five Algorithms in Five Minutes” talk, complete with a mic drop. So much so, that three months later I'm still thinking about it and all of the ways that I might put these algorithms to use.
NICAR coincided with my internship at The Sacramento Bee, my hometown paper, where I was spending lots of time with—and eating much Chipotle with—computer-assisted reporter Phillip Reese. Reese has become the go-to data expert in the Bee newsroom by helping reporters find numbers to back up their stories. He knows how to not screw up data and bulletproof his spreadsheets by keeping track of your records, making backups and asking the experts. I admire his ability to find newsworthy trends and outliers with averages, medians, percent change and sorting spreadsheets such that I share his articles on Twitter with #PhillipReeseFanClub.
So when Davis showed NICAR how algorithms can help reporters dig through data, I thought of our two-person data team in Sacramento. My machine learning research at the Knight Lab and study of algorithms in computer science classes have further shown me how we could have applied these data techniques at the Bee.
Reese and I worked on several projects involving data for all 58 counties in California, so when the state finance department dropped a report predicting how each county’s population will change by 2050, we dove in to analyze (and map) their findings for the next day’s paper. We looked at percent change and tried to find interesting outliers, but that meant keeping track of dozens of demographics variables.
Principal Component Analysis (PCA) could have really helped us out that day. It compresses a correlated variables in a dataset to make interesting variables stand out. In that story we found that the state predicted almost every county to grow due to growths in the Hispanic population, PCA would have singled out those counties that bucked the trend.
Reese localized the gun violence discussion in a Sunday A1 story by profiling Lemon Hill, the Sacramento neighborhood with the most reported assaults with a firearm and shooting into a building. Among the ornery comments on the Sacramento Bee website, some noted that police report data could be influenced by variables like population size, because it’s logical that more shootings occur where more people live. Although Lemon Hill leads all neighborhoods in gun crimes, Multidimensional Scaling, similar to PCA, could certify for its rank as a dangerous neighborhood by controlling for factors like population.
When rumors began at the start of my internship that Sacramento’s professional basketball team would move to Seattle, Reese found population data to compare Sacramento to other NBA cities. An implementation of the nearest neighbors algorithm could create a similar comparison, but using more variables — like income or geographic size — to find cities that are comparable in more ways than population size.
I spent hours of my internship building a hexagon map of homicides and shootings in Sacramento. Hexagonal binning is a popular mapping technique because they generate more clusters points, isn’t too difficult to render on browsers, and just plain looks cool. But I also could have used a dbscan algorithm to show concentrations of shootings. Davis’s Python script takes latitude and longitude pairs and creates clusters based on a provided distance.
I also spent hours of my internship examining more than 100,000 PDFs from the state transportation agency for an investigation of construction on the Bay Bridge. PDF scraping technologies like Tabula could have saved me lots of time, but I would have loved to run a Locality Sensitive Hashing script to find similarities in the text.
Computer-assisted reporters know performing operations on data requires clean, structured spreadsheets. These algorithms are no exception. Davis’s scripts for machine learning algorithms all use Python libraries like numpy and scikit-learn that apply algorithms to data in CSV files. Installing the necessary libraries and editing the Python code to run these scripts on your own datasets means flexing some programming muscles, but if computer-assisted reporters can find front-page stories in seas of census data, they can wield a command prompt and take next step in data analysis.
After spreadsheets, algorithms are the logical next step.