NICAR 2015: Machine learning lessons for journalists

Machine learning is certainly not a new concept in journalism, but it seemed to enjoy plenty of prominence at NICAR this year — fantastic news for newbies to the field like me. I attended several sessions on it, both theoretical and technical, and a few key concepts came up repeatedly. Whether this year’s conference was your first exposure to machine learning, or you’re a seasoned pro, here are four takeaways worth reviewing:

Machine learning is the starting point for a story, not the end.

This critical reminder came from Sarah Cohen, during the panel “Machine Learning in the Wild” (with Steven Rich and Janet Roberts.) Machine learning is an incredibly powerful tool — it can help you clean up data, sort through thousands of documents, and make telling predictions.

But as Sarah Cohen cautioned during Machine Learning in the Wild, the results of your model are not your story. Instead, they will provide helpful insight and serve as a useful launch pad for additional reporting. It’s important to recognize the limitations of your data. Otherwise, as Cohen pointed out, journalists fall squarely in Drew Conway’s danger zone: we know just enough about handling data to be dangerous (not enough to not handle it responsibly.)

"Nothing that comes out of these algorithms is the ‘truth,’” machine learning researcher Hanna Wallach said in her session “Lessons from Computational Social Sciences.” Your results have to be contextualized to be handled responsibly. And that takes time and plenty of due diligence because...

Your first model will suck.

Sadly, that’s just the way things are. Roberts presented a real-life scenario: in 2014, she and her team at Reuters investigated a story about how a handful of lawyers with close connections to Supreme Court justices were routinely succeeding in getting their clients’ cases heard in the highest court. They used machine learning (specifically, latent Dirichlet allocation) to identify the topics in 14,400 petitions for Supreme Court hearings and to categorize briefs, would provide data about, among other things, who these lawyers were representing (largely corporate interests.) But the first model Roberts’ team tried returned results with a dismal 36 percent accuracy.

Roberts put up pictures of the nine Supreme Court justices — some of the most powerful people in the U.S. — and discussed the elite lawyers’ connections to them. She was not, she said, going to publish a story about these people with that kind of accuracy.

But for a first model, this sort of failure is normal. In fact, machine learning is "a lot of failure," Rich said. “I’ve never had a win without a fail (or five) first.”

There are ways to fix it. Make adjustments. Evaluate. Repeat.

Model adjustment is a feedback loop, said Chase Davis during his session “Hands-on with Machine Learning.” We constantly have to ask, “How can we get this score to go up? How can we do better?” Nick Diakopoulos covered some text processing methods in his session “Text Analysis and Visualization,” including:

  • converting all text to lowercase
  • consolidating various plurals and tenses of a word into its root form (called "stemming")
  • taking "meaningless" words, like prepositions, out of consideration (called "stop word removal")


Roberts and her team employed some of these methods and experimented with the number of topics they were extracting (too few, and they would lose precision — too many, and topics would be hyperspecific). Eventually, their model returned the top topic with 93 percent accuracy.

Know exactly how poorly (or wonderfully) your model did is not as valuable as understanding why.

"You’re going to live or die by your ability to evaluate your model,” Davis said, and it's especially true when you’re working with your first few iterations. But as Wallach pointed out, digging past your accuracy percentage will provide crucial insight. Understand the math of your model if you can, and think carefully about what it got right and what it got wrong.

Wallach calls this an “end user concept.” For computer scientists working in computational social sciences, being able to present information about what caused your algorithm to return a certain result helps social scientists understand how much they should trust the data. And, Wallach says, for journalists trying to make editors understand that the data will never be 100 percent accurate, understanding what’s behind an accuracy result can help them explain that “the fact that we’re getting it 30 percent wrong doesn’t invalidate the fact that we’re getting it 70 percent right.”

For more on NICAR’s Machine Learning sessions, check out Stephen Suen’s summary of Hanna Wallach’s computational social sciences talk, Rich’s slides on machine learning wins and fails, Diakopoulos’s slides on text processing, and Davis’s materials for his hands-on demo.

About the author

Anushka Patil

Undergraduate Fellow

Latest Posts

  • A Big Change That Will Probably Affect Your Storymaps

    A big change is coming to StoryMapJS, and it will affect many, if not most existing storymaps. When making a storymap, one way to set a style and tone for your project is to set the "map type," also known as the "basemap." When we launched StoryMapJS, it included options for a few basemaps created by Stamen Design. These included the "watercolor" style, as well as the default style for new storymaps, "Toner Lite." Stamen...

    Continue Reading

  • Introducing AmyJo Brown, Knight Lab Professional Fellow

    AmyJo Brown, a veteran journalist passionate about supporting and reshaping local political journalism and who it engages, has joined the Knight Lab as a 2022-2023 professional fellow. Her focus is on building The Public Ledger, a data tool structured from local campaign finance data that is designed to track connections and make local political relationships – and their influence – more visible. “Campaign finance data has more stories to tell – if we follow the...

    Continue Reading

  • Interactive Entertainment: How UX Design Shapes Streaming Platforms

    As streaming develops into the latest age of entertainment, how are interfaces and layouts being designed to prioritize user experience and accessibility? The Covid-19 pandemic accelerated streaming services becoming the dominant form of entertainment. There are a handful of new platforms, each with thousands of hours of content, but not much change or differentiation in the user journeys. For the most part, everywhere from Netflix to illegal streaming platforms use similar video streaming UX standards, and...

    Continue Reading

  • Innovation with collaborationExperimenting with AI and investigative journalism in the Americas.

    Lee este artículo en español. How might we use AI technologies to innovate newsgathering and investigative reporting techniques? This was the question we posed to a group of seven newsrooms in Latin America and the US as part of the Americas Cohort during the 2021 JournalismAI Collab Challenges. The Collab is an initiative that brings together media organizations to experiment with AI technologies and journalism. This year,  JournalismAI, a project of Polis, the journalism think-tank at...

    Continue Reading

  • Innovación con colaboraciónCuando el periodismo de investigación experimenta con inteligencia artificial.

    Read this article in English. ¿Cómo podemos usar la inteligencia artificial para innovar las técnicas de reporteo y de periodismo de investigación? Esta es la pregunta que convocó a un grupo de siete organizaciones periodísticas en América Latina y Estados Unidos, el grupo de las Américas del 2021 JournalismAI Collab Challenges. Esta iniciativa de colaboración reúne a medios para experimentar con inteligencia artificial y periodismo. Este año, JournalismAI, un proyecto de Polis, la think-tank de periodismo...

    Continue Reading

  • AI, Automation, and Newsrooms: Finding Fitting Tools for Your Organization

    If you’d like to use technology to make your newsroom more efficient, you’ve come to the right place. Tools exist that can help you find news, manage your work in progress, and distribute your content more effectively than ever before, and we’re here to help you find the ones that are right for you. As part of the Knight Foundation’s AI for Local News program, we worked with the Associated Press to interview dozens of......

    Continue Reading

Storytelling Tools

We build easy-to-use tools that can help you tell better stories.

View More