Fact checking Chicago Public Schools using algorithms, statistics and data mining


Some students take it easy for the spring semester of their senior year; I loaded up on Introduction to Algorithms and Statistical Methods for Data Mining. The stats class covered theoretical foundations for data mining techniques like logistic regression and neural networks and finished with an open-ended group project assignment.

As it happened, the class coincided with Chicago Public Schools' decision to close 49 schools. The move drew ferocious criticism from community groups (including the Chicago Teachers Union), which claimed CPS unfairly selected the schools to be closed. Protesters and internet memes accused the district of racism.

Those accusations turned out to be the perfect fact checking project for my algorithms and stats class. Classmates Jim Garrison, Jaya Sah and I used three data mining techniques — logistic regression, neural networks and classification trees — to determine if racial demographics predicted schools closings.

Feel free to read our 15-page report or look at the slides from our in-class presentation, but if you don't have that time, read on for five data observations I made during the project.

1) Get the basics

Before we opened SPSS to apply our data mining methods to determine if race is a predictor for schools closings, we ran some basic averages to get an initial evaluation of the protesters’ claims. Indeed, the Chicago school closures disproportionately affect black students, who make up 40% of the Chicago student population but 90% of the student body in closed schools.

We used the Illinois State Report Cards throughout our analysis, which is released annually and includes all kinds of statistics about campus demographics, test scores, and resources. But it was a lot of data. Wrangling the 9,655 columns of data for each school created problems for our analysis, and I spent the majority of my time reducing its complexity. Luckily I was working at the Knight Lab with Joe Germuska, who is very familiar with the Illinois data thanks to his work with the Chicago Tribune report cards news app.

2) CSVkit rules

Thanks to Joe I could use a schema to convert the 225 megabyte, semicolon-delimited 2012 report card file to a CSV. SPSS still couldn’t import 9,655 data attributes at once, and understanding what was in the dataset was a struggle.

CSVkit was a godsend. Chris Groskopf’s tool allowed us to examine and splice the data just how we wanted before importing it to SPSS. (More on our data cleaning methods below.) CSVkit allowed us a smooth workflow testing CSVs with different attributes so we only stayed in the library until 2 a.m., not 5 a.m.


3) Have a dirty data plan

Neural network of Chicago Public Schools.

Being aware of and dealing with “dirty data” – information that is inaccurate or incomplete – is vital for any data-driven project, but it was especially crucial to our data mining techniques. Neural networks and logistic regression fail with missing data, so we had to clean our dataset.

We observed that the report cards included fields for high school graduation and high school test scores, even though the majority of our closed schools were elementary schools for which those did not apply. We used CSVkit to exclude those fields. For other missing data fields, however, we also could have used SPSS processes that compute averages to assign to missing data fields.

4) Another dimension, another dimension

The thousands of data attributes for each school also posed a challenge for SPSS and prevented our data mining methods from drawing conclusions. The case of having lots of columns in your CSV is called “high-dimensional data,” and it's an increasingly common situation in our data-driven world. Although techniques like “feature screening” and “multi-dimensional scaling” algorithmically address this issue, we came up with our own approach. We called it “the bracket,” or “March Madness.” Read our report if you’re curious.

5. Consider the source

Wrangling the state dataset was one of the most difficult aspects of this data mining project, so we could have considered using other sources or mashing several. A few possibilities for our project, which use various data sources and employ different levels of data analysis:


So what, finally, did logistic regressions, neural networks and classification trees say about the CPS school closures?

Our tests largely found attendance rate – “the aggregate days of student attendance, divided by the sum of the aggregate days of student attendance and aggregate days of student absence” – to be the best at predicting whether a school would close.

This supports CPS claims that schools were closed based on its own utilization metric. However, as Ramsin Canon writes, just because a decision isn’t based on race doesn’t necessarily mean its effects aren’t discriminatory.

Latest Posts

  • A Big Change That Will Probably Affect Your Storymaps

    A big change is coming to StoryMapJS, and it will affect many, if not most existing storymaps. When making a storymap, one way to set a style and tone for your project is to set the "map type," also known as the "basemap." When we launched StoryMapJS, it included options for a few basemaps created by Stamen Design. These included the "watercolor" style, as well as the default style for new storymaps, "Toner Lite." Stamen...

    Continue Reading

  • Introducing AmyJo Brown, Knight Lab Professional Fellow

    AmyJo Brown, a veteran journalist passionate about supporting and reshaping local political journalism and who it engages, has joined the Knight Lab as a 2022-2023 professional fellow. Her focus is on building The Public Ledger, a data tool structured from local campaign finance data that is designed to track connections and make local political relationships – and their influence – more visible. “Campaign finance data has more stories to tell – if we follow the...

    Continue Reading

  • Interactive Entertainment: How UX Design Shapes Streaming Platforms

    As streaming develops into the latest age of entertainment, how are interfaces and layouts being designed to prioritize user experience and accessibility? The Covid-19 pandemic accelerated streaming services becoming the dominant form of entertainment. There are a handful of new platforms, each with thousands of hours of content, but not much change or differentiation in the user journeys. For the most part, everywhere from Netflix to illegal streaming platforms use similar video streaming UX standards, and...

    Continue Reading

  • Innovation with collaborationExperimenting with AI and investigative journalism in the Americas.

    Lee este artículo en español. How might we use AI technologies to innovate newsgathering and investigative reporting techniques? This was the question we posed to a group of seven newsrooms in Latin America and the US as part of the Americas Cohort during the 2021 JournalismAI Collab Challenges. The Collab is an initiative that brings together media organizations to experiment with AI technologies and journalism. This year,  JournalismAI, a project of Polis, the journalism think-tank at...

    Continue Reading

  • Innovación con colaboraciónCuando el periodismo de investigación experimenta con inteligencia artificial.

    Read this article in English. ¿Cómo podemos usar la inteligencia artificial para innovar las técnicas de reporteo y de periodismo de investigación? Esta es la pregunta que convocó a un grupo de siete organizaciones periodísticas en América Latina y Estados Unidos, el grupo de las Américas del 2021 JournalismAI Collab Challenges. Esta iniciativa de colaboración reúne a medios para experimentar con inteligencia artificial y periodismo. Este año, JournalismAI, un proyecto de Polis, la think-tank de periodismo...

    Continue Reading

  • AI, Automation, and Newsrooms: Finding Fitting Tools for Your Organization

    If you’d like to use technology to make your newsroom more efficient, you’ve come to the right place. Tools exist that can help you find news, manage your work in progress, and distribute your content more effectively than ever before, and we’re here to help you find the ones that are right for you. As part of the Knight Foundation’s AI for Local News program, we worked with the Associated Press to interview dozens of......

    Continue Reading

Storytelling Tools

We build easy-to-use tools that can help you tell better stories.

View More