Fact checking Chicago Public Schools using algorithms, statistics and data mining


Some students take it easy for the spring semester of their senior year; I loaded up on Introduction to Algorithms and Statistical Methods for Data Mining. The stats class covered theoretical foundations for data mining techniques like logistic regression and neural networks and finished with an open-ended group project assignment.

As it happened, the class coincided with Chicago Public Schools' decision to close 49 schools. The move drew ferocious criticism from community groups (including the Chicago Teachers Union), which claimed CPS unfairly selected the schools to be closed. Protesters and internet memes accused the district of racism.

Those accusations turned out to be the perfect fact checking project for my algorithms and stats class. Classmates Jim Garrison, Jaya Sah and I used three data mining techniques — logistic regression, neural networks and classification trees — to determine if racial demographics predicted schools closings.

Feel free to read our 15-page report or look at the slides from our in-class presentation, but if you don't have that time, read on for five data observations I made during the project.

1) Get the basics

Before we opened SPSS to apply our data mining methods to determine if race is a predictor for schools closings, we ran some basic averages to get an initial evaluation of the protesters’ claims. Indeed, the Chicago school closures disproportionately affect black students, who make up 40% of the Chicago student population but 90% of the student body in closed schools.

We used the Illinois State Report Cards throughout our analysis, which is released annually and includes all kinds of statistics about campus demographics, test scores, and resources. But it was a lot of data. Wrangling the 9,655 columns of data for each school created problems for our analysis, and I spent the majority of my time reducing its complexity. Luckily I was working at the Knight Lab with Joe Germuska, who is very familiar with the Illinois data thanks to his work with the Chicago Tribune report cards news app.

2) CSVkit rules

Thanks to Joe I could use a schema to convert the 225 megabyte, semicolon-delimited 2012 report card file to a CSV. SPSS still couldn’t import 9,655 data attributes at once, and understanding what was in the dataset was a struggle.

CSVkit was a godsend. Chris Groskopf’s tool allowed us to examine and splice the data just how we wanted before importing it to SPSS. (More on our data cleaning methods below.) CSVkit allowed us a smooth workflow testing CSVs with different attributes so we only stayed in the library until 2 a.m., not 5 a.m.


3) Have a dirty data plan

Neural network of Chicago Public Schools.

Being aware of and dealing with “dirty data” – information that is inaccurate or incomplete – is vital for any data-driven project, but it was especially crucial to our data mining techniques. Neural networks and logistic regression fail with missing data, so we had to clean our dataset.

We observed that the report cards included fields for high school graduation and high school test scores, even though the majority of our closed schools were elementary schools for which those did not apply. We used CSVkit to exclude those fields. For other missing data fields, however, we also could have used SPSS processes that compute averages to assign to missing data fields.

4) Another dimension, another dimension

The thousands of data attributes for each school also posed a challenge for SPSS and prevented our data mining methods from drawing conclusions. The case of having lots of columns in your CSV is called “high-dimensional data,” and it's an increasingly common situation in our data-driven world. Although techniques like “feature screening” and “multi-dimensional scaling” algorithmically address this issue, we came up with our own approach. We called it “the bracket,” or “March Madness.” Read our report if you’re curious.

5. Consider the source

Wrangling the state dataset was one of the most difficult aspects of this data mining project, so we could have considered using other sources or mashing several. A few possibilities for our project, which use various data sources and employ different levels of data analysis:


So what, finally, did logistic regressions, neural networks and classification trees say about the CPS school closures?

Our tests largely found attendance rate – “the aggregate days of student attendance, divided by the sum of the aggregate days of student attendance and aggregate days of student absence” – to be the best at predicting whether a school would close.

This supports CPS claims that schools were closed based on its own utilization metric. However, as Ramsin Canon writes, just because a decision isn’t based on race doesn’t necessarily mean its effects aren’t discriminatory.

Latest Posts

  • A Google Spreadsheets change affecting TimelineJS users

    Google recently changed something about their Sheets service which is causing many people to run into an error when they are making a new timeline. Note: there should be no impact on existing timelines! After this change, many of you click on the "preview" and get this message: An unexpected error occurred trying to read your spreadsheet data [SyntaxError] Timeline configuration has no events. There is a straightforward work-around, but it requires those of you who have...

    Continue Reading

  • How Americans think and feel about gun violence

    A man killed his wife, then himself. I want you to see his face and learn that he enjoyed fishing with his grandchildren. A small-time drug dealer is shot by two men in a parking lot. I find his Facebook profile and a photo shows him striking a playfully irreverent pose, giving the camera the middle finger. The photo’s comments take a mournful turn after a certain date. “Rest easy bro ???” Gun Memorial runs...

    Continue Reading

  • Software developers interested in journalism: Northwestern and The Washington Post want you!

    Northwestern University and The Washington Post are offering a unique opportunity for two talented software developers interested in applying their programming skills in media and journalism. Here’s the proposition: (1) a full-tuition scholarship to earn a master’s degree in journalism at Northwestern University, followed by (2) a six-month paid internship with The Post’s world-class engineering team, with the possibility of subsequent full-time employment. These opportunities are made possible by the John S. and James L....

    Continue Reading

  • What happened when Gun Memorial let anyone contribute directly to victim profiles

    If you’re reporting local or niche news, there’s a good chance that your audience collectively knows more about the story than you do. That’s especially true for us at Gun Memorial, a small publication with a nationwide mission of covering every American who is shot dead. In our latest, mostly successful, experiment, we let readers add to our stories without editor intervention. This article shares some lessons from that experience. Asking for reader contributions A...

    Continue Reading

  • How conversational interfaces make the internet more accessible for everyone

    This story is part of a series on bringing the journalism we produce to as many people as possible, regardless of language, access to technology, or physical capability. Find the series introduction, as well as a list of published stories here. In 2004, human-computer interaction professor Alan Dix published the third edition of Human-Computer Interaction along with his colleagues, Janet Finley, Gregory Abowd, and Russell Beale. In a chapter called “The Interaction,” the authors wrote...

    Continue Reading

  • Three tools to help you make colorblind-friendly graphics

    This story is part of a series on bringing the journalism we produce to as many people as possible, regardless of language, access to technology, or physical capability. Find the series introduction, as well as a list of published stories here. I am one of the 8% of men of Northern European descent who suffers from red-green colorblindness. Specifically, I have a mild case of protanopia (also called protanomaly), which means that my eyes lack...

    Continue Reading

Storytelling Tools

We build easy-to-use tools that can help you tell better stories.

View More