Fact checking Chicago Public Schools using algorithms, statistics and data mining


Some students take it easy for the spring semester of their senior year; I loaded up on Introduction to Algorithms and Statistical Methods for Data Mining. The stats class covered theoretical foundations for data mining techniques like logistic regression and neural networks and finished with an open-ended group project assignment.

As it happened, the class coincided with Chicago Public Schools' decision to close 49 schools. The move drew ferocious criticism from community groups (including the Chicago Teachers Union), which claimed CPS unfairly selected the schools to be closed. Protesters and internet memes accused the district of racism.

Those accusations turned out to be the perfect fact checking project for my algorithms and stats class. Classmates Jim Garrison, Jaya Sah and I used three data mining techniques — logistic regression, neural networks and classification trees — to determine if racial demographics predicted schools closings.

Feel free to read our 15-page report or look at the slides from our in-class presentation, but if you don't have that time, read on for five data observations I made during the project.

1) Get the basics

Before we opened SPSS to apply our data mining methods to determine if race is a predictor for schools closings, we ran some basic averages to get an initial evaluation of the protesters’ claims. Indeed, the Chicago school closures disproportionately affect black students, who make up 40% of the Chicago student population but 90% of the student body in closed schools.

We used the Illinois State Report Cards throughout our analysis, which is released annually and includes all kinds of statistics about campus demographics, test scores, and resources. But it was a lot of data. Wrangling the 9,655 columns of data for each school created problems for our analysis, and I spent the majority of my time reducing its complexity. Luckily I was working at the Knight Lab with Joe Germuska, who is very familiar with the Illinois data thanks to his work with the Chicago Tribune report cards news app.

2) CSVkit rules

Thanks to Joe I could use a schema to convert the 225 megabyte, semicolon-delimited 2012 report card file to a CSV. SPSS still couldn’t import 9,655 data attributes at once, and understanding what was in the dataset was a struggle.

CSVkit was a godsend. Chris Groskopf’s tool allowed us to examine and splice the data just how we wanted before importing it to SPSS. (More on our data cleaning methods below.) CSVkit allowed us a smooth workflow testing CSVs with different attributes so we only stayed in the library until 2 a.m., not 5 a.m.


3) Have a dirty data plan

Neural network of Chicago Public Schools.

Being aware of and dealing with “dirty data” – information that is inaccurate or incomplete – is vital for any data-driven project, but it was especially crucial to our data mining techniques. Neural networks and logistic regression fail with missing data, so we had to clean our dataset.

We observed that the report cards included fields for high school graduation and high school test scores, even though the majority of our closed schools were elementary schools for which those did not apply. We used CSVkit to exclude those fields. For other missing data fields, however, we also could have used SPSS processes that compute averages to assign to missing data fields.

4) Another dimension, another dimension

The thousands of data attributes for each school also posed a challenge for SPSS and prevented our data mining methods from drawing conclusions. The case of having lots of columns in your CSV is called “high-dimensional data,” and it's an increasingly common situation in our data-driven world. Although techniques like “feature screening” and “multi-dimensional scaling” algorithmically address this issue, we came up with our own approach. We called it “the bracket,” or “March Madness.” Read our report if you’re curious.

5. Consider the source

Wrangling the state dataset was one of the most difficult aspects of this data mining project, so we could have considered using other sources or mashing several. A few possibilities for our project, which use various data sources and employ different levels of data analysis:


So what, finally, did logistic regressions, neural networks and classification trees say about the CPS school closures?

Our tests largely found attendance rate – “the aggregate days of student attendance, divided by the sum of the aggregate days of student attendance and aggregate days of student absence” – to be the best at predicting whether a school would close.

This supports CPS claims that schools were closed based on its own utilization metric. However, as Ramsin Canon writes, just because a decision isn’t based on race doesn’t necessarily mean its effects aren’t discriminatory.

About the author

Dan Keemahill

Undergraduate Fellow

Latest Posts

  • Prototyping Augmented Reality

    Something that really frustrates me is that, while I’m excited about the potential AR has for storytelling, I don’t feel like I have really great AR experiences that I can point people to. We know that AR is great for taking a selfie with a Pikachu and it’s pretty good at measuring spaces (as long as your room is really well lit and your phone is fully charged) but beyond that, we’re really still figuring...

    Continue Reading

  • Capturing the Soundfield: Recording Ambisonics for VR

    When building experiences in virtual reality we’re confronted with the challenge of mimicking how sounds hit us in the real world from all directions. One useful tool for us to attempt this mimicry is called a soundfield microphone. We tested one of these microphones to explore how audio plays into building immersive experiences for virtual reality. Approaching ambisonics with the soundfield microphone has become popular in development for VR particularly for 360 videos. With it,...

    Continue Reading

  • How to translate live-spoken human words into computer “truth”

    Our Knight Lab team spent three months in Winter 2018 exploring how to combine various technologies to capture, interpret, and fact check live broadcasts from television news stations, using Amazon’s Alexa personal assistant device as a low-friction way to initiate the process. The ultimate goal was to build an Alexa skill that could be its own form of live, automated fact-checking: cross-referencing a statement from a politician or otherwise newsworthy figure against previously fact-checked statements......

    Continue Reading

  • Northwestern is hiring a CS + Journalism professor

    Work with us at the intersection of media, technology and design.

    Are you interested in working with journalism and computer science students to build innovative media tools, products and apps? Would you like to teach the next generation of media innovators? Do you have a track record building technologies for journalists, publishers, storytellers or media consumers? Northwestern University is recruiting for an assistant or associate professor for computer science AND journalism, who will share an appointment in the Medill School of Journalism and the McCormick School...

    Continue Reading

  • Introducing StorylineJS

    Today we're excited to release a new tool for storytellers.

    StorylineJS makes it easy to tell the story behind a dataset, without the need for programming or data visualization expertise. Just upload your data to Google Sheets, add two columns, and fill in the story on the rows you want to highlight. Set a few configuration options and you have an annotated chart, ready to embed on your website. (And did we mention, it looks great on phones?) As with all of our tools, simplicity...

    Continue Reading

  • Join us in October: NU hosts the Computation + Journalism 2017 symposium

    An exciting lineup of researchers, technologists and journalists will convene in October for Computation + Journalism Symposium 2017 at Northwestern University. Register now and book your hotel rooms for the event, which will take place on Friday, Oct. 13, and Saturday, Oct. 14 in Evanston, IL. Hotel room blocks near campus are filling up fast! Speakers will include: Ashwin Ram, who heads research and development for Amazon’s Alexa artificial intelligence (AI) agent, which powers the...

    Continue Reading

Storytelling Tools

We build easy-to-use tools that can help you tell better stories.

View More