A Northwestern University joint initiative of Medill School of Journalism, Media, Integrated Marketing Communications and the Robert R. McCormick School of Engineering & Applied Science. Northwestern University joint initiative of Medill & McCormick School of Engineering.

Fact checking Chicago Public Schools using algorithms, statistics and data mining

Some students take it easy for the spring semester of their senior year; I loaded up on Introduction to Algorithms and Statistical Methods for Data Mining. The stats class covered theoretical foundations for data mining techniques like logistic regression and neural networks and finished with an open-ended group project assignment.

As it happened, the class coincided with Chicago Public Schools’ decision to close 49 schools. The move drew ferocious criticism from community groups (including the Chicago Teachers Union), which claimed CPS unfairly selected the schools to be closed. Protesters and internet memes accused the district of racism.

Those accusations turned out to be the perfect fact checking project for my algorithms and stats class. Classmates Jim Garrison, Jaya Sah and I used three data mining techniques — logistic regression, neural networks and classification trees — to determine if racial demographics predicted schools closings.

Feel free to read our 15-page report or look at the slides from our in-class presentation, but if you don’t have that time, read on for five data observations I made during the project.

1) Get the basics

Before we opened SPSS to apply our data mining methods to determine if race is a predictor for schools closings, we ran some basic averages to get an initial evaluation of the protesters’ claims. Indeed, the Chicago school closures disproportionately affect black students, who make up 40% of the Chicago student population but 90% of the student body in closed schools.

avg-race-cps-schoolsWe used the Illinois State Report Cards throughout our analysis, which is released annually and includes all kinds of statistics about campus demographics, test scores, and resources. But it was a lot of data. Wrangling the 9,655 columns of data for each school created problems for our analysis, and I spent the majority of my time reducing its complexity. Luckily I was working at the Knight Lab with Joe Germuska, who is very familiar with the Illinois data thanks to his work with the Chicago Tribune report cards news app.

2) CSVkit rules

Thanks to Joe I could use a schema to convert the 225 megabyte, semicolon-delimited 2012 report card file to a CSV. SPSS still couldn’t import 9,655 data attributes at once, and understanding what was in the dataset was a struggle.

CSVkit was a godsend. Chris Groskopf’s tool allowed us to examine and splice the data just how we wanted before importing it to SPSS. (More on our data cleaning methods below.) CSVkit allowed us a smooth workflow testing CSVs with different attributes so we only stayed in the library until 2 a.m., not 5 a.m.

3) Have a dirty data plan
Neural network of Chicago Public Schools.

Neural network of Chicago Public Schools.

Being aware of and dealing with “dirty data” – information that is inaccurate or incomplete – is vital for any data-driven project, but it was especially crucial to our data mining techniques. Neural networks and logistic regression fail with missing data, so we had to clean our dataset.

We observed that the report cards included fields for high school graduation and high school test scores, even though the majority of our closed schools were elementary schools for which those did not apply. We used CSVkit to exclude those fields. For other missing data fields, however, we also could have used SPSS processes that compute averages to assign to missing data fields.

4) Another dimension, another dimension

The thousands of data attributes for each school also posed a challenge for SPSS and prevented our data mining methods from drawing conclusions. The case of having lots of columns in your CSV is called “high-dimensional data,” and it’s an increasingly common situation in our data-driven world. Although techniques like “feature screening” and “multi-dimensional scaling” algorithmically address this issue, we came up with our own approach. We called it “the bracket,” or “March Madness.” Read our report if you’re curious.

5. Consider the source

Wrangling the state dataset was one of the most difficult aspects of this data mining project, so we could have considered using other sources or mashing several. A few possibilities for our project, which use various data sources and employ different levels of data analysis:

So what, finally, did logistic regressions, neural networks and classification trees say about the CPS school closures?

Our tests largely found attendance rate – “the aggregate days of student attendance, divided by the sum of the aggregate days of student attendance and aggregate days of student absence” – to be the best at predicting whether a school would close.

This supports CPS claims that schools were closed based on its own utilization metric. However, as Ramsin Canon writes, just because a decision isn’t based on race doesn’t necessarily mean its effects aren’t discriminatory.

Like what you see?

Northwestern University Knight Lab advances news media innovation through exploration, experimentation and education. The Lab's free publishing tools help to make information more meaningful and promote quality storytelling on the Internet.

About the author

About Posted on July 19, 2013 Posted by

Dan Hill

Developer, Journalism (2013)
Journalism and computer science student. Organizer for the Online News Association at NU. Likes investigative reporting and bluegrass jams.
  • Jeanne Marie Olson

    Unfortunately, my comment of yesterday has disappeared, but let me try to recreate it.

    This is a wonderfully detailed analysis and I appreciate so much the detail as it will allow others a chance to “peek behind the curtain” of how to crunch such data.

    However, it raises some important questions for creators of Civic Apps and volunteers in Open Gov:

    WHAT IF the data you have to analyze isn’t the data you need in order to fully understand the story?

    WHAT IF the most important variables that should be used to determine influence or cause/effect aren’t available in public data?

    The school cuts data is a very important example of this.

    CPSApples2Apples took a look at many of these same variables around school closings (did not address it as finely as this, kudos to the Knight Lab!) But we also went into the field to determine what data was missing.

    One of the most important data sets that we were unable to obtain reflected one of the most interesting trends that we began to see…that many of the schools on the final list were the locations of the CPS Special Education cluster programs for the District. These were schools with a higher than normal amount of self-contained special education classrooms These self-contained classrooms are legally and logistically held to significantly smaller class sizes than the District’s “ideal” average class size of 30 students. Many of them had class sizes between 6-13 students depending upon the complexity of Individualized Education Plan for the student. And yet, the CPS Utilization Formula determined that their capacity should still be 30 students. Thus, many of these schools had little to no control over being determined as under-utilized. Students eligible for these special education cluster programs were still being transferred by the District into at least one of the now closed schools up until the end of the 2012-2013 school year.

    I was able to obtain a detailed “real use” audit of one school–Trumbull–from a member of the LSC and illustrated the dramatic difference it made to the school’s utilization rating for WTTW. ( I’ll post the link in a different comment in the case that a link is what prevented my earlier comment from being posted.)

    How do the complex physical and mental health needs of the students in these special education cluster programs affect attendance scores and, therefore, this analysis? I’m not sure. I would need the data.

    But it is important to identify where data is missing along with analyzing the data we have in order to reach for as robust an analysis as possible.

    P.S. Huge fan of Knight Lab at NU! Love the program, love the faculty, love the students. Thanks for posting this important work out there for others to learn from!