A Northwestern University joint initiative of Medill School of Journalism, Media, Integrated Marketing Communications and the Robert R. McCormick School of Engineering & Applied Science. Northwestern University joint initiative of Medill & McCormick School of Engineering.

Questions and consequences when publishing public data

Over the past few months something unusual has happened to public data projects: they’ve made national headlines.

For journalists the most well known project was the gun permit holder map the Journal News in White Plains, New York published late last year featuring names and addresses of all registered gun owners in two New York counties.

The map was controversial and inspired journalists and journalism pundits to weigh in on the project’s virtues and faults before it was ultimately removed late last month.

The controversy — especially in light of recent and proposed legislation — got us thinking about how newsroom developers should best handle public data. What solutions are best suited to deal with data that is potentially invasive? Are there differences when dealing with data online versus in print? And what repercussions might news organizations face following controversial publishing of public data?

Questions that come with data

At its core, publishing data requires editorial judgment not all that different from the judgment journalists have honed in print over the past few decades.

“Every data set is like a human source,” said Derek Willis, an interactive news developer at The New York Times.  “You weigh whether to publish the information you get from it, in what context and to what end.”

Still, there are some unique questions that come with data and digital distribution.

Of course there’s the issue of permanence. Stories and data last much longer online than they do in print and have the potential to follow the people mentioned in the data for years to come with potentially negative effects.

There’s also the issue of accuracy. Rich Gordon, a Knight Lab co-founder and former digital director at the Miami Herald, recalls working on projects for the print edition of the Herald years ago in which every line of data printed was double-checked by a person.

Online, Gordon contends, there’s a greater tendency to present all data of a particular set. That tendency allows for much more depth, but the volume makes it difficult to double-check for accuracy.

It also represents a shift from looking for stories within data to data being the story. That shift isn’t necessarily problematic, but it does make journalists less likely to find mistakes or inaccuracies in the data.

“Because we spent a lot of time with the data in search of the news before we published, we were more likely to find trouble with the data,” Gordon said.

In fact, data accuracy was one of the reasons the Journal News’ publisher cited for taking the map down, according to the publisher’s note that announced the removal. It also appears to be one of the reasons cited for a similar database the Roanoke Times introduced and subsequently removed back in 2007, according to a note from the publisher of that paper.

More challenging than mere accuracy for large data sets, is that what’s accurate one day might be inaccurate the next — again, a factor in both the Journal News’ decision to pull down the map.

The mugshot data dilemma

The changing nature of accuracy was one of the key concerns the New Products Development Team at the Tampa Bay Times faced when developing a mug shot site back in 2009, said Matt Waite, who was part of the team and today is a journalism professor at the University of Nebraska.

“You have to ask yourself, how long is your data valid,” Waite said, “how long is it good.”

Waite and his colleagues didn’t have a reliable answer to that question when it came to arrest records. Though the record of each arrest was unlikely to change, it seemed unfair to publish an arrest record and then neglect to follow the case through the court system simply because it was a challenge they couldn’t handle programmatically, he said.

Out of concern for privacy and a desire to avoid building a “background check tool” for the Tampa area, the team decided to take some steps to protect the privacy of those arrested.

Waite and team told scrapers running not to scrape the mug shot pages. They did it twice, in fact — first in the site’s robots.txt file and again in the HTML of the individual pages. Then they came up with a way to house the names of arrestees in the JavaScript for each page, a non-standard way to handle names and one not likely to be picked up by bots, Waite said.

As a final protection both against publishing inaccurate data and against creating an undue burden on those arrested, all photos are deleted from the site after 60 days.

Waite’s team demonstrates data and potential invasion of privacy of private citizens are challenges, but not insurmountable. Creativity allows you to make illustrate a story without trampling on potential privacy concerns.

“You have options as a developer,” said Ben Welsh, a database producer at the Los Angeles Times.

For example, the Memphis Commercial Appeal publishes a database of handgun carry permit holders, including full name, city, and zip code. The information provided was not all that different from what the Journal News presented.

The difference is that the database is searchable and doesn’t ever appear in one piece. It works well and searches return broad results. When I enter “Graff” in the last name field, for example, it returns not only exact matches, but also Pendergraff and DeGraffenreaid.

Another potential solution is to carefully choose what data to publish, which is exactly what the Commercial Appeal did with it’s decision to publish zip codes, but not addresses.

And therein lies the challenge — “threading the needle,” as Welsh said. The idea is to provide enough information to be useful, but not so much that you’re invading the privacy of ordinary citizens.

Consequences?

The consequences for the news industry for journalists and others who publish data that the public deems reckless are real. Lawmakers in New York passed legislation soon after the Journal News’ map that allowed permit holders to request confidentiality. Just last week, a group of lawmakers in Maine tried to pressure the Bangor Daily News into withdrawing a request for concealed weapon permits.

Also last week a Florida lawmaker introduced a bill that would require all websites to remove mug shots within 15 days of being notified that an arrest did not result in conviction. The bill was reportedly inspired by a so-called extortion mug shot site, but makes no distinction between those sites and traditional news sites.

“If news organizations want to separate themselves from the mug shot racket they need to be conscientious about how they handle public data,” Waite said.

It’s a good lesson and one that journalists can avoid with some creativity and, perhaps, restraint.

The real key in publishing data, as in other journalism, is to add context and nuance to it.

“To republish something with out any insight or analysis is a low form of journalism,” Welsh said. “With data — as with all things in journalism — we should strive not to be stenographers.”

Share Button
About Posted on February 20, 2013 Posted by

Ryan Graff

Communications

News nerd ecstatic for the future of news. Formerly a Colorado-based reporter and magazine writer. Presently the Lab’s editor, and handler of marketing and outreach.