Data journalism

This module addresses these ethical issues:

How can organizations guard the accuracy of data journalism?
How should we evaluate the sources of data?
What ethical dangers lurk in how data is presented?
Is there a “right to reply” to data?

Ethical considerations in data journalism are no different from those in any other area of journalism. But the ethical issues most likely to come into play are those around accuracy and balancing the right to privacy against the public interest.

Accuracy is perhaps the central concern of journalists working with any form of data. Numbers, charts and maps possess an air of authority that other types of information often lack – and yet they are equally subject to manipulation.

Journalists need to be careful both in the credibility they place in numerical and graphical sources – and in the way they present their own stories numerically and graphically.

As an increasing proportion of journalists’ sources involve data, “numeracy” becomes as important as literacy: confusing percentage increases with percentage point increases should be as shameful as spelling someone’s name wrong. We should be as concrete in our language regarding data as we are concrete in describing events and people, where there is no room for vagueness or confusion.

Ethical journalists dealing with data ask the same questions of that data as they would any source. What is the vested interest of the person giving me this? How has this information been collected, and what or who (or when or where) might be missing from it? How were the questions phrased, and what questions were used to frame it beforehand? Can I find a second independent source of the same information, or a different interpretation? What is the margin of error? Do I have the knowledge to ask the right questions of all these sources? See the end of this piece for a wide range of literature for journalists to use to learn about the subject, covering everything from sports to politics.

In presenting the data, context is key: big numbers alone tell us nothing about whether those numbers are higher or lower than they should be, going up or down, or the best or worst in the region, country or world. Presenting them within a historical context, by person or by day helps make the numbers more meaningful. Personalization, however, can present problems of its own: if you tell users how things affect them, ensure they have a sense of the bigger picture also.

Visual representations of data can be particularly subject to manipulation by both source and journalist: baselines that don’t begin from zero can be particularly misleading (one bar can be twice as high as another, but only 1 percentage point greater in reality). Line charts that begin from the lowest or highest point can suggest a much bigger drop or rise than the long-term reality.

The use of 3D effects can actively distort proportions in a chart. A pie chart that recedes into the distance suffers from the same problem as anything in the distance: a slice that is further away is smaller than the same slice in the foreground. 3D as a whole adds meaningless noise to a chart, so it is best avoided.

Data about people often involves concerns over privacy and accuracy. The Tampa Bay Mug Shots site, for example, pulled a feed of people charged with a crime from police websites, while The Journal News in Westchester County, New York, turned publicly available locations of pistol permit owners into a map.

In both cases, the newspapers merely relayed the information to a wider audience. But both faced questions about their ethics, primarily around minimizing harm and balancing privacy against public interest. Questions should be asked about what role the journalist should play in providing context, updates and corrections when required, and in particular what level of detail is actually required to tell the story you are trying to report. Aggregate, less personal, information may provide a clearer story about broader trends, for example, while random checks of the validity of the data may turn up stories about flaws in such publicly available information.

Sometimes when publishing raw data, journalists may not be able to check every row and column. In these cases, a judgment call needs to be made about the practicalities of providing a “right to reply” to the source or subjects, and the language framing the information guiding users about how reliable it is. Part of this includes making clear, as Adrian Holovaty explains, “which parts of the data might be out of date, how often it’s updated, which bits of the data are updated … and any other peculiarities about your process … Any application that repurposes data from another source has an obligation to explain how it gets the data … The more transparent you are about it, the better.”

It should at least be easy for users to notify publishers of likely errors in the data that warrant further checks, especially when it relates to them individually. Where based on publicly available data, the original publisher should also be notified. In these cases, there may be a follow-up story about flaws in the data itself.

Journalists may also need to be careful about protecting sources in the way that they publish leaked data. Metadata stored in files about the date and location of access, the computers and accounts used, and other data, can be used to identify the source.

The main author of this section is Paul Bradshaw.

Additional Resources
How To Lie With Statistics, Darrell Huff
Bad Science, Ben Goldacre
Bad Pharma, Ben Goldacre
Numbers Rule Your World, Kaiser Fung
The Tiger That Isn’t, Blastland and Dilnot
The Victory Lab, Sasha Issenberg
The Numbers Game, Anderson and Sally
The Signal and The Noise, Nate Silver
Distrust Your Data: Jacob Harris on Six Ways to Make Mistakes with Data