Common arguments against publishing data
Across the community almost everyone can explain stories about how struggling with government officials for transactional spending data in machine-readable format. Often publishers simply do not know that civil society wants data in a particular format, but there are also deliberate obstructions. In this FAQ we provide a list of the most typical excuses for rejecting to release data in computer-friendly formats.
… in machine-readable format
“PDFs are on my computer - therefore they are machine-readable”
FALSE: The fact they are on your computer means they are electronic copies, but not that they are machine-readable. PDFs are essentially a set of instructions for a printer on how to print a page, they look nice and appealing to the human eye, but to a computer, they are little more than a picture.
PDFs go from bad to worse from the perspective of someone trying to do data work:
- Better PDFs are machine-generated, typically something like an Excel or Structured Word Documents converted into a PDF (see example). Often, you can copy and paste information from them, but there may be some formatting or issues.
- Worse PDFs are typically scanned documents. Often, to add to the misery, they will be copies of faxes, smudged, speckled, tea- water- or mould-stained or crooked (sometimes all of the above).
- Image files are not machine-readable for the same reasons.
“If we publish in machine-readable, open formats - someone will alter the data and use it to discredit us.”
Again, FALSE. If someone wants to use data badly enough, they will use it even if they have to get it out of documents manually. If they have to get it out manually - mistakes could be introduced. Publishing the data in machine-readable format simply allows the user to start working with the data straight away.
Our advice would be the following:
- Publish both machine-readable and non-machine readable formats. We insist on the former for analysis, but the latter can also be useful e.g. to cross reference numbers and be an easily readable form to read and share reports.
- Encourage users of the data to show their working. A good data project will usually:
-
- Link back to the original source data
- Link to any modified data with an explanation of how it was changed, with the calculations to any underlying working clearly visible. When you provide such a clear audit trail others will be able to replicate your work and examine transparently that everything was done without errors. In journalism this is sometimes known as the “nerd box”.
- Offer the data source the chance to comment on calculations from the data in order to clear out misunderstandings.
- This allows anyone to check the accuracy of the working and verify the results.
- Improve this page Edit on Github Help and instructions
-
Donate
If you have found this useful and would like to support our work please consider making a small donation.