UK Departmental Spending
Every entity within the UK central government has to report its expenditure once a month. However, the data released through this mechanism is patchy - some national bodies and local councils report all of their spending in great detail, while others report nothing or export their data in forms which are hard to interpret. This leads to a skewed picture of government finances - as it becomes hard for data users to distinguish between the absence of data and the absence of transactions having taken place.
This is why we decided to initiate a collaboration between OpenSpending and the team behind data.gov.uk to help fix this issue. We’ve built a data cleansing and validation toolkit which makes the data available in regular intervals while it easier to spot departments that are not keeping up with their publication requirements.
The tool lists all public bodies registered as data publishers on data.gov.uk and details how well they have followed the HM Treasury reporting guidelines. It will also make the whole of the reported data available for search and analysis - either within the OpenSpending platform or as a bulk download.
The clean-up and integration of the (over 6000) spreadsheet documents required a number of stages, ranging from retrieval to validation. Below we provide a brief overview of the technical stages involved in the process:
- Going through the index on data.gov.uk, we tried to retrieve all of the linked resources. Many of them turned out to be missing, either because they had been invalid in the first place, or because the data had been removed since their publication.
- Next, the format of the downloaded files had to be detected. While the treasury mandates that files should be released as CSV (comma-separated values), many entities publish their spending in Excel or OpenOffice formats. Some departments are also still publishing PDF files, which cannot be analyzed automatically.
- Once the format of the data is understood, we need to find and match the given column headers from the data with those field names mandated by the Treasury. While the guidance names 16 headers, few departments actually report on all of them. On the other hand, many add their own data, such as project identifiers or non-standard classification schemes. To ensure data quality, many of the column headers had to be matched manually, using the nomenklatura web interface.
- Within some of the required fields, we decided to further apply cleansing and integration tools. This included simple tools to interpret dates and numbers, but also OpenCorporates to reconcile supplier names and nomenklatura for department family names.
- Finally, data had to pass validation. Out of a total of 7mn extracted transactions, only 4mn met the minimum requirement of having both an amount and date associated with them.
- We then created a report which detailed any issues in getting the data. Data.gov.uk supported this by giving us information on which departments were core (thus had to report), and which were just recommended to do so.
In all, the process is still very prone to errors and the messiness of the input data is making a strong case for the enforcement (and technical implementation) of a standard for transactional spending data.