Open Knowledge

Getting and cleaning data

As you look into a new, data-driven project, you'll find it easy to imagine all the analysis and comparisons that could be done to prove a particular point or to explore some area of government activity. As you then look into the availability of data on a particular topic, disillusionment will quickly set in: information may be hard to find, unstructured or just not available to the public at all. As you search for data, there are a number of possible techniques you may want to investigate. Some are based on the governments cooperation while others use technology to bridge the gap between how information is shared and what is necessary for a data-driven advocacy project.

Getting data

<p >Data processing pipeline

An increasing number of governments have accepted the need for pro-active transparency. When releasing open data, governments commit to making machine-readable information available on the web, in a form that enables anyone to use, re-use and re-distribute the data without legal or technical restrictions. Such releases should happen in a timely manner and include detailed, disaggregated data. Many countries have set up dedicated open data portals which provide easy access to all published datasets and relevant metadata (i.e. information about when a dataset was last updated, who published it and where documentation regarding format and contents can be found). The benefit of using open data when it's available includes the ease of access, but also usually a more authoritative base for any further analysis.

In some cases, open data is provided through an application programming interface (API), often a web-based method for retrieving, searching or even updating the available information dynamically. APIs provide up-to-date data in a granular and filtered form, removing the need to repeatedly process and update source files.

A common use case for APIs is relatively time-sensitive information, such as procurement calls and contracts which are released every day. In the UK, BusinessLink (http://www.contractsfinder.businesslink.gov.uk/data-feed.aspx) provides a number of data feeds which contain information about procurement notices. Similarly, the USASpending portal provides a set of APIs that can be used to retrieve up-to-date grants information for the US federal government (http://usaspending.gov/data).

If you are looking for a list of data catalogs from around the world, DataCatalogs.org is curated by experts in this area from around the world. Search for your country and see what you find.

Freedom of information

Even before the rise of open data, many countries decided to increase the transparency of their governments by introducing freedom of information (FoI) legislation. Such laws enable every citizen to request documents and other material from parts of the government which do not merit special protection (e.g. due to concerns over privacy, national security or commercial confidentiality).

Journalists, activists and CSOs have long had channels of acquiring information. Sometimes, having a good relationship with a press-officer or a civil servant is good enough and making a formal request for information is unnecessary (your friendly press-officer may even feel slightly offended if you don't ask them nicely first). FoIs generate a lot of paperwork (hence grumpy civil servants), so if you do have the contacts, it may be a good idea to ask nicely first!

Freedom of Information requests often require some degree of preparation, so that the documents or databases that are requested are clearly identified, you know which department or unit is in charge of it and you can address possible concerns over privacy or commercial confidentiality in your request.

While freedom of information legislation is in force in many countries, it was often made before the need for structured data became apparent - thus many laws do not allow the citizen to specify a particular format. Many governments choose to release information on paper rather than in a structured, digital form, making the data processing step more painful. Still, the legally binding character of freedom of information requests often makes them an invaluable tool in the process of gaining access to financial data.

FoI requests may be necessary when you want to get more detail on the projects that government money is funding. Often the transactional spending data released will include only a brief description of the project, if at all. To get more information about it, you might need to submit an FoI request. For instance, if you have the high level payment information for a contract that includes the recipient, location and total amount, but you want to know the details of the contract deliverables, you will probably need to submit an FoI request for the full contract.

A good example of this process is the Sunlight Foundation's request for information on the Airport Improvement Program in the United States. The program accepts applications from airports around the country for infrastructure improvement grants, such as repaving a runway. Each project is assigned a safety priority rating and is prioritized in a queue. The high level spending information for this program was available in USASpending.gov, but since the priority ratings are specific to this program and not spending data in general, they were not included in that dataset. The Sunlight Foundation submitted a FoI request for the full dataset, including the priority ratings. After that, they were able to determine when airports with low priority projects were getting money, and how often. So the lesson is, if you see some interesting patterns in your high level spending data, don't be afraid to dig deeper and ask for more detailed program information.

Wanting to submit a request, but not sure where to start, who to address your request to or how to write it? Access Info (http://www.access-info.org/) are an organisation who work to help people obtain the information they require from the public bodies that hold it. They have also produced a toolkit (http://www.legalleaks.info/toolkit.html) to using FoIs. It's primarily aimed at Journalists, but most of the tips are equally relevant for CSOs.

Data scraping

Unlike open data or freedom of information requests, data scraping does not rely on the cooperation of government authorities for acquiring machine-readable documents. Scraping refers to the process of transforming unstructured documents - online database interfaces, PDF files or even printed documents into a more structured form (see section below on Optical Character Recognition for more tips on how to do this). Many tools and techniques enable such refinement by re-interpreting documents generated for humans into a more structured form which can be analysed, aggregated and distributed in new ways.

While there is an increasing number of easy-to-use scraping tools which do not require much technical knowledge, more complex processes - such as the automated scraping of thousands or millions of web sites or the mass interpretation of PDF files require some programming so that the process can be fully automated and produce reliable results.

In some cases, the only way to gain access to a set of figures is through the digitization of printed material. While scanners and optical character recognition (OCR) software can be used to import such documents, the high cost and low data quality generated through this approach often it an unattractive one.

A very accessible guide to scraping has been published (https://leanpub.com/scrapingforjournalists). Again, it brands itself at journalists (they're getting a little spoiled aren't they?), but scraping is scraping and so a worthwhile read for CSOs.

Another great source for tutorials, as well as being a tool itself to help with scraping is ScraperWiki. Using Scraperwiki (https://scraperwiki.com/) has the additional benefit of anything that you scrape being available to others, as most of the scraped data goes into a public data store.

 

Getting data out of scanned documents

When you deal with scanned documents, the crucial step in the extraction process is to have the computer attempt to recognize any characters - letters, numbers and other signs. Optical character recognition (OCR) software is built to do this, accepting scanned pictures and PDF documents as an input.

There are both commercial software products for OCR (such as ABBYY FineReader, http://finereader.abbyy.com/), and some open-source software packages, such as Google's Tesseract (http://code.google.com/p/tesseract-ocr/). In general, the quality of all automatic recognition is limited, and you should make sure to cross check any numbers coming from scanned material against the printed documents.

Keeping the data around

As you retrieve data from the government (or other sources), it's easy to just consider the websites it has been released on as a permanent resource. Still, experience has shown that data does go away: whether it is through government re-designing its web sites, new policies that retract transparency rules or simple system failures.

At the same time, downloading complete copies of web sites - a process called mirroring - is a fairly well-established technique that can easily be deployed by civil society organisations. Mirroring involves an automated computer program (for a list see: http://en.wikipedia.org/wiki/Web_crawler) harvesting all the web pages from a specified web page, e.g. a ministry home page. In most cases, it is also possible to find old versions of web sites via the Internet Archive's Wayback machine (http://archive.org/web/web.php), a project that aims to create up-to-date copies of all public web sites and archive them forever.

Cleaning data

<p >Data Processing Pipeline

As you acquire government spending information, you will notice that such data often has many inconsistencies: program names are used inconsistently, financial amounts will be stated in badly formatted numbers, while some data may not be usable at all due to file corruptions. In short: data always needs to be cleaned and processed. In fact, processing, augmenting and cleaning the data is very likely to be the most time- and labour-intensive aspect of your project.

Processing stages for data projects

While there are many different types of data, almost all processing can be expressed as a set of incremental stages. The most common stages include data acquisition, extraction, cleaning, transformation, integration, analysis and presentation. Of course, with many smaller projects, not each of these stages may be necessary.

In this process, each stage fulfils a unique purpose

  • Acquisition describes gaining access to data, either through any of the methods mentioned above or by generating fresh data, e.g through a survey or observations.
  • In the extraction stage, data is converted from whatever input format has been acquired (e.g. XLS files, PDFs or even plain text documents) into a form that can be used for further processing and analysis. This often involves loading data into a database system, such as MySQL or PostgreSQL.
  • Cleaning and transforming the data often involves removing invalid records and translating all the columns to use a sane set of values. You may also combine two different datasets into a single table, remove duplicate entries or apply any number of other normalizations.
  • Analysis of data to answer particular questions we will not describe in detail in the following chapters of this book. We presume that you are already the experts in working with your data and using e.g. economic models to answer your questions. The aspects of analysis which we do hope to cover here are automated and large-scale analysis, showing tips and tricks for getting and using data, and having a machine do a lot of the work, for example: network analysis or natural language processing.
  • Presentation data only has impact when it is packaged in an appropriate way for the audiences it needs to aim at.

As you model a data pipeline, it is important to take care that each step is well documented, granular and - if at all possible - automated. This is particularly true when processing more complex datasets, such as transactional government expenditure which has been scraped from a government website.

Data provenance

Good documentation on data provenance (the origin and history of a dataset) can be compared to the chain of custody which is maintained for criminal investigations: each previous owner of a dataset must be identified, and they are held accountable for the processing and cleaning operations they have performed on the data. For Excel spreadsheets this would include writing down the steps taken in transforming the data, while advanced data tools (such as Open Refine, formerly Google Refine), often provide methods of exporting machine-readable data containing processing history. Any programs that have been written to process the data should be available when users access your end result and shared as open-source code on a public code sharing site such as GitHub.

Tools for documenting your data work

Documenting the transformations you perform on your data can be as simple as a detailed prose explanation and a series of spreadsheets that represent key, intermediate steps. But there are also a few products out there that are specifically geared towards helping you do this. Socrata is one platform that helps you perform transforms on spreadsheet-like data and share them with others easily. You can also use the Data Hub (pictured below), an open source platform that allows for several versions of a spreadsheet to be collected together into one dataset, and also auto-generates an API to boot.

Normalizing data

Data that comes from the government is often generated across multiple departments by hand. This can result in inconsistencies in what kinds of values or formats are used to describe the same meaning. Normalizing values to be consistent across a dataset is therefore a common activity.

First, you want to start by finding all of the distinct ranges of values for the different columns in your dataset. You can accomplish this by using a database query language (such as SQL's DISTINCT), or by simply using the 'filter' property on a spreadsheet program.

For example, if you have a spreadsheet with contracting data, and one column is 'Competed?', you would expect the values to be 'yes' or 'no'. But if this spreadsheet is an amalgam of spreadsheet data from multiple users and departments, your values could vary among the following: 'Y', 'YES', 'yes', 1, 'True', 'T', 't', 'N', 'NO', 'no', 0, 'False', 'F', 'f', etc. Limiting all of these potential values to two clear options will make it easier to analyse the data, and also easier for those who follow in your footsteps.

Especially with financial data, numbers can be formatted several different ways. For example, are your negative values represented with a '-' or placed inside '( )' or possibly even highlighted in red? Not all of these values will be easily read by a computer program (especially the color), so you'll want to pick something clear and consistent to convert all your negative values to (probably the negative sign).

Is all your numerical data measured out in ones or is abbreviated in thousands? Especially with budget data, order of magnitude errors are not uncommon when one department thinks they're reporting in thousands or millions by default but others expand their data all the way to the ones place. Are some values in scientific notation (e.g. 10e3938)? Make sure all your values are consistent, otherwise your analysis could contain serious errors.

A column of data requiring name normalization

Data quality

Data cleaning and data quality go hand in hand. You may find it helpful to conduct data quality tests after your analysis in order to measure your confidence in the data, or before to decide whether you want to do the analysis at all. Data quality can mean a few different things but here or some key factors to consider.

Completeness

Is all your data there? If you have a time series of some data, do you have approximately the same number of rows for each year or are there major outliers? An extremely high or low number of rows that looks out of place may indicate duplicate or lost rows. Or it could indicate a change in how data was reported at that time. Are there values for each column where values are required (according to the documentation)?

Accuracy

Does your dataset match up with other available data out there? This can be difficult to confirm, but it is often useful to employ a sampling methodology. Select a significant sample of your records (3-5%) and decide which values you can check for accuracy. If your data is transactional contract data, you could request the full contracts for your sample records and compare the structured data you have with the prose of the contract to confirm that it is correct. You can sometimes check contract and grant amounts from journalistic outlets and even company websites.

Timeliness

Sometimes governments can lag behind in their reporting. Even if a dataset says it's for a specific time period, do you notice a long tail of records closer to the end of the time frame? If so, you may want to restrict your analysis to the subset of data where you have a healthy sampling.

Case studies

Some projects have gone beyond simple assessments of data quality to ensure confidence in the underlying result of a project, but are projects completely about data quality themselves!

The Open Knowledge Foundation recently created a data quality monitor for UK transactional expenditure data (http://openspending.org/resources/gb-spending) which has been developed in cooperation with the data.gov.uk team. Each imported file is documented, highlighting any mistakes in the data structure or formal representation of the data. More information at:

The Sunlight Foundation has run a yearly report called Clearspending (http://sunlightfoundation.com/clearspending/) for the past three years that assesses the quality of the data reported in the US federal spending dashboard, USASpending.gov. The results haven't been very good but have highlighted the need for data standardization and oversight of the data release process.

As an even more thorough technique, little programs and tests can be written to check data validity along the processing pipeline by testing certain assumptions about each record in the dataset. If you do not have the option of creating such tests, you may still be able to create a spreadsheet in which you record known defects and issues with the data, including source files that did not open and inconsistent column values.

The most powerful data quality tool is much more simple, though: sharing the data. Without access to your source data, others will have to trust your word blindly. Therefore, sharing both the source and processed forms of your dataset is essential to discussing data quality.

Categorization and reference data

One of the most powerful ways of making data more meaningful for analysis is to combine it with reference data and code sheets. Unlike transaction data - such as statistical time series or budget figures - reference data does not describe observations about reality - it merely contains additional details on category schemes, government programmes, persons, companies or geographies mentioned in the data.

For example, in the German federal budget, each line item is identified through an eleven-digit code. This code includes three-digit identifiers for the functional and economic purpose of the allocation. By extending the budget data with the titles and descriptions of each economic and functional taxonomy entry, two additional dimensions become available that enable queries such as the overall pension commitments of the government, or the sum of all programmes with defence functions.

The main groups of reference data that are used with government finance include code sheets, geographic identifiers and identifiers for companies and other organizations:

Classification reference data

Reference data are dictionaries for the categorizations included in a financial datasets. They may include descriptions of government programmes, economic, functional or institutional classification schemes, charts of account and many other types of schemes used to classify and allocate expenditure.

Some such schemes are also standardized beyond individual countries, such as the UN's classification of functions of government (COFOG) and the OECD DAC Sector codes (http://www.oecd.org/dac/aidstatistics/dacandcrscodelists.htm). Still, the large majority of governments use their own code sheets to allocate and classify expenditure. In such cases, it is often advisable to request access to the code list versions used internally by government, including revisions over time that may have changed how certain programmes were classified.

A library of reference data that can be re-used across different projects and it is a valuable asset for any organization working with government finance. Sharing such data with others is crucial, as it will help to enable comparable outputs and open up options for future data integration. Existing repositories include the IATI Standard (http://iatistandard.org/) and datahub.io.

Geographic identifiers

Geographic identifiers are used to describe administrative boundaries or specific locations identified in a dataset. While some regional classifications (such as the EU NUTS) are released on the web, there is also an increasing number of open databases which contain geographic names - including geonames.org and the recently developed world.db.

Another related technique is the process of reverse geo-coding: translating a human-readable address into a pair of coordinates. Services like nominatim (http://nominatim.openstreetmap.org/) will not only enable users to generate precise maps of projects in a region, they will also return the responsible administrative boundary for many coordinates. This means that projects which are given by precise address can also be aggregated by state, region or any other geographic unit.

Additionally, many countries have shapefiles of their political and geographic districts available (usually through the census or interior bureaus) that can be imported into custom mapping applications, like TileMill (http://mapbox.com/tilemill/).

Company and organisational identifiers

As you look into spending data that includes recipients outside the government, you'll find companies which act as suppliers to government, but also other types of entities including charities, associations, foreign governments, political parties and even individuals which act as recipients of direct assistance.

Identifying such entities is notoriously hard, since the only information kept by government is often a simple name (which may not uniquely identify the beneficiary, for example "MS"). While most (if not all) countries maintain company registers which assign some type of unique identifier to a company, these databases are often not accessible in bulk and not used coherently across different parts of government. Alternative identifiers - such as tax identifiers and company IDs from private business information suppliers (such as Dun & Bradstreet in the US) - further complicate this process.

As an alternative, open registries are beginning to compile organisational identifiers in a form that is easy to re-use and thus enables the sharing of databases which have been augmented with such information. OpenCorporates.com (http://opencorporates.com) is a startup that collects information from companies world-wide and provides a convenient API to match datasets with the list of known countries. The IATI project for aid transparency is working towards similar standards for other organisations, such as foreign governments and charities active in the development space.