|Data available online but not downloadable. (e.g. in HTML tables on webpages).
|| For simple sites (information on an individual webpage) Google Spreadsheets and ImportHTML Function, or the Google scraper extension (basic).
For more complex webpages (information spread across numerous pages) - a scraper will be required. Scrapers are ways to extract structured information from websites using code. There is a useful tool to make doing this easier online - Scraperwiki.(advanced).
||For the basic level, anyone who can use a spreadsheet and functions can use it. It is not, however, a well-known command and awareness must be spread about how it can be used. (People often daunted because they presume scraping involves code). Scraping using code is advanced, and requires knowledge of at least one programming language.
The need to be able to scrape was mentioned in every country we interviewed in the Athens to Berlin Series.
For more information, or to learn to start scraping, see the School of Data course on Scraping.
|Data available only in PDFS (or worse, images)
||A variety of tools are available to extract this information. Most promising non-code variants are ABBYY Finereader (not free) and Tabula (new software, still a bit buggy and requires people to be able to host it themselves to use.)
||Most require knowledge of coding - some progress being made on non-technical tools. For more info and to see some of the advanced methods - see the <a href "http://schoolofdata.org/handbook/courses/extracting-data-from-pdf/">School of Data course.</a>
Note: these tools are still imperfect and it is still vastly preferable to advocate for data in the correct formats, rather than teach people how to extract.
Recently published guidelines coming directly from government in the UK and US can now be cited as examples to get data in the required formats.
||Several projects made use of secure dropboxes and services for whistleblowers.
||Advanced - security of utmost concern.
||For example: MagyarLeaks
## Stage 2: Cleaning, Working with and Analyzing Data
|Messy data, typos, blanks (various)
||Spreadsheets, Open Refine, Powerful text editors e.g. Text Wrangler plus knowledge of Regular Expressions.
||Basic -> Advanced
|Need to reconcile entities against one another to answer questions such as, "what is company X?", "Is company X Ltd. the same as company X?" (ditto for other types of entities e.g. departments, people).
||Nomenklatura, OpenCorporates, <a href="http://publicbodies.org/"</a>
Reconciling entities is complicated both due to the tools needed as well due to the often inaccurate state of the data.
Working with data without common identifiers and data of poor quality makes entity reconciliation highly complicated and can cause big gaps in analysis.
|Need to be able to conceptualize networks and relationships between entities (See dedicated section on Network Mapping below).
||Intermediate - advanced.
| Need to be able to work with many many lines of data (too big to be able to fit in Excel).
|| OpenSpending.org, Other database software (PostGres, MySQL), Command line tools
|| OpenSpending.org - easy for basic upload search and interrogation, in OpenSpending and other databases some advanced queries may require knowledge of coding.
|| Note: As few countries currently release transaction level data, this is not a frequent problem, but is already problematic in places such as Brazil, US and the UK. As we push for greater disclosure, this will be needed ever more.
|Performing repetitive tasks or modelling
||Macros - Excel
||Basic - Intermediate.
|Entity Extraction (e.g. from large bodies of documents)
|| Open Calais, Yahoo/YQL Content Analysis API, TSO data enrichment service
|| This is far from a perfect method and it would be vastly easier to answer questions relating to entities if they were codified by a unique identifier.
|Analysis needs to be performed on datasets that are published in different languages (e.g. in India)
||To some extent: Google Translate for web based data.
||Still searching for a solution to automatically translate offline spreadsheets.
| Figures change in data after publication
For non-machine readable data - tricky.
For simple, machine readable file formats, such as CSV - version control is a possibility.
For web-based data - some scrapers can be configured to trigger (e.g. email someone) whenever a field changes.
|| Intermediate to advanced
|| Future projects that are likely to tackle this problem: DeDupe.
Note on SPSS and R: It’s our impression that interviewees seemed largely to have been trained to use SPSS. R is however important to mention as it offers a free access to a broad section of the same models, though based on a programming interface.
A few examples of analysis on spending data, which can be done with statistical software such as SPSS or R:
a) Hidden Markov: Hidden Markov was originally developed for finding patterns in bioinformatics, but has turned out useful for predicting fraudulent and corrupt behaviour. Using Hidden Markov requires high quality data, and was for instance used to analyse spending data from 50 mio transactions in the Slovenian platform Supervizor.
b) Benford's law: Benford's law examines the distribution of figures in your data, against how it should actually look. Diversions from the normal distribution can help detect fraudulent reporting (eg. if companies tend to report ernings less than $500 mio. in order to fit a particular regulation Benford’s law could be a tool to detect that). Check this example using Benford’s law to test the release of all Danish corporate tax filings and check this R blog post on the topic.
Finally a few notes on the differences between SPSS and R: Though SPSS is fairly easy to get started using, it can be difficult to collaborate around as it applies its own SPSS data format. Some models might also be unavailable from the basic SPSS package. R is the free alternative, uses a programme interface, where all extensions are accessible, and where community support and code samples are widely available. One possible compromise bridging the convenience of SPSS and the wide usability of R, is the proprietary software R Revolution.
## Stage 3: Presenting Data
| Finding statistical patterns in spending data (such analysis is depends on high data quality)
|| R (free), SPSS (proprietary) and other statistical software for clustering and anomaly detection (also see note).
|| Examples: Data from Supervizor has been used to track changes in spending on contractors changes in government.
A note on statistical analysis software can be found below
## Publishing Data
| Basic visualisation, time series, bar charts
|| DataWrapper, Tableau Public, Many Eyes, Google Tools
| More advanced visualisation
|| Used in e.g. OpenBudgetOakland
|| TileMill, Fusion Tables, Kartograph QGIS
|| Basic -> Advanced
| Creating a citizen’s budget
|| OpenSpending.org, Off-the shelf tools listed above. Disqus commenting module added to OS for commenting and feedback.
|| OpenSpending.org - making a custom visualisation - basic. Making a custom site enabling discussion - advanced.
|| Used in e.g. OpenBudgetOakland
See also the resources section in the Spending Data Handbook
Note: Many of these tools will have difficulty working on Internet Explorer (especially older versions), but we have yet to find more powerful tools which also work there.
## A note on Network Analysis
As you will see from the case studies in the videos, Network Analysis is an area that more and more people are looking into with regard to public procurement and contracts.
Network visualisations are commonly used as a solution to this problem, however, we offer a note of caution to use them sparingly; due to the amount of data on which they are often used, they can sometimes be overwhelming and the average non-expert can find them hard to interpret.
Often the types of information that it is possible to extract from a network visualisation e.g. “who is best connected?”, “are there links between person X and person Y?” - could be more easily be found with a searchable database of connections.
It may also be wise to separate tools suitable for investigating the data, and tools used to present the data. In the latter case, clarity and not-overloading the visualisation will most likely yield a clearer result - so this is one area in which custom infographics may win out in terms of delivering value.
### Existing solutions for network mapping:
For producing network visualisations there are currently open source solutions:
* [Gephi](https://gephi.org/) (Again note that Gephi has non-visualisation functions to explore the data, which at times may be more useful in exploring the interconnections than the visualisations themselves).
* [Mapa 76](http://mapa76.com/) - This is also interesting due to the function which is being developed to extract individual entities.
* [RelFinder](http://www.visualdataweb.org/relfinder.php) Based off DBPedia, this tool structures and maps out relations between entities based on which articles they feature in on Wikipedia.
* Google Fusion Tables has a network function
* NodeXL is a powerful network toolkit for Excel.
## Some favourite examples of (non) Network ways of presenting hierarchies, relationships and complex systems:
* [Connected China (Reuters)](http://connectedchina.reuters.com/) - enables the user to easily see family connections, political coalitions, leaders and connections. Additionally - it gives a detailed organisational diagram of the Communist Party of China, as well as timelines of people’s rise to power.
* [Little Sis](http://littlesis.org/). This is an American database of political connections, including party donations, career histories and family members. Read their About Page for more details of the questions they seek to answer.
### Further reading:
| Need a place online to store and manage data, raw, especially from Freedom of Information Requests.
|| DataNest, CKAN, Socrata - various Data Portal Software options.
|| Basic to use.
|| Can require a programmer to get running and set up a new instance.
|Individual storage of and online collaboration around datasets
|| Google Spreadsheets, Google Fusion Tables, Github
|| 1-3 Basic. Github - intermediate.
**Next**: [Common arguments against publishing data](./machinereadfaq/)
- A pipeline for local councils to address privacy concerns about publishing transaction-level data. In the UK, despite clear guidelines about what should be removed from data before publication, a few councils have published sensitive data over the past year. Some companies are looking at maintaining suppression lists for this data, however ideally this should be done in government, prior to publication - so workflows need to be developed for this.
- Tools to help spot absence of publication as it happens. Currently, civil-society led initiatives such as the Open Budget Survey can only monitor publication of key budget documents retrospectively, and using large amounts of people power. There are a couple of possibilities which spring to mind:
- In the UK - the OpenSpending team have been working with the team of data.gov.uk to produce automated reports to help those enforcing transparency obligations to see which departments are not compliant with said regulations. The reports check both timeliness as well as structure and format of the data. This proved very successful at prompting data release initially - departments were given advance warning that the tool would be featured and any departments without up to date data would be flagged up in red; by launch date, nearly all departments had updated data. This is possible where:
- The data are published via a central platform (e.g. data.gov.uk)
- The data are machine readable, so a computer program can quickly ascertain whether the required fields are present.
- There is a standard layout for the data, so a computer can quickly verify whether column headings are correct and all present.
- Introducing a calendar of expected dates of publication of a particular dataset so that organisations could know when a document is expected to be published and enforce that it is. This could be done either on a country by country basis, or simply by aligning with internationally recognised, best practice guidelines.
- Tools which help to remove duplication of effort. For example, if one organisation has already cleaned up or extracted data from a PDF, encouraging them to share that data so another organisation does not have to waste time doing the same.