Spending Data: The Tool Ecosystem
There are a set of staple tools that can be used to tackle many of the issues highlighted by the organisations in this report. For each one - we’ve outlined the tool - what it’s useful for and what the barrier to entry is.
We continue to hunt for more and better tools to do the job and hope that some of the problems, such as governments publishing their data in PDFs or HTML, will soon be irrelevant, so that we can all focus on more important things.
If you would like to suggest a tool to be added to this ecosystem - please email info [at] openspending.org
For each tool - we’ve outlined the its use and what the barrier to entry is, here’s a guide to the rough categorisation we used:
Basic = An off-the-shelf tool that can be learned and first independent usage made of within 1 day. No installation on servers etc required.
Intermediate = Between 1 day - 1 week to master basic functionality. May require tweaking of code but not new creation thereof.
Advanced = Requires code.
Stage 1: Extracting and getting data
|Data not available||Freedom of Information Portals (e.g. What Do They Know, Frag den Staat).||Basic - though some education may be required to inform people that they have the right to ask, how to phrase an FOI request, whether it is possible to submit these requests electronically etc.||While Freedom of Information portals are a good way of getting data - results often end up scattered. It would be useful to have results structured into data directories so that it was possible to search successful responses together with proactively released data so that there was one common source for data.</tr>|
|Data available online but not downloadable. (e.g. in HTML tables on webpages).||For simple sites (information on an individual webpage) Google Spreadsheets and ImportHTML Function, or the Google scraper extension (basic). For more complex webpages (information spread across numerous pages) - a scraper will be required. Scrapers are ways to extract structured information from websites using code. There is a useful tool to make doing this easier online - Scraperwiki.(advanced).||For the basic level, anyone who can use a spreadsheet and functions can use it. It is not, however, a well-known command and awareness must be spread about how it can be used. (People often daunted because they presume scraping involves code). Scraping using code is advanced, and requires knowledge of at least one programming language.||The need to be able to scrape was mentioned in every country we interviewed in the Athens to Berlin Series. For more information, or to learn to start scraping, see the School of Data course on Scraping.|
|Data available only in PDFS (or worse, images)||A variety of tools are available to extract this information. Most promising non-code variants are ABBYY Finereader (not free) and Tabula (new software, still a bit buggy and requires people to be able to host it themselves to use.)||Most require knowledge of coding - some progress being made on non-technical tools. For more info and to see some of the advanced methods - see the <a href "http://schoolofdata.org/handbook/courses/extracting-data-from-pdf/">School of Data course.</a>||Note: these tools are still imperfect and it is still vastly preferable to advocate for data in the correct formats, rather than teach people how to extract. Recently published guidelines coming directly from government in the UK and US can now be cited as examples to get data in the required formats.|
|Leaked data||Several projects made use of secure dropboxes and services for whistleblowers.||Advanced - security of utmost concern.||For example: MagyarLeaks
## Stage 2: Cleaning, Working with and Analyzing Data