With all preconditions in place, the data can be collected. Where should you start? What is relevant? This chapter goes into the details of collecting and identifying data. Collecting data can be approached from two angles: quick wins and thorough data management. It highly depends on the infrastructural choices within your organisation. Look at your strategy: Where will the data be managed? Will it be done centrally or is it processed at multiple units?
General collection process
Create a process for collecting data that suits your situation. The following is a brief description of steps that might come in handy while creating your collection process. This process includes mapping the currently available datasets, prioritizing the datasets, practicing, topics to publish and publishing categories.
Different steps of the collection process
Map the currently available datasets
Start your Open Data initiative by creating an overview of the data that is already available in your organisation. This is a quick win: the data is there and you will have a list of all data and where it is managed. Ask your data-managing colleagues to help you with this.
Prioritise the datasets
Not all datasets are relevant to publish right away. To prioritise your list, look at the following criteria:
The datasets that meet these requirements should be prioritised: these are your quick wins. With this list, you have a complete overview of the data and you have identified what can be published, what not and what should be published first. Later on, you can choose to prioritise by demand or other parameters.
Recommendation: Create Quick Wins and start with those. Practice your collection process first to get acquainted with it. You will be able to improve it, and answer questions that are asked about it.
Go through the collection process. What steps did you take? Who is responsible for the next part of the process? What is the standard process of collecting and prioritizing data? What will happen if new data is created or a data set is updated? Learn by doing and document the steps.
The Irish Best Practice Handbook described a best practice around auditing your existing data, and suggests how to become aware of the datasets that are currently available within the organisation. See the Best Practice Statement below.
Types of data to publish: the G8 Open Data Charter
Data is created, stored, and distributed covering a large variety of topics and categories. However, not all types of data are of equal relevance. In 2013, the G8 came together to discuss governmental transparency, innovation and accountability. This discussion led to the creation of the “G8 Open Data Charter” (Cabinet Office, 2013): a summary of visions and principles for creating a transparent Government, the opening up of data and its quality and quantity.
Part of this charter holds valuable and useful guidelines concerning topics, data types and formats, and quality. The most relevant and high quality topics are summarized in the following 14 categories:
|Example of DataSets|
|Crime and Justice||Crime statistics, safety|
|Earth observation||Meteorological/weather, agriculture, forestry, fishing, and hunting|
|Education||List of schools; performance of schools, digital skills|
|Energy and Environment||Pollution levels, energy consumption|
|Finance and contracts||Transaction spend, contracts let, call for tender, future tenders, local budget, national budget (planned and spent)|
|Geospatial||Topography, postcodes, national maps, local maps|
|Global Development||Aid, food security, extractives, land|
|Government Accountability and Democracy||Government contact points, election results, legislation and statutes, salaries (pay scales), hospitality/gifts|
|Health||Prescription data, performance data|
|Science and Research||Genome data, research and educational activity, experiment results|
|Statistics||National Statistics, Census, infrastructure, wealth, skills|
|Social mobility and welfare||Housing, health insurance and unemployment benefits|
|Transport and Infrastructure||Public transport timetables, access points broadband penetration|
The G8 High Value categories of data
The purpose of this list of categories is to ensure that Data Holders focus on the release of the right and most relevant types of data. This does not mean that other categories of data cannot be published. The list above gives an indication of the topics that should have the highest priority, as these datasets are indicated as datasets with the highest potential value.
Next to gathering categories, there are publishing categories. You might want to publish your data under another set of categories than the G8 list. Other portals have created their own set of categories as well. Think of your data: under which categories are you going to publish your data?
To provide you with an idea of how to categorise your data, here is an example. Please look at the categorisations as a re-user. Try to imagine that you are looking for a single file: how will you navigate towards it? There are pros and cons of both large and little amounts of categories. Try to find out what suits your purpose best and what you, imagining being a re-user, prefer as a logical structure. The one requirement is that it is automated through metadata.The figure below shows an example of the categorisation used by the European Data Portal linked to the DCAT Application Profile detailed in the next sections of this chapter.
Example from http://www.europeandataportal.eu/