Now that the data has been collected, it should be prepared for publication. As raw data is not useful for publication, a set of actions will prepare the data to be published. This section will provide more information about getting the data ready. Preparation can be done touching 4 topics: Quality, Technical Openness, Legal Openness and Metadata.
Steps to prepare data
Ensuring data quality
This chapter highlights the following aspects of data quality: content quality, timeliness, and consistency.
Aspects of data quality
Usefulness can be determined by the quality. The quality of Open Data, next to its discoverability, is one of the largest influencers of the success of Open Data. Quality concerns many aspects. This chapter covers the completeness, cleanness and accuracy of data.
Is the data complete?
Is your data set complete? Completeness concerns various aspects. Every data set should:
Is the data clean?
Is your data set clean? Cleanness concerns various aspects. Check the following aspects:
Always check your data on these points and make sure that your data set does not violate any of the legal constraints mentioned in “Legal Openness”.
Is the data accurate?
Is your data set accurate? Accuracy concerns various aspects. The most important aspects regarding accuracy:
Questions regarding accuracy
Data changes over time. Historical data will remain stable, but recent data will be updated over time. Therefore, it is important to check data with regard to its timeliness regularly. For consistency purposes, it is wise to create an update process that keeps the data up-to-date. Be sure that the data contains a notion of its timeliness. This topic is closely related to the maintenance of datasets.
Reading through the quality aspects of data, the consistency of the presentation of your data is of major importance. Imagine re-users correlating data from various sources, but all datasets differ in accuracy, use of terms and timeframe. As an example, if you change the field names of the data collected for managing waste each year, the data cannot be compiled from one year to the next. This makes it difficult to use datasets: it will require a large effort of manipulation. Therefore, make sure you use the standards and be consistent in publishing datasets of equal quality.
Preparing data: technical openness
Data has been prepared in terms of quality. In this chapter, several concepts will be introduced: Linked Data, metadata and the 5-Star Open Data Model. To understand the 5-Star Model, you will have to understand the basics of Linked Data.
The concept of Linked Data increases the interoperability and discoverability of datasets. Linked Data is not the same as Open Data. Whereas Open Data concerns the openness of the data itself, Linked Data is a way of publishing Open Data as Linked Data or enriching datasets with Linked metadata. This is where it gets more technical. The definition of Linked Data:
“Linked Data is a set of design principles for sharing machine-readable data on the web to be used by public administrations, business and citizens” (Berners-Lee, 2013)
Linked Data are pieces of information that are linked through a graph connection. Opposed to other relational descriptions of data, in Linked Data, a machine can walk through the graph and understand the content. This is seen as a revolution in the area of data storage and sharing: a computer can, to some extent, qualitatively interpret the data. This is possible, because the data is enriched with uniform descriptors. By means of these descriptors, the data is no longer a set of static content, but is described and can therefore be interpreted, regardless of any distinguishing factor such as language or file type.
We will provide you with a comprehensible example from the Educational Curriculum for the usage of Linked Data (EUCLID) module 1 (EUCLID, 2014) and we will explain the basic concepts attached to Linked Data through this example.
Datasets usually encode facts about individual objects and events, such as the following two facts about the Beatles (shown here in English rather than a database format):
The Beatles are a music group
The Beatles are a group
There is something odd about this pair of facts: having said that the Beatles are a music group, why must we add the more generic fact that they are a group? Must we list these two facts for all music groups, not to mention all groups of acrobats or actors, etc.? Must we also add all other consequences of being a music group, such as performing music and playing musical instruments?
Ontologies allow more efficient use of data by encoding generic facts about classes (or types of object), such as the following:
Every music group is a group
Every theatre group is a group
It is now sufficient to state that the Beatles (and the Rolling Stones, etc.) are music groups, and the more general fact that they are groups can be derived through inference. Ontologies thus enhance the value of data by allowing a computer application to automatically infer many essential facts that may be obvious to a person, but not to a program.
Linked Data make use of several techniques, among which RDF, vocabularies and URIs. Many data catalogues and Open Data portals that aim to publish Linked Open Data use predetermined vocabularies in order to remain uniform. The European Union Catalogue Specification is called DCAT- AP. Therefore, it is recommended to use DCAT Application Profile. A brief description of these terms is presented below.
The RDF Framework (Resource Description Framework) is the basic principle of Linked Data. It is the new general syntax for representing data on the web. This syntax is a link (URI – Unique Resource Identifier) that is built from 3 descriptors, which all together are called a Triple.
By describing an object with this triple, it becomes linked. As terms can differ from sentiment, the structured way of describing them through the RDF triple overcomes this. Furthermore, as many terms are described through RDF terms, they can be linked to each other.
The idea of an RDF Triple
A basic introduction to RDF: http://www.linkeddatatools.com/introducing-rdf
The frequently used technique of RDF on data portals is RDFa: embedding RDF in HTML. A comprehensible quick presentation on RDFa:
Always publish your metadata embedded in HTML with RDFa. An example:
An RDFa embedded in HTML example
The term URI stands for Unique Resource Identifier and can refer to text, Uniform Resource Name (URN) or Uniform Resource Locator (URL). Its main function is to identify something. In general, in the case of Linked Data, URIs are triples in the form of a URL (http://www.europeandataportal.eu/) or vocabulary specific identifiers. Detailed information about URIs:
A deep-dive presentation about Linked Data:
European Data Portal eLearning Module on Linked Data
Metadata has a large influence on the re-use of Open Data. It will increase the discoverability and the re-use of your data. Therefore, take the time to inform the re-user about the quality of the data set by providing rich metadata. This will make the usability of the data set better. Metadata has been defined by the W3C Foundation as (W3C Foundation, 2015):
“Metadata is structured information that describes, explains, locates or otherwise makes it easier to retrieve, use, or manage an information resource. Metadata is often called “data about data”.
In a nutshell, Metadata helps:
Important reasons to add Metadata
Recommendation: For a full description of best practices with regard to metadata, please go to the website of the W3C Foundation:
Here is an example of the metadata that would be used to describe the Beatles:
Describing the Beatles as metadata
Datasets can be enriched by descriptions, making the interpretation easier. Metadata within the context of Linked Data has even more value: by enriching metadata with URIs, the data can be linked. This enhances the discoverability and interoperability of data incredibly. If you publish metadata with your data, it is recommended to enrich your metadata with URIs. Important to know is that metadata is a necessity if you want to be harvested by data portals such as the European Data Portal.
Recommendation: Always publish your metadata as Linked Data. This increases the discoverability and interoperability of your datasets
Metadata Best Practices
Providing qualitative metadata is a complex but necessary practice. The W3C foundation has developed guidelines and best practices to support data holders. Furthermore, interoperability with the European Data Portal is crucial. This avoids costly crosswalks and mappings between datasets. Hence, the use of the DCAT-AP is strongly encouraged. To summarise, publish the metadata with the data using a machine-readable format and standard terms to define the metadata. In addition, describe the overall features of the data set with information about local parameters, licence, origin and quality.
Summary of metadata best practices
Although a metadata-set is closely related to the data set it describes by providing helpful data about this particular data set, it can sometimes be useful to provide the metadata at multiple places. This enables different Open Data portals to address different audiences. For example, the European Data Portal thrives to be a place where all Open Data portals of the EU Member States share the metadata of their Open Data. In this way, citizens and businesses will have one single place to access metadata data or in other words, access the data about the data available from all over Europe. The participating countries’ own Open Data portals will stay active as portals hosting the data and responsible for their respective regions, enabling users with specific needs to use a fitting Open Data portal. An even more fine-grained approach can be seen when looking at small Open Data portals maintained by cities or other administrative areas.
Thus, you will often find sets of metadata in different Open Data portals describing the same underlying data set. Sometimes, these are copies from the portal the data was originally provided on, other times less metadata is provided when the metadata schema applied does not allow the display of the full metadata. The process of providing the metadata to different portals can either be done technically by the Open Data portal itself (called harvesting, explained next) or should be done manually.
Steps to publish metadata
Large Open Data portals often act as an aggregator of smaller Open Data portals. They regularly check for new data in smaller Open Data portals and copy the metadata found so that users will find it there as well. This process is called harvesting. Open Data portals usually do this in the background without the user noticing it. Open Data portals often provide an API with which they provide their data or metadata in a machine-readable format. These APIs can be used by other Open Data portals – or any other user – to read the data and copy it into their own database. Sometimes the data has to be transformed to a different format, because a different categorisation is used.
Depending on the API protocol, the harvesting entity can apply filters if, for example, only a subset of the data from the harvested portal is desired. By using harvesting, portals will have a greater database and can address a bigger or more specific audience without having to rely on users providing the metadata manually.
As an example, discover the requirements of the European Data Portal http://ppe.paneuropeandataportal.eu/en/content/providing-data/how-to-be-...
Mapping the metadata
For easy metadata inclusion, map the metadata. To do so, use the standard Linked Open Data vocabularies (DCAT-AP) to create a table of properties and URIs to enable easy adding of the metadata to the file. Make the distinction between metadata with regard to the data set itself (title, description, licence), and metadata with regard to the distribution (URL, format, status).
The European Commission has created a Linked Open Data vocabulary specification called DCAT-AP. This increases interoperability between all data portals in Europe. For instance, when talking about file formats, there are various standards.
Please look at the list of the most recent DCAT-AP publications to learn more about controlled vocabularies:
For more general information and training about metadata, please look at the following links:
For international interoperability, it is useful to make use of multilingual thesauri. This means that you use a standard vocabulary set of words, which can be translated to other languages more easily. Eurovoc is a multilingual thesaurus. Please see: http://eurovoc.europa.eu/
Publishing the metadata
Most portal software solutions come with integrated metadata creation modules. In this case, metadata is created by filling in predetermined fields or by choosing from drop-down lists.
Recommendation: Always publish your metadata as Linked Data. This increases the discovery and interoperability of your datasets
The 5-Star Open Data model
Publishing high-quality Open Data requires some effort. The W3C Foundation has created a basic model for Open Data with regard to quality: the 5-Star Open Data model. The 5 stages of Open Data are:
|★||Make your stuff available on the web (whatever format) under an open licence|
|★★||Make it available as structured data (e.g. Excel instead of image scan of a table)|
|★★★||Use non-proprietary formats (e.g. CSV instead of Excel)|
|★★★★||Use URIs to denote things, so that people can point at your stuff|
|★★★★★||Link your data to other data to provide context|
Descriptions of all stages of the 5-star Open Data Model
The 1-Star Stage: Publishing your data
Stage 1 of the 5-Star Open Data model can be achieved by publishing your data. This can be done in various ways, via download, bulk download, or API’s.
The 2-Star Stage: Making it available as structured data
The power of Open Data lies in its re-usability and stimulates interoperability of systems and services. Data formats can be clustered into 2 categories:
Structured data is developed to be processed by machines and is thus different from digitally accessible information. Structured data is machine-readable and more interoperable. See Table 3 for a shortlist of machine-readable formats.
|JSON||Shapefile||RTF (for text)|
|CSV||KML||PDF (for text)|
The 3-Star Stage: Using non-proprietary formats
Non-proprietary means: not bound to specific software or a specific vendor. Example given: an Excel file (.xls) might seem very open, but it is not. It is bound to Microsoft Excel. This means that everyone that is not in the possession of Microsoft Office is unable to open this file. We call these files being of proprietary formats.
Recommendation: Aim to reach 3-star or higher quality data. It is a process. Do not start with 5-stars. Begin with the quick wins. Any star is good to start with.
It is widely promoted to convert proprietary and non-machine readable files into open and machine-readable formats in order to get a high-quality Linked Data set.
|Machine Readable||Geodata Machine readable||Less redable||Closed|
|JSON||Shapefile||PDF (For text)||Images (PNG, JPG)|
Technical Openness of files
The 4-Star Stage: Use URIs to denote things
If you publish your data as 4-Star Open Data, you will use URIs to denote things and create a first step towards Linked Data. In practice, this means you will convert your files to RDF format and/or you will enrich your metadata with URIs. This is the first step towards Linked Data.
The 5-Star Stage: Link your data to other data to provide context
This is a very advanced stage of Open Data. In this stage, the data is linked to other data in order to provide context. This will lead to very interoperable and easy discoverable data. For more information and examples of all stages of the 5-Star Data model, go to http://5stardata.info/
Preparing data: legal openness
The implementation of Open Data has to be in-line with current legislation, and datasets should be published under an open licence as discussed already in the chapter about licensing. These legal implications are of major importance for any stakeholder trying to make use of Open Data.
Your policy should clearly establish a licensing procedure and take into account national and supra-national legislative matters. Next to the presence of legal aspects in your policy, every data set should be published individually using a licence.
Recommendation: Check the licensing assistant on the European Data Portal for more guidance with regard to choosing a licence. See the European Data Portal website for more information
Consult the legal department of your organisation to make sure your data is legally open or to check if your policy is compliant. It is the responsibility of the publishing organisation to be up to date with all legislative and legal rules.
For more information about why you need to licence, please also consult the online training module about this topic: http://europeandataportal.eu/elearning/en/module4/#/id/co-01
Summary of the eLearning Module: Why do we need to license?
In order for data to be open, it should be accessible (this usually means being published online) and licensed for anyone to access, use and share.
In this module we’ll explore the following:
- Why open data needs to be licensed
- How licences unlock the value of open data
- What type of licence suits open data?
-How to provide for open data licensing in the tender, procurement and contracting lifecycle
Even in cases where data has been made available as a public domain dedication without conditions on reuse, an explicit statement is required together with the data to provide users with legal clarity.
The Final Check
For a final check, use a data set preparation checklist:
Go to “Appendix 7 – Publishing best practices” to read some different examples of publishing best practices.
Data set Preparation Checklist
Data set Preparation Checklist