Portál Európskych Údajov

Preparing data

Preparing data



Now that the data has been collected, it should be prepared for publication. As raw data is not useful for publication, a set of actions will prepare the data to be published. This section will provide more information about getting the data ready. Preparation can be done touching 4 topics: Quality, Technical Openness, Legal Openness and Metadata. 


Steps to prepare data


Ensuring data quality 

This chapter highlights the following aspects of data quality: content quality, timeliness, and consistency. 


Aspects of data quality


Content quality 

Usefulness can be determined by the quality. The quality of Open Data, next to its discoverability, is one of the largest influencers of the success of Open Data. Quality concerns many aspects. This chapter covers the completeness, cleanness and accuracy of data. 

Is the data complete?

Is your data set complete? Completeness concerns various aspects. Every data set should:

  • Contain a header row with a single description of what is shown. This means that once a data set structure is in place, it should not change when sources are added. In the metadata, the header should be described
  • Be labelled with a version number. Once an update is done the data set should get a new version number in order for the audience to keep track of changes 
  • Contain information about its origin. What is the data about, where does it come from and for what purpose has it been published?
  • Be given a status: Draft, validated, final

Is the data clean?

Is your data set clean? Cleanness concerns various aspects. Check the following aspects:

  • Empty fields
  • Dummy data and default values: are they correct? 
  • Wrong values
  • Double entries 
  • Privacy sensitive information 

Always check your data on these points and make sure that your data set does not violate any of the legal constraints mentioned in “Legal Openness”. 

Is the data accurate?

Is your data set accurate? Accuracy concerns various aspects. The most important aspects regarding accuracy:

  • Is the data accurate enough for its potential purpose?
  • Does its accuracy affect its reliability?
  • Are the choices concerning interval described
  • Does the data need aggregation or disaggregation? 


Questions regarding accuracy



Data changes over time. Historical data will remain stable, but recent data will be updated over time. Therefore, it is important to check data with regard to its timeliness regularly. For consistency purposes, it is wise to create an update process that keeps the data up-to-date. Be sure that the data contains a notion of its timeliness. This topic is closely related to the maintenance of datasets.


Reading through the quality aspects of data, the consistency of the presentation of your data is of major importance. Imagine re-users correlating data from various sources, but all datasets differ in accuracy, use of terms and timeframe. As an example, if you change the field names of the data collected for managing waste each year, the data cannot be compiled from one year to the next. This makes it difficult to use datasets: it will require a large effort of manipulation. Therefore, make sure you use the standards and be consistent in publishing datasets of equal quality.

Preparing data: technical openness 


Data has been prepared in terms of quality. In this chapter, several concepts will be introduced: Linked Data, metadata and the 5-Star Open Data Model. To understand the 5-Star Model, you will have to understand the basics of Linked Data. 

Linked Data

The concept of Linked Data increases the interoperability and discoverability of datasets. Linked Data is not the same as Open Data. Whereas Open Data concerns the openness of the data itself, Linked Data is a way of publishing Open Data as Linked Data or enriching datasets with Linked metadata. This is where it gets more technical. The definition of Linked Data:

“Linked Data is a set of design principles for sharing machine-readable data on the web to be used by public administrations, business and citizens” (Berners-Lee, 2013) 

Linked Data are pieces of information that are linked through a graph connection. Opposed to other relational descriptions of data, in Linked Data, a machine can walk through the graph and understand the content. This is seen as a revolution in the area of data storage and sharing: a computer can, to some extent, qualitatively interpret the data. This is possible, because the data is enriched with uniform descriptors. By means of these descriptors, the data is no longer a set of static content, but is described and can therefore be interpreted, regardless of any distinguishing factor such as language or file type.

We will provide you with a comprehensible example from the Educational Curriculum for the usage of Linked Data (EUCLID) module 1 (EUCLID, 2014) and we will explain the basic concepts attached to Linked Data through this example.


Datasets usually encode facts about individual objects and events, such as the following two facts about the Beatles (shown here in English rather than a database format): 

The Beatles are a music group 
The Beatles are a group 

There is something odd about this pair of facts: having said that the Beatles are a music group, why must we add the more generic fact that they are a group? Must we list these two facts for all music groups, not to mention all groups of acrobats or actors, etc.? Must we also add all other consequences of being a music group, such as performing music and playing musical instruments? 

Ontologies allow more efficient use of data by encoding generic facts about classes (or types of object), such as the following: 

Every music group is a group 
Every theatre group is a group 

It is now sufficient to state that the Beatles (and the Rolling Stones, etc.) are music groups, and the more general fact that they are groups can be derived through inference. Ontologies thus enhance the value of data by allowing a computer application to automatically infer many essential facts that may be obvious to a person, but not to a program. 

Linked Data make use of several techniques, among which RDF, vocabularies and URIs. Many data catalogues and Open Data portals that aim to publish Linked Open Data use predetermined vocabularies in order to remain uniform. The European Union Catalogue Specification is called DCAT- AP. Therefore, it is recommended to use DCAT Application Profile. A brief description of these terms is presented below. 


The RDF Framework (Resource Description Framework) is the basic principle of Linked Data. It is the new general syntax for representing data on the web. This syntax is a link (URI – Unique Resource Identifier) that is built from 3 descriptors, which all together are called a Triple. 
By describing an object with this triple, it becomes linked. As terms can differ from sentiment, the structured way of describing them through the RDF triple overcomes this. Furthermore, as many terms are described through RDF terms, they can be linked to each other.  


The idea of an RDF Triple


A basic introduction to RDF:  http://www.linkeddatatools.com/introducing-rdf 


The frequently used technique of RDF on data portals is RDFa: embedding RDF in HTML. A comprehensible quick presentation on RDFa: 


Always publish your metadata embedded in HTML with RDFa. An example:

An RDFa embedded in HTML example


The term URI stands for Unique Resource Identifier and can refer to text, Uniform Resource Name (URN) or Uniform Resource Locator (URL). Its main function is to identify something. In general, in the case of Linked Data, URIs are triples in the form of a URL (http://www.europeandataportal.eu/) or vocabulary specific identifiers. Detailed information about URIs:


Additional information

A deep-dive presentation about Linked Data: 





European Data Portal eLearning Module on Linked Data




It is important to ensure that your data can be found. The term usually applied to this is the discoverability of data. Essential for discoverability is metadata. Metadata describes the data set itself (e.g. date of creation, title, content, author, type, size). This information about the data needs to be added to the catalogues to help discover the data. If it is published as Linked Data, the discoverability of the data is greatly increased. 

Metadata has a large influence on the re-use of Open Data. It will increase the discoverability and the re-use of your data. Therefore, take the time to inform the re-user about the quality of the data set by providing rich metadata. This will make the usability of the data set better. Metadata has been defined by the W3C Foundation as (W3C Foundation, 2015): 

“Metadata is structured information that describes, explains, locates or otherwise makes it easier to retrieve, use, or manage an information resource. Metadata is often called “data about data”. 

In a nutshell, Metadata helps: 


Important reasons to add Metadata


Recommendation: For a full description of best practices with regard to metadata, please go to the website of the W3C Foundation:

Here is an example of the metadata that would be used to describe the Beatles:


Describing the Beatles as metadata


Datasets can be enriched by descriptions, making the interpretation easier. Metadata within the context of Linked Data has even more value: by enriching metadata with URIs, the data can be linked. This enhances the discoverability and interoperability of data incredibly. If you publish metadata with your data, it is recommended to enrich your metadata with URIs. Important to know is that metadata is a necessity if you want to be harvested by data portals such as the European Data Portal.

Recommendation: Always publish your metadata as Linked Data. This increases the discoverability and interoperability of your datasets

Metadata Best Practices

Providing qualitative metadata is a complex but necessary practice. The W3C foundation has developed guidelines and best practices to support data holders. Furthermore, interoperability with the European Data Portal is crucial. This avoids costly crosswalks and mappings between datasets. Hence, the use of the DCAT-AP is strongly encouraged. To summarise, publish the metadata with the data using a machine-readable format and standard terms to define the metadata. In addition, describe the overall features of the data set with information about local parameters, licence, origin and quality. 


Summary of metadata best practices


Publishing metadata

Although a metadata-set is closely related to the data set it describes by providing helpful data about this particular data set, it can sometimes be useful to provide the metadata at multiple places. This enables different Open Data portals to address different audiences. For example, the European Data Portal thrives to be a place where all Open Data portals of the EU Member States share the metadata of their Open Data. In this way, citizens and businesses will have one single place to access metadata data or in other words, access the data about the data available from all over Europe. The participating countries’ own Open Data portals will stay active as portals hosting the data and responsible for their respective regions, enabling users with specific needs to use a fitting Open Data portal. An even more fine-grained approach can be seen when looking at small Open Data portals maintained by cities or other administrative areas. 

Thus, you will often find sets of metadata in different Open Data portals describing the same underlying data set. Sometimes, these are copies from the portal the data was originally provided on, other times less metadata is provided when the metadata schema applied does not allow the display of the full metadata. The process of providing the metadata to different portals can either be done technically by the Open Data portal itself (called harvesting, explained next) or should be done manually.


Steps to publish metadata


Harvesting metadata

Large Open Data portals often act as an aggregator of smaller Open Data portals. They regularly check for new data in smaller Open Data portals and copy the metadata found so that users will find it there as well. This process is called harvesting. Open Data portals usually do this in the background without the user noticing it. Open Data portals often provide an API with which they provide their data or metadata in a machine-readable format. These APIs can be used by other Open Data portals – or any other user – to read the data and copy it into their own database. Sometimes the data has to be transformed to a different format, because a different categorisation is used. 

Depending on the API protocol, the harvesting entity can apply filters if, for example, only a subset of the data from the harvested portal is desired. By using harvesting, portals will have a greater database and can address a bigger or more specific audience without having to rely on users providing the metadata manually. 

As an example, discover the requirements of the European Data Portal http://ppe.paneuropeandataportal.eu/en/content/providing-data/how-to-be-...

Mapping the metadata

For easy metadata inclusion, map the metadata. To do so, use the standard Linked Open Data vocabularies (DCAT-AP) to create a table of properties and URIs to enable easy adding of the metadata to the file. Make the distinction between metadata with regard to the data set itself (title, description, licence), and metadata with regard to the distribution (URL, format, status). 

Controlled vocabularies

The European Commission has created a Linked Open Data vocabulary specification called DCAT-AP. This increases interoperability between all data portals in Europe. For instance, when talking about file formats, there are various standards.

  • DCAT, Data Catalogue Vocabulary
  • DCAT Application Profile is not a vocabulary, but a specification for metadata descriptions of EU governmental data and portals 

Please look at the list of the most recent DCAT-AP publications to learn more about controlled vocabularies:


For more general information and training about metadata, please look at the following links:



Multilingual Thesauri

For international interoperability, it is useful to make use of multilingual thesauri. This means that you use a standard vocabulary set of words, which can be translated to other languages more easily. Eurovoc is a multilingual thesaurus. Please see: http://eurovoc.europa.eu/ 

Publishing the metadata

Most portal software solutions come with integrated metadata creation modules. In this case, metadata is created by filling in predetermined fields or by choosing from drop-down lists.

Recommendation: Always publish your metadata as Linked Data. This increases the discovery and interoperability of your datasets

The 5-Star Open Data model

Publishing high-quality Open Data requires some effort. The W3C Foundation has created a basic model for Open Data with regard to quality: the 5-Star Open Data model. The 5 stages of Open Data are:

Make your stuff available on the web (whatever format) under an open licence 
★★ Make it available as structured data (e.g. Excel instead of image scan of a table) 
★★★ Use non-proprietary formats (e.g. CSV instead of Excel) 
★★★★ Use URIs to denote things, so that people can point at your stuff 
★★★★★  Link your data to other data to provide context 

Descriptions of all stages of the 5-star Open Data Model

The 1-Star Stage: Publishing your data

Stage 1 of the 5-Star Open Data model can be achieved by publishing your data. This can be done in various ways, via download, bulk download, or API’s. 

The 2-Star Stage: Making it available as structured data

The power of Open Data lies in its re-usability and stimulates interoperability of systems and services. Data formats can be clustered into 2 categories: 

  • Structured data (machine- and human-readable) 
  • Unstructured data (or human-readable)

Structured data is developed to be processed by machines and is thus different from digitally accessible information. Structured data is machine-readable and more interoperable. See Table 3 for a shortlist of machine-readable formats.

JSON      Shapefile      RTF (for text) 
XML      GeoJSON      HTML 
RDF      GML      Excel 
CSV      KML      PDF (for text) 
TSV     WKT      
ODF     KMZ      

Machine-readable formats


The 3-Star Stage: Using non-proprietary formats

Non-proprietary means: not bound to specific software or a specific vendor. Example given: an Excel file (.xls) might seem very open, but it is not. It is bound to Microsoft Excel. This means that everyone that is not in the possession of Microsoft Office is unable to open this file. We call these files being of proprietary formats.

Recommendation: Aim to reach 3-star or higher quality data. It is a process. Do not start with 5-stars. Begin with the quick wins. Any star is good to start with.

It is widely promoted to convert proprietary and non-machine readable files into open and machine-readable formats in order to get a high-quality Linked Data set.

Machine Readable Geodata Machine readable Less redable Closed
JSON      Shapefile      PDF (For text) Images (PNG, JPG) 
XML      GeoJSON      HTML      Charts 
RDF     GML     Excel      
CSV     KML     Word      
TSV     WKT            

Technical Openness of files


The 4-Star Stage: Use URIs to denote things

If you publish your data as 4-Star Open Data, you will use URIs to denote things and create a first step towards Linked Data. In practice, this means you will convert your files to RDF format and/or you will enrich your metadata with URIs. This is the first step towards Linked Data. 

The 5-Star Stage: Link your data to other data to provide context

This is a very advanced stage of Open Data. In this stage, the data is linked to other data in order to provide context. This will lead to very interoperable and easy discoverable data. For more information and examples of all stages of the 5-Star Data model, go to http://5stardata.info/

Preparing data: legal openness


The data is re-usable in terms of quality; it is technically open. Now it is time for the final preparation step: legally opening up the data. If the data is not legally open, it has no legal right to be re-used. The re-user cannot use the data at all. Legal openness is the basic principle of Open Data. 

The implementation of Open Data has to be in-line with current legislation, and datasets should be published under an open licence as discussed already in the chapter about licensing. These legal implications are of major importance for any stakeholder trying to make use of Open Data. 

Your policy should clearly establish a licensing procedure and take into account national and supra-national legislative matters. Next to the presence of legal aspects in your policy, every data set should be published individually using a licence. 

Recommendation: Check the licensing assistant on the European Data Portal for more guidance with regard to choosing a licence. See the European Data Portal website for more information

Consult the legal department of your organisation to make sure your data is legally open or to check if your policy is compliant. It is the responsibility of the publishing organisation to be up to date with all legislative and legal rules.

For more information about why you need to licence, please also consult the online training module about this topic: http://europeandataportal.eu/elearning/en/module4/#/id/co-01 

Summary of the eLearning Module: Why do we need to license?
In order for data to be open, it should be accessible (this usually means being published online) and licensed for anyone to access, use and share.
In this module we’ll explore the following:
Why open data needs to be licensed
How licences unlock the value of open data
What type of licence suits open data?
How to provide for open data licensing in the tender, procurement and contracting lifecycle
Even in cases where data has been made available as a public domain dedication without conditions on reuse, an explicit statement is required together with the data to provide users with legal clarity.

The Final Check

For a final check, use a data set preparation checklist:

  • Check the data set on quality
  • Check the data on timeliness and consistency
  • Check the data set on the use of standards
  • Add metadata
  • Check if the metadata is described as Linked Data
  • Check the data set on the technical openness
  • Check the data set on legal openness. If it is not open, choose an appropriate licence and apply it to the file.
  • Provide licence information and information about the origins

Go to “Appendix 7 – Publishing best practices” to read some different examples of publishing best practices.

Data set Preparation Checklist


Data set Preparation Checklist