The datasets stored in the portal need to be of an appropriate quality in the terms of:
- DCAT-AP compliant mapping,
- Available distributions,
- Usage of machine readable distribution formats,
- Usage of known open source licenses.
In order to check the datasets for these quality indicators the Metadata Quality Assurance (MQA) tool was developed. The MQA runs as a periodic process in parallel to the harvesting. CKAN and Virtuoso are filled with metadata through the harvesting process. As CKAN cannot store DCAT-AP formatted datasets directly, the datasets are mapped into a JSON schema that is DCAT-AP compliant. The MQA uses this schema for checking each dataset for its DCAT-AP mapping compliance. If there are any compliance issues detected, for instance a mandatory field is missing, a dataset is considered as not DCAT-AP compliant.
The MQA uses the CKAN API for collecting information about all harvested catalogues, MQA runs through all CKAN catalogues in parallel while collecting the required information to fulfil the quality checks. During this process, several checks are performed for each dataset. The results are stored in the MQA database and propagated via the MQA page on the portal or as downloadable sheets and pdf documents. Downloadable MQA documents are only updated after a MQA run has finished. For one run the MQA needs a couple of days. That is because the MQA checks each distribution of each dataset for its availability. Checking a distribution availability may take several seconds, with almost 800.000 datasets with 2 to 50 distributions per dataset, this takes some time.
The MQA presents its results in two views:
- The landing page called the "Global Dashboard". This view shows aggregated results for the entire EDP portal, i.e. showing the quality details for all catalogues.
- The second view "Catalogue Dashboard". This view allows you to select a specific catalogue for which you want to display the quality details.
The current quality indicators include the following:
- Distribution Statistics
- Accessible Distributions
- Error Status Codes
- DowloadURL existence
- Top 20 catalogues with most accessible distributions (*)
- Ratio machine readable datasets
- Most used distribution formats
- Top 20 catalogues mostly using common machine-readable datasets (*)
- Dataset Compliance Statistics
- Top Violation Occurrences
- Compliant Datasets
- Top 20 catalogues with most DCAT-AP compliant datasets (*)
- Dataset Licence Usage
- Ratio known to unknown licences
- Most used licences
- Top 20 catalogues with most datasets of known licences
(*)The Top 20 indicators are only available for the Global Dashboard View.
Most results of the MQA are presented in charts (pie-charts, bar-charts). I you need further information for a chart, you can always click on the "i" icon in upper right corner of each chart that will provide you additional help. Some charts have the label "?" in the x-axis. This indicates an aggregation of unknown or not-set-entities in the data. For instance, if a chart shows the most used distribution formats and for some distributions, no format is provided.