DCAT-AP: How to manage duplicates? Switch to the latest release
How to manage duplicates?
Issue
Duplicates can occur when an aggregator harvests descriptions of datasets from various sources. There could be two situations:
1. In the harvested data, there are two or more descriptions of the same physical data file or API/end point – in this case, the download or access URLs in the descriptions are the same;
2. One or more of the harvested sources describe a copy of the data file or API/end point – in this case, the descriptions refer to different physical files.
Current situation
Many DCAT-AP implementers suffer to deal with the identification and handling of duplicate datasets. Duplicates are specifically a problem when a central data portal or aggregator, for example at a national level, scrapes datasets from other data portals, for example regional data portals. When the same dataset exists on several regional portals and they are not identified using the same stable identifier, it is difficult for the national data portal to automatically identify the duplicate datasets.
Recommendation
|
Rationale
The existence of duplicate datasets within and across data portals leads to multiple interoperability-related issues. Since representations of one dataset exist on several portals, it is difficult for a data consumer to identify which is the original source, which might be necessary to identify original licence statements, provenance information, linked data sets, etc.
Example
The example below describes the same data set on 3 data portals:
1. The original dataset uploaded on a data portal of a city. As the data set is first published here, a stable identifier is defined as a value of dct:identifier.
<rdf:Description rdf:about="http://data.city.eu/datasets/12345">
<rdf:type rdf:resource="http://www.w3.org/ns/dcat#Dataset"/>
<dct:title xml:lang="en">Companies located in the city harbour</dct:title>
<dct:identifier>http://data.city.eu/datasets/12345</dct:identifier>
</rdf:Description>
|
2. The dataset harvested on a regional data portal. A local identifier, specific to the regional portal, could be added as a value of adms:identifier. The global identifier (dct:identifier) remains unchanged.
<rdf:Description rdf:about="http://data.region.eu/datasets/34567">
<rdf:type rdf:resource="http://www.w3.org/ns/dcat#Dataset"/>
<dct:title xml:lang="en">Companies located in the city harbour</dct:title>
<dct:identifier>http://data.city.eu/datasets/12345</dct:identifier>
<adms:identifier rdf:parseType="Resource">
<skos:notation>10.1000/182</skos:notation>
</adms:identifier>
</rdf:Description>
|
3. A national data portal harvests both the city and regional portals. However, as the global, primary identifier remains unchanged, the national data portal will be able to automatically identify that these 2 sources refer to the same data set. The 2 sources will not be duplicated on the national data portal.
<rdf:Description rdf:about="http://data.country.eu/datasets/56789">
<rdf:type rdf:resource="http://www.w3.org/ns/dcat#Dataset"/>
<dct:title xml:lang="en">Companies located in the city harbour</dct:title>
<dct:identifier> http://data.city.eu/datasets/12345</dct:identifier>
<adms:identifier rdf:parseType="Resource">
<skos:notation>10.1000/182</skos:notation>
</adms:identifier>
<adms:identifier rdf:parseType="Resource">
<skos:notation>138472638</skos:notation>
</adms:identifier>
</rdf:Description>
|