Skip to main content

INSPIRE metadata: metadata provenance

Published on: 10/05/2013 Discussion Archived

The INSPIRE metadata schema consists of two classes of metadata "elements": resource metadata and metadata on metadata. As far as the latter class is concerned, three metadata fields are recommended:

  • metadata point of contact
  • metadata date
  • metadata language
Metadata point of contact
The organisation publishing metadata about a resource may not be the same publishing a resource.
Metadata date
The issue/creation date of a resource is frequently different from the one of the corresponding metadata.
Metadata language
The resource language and the metadata language may not be the same. We have several example of this in INSPIRE - e.g., the resource language is the official one of the originating EU Member State, whereas metadata are provided in English.

Note that the first two fields are related to metadata provenance and accountability, whereas the metadata language gives important information of how to process metadata - e.g., in case multilingual or cross-language search is supported. Also, this information can be re-used when displaying metadata in a human-readable way. In such a case, if you know the metadata language, you can use existing tools which can be integrated into an HTML page to automatically translate metadata in a different language.

Note that this "metadata on metadata" do not correspond to the DCAT notion of catalogue record.

In RDF terms, "metadata on metadata" can be represented as a distinct rdf:Description block, linked to the relevant dataset metadata by using foaf:primaryTopic.

Component

Documentation

Category

feature

Comments

Makx DEKKERS
Makx DEKKERS Fri, 10/05/2013 - 14:34

Andrea, I do not understand that these meta-metadata properties cannnot be seen as properties of the DCAT CatalogRecord. In my mind, the kind of information you refer to (metadata about metadata) is very similar to the information defined for CatalogRecord, even if DCAT proper does not specify dct:langauge and dct:publisher for CatalogRecord. We could add those properties to the CatalogRecord in the AP if other agree.

Wouldn't it be making things unnecessarily complex if we were to add a different class for the properties you describe, alongside CatalogRecord, while both CatalogRecord and that new class  serve a very similar purpose?

 

 

Andrea PEREGO
Andrea PEREGO Fri, 10/05/2013 - 18:19

No, these are two different things.

A catalogue record describes an entry in the catalogue. "Metadata on metadata" describe who originally created the metadata.

Note that this distinction is particularly important when dealing with distributed catalogues or catalogue federations (the actual scope of DCAT-AP). Just to give you an example, the INSPIRE Geoportal [1] harvests metadata about datasets (and services) from catalogues operated by EU Member States. The catalogue record in the INSPIRE Geoportal keeps track of when the metadata has been acquired, etc., but the actual creator, maintainer, etc. of the metadata is another entity. Note that such entity does not necessarily corresponds to the organisation running the catalogue hosting it, or even to the creator, maintainer, etc. of the dataset described by the metadata.

I understand you concern about adding unnecessary complexity, but the point is that we already have thousands of datasets from EU Member States that include information on metadata provenance (the INSPIRE ones). So, it would be important to give guidelines in DCAT-AP about how to map this information.

Note that I'm not saying that we need to adopt the INSPIRE solution about the actual content of such "metadata on metadata". As I mentioned in my original message, it may be enough saying something like: if "metadata on metadata" are present, they should be represented by a distinct rdf:Description block, and properties foaf:primaryTopic / foaf:primaryTopicOf should be used to link "metadata on metadata" to the metadata they are describing. We needn't specify what should be included there (metadata language, publisher, etc.) - of course, unless we have other similar use cases in the WG that may help identify a minimal set of common properties.

  1. http://inspire-geoportal.ec.europa.eu/discovery/
Makx DEKKERS
Makx DEKKERS Fri, 10/05/2013 - 20:34

Andrea, this gets a bit complicated. I do not really understand what the differnce is between a catalogue record that says things about the description of a dataset and properties of the metadata of the dataset. To me they fit into the same category. I also don't understand what a 'distinct rdf:Description block' is. Such a 'block' would in fact contain a description of an instance of some class, wouldn't it? Why couldn't that class be CatalogRecord -- possibly extended with properties like language and publisher which are not defined for DCAT proper?

 

 

Andrea PEREGO
Andrea PEREGO Fri, 10/05/2013 - 22:02

Hi, Makx.

The difference between a catalogue record and statements on metadata provenance is that the latter exist by their own - they don't have to be in a catalogue and they are not describing a dataset as an entry in a catalogue. E.g., in INSPIRE, they are acquired in a catalogue along with the resource metadata they describe, and they are distinct from catalogue records.

I would try to better explain the rationale of this through an example:

Suppose a catalogue C, operated by organisation X, which is used to publish dataset metadata by a set of organisations O1, O2, ..., On, because they do not operate any catalogue.

Organisation O1 creates/updates one of their metadata MD about dataset D at a given time instant t1, and then publish it in catalogue C in a subsequent moment t2. Consequently:

  • metadata on MD will state that MD has been created/updated at time instant t1 , and that its publisher is organisation O1.
  • the catalogue record will use time instant t2 to specify when the dataset has been created/update in catalogue C.

The example can be even more complex - for instance the organisation in charge of creating metadata is not the same maintaining the datasets.

Note that the scenario above is not specific to INSPIRE, and it is more common than the one where the same organisation is maintaining datasets, the corresponding metadata, and operating the catalogue where they are published. In such a case, the information in catalogue records and "metadata on metadata" may match - but still they are two different things.

About rdf:Description, what I meant is that we needn't define a new class - "metadata on metadata" can just be an instance of rdfs:Resource.

Makx DEKKERS
Makx DEKKERS Mon, 13/05/2013 - 10:23

Andrea, I see your point.

As far as I understand, there are two underlying, implicit, assumptions in your scenario:

  1. what is created at t1 is a particular metadata 'record' (a set of statements that belong together); the "statements on metadata provenance" that you describe are in fact statements about that 'record'  
  2. in a chain of exchanges, you are assuming that the various catalogues downstream do not make changes to the 'record' most of the time

Assumption 1 means that in fact you think of the metadata 'record' as an entity of interest; therefore it should be of a particular class (rdfs:Resource sounds a little too general to me) and an instance of it should be identified by a URI.Then you can very easily say:

:O1sMetadataRecord a foo:MetadataRecord ; adms:contactPoint :O1sAddress ; dct:modified "T1"^^xsd:date ; dct:language :someLanguage ; dcat:dataset :datasetD .

Assumption 2 really determines how far downstream this particular metadata record travels. If catalogues that harvest and load dataset descriptions add, change or delete triples in the dataset description (even adding a local keyword or a local identifier), T1 loses its signficance. And in an open linked data environment, can you really expect the data to be immutable for long?

 

 

 

 

 

Andrea PEREGO
Andrea PEREGO Mon, 13/05/2013 - 16:03

Good point, Makx!

Actually, the INSPIRE Geoportal follows assumption (2) (basically, the idea is that you cannot change metadata you don't own). However, when a federation is operated outside a legal framework, such rule can be implemented only as a recommendation. On the other hand, in a given context, modifying the original metadata may also be considered as an added value - e.g., metadata can be validated, revised, and even enriched.

Addressing all the possible scenarios requires a comprehensive framework to represent provenance - as the one offered by the W3C PROV-O ontology [1]. I recognise this needs work in an area not in the specific scope of DCAT-AP.

So, my proposal is that we explicitly mention this issue in the spec (maybe in Section 11.2), saying that:

  • data exchange may result or not in a modification of the original metadata;
  • in scenarios where information on metadata provenance is required, this can be addressed by using existing "tools", as the PROV-O ontology - or even DCTerms in the simplest case.

We'll probably include an example of how this can be done in our report on INSPIRE metadata.

  1. http://www.w3.org/TR/prov-o/
Makx DEKKERS
Makx DEKKERS Tue, 09/07/2013 - 12:09

Consideration    

These elements are mostly relevant for cases where metadata remains unchanged across exchanges; In a linked open data environment, the information loses significance as soon as someone changes (corrects, enhances, mixes) the metadata.

 

Proposed resolution    

Add text to point implementers to W3C’s PROV Ontology

 

Proposed action  

Update specification, adding paragraph to section 11.3: “This Application Profile does not consider requirements for tracking provenance of metadata or data, other than providing information about the publisher of the data. If additional provenance information is required, implementers are encouraged to consider the use of W3C PROV Ontology [footnote: http://www.w3.org/TR/prov-o/] to capture and exchange such information.”