Description
From: http://joinup.ec.europa.eu/mailman/archives/dcat_application_profile/2015-February/000123.html
Data lineage is information about the data life-cycle - e.g., the data collection / creation methodology and workflow. Lineage is usually present in metadata of scientific datasets, but it is not uncommon also in the public sector. For example, specifying lineage is a legal requirement for INSPIRE metadata. It would be desirable to add "lineage" as an optional element in DCAT-AP 2.0. Lineage can be modelled by using dct:provenance.
An example of how to use it is provided in the reference specification of the INSPIRE profile of DCAT-AP. In the example above, lineage is specified through a free text description. For a machine readable representation, an option is to use PROV-O. This can be one of the use cases mentioned in issue JRC.8.4.
Proposed solution
Add new property to express lineage
Comments
A possible option is dct:provenance - see the solution proposed in the working draft on the alignment of INSPIRE metadata with DCAT-AP.
Yes, using dct:provenance for lineage info would align with GeoDCAT-AP developments.
dct:provenance, adms:version and adms:versionnotes would be quite expressive themselves.
However there would not be a formal link between datasets. PROV-O would provide this, but might be overkill. This links back to the 'relationship between datasets' discussion, and possible use of dct:isVersionOf.
In practice, sometimes distributions/resources are used as versions, especially in the case of dynamic datasets. For example, one dataset, with a different distribution for each year. Is this compliant with DCAT-AP?
Deirdre, as far as I understand the answer to your question "Is this compliant with DCAT-AP?" is No. The idea is that the distributions of a particular dataset all contain the same information in different formats. A time series would be modelled as a collection of related datasets. See also https://joinup.ec.europa.eu/discussion/pr5-add-new-property-relate-datasets-time-series
I wonder we can consider proposing a number of possible options depending on the use case.
For example:
For any more complex use case, the recommendation can be to use PROV. For instance, PROV can be used to link also to the model used to generate the dataset from the input dataset(s), but also to represent a more complex workflow (e.g., consisting of a model "chain"), data that are collected directly from instruments or sensors, etc.
In my understanding, PROV can be proposed as the reference vocabulary for data provenance / lineage, and it may be replaced by less complex "solutions" (dct:source, dct:provenance) for specific (and simple) use cases.
These options needn't be mutually exclusive, and there shouldn't be interoperability issues here, since we already have a mapping between dct:source and prov:wasDerivedFrom and between dct:provenance and prov:has_provenance (see http://www.w3.org/TR/prov-dc/).