Skip to main content

Enable syndication of dataset metadata?

Anonymous (not verified)
Published on: 03/04/2013 Discussion Archived

In order to support the use case of cross-catalogue search of datasets there needs to be a mechanism on how to transfer DCAT data form one catalog to an aggregator. Given that there may be many catalogs that are interested in aggregating DCAT data it would be useful with a mechanism that support harvesting of data.

The aggregator needs information about new datasets, updated datasets and deleted datasets (the CUD in CRUD) to make sure the aggregator doesn't provide incorrect information.

Proposed solution

A method to solve this in practice is to use the Atom Syndication Format (RFC 4287) to carry links to DCAT metadata. Atom is comparable to RSS, has widespread use and has multiple implementations and various platforms. This makes it easy to implement and debug. However, Atom only supports information about Created and Updated objects and needs to be enhanced with a method to desribe deleted items. A proposal for "tombstones" enhancement of atom is available here.

A catalog provides a feed for aggregators to harvest. The aggregator harvests those periodically and builds its own index for searching etc.

Component

Documentation

Category

support

Comments

Anonymous (not verified) Wed, 03/04/2013 - 18:00
Makx DEKKERS
Makx DEKKERS Wed, 03/04/2013 - 21:08

I think there are two issues here:

1. the "what", i.e. defining the set of relevant classes and properties that are exchanged between data portals

2. the "how", i.e. the mechanisms by which data can be exchanged.

Obviously, both these aspects are important. However, the charter of the Working Group as defined under the objectives on the description page currently focuses on the first, the "what".

It seems to me that the two issues are orthogonal: you can define the content independently from the transport mechanism and logistics. I would prefer we concentrate our efforts on the content first. We can maybe look at the logistics later.

Anonymous (not verified) Tue, 16/04/2013 - 13:39

According to the Process and Methodology for Developing Core Vocabularies altough there are a number of underlying technologies that are well established for sharing data, and without prejudice towards other technologies, the process and methodology we are following expressly encourage the use of Linked Data, an application of the Semantic Web for which RDF is the technological foundation.

In this specific case we have adopted DCAT, an RDF vocabulary, as the basis for this work. RDF, as every technology, has several "pros and cons" and, from my experience working with several local, regional and national Governments in Open Data initiatives, one of the main "cons" is the lack of expertise and understanding on RDF and other Semantic Web or Linked Data related technologies and concepts.

Once we have opted for supporting RDF technologies that may arise some consecuences. Specially if we really want that as a result of this work we will be able to have a graph of connected interoperable concepts with a common framework for their URI space (again as per the methodology).

I agree this group shouldn't focus on the HOW, but I also think a minimal orientation to connect the WHAT and the HOW is also needed (even just a paragraph about the importance of not just sharing common metadata but also publishing it in the right way, and pointing to some reference materials such as the Best Practices for Publishing Linked Data

 

 
stijngoedertier (not verified) Tue, 16/04/2013 - 23:35

I agree that protocols for exchanging description metadata are not in scope for this WG. However, there might be a reason to look a bit deeper into dcat:CatalogRecord.

  • Without a dcat:CatalogRecord: Without information on the dcat:CatalogRecords, it is only possible to exchange description metadata as a snapshot. This can work pretty well. In this virtual meeting we discussed the requirements for exchanging description metadata in the context of ADMS and the federated collections of interoperability assets on Joinup. A snapshot-based approach was taken, to lower the entry barrier for our federation partners. The latest snapshot with ALL description metadata is periodically sent by our federation partners, or periodically harvested from a harvest URL by Joinup. The aggregator (here Joinup) figuers out what changed itself and creates, updates, and deletes entries accordingly. This approach has worked well so far because files with description metadata are relatively small (maximally a few megabytes) to process and latency on reflecting changes is acceptable.
  • With a dcatCatalogRecord: With information regarding the dcat:CatalogRecord, more intelligent exchange protocols are possible, yet they impose more requirements on the data portals. More specificatlly the property dct:modified of the catalogue record allow only exchanging recods and associated dataset descriptions that have been created or updated afer the last harvesting period; note that it is not clear how deleted records can be treated in DCAT (see OAI-PMH's take on this). This approach, however, imposes more requirements on the data portals; i.e. they need to be aware of all metadata changes.

Bottom line: I guess it is a good thing that dcat:CatalogRecord is optional in the current application profile (Draft2).

Enric STAROMIEJSKI Wed, 17/04/2013 - 11:15

OK, the HOW is out of our scope, but then who should assume defining the HOW? I think we shouldn't at least miss the opportunity of issuing a recommendation to ISA aimed at integrating this interoperability paradigm into its interoperability strategy and actions, and probably to update the EIF and guides. Would it be out of scope for this Work-group to write a one-single-paragraph (as Carlos says) recommendation to ISA?

Makx DEKKERS
Makx DEKKERS Wed, 17/04/2013 - 12:02

I do like Carlos' suggestion to add a paragraph pointing to best practice for publishing Linked Data, maybe in combination with the pointers to URI policies.

stijngoedertier (not verified) Thu, 18/04/2013 - 14:42

Just another note to address Peter's concerns with regard to created, updated, and deleted datasets. 

In the context of the Open Data Support project, we are developing a metadata broker platform. The platform will expose metadata harvested from various data portals using its SPARQL endpoint. The platform will harmonise all description metadata according to the DCAT Application Profile. We will try to put in some intelligence so that the dcat:CatalogRecord will contain appropriate provenance data, including (exact or approximate) information when the record was created (dct:issued) and modified (dct:modified), even when the data portal of origin does not provide this data. However, how can we reflect whether a catalog record was deleted in DCAT?

Makx DEKKERS
Makx DEKKERS Mon, 22/04/2013 - 21:46

No deployment issues will be included in the specification, other than pointers to best practice in publishing open linked data and URI policies.

stijngoedertier (not verified) Tue, 30/04/2013 - 10:55

I have written out our requirements for metadata exchange as a use case, to see to which extent it could affect the Application Profile. Here it is:

When data portals exchange description metadata, they need a mechanism to keep the exchanged metadata up-to-date. Otherwise, outdated description metadata might pollute the “federation of data portals”. For example, without a proper mechanism, deleted datasets continue to be listed on the websites of aggregators.  This mechanism can be based on the exchange of catalog records (A) or on the exchange of an entire snapshot (B).

 

Mechanism A. Exchange based on catalog records: A set of catalog records that have been created, updated, or deleted after a specific time interval – typically the last update period – is exchanged between a Data Portal and a Metadata Broker. This happens in the following steps:

  1. Recordkeeping by Data Portal: The Data Portal keeps track of catalog records that represent the latest create, update, and delete transactions to its metadata;
  2. Exchange (push or pull): Periodically, the Data Portal pushes the catalog records that have been created, updated, or deleted to the Metadata Broker. Alternatively, the Metadata Broker periodically pulls (metadata harvesting) the metadata records that have been created, updated, or deleted from the Data Portal.
  3. Update by Metadata Broker: The Metadata Broker updates its own metadata to reflect the changes indicated in the catalog records.
    • Created records: It will create the metadata for all catalog records that have been created. For example, if a new dataset was added to the collection of the Data Portal, the Metadata Broker will incorporate its description metadata;
    • Updated records: It will reflect updates to the metadata for all catalog records that indicate an update of metadata. For example, if the description metadata of a dataset was updated on the Data Portal, the Metadata Broker will reflect all changes;
    • Delete records: It will delete the metadata for all catalog records that indicate a deletion of metadata. For example, if a data set was removed from the collection of a Data Portal, the Metadata Broker will reflect this.
  4. Recordkeeping by Metadata Broker: The Metadata Broker uses the same catalog records as the Data Portal. In turn, he can offer a CatalogRecord-based exchange of metadata.

 

Mechanism B.Snapshot-based exchange: A metadata snapshot is exchanged between a Data Portal and a Metadata Broker that contains all metadata exactly as it appears at a specific point in time.

  1. No recordkeeping by the Data Portal: The Data Portal does not (need to) keep track of catalog records.
  2. Exchange (push or pull): Periodically, the Data Portal pushes a snapshot of all its metadata to the Metadata Broker. Alternatively, the Metadata Broker pulls (metadata harvesting) a snapshot from the Data Portal.
  3. Update by Metadata Broker: The Metadata Broker updates its own metadata but also incorporates catalog records to reflect creates, updates, and deletes to the metadata. The latter can be achieved if the Metadata Brokers compares the snapshot with a previous snapshot for the Data Portal.
    • Unchanged metadata: The Metadata Broker does not update the metadata nor the corresponding catalog records. For example, if a description of a dataset remains unchanged between the current and the previous snapshot, no updates are needed.
    • Created metadata: The Metadata Broker adds metadata which has been added to the snapshot and also creates a catalog record to reflect this. For example, if a description of a dataset was added to the current snapshot that was not present in the previous snapshot, the Metadata Broker will also incorporate this description metadata and it will create a catalog record to reflect the creation.
    • Updated metadata: The Metadata Broker updates metadata which has been updated and also updates the modification date of the catalog record to reflect this. For example, if the title of a dataset is updated, the Metadata Broker will apply this update and update the modification date of the corresponding catalog record to reflect this.
    • Deleted metadata: By comparing the snapshot with a previous snapshot, the Metadata Broker detects that some metadata has been removed, it also remove the metadata, but leave a catalog record to reflect this deletion. For example, if a dataset is removed from the collection of a Data Portal, the Metadata Broker will delete the Dataset and include information about the “deleted entry” in its catalog records.
  4. Recordkeeping by Metadata Broker: The Metadata Broker now can offer a CatalogRecord-based exchange of metadata.
Makx DEKKERS
Makx DEKKERS Wed, 01/05/2013 - 12:29

I am adding a section on "Exchange of data" in section 11 on deployment in draft 3 for review by the Working Group.

stijngoedertier (not verified) Thu, 02/05/2013 - 11:14

 

@Andrea: Thanks for referring us to DCIP. One solution for us could be to reuse the DCIP change_type property and add it to dcat:CatalogRecord:   change_type (or dct:type): the type of the latest revision of a dataset's entry in the catalog. MUST take one of the values "create", "update" or "delete" depending on whether this latest revision is as a result of a creation, update or deletion.  
Anonymous (not verified) Tue, 07/05/2013 - 00:14

If the intention of the AP is to be used at Linked Data environments I suggest a stronger encouragement at section 11 (i.e. "should" instead the current "may")

Makx DEKKERS
Makx DEKKERS Tue, 07/05/2013 - 09:00

I can change "...it may be useful for publishers to consider..." to "...publishers should consider...".

Andrea PEREGO
Andrea PEREGO Fri, 10/05/2013 - 15:38

I have some concerns about using dct:type to model the DCIP change_type field.

BTW, a new version of the DCIP spec has been published today, and the relevant part [1] now reads as follows:

change_type MUST take one of the values create, update or delete depending on whether this latest revision is as a result of an update, creation or deletion.

Here, as in the original definition quoted by Stijn, change_type denotes the type of latest revision, not the type of catalogue record.

I would suggest we use adms:status [2] instead.

  1. See http://spec.datacatalogs.org/#response-format

  2. https://dvcs.w3.org/hg/gld/raw-file/default/adms/index.html#adms-status
Login or create an account to comment.