In order to support the use case of cross-catalogue search of datasets there needs to be a mechanism on how to transfer DCAT data form one catalog to an aggregator. Given that there may be many catalogs that are interested in aggregating DCAT data it would be useful with a mechanism that support harvesting of data.
The aggregator needs information about new datasets, updated datasets and deleted datasets (the CUD in CRUD) to make sure the aggregator doesn't provide incorrect information.
Proposed solution
A method to solve this in practice is to use the Atom Syndication Format (RFC 4287) to carry links to DCAT metadata. Atom is comparable to RSS, has widespread use and has multiple implementations and various platforms. This makes it easy to implement and debug. However, Atom only supports information about Created and Updated objects and needs to be enhanced with a method to desribe deleted items. A proposal for "tombstones" enhancement of atom is available here.
A catalog provides a feed for aggregators to harvest. The aggregator harvests those periodically and builds its own index for searching etc.
Comments
I think there are two issues here:
1. the "what", i.e. defining the set of relevant classes and properties that are exchanged between data portals
2. the "how", i.e. the mechanisms by which data can be exchanged.
Obviously, both these aspects are important. However, the charter of the Working Group as defined under the objectives on the description page currently focuses on the first, the "what".
It seems to me that the two issues are orthogonal: you can define the content independently from the transport mechanism and logistics. I would prefer we concentrate our efforts on the content first. We can maybe look at the logistics later.
Hi,
indeed the How aspect is beyond the scope of the WG.
For more details on sharing and exchange of metadata between Open Data portals, you may refer to:
the Open Data Support project
Integrating Linked Metadata Repositories into the Web of Data (Gofran Shukair, Nikolaos Loutas and Vassilios Peristeras)
According to the Process and Methodology for Developing Core Vocabularies altough there are a number of underlying technologies that are well established for sharing data, and without prejudice towards other technologies, the process and methodology we are following expressly encourage the use of Linked Data, an application of the Semantic Web for which RDF is the technological foundation.
In this specific case we have adopted DCAT, an RDF vocabulary, as the basis for this work. RDF, as every technology, has several "pros and cons" and, from my experience working with several local, regional and national Governments in Open Data initiatives, one of the main "cons" is the lack of expertise and understanding on RDF and other Semantic Web or Linked Data related technologies and concepts.
Once we have opted for supporting RDF technologies that may arise some consecuences. Specially if we really want that as a result of this work we will be able to have a graph of connected interoperable concepts with a common framework for their URI space (again as per the methodology).
I agree this group shouldn't focus on the HOW, but I also think a minimal orientation to connect the WHAT and the HOW is also needed (even just a paragraph about the importance of not just sharing common metadata but also publishing it in the right way, and pointing to some reference materials such as the Best Practices for Publishing Linked Data)
I agree that protocols for exchanging description metadata are not in scope for this WG. However, there might be a reason to look a bit deeper into dcat:CatalogRecord.
Bottom line: I guess it is a good thing that dcat:CatalogRecord is optional in the current application profile (Draft2).
OK, the HOW is out of our scope, but then who should assume defining the HOW? I think we shouldn't at least miss the opportunity of issuing a recommendation to ISA aimed at integrating this interoperability paradigm into its interoperability strategy and actions, and probably to update the EIF and guides. Would it be out of scope for this Work-group to write a one-single-paragraph (as Carlos says) recommendation to ISA?
I do like Carlos' suggestion to add a paragraph pointing to best practice for publishing Linked Data, maybe in combination with the pointers to URI policies.
Just another note to address Peter's concerns with regard to created, updated, and deleted datasets.
In the context of the Open Data Support project, we are developing a metadata broker platform. The platform will expose metadata harvested from various data portals using its SPARQL endpoint. The platform will harmonise all description metadata according to the DCAT Application Profile. We will try to put in some intelligence so that the dcat:CatalogRecord will contain appropriate provenance data, including (exact or approximate) information when the record was created (dct:issued) and modified (dct:modified), even when the data portal of origin does not provide this data. However, how can we reflect whether a catalog record was deleted in DCAT?
No deployment issues will be included in the specification, other than pointers to best practice in publishing open linked data and URI policies.
I have written out our requirements for metadata exchange as a use case, to see to which extent it could affect the Application Profile. Here it is:
When data portals exchange description metadata, they need a mechanism to keep the exchanged metadata up-to-date. Otherwise, outdated description metadata might pollute the “federation of data portals”. For example, without a proper mechanism, deleted datasets continue to be listed on the websites of aggregators. This mechanism can be based on the exchange of catalog records (A) or on the exchange of an entire snapshot (B).
Mechanism A. Exchange based on catalog records: A set of catalog records that have been created, updated, or deleted after a specific time interval – typically the last update period – is exchanged between a Data Portal and a Metadata Broker. This happens in the following steps:
Mechanism B.Snapshot-based exchange: A metadata snapshot is exchanged between a Data Portal and a Metadata Broker that contains all metadata exactly as it appears at a specific point in time.
In response to this post that specifies a use case for CatalogRecord, I propose to make the class CatalogRecord optional instead of excluded. This also takes care of the issue raised at http://joinup.ec.europa.eu/discussion/excluded-versus-optional-classes.
It may be worth investigating whether the upcoming DCIP (Data Catalog Interoperability Protocol) [1] addresses the issues outlined by Stijn.
I am adding a section on "Exchange of data" in section 11 on deployment in draft 3 for review by the Working Group.
If the intention of the AP is to be used at Linked Data environments I suggest a stronger encouragement at section 11 (i.e. "should" instead the current "may")
I can change "...it may be useful for publishers to consider..." to "...publishers should consider...".
I have some concerns about using dct:type to model the DCIP change_type field.
BTW, a new version of the DCIP spec has been published today, and the relevant part [1] now reads as follows:
Here, as in the original definition quoted by Stijn, change_type denotes the type of latest revision, not the type of catalogue record.
I would suggest we use adms:status [2] instead.
See http://spec.datacatalogs.org/#response-format