Now we have cardinality for properties in draft 2 v0.03 I would like to arise some potential issues:
- First a general question: should cardinality for non-mandatory properties always be in the 0..x form? I am not sure myself in this case as I assume it means cardinality applies once you decide to use that properties, but at the same time it is also a little bit confusing as I am used to that notation for "optional things". Even more confusing is that it looks like different criteria are applied to recommended and optional properties.
- Update/modification date -> should it be 0..1? (you don't have any such date until first update/modification)
- Identifier -> should it be 0..1, or even 1..1? (even when mulitple identifiers are possible to allow for local identifieres as Makx said, dct:identifier is defined as "An unambiguous reference to the resource within a given context", so this is about the unique identifier within a given system - in this case a given Catalog - even when other identifiers may apply for other systems, but you shouldn't be using dct:identifier for them)
- download URL -> should it be 0..n? what if we have several mirrors for the same Distribution for example?
- format -> should it be 0..n? what if we have a Dataset of zipped image resources in different formats for example?
- license -> should it be 0..1? (and thus optional) This is without doubt a must have, but unfortunatly my experience is that license information is frequently not easy to get or even impossible sometimes because even the data managers don't know about it.
Some other background references:
https://joinup.ec.europa.eu/asset/dcat_application_profile/issue/nolice…
https://joinup.ec.europa.eu/asset/dcat_application_profile/issue/cardin…
https://joinup.ec.europa.eu/asset/dcat_application_profile/issue/max-ca…
https://joinup.ec.europa.eu/asset/dcat_application_profile/issue/use-ma…
Comments
Carlos, good questions:
1- cardinality: yes I was struggling with that too. I could just give maximum cardinality (either 1 or unlimited). Would that help?
2- update/modification date: it's not mandatory for any classs: recommended for Catalog and optional for Dataset and Distribution. So I don't see the problem.
3- identifier: your assumption that the context is the catalogue may not be correct. Ther are also more general and domain-specific contexts. A Dataset may have one or more external identifiers (examples IVOA, ADS, DOI, DCO-ID, ARK, EZID and there might be many more). I propose: no change
4- downloadURL: indeed should allow mulitple URL in case of mirroring. I propose: change cardinality to unlimited
5- format: we currently do not cover cases with files within files. ADMS does this by having a property adms:representationTechnique to tell you what is inside a file. We could do the same here if necessary but I note that DCAT itself does not consider such cases. Maybe this should be brought up with W3C GLD?
6- license: this is indeed a difficult issue. I am told by legal people that if a licence is not specified, the default position is that you cannot use the resource. That's why it's mandatory in the spec. Making it recommended or even optional might mean that data managers (and yes it will be hard to find out in some cases) will not bother to try and find out. Then you'll end up with lots of datasets of which you don't know whether you can re-use them or not.
1 - That might be better, but personally i would prefer to give min of 0 for anything optional as that's the way I am used to and it provides also useful information about what's optional and what's not. In any case let's also see what is more clear for others.
2 - The problem is that current 1..1 means it is mandatory indeed (at least for me). This is closely related to the previous one.
3 - I think it is not my assumption but the official dct:identifier definition. Of course I am not going to discuss dc meanings with you because I have nothing to gain :) Still I think that an identifier should be unique within a system (by definition), if not, how do you know what's the identifier. Anyway, we may want to broad the conversation and see what others think.
4 - ok
5 - Yes, a complex use case indeed. Probably not for this group.
6 - I fully agree with the importance of having license information and also understand the legal issue, but we can't also ignore real-world evidence. There are lots of cases where license information is simply unknow. I know that making this optional is like a back door for people not caring at all about it, but on the other hand if it is finally required, how are we going to manage the (multiple) cases were you just don't have that information and can't also figure out in any way? Can't you publish the dataset then? We will need to remove several from current online catalogs in that case. Just raising the issue, I don't have any good fix suggestion for this and I am afraid that nobody may have it.
On 3, maybe to clarify: I intended to say that it may not be true that the only context for a particular dataset is a specific catalogue. The DCMI specification leaves open what the "given context" is. If you allow multiple contexts, you can have multiple identifiers.
Yes, DCMI leaves open what the context is but I think our specific use case does not and that's my point. A Catalog is a given single context, not a multiple one. Identifiers should be uniques within the Catalog context, even when datasets may use other identification systems in different contexts (that are not identifiers any more in this specific context)
Carlos, I very much disagree that the Catalog is the only context for a given Dataset. A Dataset can be part of many contexts at the same time. An example from another domain is a digital book in a library. That book may have a local identifier in the context of the catalogue of the library that holds it, but also an identifier in the context of all published books (ISBN), an identifier in the context of digital objects (DOI), an identifier in the context of serial publications if it is part of a series (ISSN), an identifier in the context of the national bibliography (NBN) etc. etc.
It seems to me that it would be unhelpful if we did not allow the inclusion of 'external' identifiers in descriptions of Datasets. In my mind, that would be a serious limitation of functionality. What would be the benefit of such a restriction?
on points 1 and 2, about cardinalities, I propose to do the following:
Ok for 1 and 2
With respect to 3, I think I was unable to explain myself correctly.
I don't think that the Catalog is the only context for a given Dataset either, but that a Dataset has only a given context at the same time in any given scenario.
Of course every Dataset may have several IDs for different scenarios, but each of these IDs is unique in a given context. Similary, I have also different "IDs" for different scenarios (my ID card, my driving license, my passport, my local library pass, my swimming pool pass and so on...) but I can use each of them as ID only in their specific context (I can't use my local library pass at the swimming pool for example, neither my driving licenseI or passport) In any other context they are not IDs any more (for that context), even when they remain being semantically IDs, each for their correspondent system.
What I mean is that people usually create specific datasets IDs systems for their Open Data catalogues, and that are the only valid IDs within that system (the Open Data Catalog). If I model other external references (that are indeed used as IDs in other systems) as dct:identifier in my current system (the Catalog) How could one know - while querying for example - what the "real" identifier is in this system? I will be getting several IDs for such query, what's the Dataset ID then in the Catalog? (and it get worse if you take into account that while playing with RDF will be probably using the Dataset URI as ID to be able to dereference further knowledge)
My main point on this, again, is that several IDs can't coexist as such in the same context. If so, then you have or a broken ID system. Maybe I am thinking to much in terms of programming here, but I really think we will be creating a big issue if we don't differenciate between the system IDs and other external IDs using the same property for both.
Can't imagine how could I try to explain myself better :) so I encourage everybody else to also provide their thoughts on this to adavance.
Carlos, I think you gave a possible answer how to distinguish between the "native" ID (the ID in the context of the catalogue) and external identifiers: the native (or as you call it, the real) identifier in the context of the catalogue is the Dataset URI (the one you use in the rdf:about); all the others are dct:identifiers.
Makx, I think this interpretation goes far beyond of what could be assumed by reading just the AP spec. If that is the expected model to folow then it should be explicity stated somewhere. In any case I think it shouldn't, you may want to use "native" IDs other than URIs, examples:
http://opendata.euskadi.net/w79-contdata/es/contenidos/ds_general/calen…
http://opendata.aragon.es/dataset/hogares-familia-caracteristicas-hogar…
http://www.zaragoza.es/datosabiertos/sparql
http://open-data.europa.eu/en/data/dataset/cO7r1ma3AEH2Zkq78aJjmQ
Even if you still decide to use URIs you may want to keep that also as part of your metadata, as in the Spanish Technical Interoperability Standard for the Reuse of Information Resources (NTI).
Finally, it looks like the EC Open Data portal doc has opted also for different id metadata for "native" and "external" ones http://open-data.europa.eu/files/MetadataVocabulary.ods
Carlos, I looked at the file at http://open-data.europa.eu/files/MetadataVocabulary.ods and I think that they do what I wrote in my previous comment:
So they use the Dataset URI (the native one) as the subject in all triples and not as the object of an identifier predicate.
I still do not understand whether you really want to exclude the exchange of these "other identifiers" as part of the metadata for a dataset. What exactly do you propose?
Just a not to clarify: my intention is not to exclude any ID, but to be able to differentiate between native and external IDs in a machine-readable way.
My point about the open-data.europa.eu doc is that, as it can be seen at the metadata table, those are already being differentiated:
- Different properties in the case of CKAN and ADMS (identifier vs. id)
- Linked Data implementation in the EC Catalogue (identifier vs. rdf:about)
The first is a more technology-agnostic solution, and that's also what I would like to see here to make the DCAT-AP compatible also for those catalogs that will not implement a full linked data solution. And even when you have implemented a linked data solution you may have opted for a different native ID other than the linked data rdf:about one (I have already provided some examples above). We may think that those IDs are usually used also by other components of the Catalog system, such as CMSs, that are not linked-data aware and don't use URI IDs.
Even in the case of the EC Catalogue if you have a look at any Dataset form such as http://open-data.europa.eu/en/data/dataset/cO7r1ma3AEH2Zkq78aJjmQ the identifier (comp_sah_02) is not a URI. I can't see how it is being implemented in RDF because not content negotiation is being used currently and no documentation is provided for the SPARQL endpoint, so no idea of what the representation URIs are for the catalog and its metadata.
So are you proposing that we create a new property, either:
or
Or something else?
Yes, that is. More specifically:
- keep dct:identifier for the native identifier (URI or otherwise)
- if we want to support other identifiers (URI or otherwise) use somenamespace:otherIdentifier or similar
Carlos, there was a similar discussion in the GLD group (see: http://lists.w3.org/Archives/Public/public-gld-wg/2013Mar/0114.html). It's not exaclty the same as your issue here, but it involved a proposal to distinguish between identifiers. The GLD could not reach consensus on this so they decided to do nothing. Of course, as GLD has a strict Linked Data perspective, they would see the native URI as being used in the left-hand side of the triple and not as a dct:identifier.
We'll discuss your proposal this afternoon on the call. In the meantime, what would be your proposal for the additional identifier property, ideally in an exisiting, well-known and well-maintained namespace?
I will add property adms:identifier to Dataset.
Usage note for dct:identifier to read:
This property contains the main identifier for the dataset, e.g. the URI or other unique identifier in the context of the Catalog.
Usage note for adms:identifier to read:
This property refers to a secondary identifier of the Dataset, e.g such as MAST/ADS, DOI, EZID or W3ID.
[with references
Mikulski Archive for Space Telescopes (MAST). Referencing Data Sets in Astronomical Literature. http://archive.stsci.edu/pub_dsn.html
DOI. Digital Object Identifier. http://www.doi.org/
EZID. http://n2t.net/ezid
W3C Permanent Identifier Community. Permanent Identifiers for the Web. https://w3id.org/]
We have discussed the matter of identifiers extensively at our CERIF Task Group meetings (CERIF = Common European Research Information Format, backed by euroCRIS, www.eurocris.org). Our experience can be summarized in the following points, which I hope can be inspiring for DCAT application profile as well:
In the light of this, I'd suggest to introduce an Identifier class that would record both the identifier value and a reference to the identifier context. Datasets would be allowed to have 0..x such identifiers, and the dcat:identifier would be cast into this form as well.
Jan, we have discussed this in the last call and the WG decided to have two identifier properties:
dct:identifier (range rdfs:Literal) that contains the main identifier for the dataset, e.g. the URI or other unique identifier in the context of the Catalog.
adms:identifier (range adms:Identifier) that refers to a secondary identifier of the Dataset.
The class adms:Identifier is modelled on UN/CEFACT and allows you to specify the context of the identifier as you suggest.
For the identifier in dct:identifier you don't need to specify that context because its context is predefined, namely the Catalog the description of the Dataset is in at that particular moment. When the context changes (e.g. the description of the Dataset is moved to a different Catalog or into a federated Catalog) that identifier will be replaced by the main identifier in the new Catalog.
These properties will be described in the new version that is being prepared for public review.
"license -> should it be 0..1? (and thus optional) This is without doubt a must have, but unfortunatly my experience is that license information is frequently not easy to get or even impossible sometimes because even the data managers don't know about it."
+1 Despite license information is vert important, it is in practice rarely available.
I would suggest having license as a recommended property.
+1 from me.
Consideration
For the practical reason mentioned, it does make sense to relax the obligation for licence..
Proposed resolution
Make dct:license recommended for Distribution
Proposed action
Update specification, moving dct:license from section 7.4.1 to 7.4.2 and changing cardinality to 0..n.