Section 9 includes a table with several controlled vocabularies that MUST be used for the listed properties, but most of them are not dereferencable (some don't have public URIs and others always return 404). This issue has also be raised in relation to EUROVOC https://joinup.ec.europa.eu/asset/dcat_application_profile/issue/eurovo…, but IMO should be extended to every required vocabulary.
We may need to stablish some minimal requirements for a controlled vocabulary to be propossed as a MUST use, e.g.:
- Openly licensed.
- Dereferencable.
- Operated/maintained by official EU bodies or institutions or otherwise by any recognised standardisation body, etc (see also https://joinup.ec.europa.eu/asset/dcat_application_profile/issue/regist…).
- ... others?
Once we have the minimal requirements we may need to revisit this list and take the proper actions to replace, downgrade (not required) or try to improve several of the vocabularies.
(See also https://joinup.ec.europa.eu/discussion/controlled-vocabulary-requirement-version)
Comments
I agree with the criteria listed above (see also the 45+ CAMSS assessment criteria ). Of course, the controlled vocabulary must in the first place suit our basic use case: to allow for a cross-portal search for governmental data sets. I think this also means that controlled vocabularies must meet the following basic requirements for search taxonomies:
I hope that we will be able to reuse some of the controlled vocabularies that are currently used on open data portals. See also this ticket:
http://joinup.ec.europa.eu/discussion/which-vocabularies-are-used-your-data-portal
Version should be mandatory.
Could we clarify what openly license means? SKOS-HASSET is I think an importtant one to include but is does require a license which can be obtained upon request (http://www.data-archive.ac.uk/find/hasset-thesaurus/hasset-licence). Here is the rationale behind this decision the decision to license it: http://hassetukda.wordpress.com/2012/11/28/licensing-for-skos-hasset-wp…
See my comment athttp://joinup.ec.europa.eu/discussion/controlled-vocabulary-requirement-version#comment-14143
Good question: what does openly licensed mean?
From a practical perspective the following three conditions should at least hold:
- anyone can search the vocabulary to find terms that they want to use in their instance metadata without the need to sign a licence agreement;
--> if it is difficult for people to find out what the terms are, then people will not use the vocabulary
- anyone can include the URIs of the terms in the vocabulary in instance metadata without restrictions and without the need to sign a licence agreement;
--> if there are restrictions for use of the URI, you can't use the terms
- anyone who receives such data should be able to follow the link to the term and get information about what the term means (including at least all labels and descriptions) again without signing licence agreements.
--> if you can't look up what a URI means, such a URI is useless.
The W3C Best Practices for Vocab Selection gives some useful selection criteria: http://www.w3.org/2011/gld/wiki/222_Best_Practices_for_Vocab_Selection. If I look at the controlled vocabularies we maintain in the Publications Office MDR (http://publications.europa.eu/mdr/authority/index.html), I think we meet the majority of the criteria, although there is still a lot of work to do:
For me the most important criteria would be the persistence of the URI, the stability of the organisation behind the vocabulary, and, given our European context, the multilingual aspect.
Here a first attempt at summarising the requirements.
I think that the most complete, mature and updated reference for this so far is again the Best Practices for Publishing Linked Data Document, which in fact is an evolution of the Best Practices for Vocab Selection document Willem have already pointed to.
If we have a look at the "Is your linked data vocabulary 5 star?" and the "Vocabulary selection criteria" sections, I would try to align as much as possible with them. More specifically, I think that the documentation requirement should be also included as a minimum one.
If we think in terms of the vocabulary final user, you can do little or nothing with a vocabulary that is not documented, and you are also prompt to misuses it, with the correspondent consequences in terms of interoperability (note also that this is already the only "must" requirement at the BP document and also a 2-stars level one).
With regards to the publisher requirement, I don't think it must be required to be an institution of the European Union, but just any trusted group or organization.
My second attempt:
I agree with the requirements, but the title of the section listing the vocabularies is somewhat misleading: 8.2. Proposed vocabularies => ... that MUST be used ...
So they are actually "Mandatory vocabularies" (MUST) ? (like "Mandatory properties" and "Mandatory classes" in the chapters 7.x)
Or only "Recommended" (= +/- Proposed) ?
I propose to change the section heading to "Controlled vocabularies to be used"
Thanks Carlos for the link to the Best Practices for Publishing Linked Data Document. The selection criteria mentioned are indeed a good reference.
I agree that documenting a vocabulary is essential, but when I look at the properties for which controlled vocabularies are suggested in section 8, I do not think that there will be many misunderstandings possible as to what a particular concept stands for (e.g. language, country, ...), so dereferenciation is IMHO less critical than with other vocabularies.
I have some questions/remarks concerning the suggested use of controlled vocabularies in section 8:
1. dct:format/dcat:mediaType. I'm not sure to have understood the intended difference between file format and media type. An example under 7.3.2 would be useful.
2. dct:publisher. The suggested MDR Corporate bodies table covers only European institution/bodies and some international organiations. A possible extension and the scope of this extension should be discussed
3. By adding a DCAT profile "context" to the Named Authority Lists in the MDR, we could filter only those concepts that are relevant for the DCAT profile and thus get managable controlled lists.
My view on your points:
1. In DCAT, there is a difference between the use of dct:format and dcat:mediaType. The latter "should be used when the media type of the distribution is defined in IANA, otherwise dct:format may be used with different values.". As we are not using the IANA mdeia types but the MDR NAL which "is based on the IANA MIME Medai type", maybe we should not be using dcat:mediaType>?
2. Chapter 8 has a remark to reflect the fact that the MDR NAL only applies to European instititions. Is it realistic to assume that you can extend the list to include all organisations that are involved in publishing datasets?
3. We could indeed try to profile the controlled vocabularies to the concepts that are relevant for the application profile, but how would we go about identifying the subset?
Here some more ideas:
We have already a good handful of requirement, but all of them not required so far (should vs. must). We may need also to focus on defining a minimal set of required ones.
That is IMO specially importat for those vocabularies required at the AP (table at 8.2 - no much difference between "must be used" and the new propossed "to be used"). Several of them doesn't fulfil the minimal desirable ones (URIs that doesn't dereference - for both human and machine versions - or even doesn't resolve, lack of documentation, etc.). I don't know whether all that is planned to change in the short term but, honestly, as per today that doesn't look like a portfolio of recommended best practices.
Carlos, the "to be used" was the proposed alternative for the section heading "proposed vocabularies". It was noted by Bart that the section heading did not match the MUST that the section contains.
As to the requirements, I agree that we may not be able to find vocabularies that satisfy all requirements. We will need to determine what in the practical circumstances is the best we can do.
Makx, thanks for the clarifications ...
1. When we created the File types table, we were not aware of the existence of stable URI's for IANA media types and we have some file types that do not exist in IANA (Formex). So this table started as a table for internal purposes, but as mentioned in https://joinup.ec.europa.eu/discussion/registers-operated-ops-mdr-and-dcat-ap, we can add concepts upon request.
2. I must have worked on a previous version. I had not seen the remark, so forget about my question.
3. We have an attribute "use.context" where we indicate in which context a particular concept is used. We could add a value for the DCAT application profile (e.g. "DCAT_AP"). This would allow selecting only the relevant concepts. To be discussed ...