Eurovoc has been mentioned as a candidate as a controlled multilingual vocabulary to classify datasets. It may be a good idea to research alternatives before deciding on Eurovoc. Some of the drawbacks with Eurovoc are:
1. Complexity: the person who has to enter the topic for a dataset has a lot of topics to choose from. They may opt for high level topics which will reduce precision in the aggregated catalog. Also, a software solution for a smaller catalog needs to import Eurovoc and provide som sort of browsing interface.
2. Mapping between topics: the SKOS version of Eurovoc seem to be designed with one URI per conept per language instead of concepts with multiple labels. This requires multiple operations to cross reference topics.
Any reference implementations should study how these issues affect usability.
Comments
The Spanish Open Data initiatives use a 22-theme taxonomy to classify datasets. This list of subjects was created to simplify the existing ones, making classification easier for publishers. The list of terms was selected after analysing the document "A Proposal for a Common Taxonomy of E-Services and Procedures Under Law 11/2007" and comparing the subject lists in websites as 060 (Spanish public portal), EUGO, INE (Spanish Intitute of Statistics), EUROSTAT, WORLD BANK, OECD.
The concept scheme is defined using SKOS. See the taxonomy in this post.
EUROVOC, although maybe too oriented to the European Parliament activity, could be a very good consensus option. There may be others such those from Agrovoc, Eurostats, OECD or others, but IMO all of them are too much biased or have other issues similar to those from Eurovoc. In any case EUROVOC may also need first some further improvements to be the optimal choice.
I don't think complexity will be a big issue, as it provides a really good hierarchy system, so you can decide how depth do you want to go (and if you finally decide to stop at the first level don't need to work with the full thesaurus). The current licensing model and not being dereferencable, as you pointed before, are bigger issues from my perspective, as well as the not optimized model (one URI per concept and language). Maybe this will be a good opportunity to provide rationale and encourage its improvement.
It may also be hard to balance between something more European-wide compatible and something more national specific and complete to address local peculiarities, but this is something we could manage just with proper national, regional or local EUROVOC mappings and extensions.
Of course, Eurovoc is also quite "european-centric", and other issues will arise once we want to be compatible beyond the European borders in the future, but I don't think that should be something to take into consideration right now.
@Carlos: You say "as the not optimized model (one URI per concept and language)."
What do you mean by this? As you can see from the snipped below, EuroVoc has one URI per eu:MicroThesaurus (skos:ConceptScheme) and one URI per eu:ThesaurusConcept (skos:Concept). Each have multiple prefLabels for all 22 langagues that EuroVoc supports. There seems to be nothing wrong with this.
@Martín: I must admit that the Spanish 22-theme taxonomy has much appeal too, because of its simplicity. (see also: requirements for controlled vocabularies: https://joinup.ec.europa.eu/discussion/requirements-controlled-vocabularies ).
Would there be a way to combine the best of both worlds? This means, use the a taxonomy like the Spanish to enable a easy to use search for datasets (e.g. facetted search) and use the EuroVoc taxonomy because it provides an excellent controlled vocabulary of multilingual search?
@Stijn Unfortunatly I can't see any URI from a snippet. What I mean is what are exactly the URIs to dereference the different eu:ThesaurusConcept? I can't find any documentation about the URI scheme being used for EUROVOC (if any) so I was (probably wrongly) assuming that it may be using a similar language dependent schema that is broadly used at the EC portals (whatever.europa.eu/foo/?q=es for example) which is clearly language dependent.
Is there alreay any working URI scheme in place for the EUROVOC vocabulary? if so, is it documented anywhere?
@Carlos: You are right, I should have added one more triple to the snippet to make it crystal clear. The above-mentioned URI http://eurovoc.europa.eu/2778 is for a eu:ThesaurusConcept (a subclass of skos:Concept). As you can see above, one concept has prefLables for 22 languages. I hope that this demonstrates that EuroVoc does not have a language-dependent URI schema.
<eu:ThesaurusConcept rdf:about="http://eurovoc.europa.eu/2778"/>
@Stijn I'm afraid that then we can't conclude anything as the given URI doesn't dereference correctly. See results from the test for details.
It looks like indeed that the SKOS Concept scheme has been well designed with respect to multilingualism, but - please correct my if I am wrong on this - it also looks like no URI scheme has been implemented on the top of it yet, so it is not currently possible to dereference any given RDF concept.
In any case, if the plan is to implement the aformentioned URI scheme on the top of it at any time in the future, that will be also clearly language independent when implemented.
Dear all
With regard to the "deferenceability" of EuroVoc URI I can confirm that they will become "deferenceable". The Publications Office will provide you with a calendar of implementation as soon as possible. This should make the choice for the use of EuroVoc easier. Please let us know what is your opinion and of course, we remain open to any suggestion you may have.
I agree, Eurovoc can be one of the best choice – the common choice -, to be used together with other specialized as well as local controlled vocabularies.
As for opportunities, Eurovoc is a multilingual tool, currently revised and updated by an official organization. As for its limits, it’s true… it is “European centric” and too oriented to the European Parliament activity. And it is complex too but, in my opinion, this is not necessarily a limit. Complexity can sometimes promote accuracy and findability; the problem could be the presence / lack of the necessary skills to use it. Controlled vocabularies (thesauri, classification and content descriptors) should be used by information and/or domain specialists, Too often it is believed that, once developed or found a tool for information findability, the important work is done. Unfortunately if there is no staff prepared and capable of producing a coherent and consistent system, the result is confusion in the responses of the system. So among the requisite necessary for Eurovoc or other “complex” vocabularies, I would say that it must be used by specialists.
No further action.
We may consider using COFOG or forge our own multilingual controlled vocabulary...
Further issue raised in:
http://joinup.ec.europa.eu/release/dcat-application-profile-data-portals-europe-final-draft#comment-14632
We are hesitant regarding the use of EuroVoc to describe the themes/categories. We tested EuroVoc on a set of swedish municipality datasets but discovered that EuroVoc may be insufficient to enable cross-border discovery of municipality datasets. EuroVoc provides much detail for different political activities, the activities of the European institutions and areas with concern for the EU budget while much of the work being performed by the public sector in member states ends up in the broad domain “28 Social Questions”. Also law enforcement, a public task that has attracted a significant interest in open data work, gets hidden in the far from logical domain “04 Politics” while statistics regarding specific crimes ends up under the domain “12 Law”.
An alternative proposal: COFOG or NACE
A better choice can be NACE or COFOG. NACE’s weakness is that much of the political activities and policy work ends up in the broad domain “84 Public administration and defence; compulsory social security” on the other hand NACE offers the possibility to differ between policy work and administration of services at one hand and the production of services on the other. NACE also offers the possibility to increase the granularity of the theme/ category from two to three or four digits in a logical order while EuroVoc is somewhat more difficult to use. NACE, or national variations of it, is also widely used to classify and describe data and economic activities and can therefore be assumed to be easier to adopt for public administrators than EuroVoc.
COFOG has a similar weakness to NACE namely that much of the political activities and policy work end up in a broad domain “General public services”. On the other hand datasets ending up in “28 Social Questions” using EuroVoc would be devided in more detailed domains such as “10 Social protection”, “06 Housing and community amenities”, “07 Health” and “08 Recreation, culture and religion”. Also datasets being described as “Science” in EuroVoc can get a more detailed description in COFOG by using three figure descriptions where there is an R&D group in each domain.
Further thought needs to be put in to the choice of controlled vocabulary for theme/ category. Both NACE and COFOG would be more useful than EuroVoc. United Nations Statistics Division and Eurostat also provides correlation tables making it possible to map NACE rev.2 to COFOG and ISIC rev.4. Such mapping would open up for the possibility of linking datasets using any of the three classifications this while EuroVoc lacks similar correlation tables.
Discussion in the meeting of 18 July 2013: