Investigate alternatives to Eurovoc

Anonymous (not verified)

Published on: 04/04/2013 Discussion Archived

Eurovoc has been mentioned as a candidate as a controlled multilingual vocabulary to classify datasets. It may be a good idea to research alternatives before deciding on Eurovoc. Some of the drawbacks with Eurovoc are:

1. Complexity: the person who has to enter the topic for a dataset has a lot of topics to choose from. They may opt for high level topics which will reduce precision in the aggregated catalog. Also, a software solution for a smaller catalog needs to import Eurovoc and provide som sort of browsing interface.

2. Mapping between topics: the SKOS version of Eurovoc seem to be designed with one URI per conept per language instead of concepts with multiple labels. This requires multiple operations to cross reference topics.

Any reference implementations should study how these issues affect usability.

Component

Miscellaneous

Comments

Martin ALVAREZ-ESPINAR Fri, 05/04/2013 - 14:30

The Spanish Open Data initiatives use a 22-theme taxonomy to classify datasets. This list of subjects was created to simplify the existing ones, making classification easier for publishers. The list of terms was selected after analysing the document "A Proposal for a Common Taxonomy of E-Services and Procedures Under Law 11/2007" and comparing the subject lists in websites as 060 (Spanish public portal), EUGO, INE (Spanish Intitute of Statistics), EUROSTAT, WORLD BANK, OECD.

The concept scheme is defined using SKOS. See the taxonomy in this post.

Anonymous (not verified) Tue, 16/04/2013 - 12:54

EUROVOC, although maybe too oriented to the European Parliament activity, could be a very good consensus option. There may be others such those from Agrovoc, Eurostats, OECD or others, but IMO all of them are too much biased or have other issues similar to those from Eurovoc. In any case EUROVOC may also need first some further improvements to be the optimal choice.

I don't think complexity will be a big issue, as it provides a really good hierarchy system, so you can decide how depth do you want to go (and if you finally decide to stop at the first level don't need to work with the full thesaurus). The current licensing model and not being dereferencable, as you pointed before, are bigger issues from my perspective, as well as the not optimized model (one URI per concept and language). Maybe this will be a good opportunity to provide rationale and encourage its improvement.

It may also be hard to balance between something more European-wide compatible and something more national specific and complete to address local peculiarities, but this is something we could manage just with proper national, regional or local EUROVOC mappings and extensions.

Of course, Eurovoc is also quite "european-centric", and other issues will arise once we want to be compatible beyond the European borders in the future, but I don't think that should be something to take into consideration right now.

stijngoedertier (not verified) Thu, 18/04/2013 - 17:00

@Carlos: You say "as the not optimized model (one URI per concept and language)."

What do you mean by this? As you can see from the snipped below, EuroVoc has one URI per eu:MicroThesaurus (skos:ConceptScheme) and one URI per eu:ThesaurusConcept (skos:Concept). Each have multiple prefLabels for all 22 langagues that EuroVoc supports. There seems to be nothing wrong with this.

<eu:MicroThesaurus rdf:about="http://eurovoc.europa.eu/100240"> <s04:prefLabel xml:lang="pt">4821 transporte marítimo e fluvial</s04:prefLabel> <s04:prefLabel xml:lang="pl">4821 transport morski i śródlądowy</s04:prefLabel> <s04:prefLabel xml:lang="sk">4821 námorná a vnútrozemská riečna doprava</s04:prefLabel> <s04:prefLabel xml:lang="sl">4821 pomorska in notranja plovba</s04:prefLabel> <s04:prefLabel xml:lang="fi">4821 meri- ja jokiliikenne</s04:prefLabel> <s04:prefLabel xml:lang="sv">4821 sjötransport och transport på inre vattenväg</s04:prefLabel> <s04:prefLabel xml:lang="hr">4821 pomorski prijevoz i prijevoz unutrašnjim vodama</s04:prefLabel> <s04:prefLabel xml:lang="sr">4821 Поморски превоз и превоз унутрашњим пловним путевима</s04:prefLabel> <s04:prefLabel xml:lang="bg">4821 морски и речен воден транспорт</s04:prefLabel> <s04:prefLabel xml:lang="es">4821 transporte marítimo y fluvial</s04:prefLabel> <s04:prefLabel xml:lang="cs">4821 námořní a říční doprava</s04:prefLabel> <s04:prefLabel xml:lang="da">4821 sø- og flodtransport</s04:prefLabel> <s04:prefLabel xml:lang="de">4821 See- und Binnenschiffsverkehr</s04:prefLabel> <s04:prefLabel xml:lang="et">4821 mere- ja siseveetransport</s04:prefLabel> <s04:prefLabel xml:lang="el">4821 θαλάσσιες και ποτάμιες μεταφορές</s04:prefLabel> <s04:prefLabel xml:lang="en">4821 maritime and inland waterway transport</s04:prefLabel> <s04:prefLabel xml:lang="fr">4821 transports maritime et fluvial</s04:prefLabel> <s04:prefLabel xml:lang="ga">4821 transports maritime et fluvial</s04:prefLabel> <s04:prefLabel xml:lang="it">4821 trasporti marittimi e fluviali</s04:prefLabel> <s04:prefLabel xml:lang="lv">4821 jūras un iekšzemes ūdensceļu transports</s04:prefLabel> <s04:prefLabel xml:lang="lt">4821 jūrų ir vidaus vandens kelių transportas</s04:prefLabel> <s04:prefLabel xml:lang="hu">4821 tengeri és belvízi közlekedés</s04:prefLabel> <s04:prefLabel xml:lang="mt">4821 maritime and inland waterway transport</s04:prefLabel> <s04:prefLabel xml:lang="nl">4821 vervoer over zee en over binnenwateren</s04:prefLabel> <s04:prefLabel xml:lang="ro">4821 transport maritim şi fluvial</s04:prefLabel> </eu:MicroThesaurus> <rdf:Description rdf:about="http://eurovoc.europa.eu/2778"> <xl:altLabel rdf:resource="http://eurovoc.europa.eu/317031"/> <s04:prefLabel xml:lang="pt">produto regional bruto</s04:prefLabel> <s04:prefLabel xml:lang="lv">reģiona kopprodukts</s04:prefLabel> <s04:prefLabel xml:lang="en">gross regional product</s04:prefLabel> <s04:prefLabel xml:lang="da">bruttoregionalindkomst</s04:prefLabel> <s04:prefLabel xml:lang="fi">alueellinen bruttokansantuote</s04:prefLabel> <s04:prefLabel xml:lang="de">Bruttoregionalprodukt</s04:prefLabel> <s04:prefLabel xml:lang="ro">produs regional brut</s04:prefLabel> <s04:prefLabel xml:lang="sl">bruto regionalni proizvod</s04:prefLabel> <s04:prefLabel xml:lang="hr">bruto regionalni proizvod</s04:prefLabel> <s04:prefLabel xml:lang="it">prodotto regionale lordo</s04:prefLabel> <s04:prefLabel xml:lang="es">producto regional bruto</s04:prefLabel> <s04:prefLabel xml:lang="sr">бруто регионални производ</s04:prefLabel> <s04:prefLabel xml:lang="sk">hrubý regionálny produkt</s04:prefLabel> <s04:prefLabel xml:lang="el">ακαθάριστο περιφερειακό προϊόν</s04:prefLabel> <s04:prefLabel xml:lang="fr">produit régional brut</s04:prefLabel> <s04:prefLabel xml:lang="lt">bendrasis regioninis produktas</s04:prefLabel> <s04:prefLabel xml:lang="cs">hrubý regionální produkt</s04:prefLabel> <s04:prefLabel xml:lang="nl">bruto regionaal product</s04:prefLabel> <s04:prefLabel xml:lang="hu">bruttó regionális termék</s04:prefLabel> <s04:prefLabel xml:lang="bg">брутен регионален продукт</s04:prefLabel> <s04:prefLabel xml:lang="sv">bruttoregionalprodukt</s04:prefLabel> <s04:prefLabel xml:lang="et">piirkondlik kogutoodang</s04:prefLabel> <s04:prefLabel xml:lang="pl">produkt regionalny brutto</s04:prefLabel> </rdf:Description>

stijngoedertier (not verified) Thu, 18/04/2013 - 17:06

@Martín: I must admit that the Spanish 22-theme taxonomy has much appeal too, because of its simplicity. (see also: requirements for controlled vocabularies: https://joinup.ec.europa.eu/discussion/requirements-controlled-vocabularies ).

Would there be a way to combine the best of both worlds? This means, use the a taxonomy like the Spanish to enable a easy to use search for datasets (e.g. facetted search) and use the EuroVoc taxonomy because it provides an excellent controlled vocabulary of multilingual search?

Anonymous (not verified) Tue, 30/04/2013 - 17:53

@Stijn Unfortunatly I can't see any URI from a snippet. What I mean is what are exactly the URIs to dereference the different eu:ThesaurusConcept? I can't find any documentation about the URI scheme being used for EUROVOC (if any) so I was (probably wrongly) assuming that it may be using a similar language dependent schema that is broadly used at the EC portals (whatever.europa.eu/foo/?q=es for example) which is clearly language dependent.

Is there alreay any working URI scheme in place for the EUROVOC vocabulary? if so, is it documented anywhere?

Makx DEKKERS Wed, 01/05/2013 - 12:53

stijngoedertier (not verified) Thu, 02/05/2013 - 10:03

@Carlos: You are right, I should have added one more triple to the snippet to make it crystal clear. The above-mentioned URI http://eurovoc.europa.eu/2778 is for a eu:ThesaurusConcept (a subclass of skos:Concept). As you can see above, one concept has prefLables for 22 languages. I hope that this demonstrates that EuroVoc does not have a language-dependent URI schema.

<eu:ThesaurusConcept rdf:about="http://eurovoc.europa.eu/2778"/>

Anonymous (not verified) Thu, 02/05/2013 - 21:32

@Stijn I'm afraid that then we can't conclude anything as the given URI doesn't dereference correctly. See results from the test for details.

It looks like indeed that the SKOS Concept scheme has been well designed with respect to multilingualism, but - please correct my if I am wrong on this - it also looks like no URI scheme has been implemented on the top of it yet, so it is not currently possible to dereference any given RDF concept.

In any case, if the plan is to implement the aformentioned URI scheme on the top of it at any time in the future, that will be also clearly language independent when implemented.

Anonymous (not verified) Tue, 07/05/2013 - 22:25

Dear all

With regard to the "deferenceability" of EuroVoc URI I can confirm that they will become "deferenceable". The Publications Office will provide you with a calendar of implementation as soon as possible. This should make the choice for the use of EuroVoc easier. Please let us know what is your opinion and of course, we remain open to any suggestion you may have.

acornero (not verified) Wed, 08/05/2013 - 08:42

I agree, Eurovoc can be one of the best choice – the common choice -, to be used together with other specialized as well as local controlled vocabularies.

As for opportunities, Eurovoc is a multilingual tool, currently revised and updated by an official organization. As for its limits, it’s true… it is “European centric” and too oriented to the European Parliament activity. And it is complex too but, in my opinion, this is not necessarily a limit. Complexity can sometimes promote accuracy and findability; the problem could be the presence / lack of the necessary skills to use it. Controlled vocabularies (thesauri, classification and content descriptors) should be used by information and/or domain specialists, Too often it is believed that, once developed or found a tool for information findability, the important work is done. Unfortunately if there is no staff prepared and capable of producing a coherent and consistent system, the result is confusion in the responses of the system. So among the requisite necessary for Eurovoc or other “complex” vocabularies, I would say that it must be used by specialists.

Makx DEKKERS Fri, 10/05/2013 - 14:54

No further action.

stijngoedertier (not verified) Tue, 28/05/2013 - 22:53

@Alessandra: please note that the current Final Draft only proposes to use the 21 EuroVoc domains (proposed controlled vocabulary for dcat:theme). They should be understandable to non-specialists.

However, we have created a number of mappings of existing controlled vocabularies used by European data portals to the EuroVoc domains. You can find them via the link to the Google spreadsheet below. These include the one of publicdata.eu, the Spanish, Austrian, and German application profiles, and the categories used by the City of Ghent data portal.

https://docs.google.com/spreadsheet/ccc?key=0AtYBrl3GPikydEppeERJb2FxVDQzMzBZMjBnWS1KN1E#gid=4

The conclusion after analysing 5 data portals is that:

the current proliferation of categorisation systems makes harmonisation difficult.
the EuroVoc domain '28 SOCIAL QUESTIONS ' seems to be too broad to be practically usable. This could be partially mended by drilling down one level (to the level of the eu:MicroThesaurus in EuroVoc), as the mappings indicate.

We may consider using COFOG or forge our own multilingual controlled vocabulary...

Makx DEKKERS Thu, 01/08/2013 - 17:17

Further issue raised in:

http://joinup.ec.europa.eu/release/dcat-application-profile-data-portals-europe-final-draft#comment-14632

We are hesitant regarding the use of EuroVoc to describe the themes/categories. We tested EuroVoc on a set of swedish municipality datasets but discovered that EuroVoc may be insufficient to enable cross-border discovery of municipality datasets. EuroVoc provides much detail for different political activities, the activities of the European institutions and areas with concern for the EU budget while much of the work being performed by the public sector in member states ends up in the broad domain “28 Social Questions”. Also law enforcement, a public task that has attracted a significant interest in open data work, gets hidden in the far from logical domain “04 Politics” while statistics regarding specific crimes ends up under the domain “12 Law”.

An alternative proposal: COFOG or NACE

A better choice can be NACE or COFOG. NACE’s weakness is that much of the political activities and policy work ends up in the broad domain “84 Public administration and defence; compulsory social security” on the other hand NACE offers the possibility to differ between policy work and administration of services at one hand and the production of services on the other. NACE also offers the possibility to increase the granularity of the theme/ category from two to three or four digits in a logical order while EuroVoc is somewhat more difficult to use. NACE, or national variations of it, is also widely used to classify and describe data and economic activities and can therefore be assumed to be easier to adopt for public administrators than EuroVoc.

COFOG has a similar weakness to NACE namely that much of the political activities and policy work end up in a broad domain “General public services”. On the other hand datasets ending up in “28 Social Questions” using EuroVoc would be devided in more detailed domains such as “10 Social protection”, “06 Housing and community amenities”, “07 Health” and “08 Recreation, culture and religion”. Also datasets being described as “Science” in EuroVoc can get a more detailed description in COFOG by using three figure descriptions where there is an R&D group in each domain.

Further thought needs to be put in to the choice of controlled vocabulary for theme/ category. Both NACE and COFOG would be more useful than EuroVoc. United Nations Statistics Division and Eurostat also provides correlation tables making it possible to map NACE rev.2 to COFOG and ISIC rev.4. Such mapping would open up for the possibility of linking datasets using any of the three classifications this while EuroVoc lacks similar correlation tables.

Makx DEKKERS Thu, 01/08/2013 - 17:20

Discussion in the meeting of 18 July 2013:

SG: a mapping of the theme taxonomies of 6 open data portals to Eurovoc and COFOG is available at: https://docs.google.com/spreadsheet/ccc?key=0AtYBrl3GPikydEppeERJb2Fx VDQzMzBZMjBnWS1KN1E#gid=4 EF: Eurovoc should be considered as the pivot vocabulary, even if its granularity is not enough sometimes. I would include a sentence pointing to other vocabularies, as long as they provide interoperability with Eurovoc and other standard vocabularies, such as COFOG and NACE. AC. The Publications Office is already working on the alignment of Eurovoc with other standard thesauri. We are willing to expand our alignment to new thesauri. NL: The WG can provide to the Publications Office a list of taxonomies/vocabularies that need to be aligned to Eurovoc. [Decision] MD will provide a sentence - Eurovoc in the spec remains as is

Makx DEKKERS Thu, 01/08/2013 - 17:21

Investigate alternatives to Eurovoc

Component

Category

Comments