Developing semantic data standards with different sources of truth

Abstract
1. Introduction
2. Problem: Developing consistent and coherent data specifications
- 2.1. The requirements on the dissemination artefacts
3. Solution: Single Source of Truth (SSoT)
4. How to choose a “good” SSoT?
5. Conclusion
6. Survey
7. Annex: The interviewees

Abstract

This article delves into the challenges and solutions associated with developing consistent and coherent semantic data specifications, particularly within the context of eGovernment Core Vocabularies and Application Profiles. It examines the importance of a Single Source of Truth (SSoT) in ensuring the alignment and synchronisation of various artefacts, such as Data Shapes, Formal Ontologies, UML diagrams, and specification documents. The SEMIC Style Guide's recommendation of UML as a preferred SSoT is discussed in detail, alongside insights from interviews with stakeholders who explore the practicalities of implementing different conceptual models, including RDF. The article highlights the necessity of a formally defined language for SSoT, the value of visual representation in data modelling, and the need for accessible tools that can automate the generation of artefacts, ultimately suggesting pathways to improve the adoption and utilisation (or usability) of semantic data standards.

1. Introduction

The SEMIC Style Guide for semantic engineers provides guidelines for the development and reuse of semantic data specifications, in particular, eGovernment Core Vocabularies and Application Profiles. Those guidelines revolve around naming conventions, syntax, artefact management, and organisation. The aim is to be supplemented with technical tools and implementations that facilitate automatic conformance checking and transformation of conceptual models into formal semantic representations.

These guidelines are integral to advancing semantic interoperability among EU Member States, with the overarching goal of promoting the adoption of standards. They contribute to this objective by providing expert advice and guidelines on semantic interoperability across organisations, thereby fostering harmonisation and efficiency in data exchange across borders.

2. Problem: Developing consistent and coherent data specifications

In this blog article, we will tackle the problem of creating a consistent and coherent semantic data specification.

To describe the problem we need to frame it to the semantic data specification development process and the disseminated outcomes.

Semantic data specifications encompass a variety of artefacts, each playing a crucial role in capturing the meaning and context of data within a system. Rather than representing data solely in terms of its structure, semantic specifications delve into the semantics, relationships, and constraints associated with the data. This holistic approach to data modelling involves multiple artefacts working in tandem to articulate the intricacies of data semantics effectively.

By integrating these diverse artefacts, semantic data specifications provide a comprehensive framework for modelling and interpreting data semantics. By leveraging diverse artefacts, stakeholders can also articulate and utilise the rich semantics inherent in complex data ecosystems, fostering interoperability, knowledge discovery, and intelligent data processing.

2.1. The requirements on the dissemination artefacts

The following list enumerates the most common artefacts that are (expected to be) included as part of a semantic data specification at the time of its publication:

Formal vocabulary definition (e.g. OWL/RDF)
Formal data shape specification (e.g. SHACL)
Human-readable reference documentation (e.g. HTML, PDF)
Visual representation (e.g. ER or UML diagrams)

Additionally technical interoperability artefacts can also be included in the semantic data specification, such as:

JSON context definition
XSD element definition
Relational Database schema (SQL create statements)
API definition schema

The above can be supported with additional human-readable explanatory documentation such as handbooks, rulebooks and style guides.

3. Solution: Single Source of Truth (SSoT)

The Single Source of Truth (SSoT) paradigm is a foundational concept that improves how conceptual modelling of the data specification is approached.

The SSoT acts as the core representation that encapsulates all the information and agreements required to derive all artefacts of a data specification. These include but are not limited to Data Shapes (SHACL), Formal Ontologies (OWL/RDF), UML diagrams, and specification documents (HTML). The SSoT acts as a centralised source that ensures that all artefacts remain in sync and aligned with the definitive representation of the data semantics, structures, and relationships.

Utilising the SSoT principle, organisations can, for example, derive UML or simple Entity- relationship diagrams to visualise the structural aspects of the data model, capturing entities, attributes, and their relationships. Data Shapes (SHACL) are then formulated to define the constraints to which the data instances must adhere, ensuring data integrity and consistency.

Ontologies (OWL/RDF) are derived from the SSoT to provide a formal representation of the domain's concepts, properties, and relationships. They serve as the semantic backbone of the data specification, facilitating interoperability and semantic understanding across heterogeneous systems.

Additionally, specification documents (HTML) are generated from the SSoT for human consumption, to document the data specification guidelines, naming conventions, syntax, organisational principles, and examples. These documents serve as comprehensive references for stakeholders, guiding them in the interpretation and implementation of the data specification standards.

By maintaining a centralised SSoT, organisations ensure that updates or modifications to the data specification propagate seamlessly across all artefacts, maintaining consistency and coherence. This approach fosters clarity, reliability, and interoperability in data specification efforts, ultimately enhancing the effectiveness and usability of the data model.

4. How to choose a “good” SSoT?

In the above section we summarised the reasons for the need of a SSoT, and outlined the requirements it should fulfil, i.e. the kind of artefacts it should generate and support. The SEMIC Style Guide recommends the use of UML as a SSoT, which was demonstrated to be a viable solution in the development of many data specifications, when used together with software tools that enable automated generation of artefacts.

Many members of the SEMIC community have raised issues with the choice of UML, underlining the lack of implicit semantics of the UML language, as well as the absence of good non-commercial UML editors. It is important to note that the Style Guide comes to address these concerns. On one hand, it specifies what restricted subset of UML should be used [see the section about the Conceptual Model], what clear semantics should be associated with each recommended UML element [rule CMC-R2], and what kind of UML features should be used to encode non-semantic information [for example UML Tags]. On the other hand, the Style Guide recommendations do not rely on any specific UML editor, but only on the open and standardised XMI exchange format [XMI]. Moreover, the Style Guide provides references to free and open-source tools and toolchains that can handle the generation of artefacts from a UML conceptual model as encoded in the XMI exchange format.

There are other advantages of using UML to express the Conceptual Model that is to serve as a SSoT: It is visual, relatively simple (when restricted to recommended elements), standardised, widely known, there are many free editors as alternatives to reliable commercial tools, to name just the most important ones.

The alternative to UML most frequently mentioned by the SEMIC community members, was RDF. To understand how various stakeholders use different conceptual models as SSoT, we have conducted interviews with multiple stakeholders and knowledge representation experts.

4.1. Overview of the interviews

To learn more about the existing practices, needs, and recommendations of the SEMIC Style Guide target audience, we have conducted interviews with six subject matter experts. More information about the interviewees can be found in the Annex. In these interviews we were interested in learning about the interviewees’:

Modelling practices
Model management approaches
Publishing practices
The audience they address (the clients of the interviewees)
The tooling they use

We used the following leading questions in our interviews, which were also shared with the interviewees in advance:

Do you use a conceptual model to generate artefacts?
In what language is this conceptual model expressed?
What languages do you use for other purposes?
For which purposes do you develop the model?
How do you deal with vocabulary specification?
How do you define data exchange representation?
How do you document the data specification?
What are the main benefits of your way-of-working?
What functionality is not enabled by using this way-of-working?
What would you improve? What is still lacking?
What are you currently missing in the exchange of semantic specifications?
What would it take to make semantic specifications easily usable for you?

These “leading” questions were used both to help the interviewees prepare for the interviews (i.e. to get a sense of what SEMIC is interested in, and encourage them to think about these aspects of their work) and also to provide a thread to be followed during the discussions. In practice, the interviews were quite open and free-flowing. The interviewees not only were able to describe their practices, but they also had the chance to demonstrate the tools they have built and/or used, and the projects they are working on. There were also opportunities to touch upon many challenges of a practical, technical and philosophical nature and have in-depth discussions.

Below we present the key takeaways from these interviews, describing each of them in a dedicated section.

4.2. SSoT vs. Single Management Tool

A general observation that we made is that, in principle, all interview participants agreed that having a Single Source of Truth (SSoT), from which all published artefacts can be generated, is desirable to have. Some of the interviewees were editing the various publishable artefacts manually, while in parallel maintaining an overarching model, which could have been a SSoT, but they were not using it to generate all the artefacts. They also agreed on the benefits of a SSoT. One striking realisation was the prevalence of ‘Single Management Tools’ that were seen by the users as the SSoT.

These ‘Single Management Tools’ combine multiple sources of information allowing them to be explored and edited in a unified fashion, to proxy the behaviour of a Single Source of Truth. While they provide the same end-result that having a SSoT would, in contrast to the latter, the necessary information on the data specification is scattered across different representations within the same tool.

Often these tools are also integrated with the artefact generation scripts and algorithms, so that the artefacts can be seamlessly viewed and exported, sometimes even published from within the platform. Most tools that were presented to us were free and open source, but each provided a custom solution, most often combining multiple standard technologies into a non-standard integration. The solutions also addressed certain needs specific to their user community.

4.3. Custom RDF-based representations as a SSoT

One of the reasons for setting up these interviews was that following the publication of the Style Guide many stakeholders have recommended using pure RDF as the SSoT, instead of UML. Some have even claimed that they build data specifications from a SSoT all expressed in RDF. From these interviews, it was clear that the solutions used by our interviewees require either some complementary information in addition to the main RDF/OWL model, or they can’t generate all the artefacts from the SSoT.

In addition, the importance of visual representations was also highlighted by multiple participants (see a separate discussion point below). Simple RDF is not appropriate to encode a precise visual representation of a model. Although there are ways to automatically generate diagrams from an RDF/OWL model, in order to achieve satisfying quality for serving as a visual communication medium with subject matter experts, additional customisation or interventions is needed. This is even more evident as the complexity of the domain increases. Most often, one would like, or need, to rearrange the automatically generated diagrams in order to better convey the most important concepts. Being able to save these arrangements, and not having to redo it every time a modification is required to the underlying RDF model, is of great importance. Although it is not unimaginable to express in RDF the visual diagrammatic information and some stakeholders use it as such, the language was not designed to naturally encode such visual representation. Therefore, most of our interviewees use other languages to encode the visual representation.

4.4. Visual editing is very important

The importance of the visual representation, and in most cases even the visual editing, of the conceptual model that serves as the SSoT was underlined in almost every interview that was conducted. No wonder that all those who build their own tools have added support for this feature, to various degrees of complexity. All interviewees recommended the use of a simple and intuitive graphical representation, most of them preferring a “UML-like” diagram. That is, it should not be UML, necessarily, but should provide similar features, such as simple UML class diagrams. In fact, the recommendations in the Style Guide for building Class Diagrams using a restricted set of UML Elements, do fulfil this requirement. Although some of the interview participants were not opposed to UML and were using it themselves, most concerns regarding UML stemmed from the fact that users, when confronted with a generic UML editor, might struggle to determine which specific features to use, and what semantics will be associated with those elements. This could be mitigated by building a specific UML Profile that can support the creation of conceptual models that serve as a SSoT for Semantic Data Specifications. Providing better, dedicated, documentation for describing the subset of UML that should be used, would be also helpful.

4.5. Good support for vocabulary management is essential

Several interviewees highlighted the advantage of having support for vocabulary management. This includes both (a) being able to discover and explore relevant (external or internal) vocabularies that users might consider reusing in the process of building new data specification and also (b) ensuring that published vocabularies follow certain recommendations regarding their structure and information content that will allow them to be reused on their turn. By using a single management system as a proxy for a SSoT the tool builders can seamlessly add vocabulary management into the editors' workflow. Firstly, since all editors must use the same system, modelling rules can automatically be enforced across an organisation through the tool. This approach allows an organisation to ensure a base level of quality on all the models created by them. Secondly, it can facilitate the reuse of elements originating from other vocabularies managed in the same system. Reuse is facilitated by the increased findability of concepts and easy import features. Finally, by connecting the editing system to the publishing system, an integrated workflow can be set up for the versioning and publishing of vocabularies.

4.6. Reinventing the wheel: everyone is creating their own tool

One key takeaway from these interviews was that everybody uses different tools or combinations of tools to create and manage vocabularies and data specifications for their organisations or clients. This highlights the great diversity of the existing requirements, both expressed and perceived, and the range of possible solutions considered to address similar challenges. All participants found the majority of the recommendations in the SEMIC Style Guide adequate and were attempting to integrate them in their own processes and tools. Unfortunately, and as expected, many organisations developed their own tools to meet their specific needs resulting in a fragmented tooling landscape. Just to name a few that our interviewees work with and develop, Finland has developed and uses the Tietomallit Data Vocabularies Tool, Sweden and some other countries use EntryScape, while in the Czech Republic the Dataspecer tool is being developed. These tools are designed and created to allow exploration and editing of vocabularies, application profiles and/or various data models. Despite their diverse modelling approaches, and conceptualisations (ranging from classes and properties to forms, fields, schemas, constraints, ER diagrams, etc.), these tools share a common element: their underlying models are closely bound to the linked data and semantic web principles (they follow an “RDF first” approach and represent their models using RDFS, OWL, SKOS, SHACL, PROF, etc.). In some of these tools (e.g. EntryScape and Tietomallit) the primary way of building the data models is not a visual, diagrammatic one, but rather by managing vocabularies and completing information in forms. It is also worth mentioning that in all these tools the graphical notation is stored in a custom RDF representation format, together with the main model of the data specification. Although these tools can be used as a single management tool for data specifications, it is not clear what the single source of truth is, in the most concrete sense of the word, from which all artefacts that constitute a complete data specification can be generated. These single management tools can be used as a proxy for a SSoT, within an organisation. However when crossing the organisational boundaries (and typically tool usage boundaries) the sharing of the SSoT is very limited. Mostly the content has to be recreated in the new tool based on generated artefacts instead of directly interconnecting the SSoTs of different tools.

4.7. We need a dedicated, formally defined, language for SSoT

As highlighted in the previous sections, most of the interviewees agreed that visual representation of a semantic specification is important, and they recommended, either UML or a UML-like representation for it. However, it was also pointed out to us that people are generally bad at modelling. One of the interview participants suggested that talking to subject matter experts, especially in the initial phases of the modelling, is better achieved in front of whiteboards. People, in general, are very good at expressing their thoughts through drawing circles and arrows, with associated names for each. However, only few are able to translate those circles and arrows into good models. It was also highlighted that someone looking at a diagram might think that they know what that diagram represents, while in fact it was meant to represent something else. So, ultimately it is the job of a semantic knowledge engineer to convert the vague conceptual model that subject matter experts can express, into a conceptual model that can fully and correctly represent a semantic data specification. How such a comprehensive conceptual model should be expressed, what elements it should and may contain, and what precise semantics each of those model elements should carry, is something that ought to be formally defined. This idea was suggested in multiple interviews.

Currently there exists some work that comes to address the need for a formal definition of a SSoT, at least partially. The Style Guide provides explanations on how a restricted UML model, with well-assigned semantics, can be used as a conceptual model. The blog article about the modelling of Application Profiles also provides some suggestions to complement that. In a parallel effort, one interviewee is working on the Data Specification Vocabulary (DSV) and its Default Application Profile (DSV-DAP) that could be a very promising basis for such a formalisation.
It was expressed during the interviews on multiple occasions that, if SEMIC decides that creating such a language is a priority, and appropriate resources would be dedicated to this effort, there is great interest from the members of the community to join this undertaking.

Ideally, the conceptual model that constitutes the SSoT of a data specification should be exchangeable and published along with the artefacts that are generated from it. Having a formal specification of the SSoT would make this possible.

4.8. We need free tools for automatic artefact generation from SSoT

While there is consensus that developing a data specification based on a SSoT would be ideal, during these interviews it was also recognised that community acceptance would be significantly enhanced if SEMIC could offer robust, freely available software tools that generate the necessary artefacts from this SSoT. The SEMIC Style Guide provides examples of tools that can generate the artefacts of a data specification from a conceptual model (in SEMIC’s case UML models encoded as XMI files), for example in the last paragraph of the ”Transformation of the conceptual model” section. However, it is beyond the scope of the Style Guide to provide strong recommendations in this sense.

Based on the conducted interviews, it appears that a tool developed according to SEMIC requirements, which would be “officially” supported, and recommended by SEMIC would increase adoption of the SSoT. Such a tool should be accompanied by comprehensive documentation that clearly defines the required input and the resulting output.

5. Conclusion

In this blog article, we delved into the challenge of developing consistent and coherent data specifications. More specifically, we highlighted the importance of this issue, and outlined the requirements for effectively addressing it. We demonstrated that maintaining a Single Source of Truth (SSoT) is one of the more efficient ways to address such a challenge.

To identify the best representation of a SSoT, we conducted several interviews with subject matter experts engaged with the SEMIC methodology. This blog article summarises the key insights gained from these interviews and offers perspectives on potential future directions for the SEMIC Style Guide.

6. Survey

If you want to make a contribution towards this initiative, please participate in filling in the survey about your own SSoT usage here. Alternatively, share your feedback via the Style Guide GitHub issue log under the label 'Blog-SSoT'.

7. Annex: The interviewees

The interviewees were 6 subject matter experts each coming from a different country. They have all closely worked on Interoperability matters for their own governments and have been active members of the SEMIC community.

Alkula, Riitta

Riita Alkula is the chief specialist at the Finnish Digital and Population Data Services Agency. This Agency acts as the nation's digital agency therefore it is responsible for the Interoperability within Finland, and Finland’s interoperability with other Member States. Riita is part of the team maintaining, upgrading, developing and promoting Finland's Interoperability platforms: Tietomallit, Sanastot and Koodistot.

Grönewald, Matthias

Matthias Grönewald is a product manager in the German Federal IT-Cooperation’s (FITKO) GovData team. He is responsible for Connectivity management and Standardisation. He is an active member of the team maintaining DCAT-AP.de, i.e. the German profile of DCAT-AP.

Klimek, Jakub

Jakub Klímek has been working in the area of Linked and Open data since 2013. Since 2015 he has been working as a Linked Open Data expert at the Ministry of Interior of the Czech Republic. Since then, Linked Open Data has moved to the Czech Digital and Information Agency, where Jakub is now in charge of the technical aspects of the Czech National Open Data Catalog. In 2023 he joined the SEMIC team as one of the maintainers of DCAT-AP and its sub-profiles.

Palmér, Matthias

Matthias Palmér is the cofounder and CTO of MetaSolutions, a company specialising in linked data technologies and most prominently the open source platform EntryScape. Matthias is also a consultant for the Swedish agency for digital government. For the last ten years Matthias has been the maintainer of DCAT-AP-SE, i.e. the Swedish profile of DCAT-AP. He has also supported other national adaptations, as well as being involved in the community and development of DCAT-AP itself.

Winstanley, Peter

Peter Winstanley is an ontologist at Semantic Arts where he works in enterprises of scale to develop interoperability solutions based on RDF, OWL and other semantic web technologies. He was a contributor to the W3C “Data on the Web Best Practices” recommendation and an editor of the W3C “Data Catalog” vocabulary recommendation. A former interoperability specialist with the UK Government Linked Data and Data Architects’ Working Groups, he is currently co-chair of the W3C Dataset Exchange Working Group.

Yang, Jim

Jim Yang is a senior adviser at the Norwegian Digitalisation Agency. This agency is overall responsible for Norway’s cross-sector and cross-border interoperability activities. Jim has been responsible for the Information Governance Framework in Norway. For the last ten years Jim has been the creator and maintainer of various Norwegian sematic data specifications, including DCAT-AP-NO, i.e. the Norwegian profile of DCAT-AP. He has also been involved in the community and development of DCAT-AP itself.

We would like to thank all interviewees for their participation in the interviews and their active participation in the SEMIC activities over the years.

Report abusive content Share

SEMIC Support Centre