Skip to main content

PR2 - Add new property to Dataset to indicate whether the Dataset is public, restricted or non-public

Anonymous (not verified)
Published on: 10/03/2015 Discussion Archived

Description

From: http://joinup.ec.europa.eu/mailman/archives/dcat_application_profile/2015-February/000123.html

It is often the case that access to data requires user registration and/or authorisation.

Including this information in metadata would be beneficial for a number of reasons, including the following ones:

  • Grouping data based on access restrictions
  • Informing users about access restrictions

Allowing users to filter data based on access restrictions

 

Form: http://joinup.ec.europa.eu/mailman/archives/dcat_application_profile/2015-February/000120.html

Based on feedback from public sector organizations we have thrown in these extra properties for the dataset class:

  • Access Level -to distinguish open data from the rest by dividing into public, restricted and non-public datasets.
  • Access Rights - to express why the dataset is restricted/non-public. Applies only for restricted/non-public datasets.

Proposed solution

Add new property to Dataset to indicate whether the Dataset is public, restricted or non-public.

  • Recommending the specification of access restrictions, possibly by using dct:accessRights
  • Identify / develop a code list for access restrictions

A possible list of access restrictions:

  • no limitations
  • registration required (non-discriminatory)
  • autorisation required (“closed data”, that only authorised users can access)
  • unknown

Component

Code

Category

improvement

Comments

Anonymous (not verified) Tue, 24/03/2015 - 09:42

This property is in use in Project open data metadata schema, but with a slightly different twist. From their usage note:

"This field refers to the degree to which this dataset could be made available to the public, regardless of whether it is currently available to the public. For example, if a member of the public can walk into your agency and obtain a dataset, that entry is public even if there are no files online. A restricted public dataset is one only available under certain conditions or to certain audiences (such as researchers who sign a waiver). A non-public dataset is one that could never be made available to the public for privacy, security, or other reasons as determined by your agency"

Anonymous (not verified) Wed, 08/04/2015 - 18:48

I agree with Øystein in principle. data.gov.uk has not just public data, but plenty of 'unpublished' ones, with some text explaining why, or whether there are plans/timeline to review it or release it. So either we add these two fields, or people will just have to understand that these datasets have no distributions or licence. (As he notes, the two fields needed are subtly different to the USA ones.)

Makx DEKKERS
Makx DEKKERS Thu, 09/04/2015 - 20:19

It is not entirely clear to me what the requirement is. Is see two different requirements from the description of the issue and the comments:

 

1. a need for one property expressing a type of access licence for the dataset and one property to explain why a dataset is not open; this was the original request submitted by Øystein.

This one appears to be related to the discussion about providing licence information on the level of the dataset and to the issue https://joinup.ec.europa.eu/discussion/pr8-move-dctrights-distribution-dataset. We seem to converge on a conclusion that this is not in line with the moidel of DCAT.

 

2. a need for a property to tell users that the dataset does not have distributions (yet), which seems to be the need that David expresses.

If this is merely used to give a human user information, this could easily be done by adding a dct:description to the Dataset.

 

 

Andrea PEREGO
Andrea PEREGO Fri, 10/04/2015 - 00:20

Makx,

I think that the parallel discussion concerning issue PR3 can help clarify the requirements.

Also, I think it would be important to make a distinction between access and use conditions. Licences describe only how a resource can be used, not who can access it, and under which conditions. In theory, you can have "closed" data (i.e., data accessible only to authorised users) released according to an "open" licence (e.g., CC BY). 

As far as I can understand it, the issue under discussion is only about access, and the requirement is about associating datasets with information concerning their access levels / restrictions, that can be used by data consumers (humans and software agents) to filter out, e.g., those that they won't be able to access - I elaborated this in a comment to issue PR3.

The proposed solution can address this requirement.

About expressing why a dataset does not have a public distribution, this can be addressed by using dct:description, as you propose, or a different, and possibly more specific, property (vann:usageNote?).

Anonymous (not verified) Thu, 23/04/2015 - 11:57
It is alway easier to raise an issue than suggest solutions. I'll try my best, despite the risk of exposing my lack of knowledge.
 
I see two different needs here
 
# 1. A need to express that a distribution has access-restrictions as identified by Andrea and Bert over at PR3
# 2. A need to describe datasets that is not open data
 
I guess #1 will require a controlled vocabulary - something like this:
registration required (non-discriminatory)
autorization required (discriminatory)
++
 
Suggestion: use dcat:distribution dct:description  - for now?
 
#2 is datasets which doesn't belong in an "open data cataloge". By adding this property, entities can provide core descriptions for all their datasets using dcat-ap - in a "data cataloge". I think this value has to be set on dcat:dataset to be able to express this without the existence of dcat:distributions
 
There is no known namespaces for this besides POD's metadata schema, as far as I know
 
Suggestion A:
add "pod:accessLevel" on dcat:dataset as optional
Cardinality: 0..1
Range: skos:Concept ?
Usage: This field indicates the extent to which the dataset can be made available to the public, regardless of whether it has distributions or not. Required values: "public"*, "restricted public", "non-public". A restricted public dataset is one only available under certain conditions or to certain audiences (such as researchers who sign a waiver). A non-public dataset is one that could never be made available to the public for privacy, security, or other reasons as determined by your agency.
 
 
Suggestion B:
Reject as not compliant with DCAT proper

 

Anonymous (not verified) Mon, 11/05/2015 - 11:22

Regarding point 2 of #3, a simple SPARQL query counting the dcat:Distribution of a dcat:Dataset would be sufficient I think.

 

A returned value equal to zero means no dcat:Distributions for this dcat:Dataset (yet).

Anonymous (not verified) Fri, 12/06/2015 - 14:10
I hope I haven't contributed a lot to confusion on the proposals PR2 and PR3. First by poorly description of the proposal, then by mixing the properties during the last call.   The initial proposal was to add two properties: * Access Level -to distinguish open data from the rest by dividing into public, restricted and non-public datasets. * Access Rights - to express WHY the dataset is restricted/non-public. Applies only for restricted/non-public datasets.   These were added to joinup as PR2 and PR3:  PR2 - Add new property to Dataset to indicate whether the Dataset is public, restricted or non-public. PR3 - Add new property to Dataset to indicate WHY the Dataset is restricted or non-public   Why we think this is a good idea:   1. Potential for sharing within the public sector itself  The initial years of the Government’s open data actions were primarily motivated by creating new business opportunities and jobs in the private sector and improving openness. Potential efficiencies in the public sector were in many ways a secondary goal. However there has been a shift in recent years towards recognizing the huge potential which the opening up and sharing of public sector information can have on innovation and efficiency within the public sector itself.    When organisations consider data for release, they often only consider what can be released openly for all. However there is also a great deal of data which may not be relevant to be opened fully but is still highly valuable for the public sector itself.  By concentrating on and identifying data which can only be released publicly for everyone we are potentially missing out on much of the data which other agencies would like access to.   2. The user knows what exists There is often a "catch 22" situation when identifying data for release. The public sector asks the user community what data they would like and they will prioritise this for release. However, the user community are often not aware of what exists and therefore cannot respond here.  This is particularly relevant now as many of the obvious examples and low hanging fruit have been released.   Including descriptions of restricted and non public datasets provides (potentially) an overview of all the data an agency holds. The overview when made public, gives enough information for users (both public and private sector) to prioritise the most interesting data for release. Potential external users can also examine the data categorised as "restricted" and bring forward arguments about why some of it could be re-categorised as ‘public’ data if they see fit.   3. Speed up the data delivery process Organisations are often tempted to initiate a long and thorough process before the release of any information. Some simple tools are necessary to assist in the identification and release of some data early. A property that allows for inclusion of metadata on all datasets (not only open data) contributes here.     Why not put this property on dcat:distribution? Adding this property to dcat:dataset, enables agents to add descriptions on datasets with no distributions to their inventory list/data catalouge. Identifying technical obstacles on data distributions (log-ins, manual API key requests and more) is another issue and must be solved on dcat:Distributions   How can providers indicate WHY a datset is "restricted" or "non public"? This is where "PR3 - Add new property to Dataset to indicate why the Dataset is restricted or non-public" comes in   Initially dct:accessRights was proposed here as an "access level comment" property (and "pod:accessLevel" for indicating public, restricted and non public.) Since there is not a widely used vocabulary for access level (POD has invented their own here), dct:accessRights might be a better choice for expressing access level (public/restricted/non public), leaving access comment (PR3) with no proposed vocabulary.   In comparison POD (Project Open Data) is using their own "accessLevel" to express IF a dataset is public/restricted/non public and dct:rights to express WHY (https://project-open-data.cio.gov/v1.1/metadata-resources/#field-mappin…)    Their rationale for adding these properties is just in line with our ideas: <quote> We added the accessLevel field to help easily sort datasets into our three existing categories: public, restricted public, and non-public. This field means an agency can run a basic filter against its enterprise data catalog to generate a public-facing list of datasets that are, or could one day be, made publicly available (or, in the case of restricted data, available under certain conditions). This field also makes it easy for anyone to generate a list of datasets that could be made available but have not yet been released by filtering accessLevel to public and accessURL to blank.   We added the rights field (formerly accessLevelComment) for data stewards to explain how to access restricted public datasets, and for agencies to have a place to record (even if only internally) the reason for not releasing a non-public dataset. </quote>   Definitions (proposal): Public: Open data. Data with no sensitivity or privacy issues.   Restricted: Datasets which can not be published as open data, but still be shared within the public sector or shared under certain conditions. If the agency is unsure about whether it can or cannot be opened, the agency should categorise it as restricted and perform further examination.   Non-public: Highly sensitive data.  Can at most be shared with the person or agency the data concerns or under highly restrictive conditions. 
Makx DEKKERS
Makx DEKKERS Sun, 21/06/2015 - 12:57

The Working Group decided in its meeting of 10 June 2015 to add the optional property dct:accessRights to Dataset with reference to a controlled vocabulary with three members – Public, Restricted, Non-public – to be created and maintained by Publications Office.

Anonymous (not verified) Thu, 09/06/2016 - 19:22

Has the EU Publications Office already created the vocabulary mentioned above?

Makx DEKKERS
Makx DEKKERS Thu, 28/07/2016 - 18:55

There is a threed on the mailing list (start http://joinup.ec.europa.eu/mailman/archives/dcat_application_profile/20…) to discuss the semantics of the vocabulary terms.

 

The proposed resolutions are:

 

There have been a number of comments that I think we should not take into account:

  1. Changes of the names of the terms. There were proposals to rename “restricted” to “restricted public” and “non-public” to “private”. As David pointed out, the DCAT-AP specification refers to :public, :restricted and :non-public, so my proposed resolution is not to change the labels in order not to create confusion – they are just labels; the important thing is to get the definitions right.
  2. Adding terms. There was a proposal to add a term “sensitive”. Again, in order to stay in line with the DCAT-AP spec, my proposed resolution is not to add additional terms.
  3. There was some discussion about the relationship between “public” and “open data”. I propose that we do not try to solve that issue in this context. The way I understand the definition of open data (e.g. http://opendefinition.org/), this implies not just public access but encompasses also use and reuse: use, modify and share. If we keep the focus on the access right vocabulary strictly on access, we don’t have to decide on the precise relationship. So my proposed resolution is not to mention open data in the definition of “public”.

 

The proposed defintions are:

 

Label: Public

Definition: Publicly accessible by everyone.

Usage note/comment: Permissible obstacles include: registration and request
for API keys, as long as anyone can request such registration and/or API
keys.

 

Label: Restricted

Definition: Only available under certain conditions.

Usage note/comment: This category may include: resources that require
payment, resources shared under non-disclosure agreements, resources for
which the publisher or owner has not yet decided if they can be publicly
released.

 

Label: Non-public

Definition: Not publicly accessible for privacy, security or other reasons.

Usage note/comment: This category may include resources that contain
sensitive or personal information.