R11. Data Quality

From the CTS application:
The repository has appropriate expertise to address technical data and metadata quality and ensures that sufficient information is available for end users to make quality-related evaluations.

See the section “R0.4. Level of Curation Performed,” which details how the Dataverse software can support levels of curation.

The Dataverse software ships with dataset metadata models that are informed by standard metadata schemas such as DDI, DataCite and ISA-Tab. Version 4.9 of the Dataverse software also introduced support for depositing provenance files following W3C’s PROV-O data model.

The Dataverse software’s support for metadata customization, including controlling what metadata can or must be added and how it’s added, can help collection support staff ensure that data is described in ways that increase its FAIRness.

Answers from successful applicants

Tilburg University Dataverse collection:

When the RDO Data Curator has received the data, a quality check is carried out to ensure that the data and documentation meet the requirements. If the data package does not meet the requirements, the Curator will contact the depositor by email to ask for improvements.

The quality check includes controlling on the following aspects:

Has the deposit agreement been confirmed?
Are the files delivered in an accepted file format?
Are the files readable or saved in a portable format?
Do the files fall within the maximum data limit?
Is there adequate documentation about the data and supplementary data? (Data Report template is provided to the depositors)
In case of several files, is the folder structure clear to you and are all files included?
Are the data files complete?
Is the data free of any privacy sensitive information?

The quality check includes controlling the aspect of readability, accessibility, and use of the correct file format name.

QDR:

The quality of shared data depends upon their understandability and re-usability. These qualities, in turn, depend upon the organization of the data, and the clarity and completeness of the documentation that accompany them (i.e., how well they describe the data, the process through which they were collected/generated, and the context of their creation). QDR encourages depositors to provide all relevant information that allows for well-informed re-use of the data and works closely with them to help them provide the highest possible level of data and documentation quality. This process relies on the subject-expertise of QDR’s curation staff. Curation staff also assess the consistency of the data with the provided documentation and request changes, fixes, or updates from depositors as needed. Curation is supervised by senior staff all of whom hold graduate degrees in social science.

Metadata are generated in consultation with depositors using the Dataverse input mask, which maps (and exports) to Data Documentation Initiative (DDI) Codebook, the de-facto meta standard for social science, as well as other metadata formats such as DataCite XML and can be harvested using OAI-PMH. The actual cataloging of the metadata is performed by QDR curation staff based on depositor input and is subject to review by the depositor.

As part of the curation process, QDR also links to published work that uses or cites the data. The repository is closely following initiatives such as Scholarly Link Exchange (scholix) and Making Data Count and will use their output (for example, usage statistics) to provide additional links to related works and usage metrics together with dataset.

There is no formalized way for the designated community to comment or rate data or metadata. Nonetheless, QDR regularly works with scholars who re-use data in teaching and research in order to better understand their requirements and, if needed, adjust cataloging and curation practices.

Links:
Metadata application profile: https://qdr.syr.edu/policies/metadata
Curation policy: https://qdr.syr.edu/policies/curation
Collection development and appraisal policy: https://qdr.syr.edu/policies/collectiondevelopment

DataverseNO:

4 – The guideline has been fully implemented in the repository

Data and metadata quality
In order for the Designated Community to be able to assess the substantive quality of the data published in the repository, DataverseNO provides documentation of the data in two main ways: On deposit, metadata must be entered into metadata schemas in the repository software (Dataverse), and a ReadMe file must be uploaded together with the data file(s). The repository strives to provide enough domain-specific information about the data such that the Designated Community can assess the substantive quality of the data. However, the generic nature of the DataverseNO repository puts some limitations on the granularity of the provided domain-specific metadata schemas. To compensate for such limitations, domain-specific information is provided in the mandatory ReadMe file.

The deposited ReadMe file must give a description of how to interpret, understand and (re)use the dataset, including a statement of the creation and completeness, or the limitations, of the dataset. The remaining content of the ReadMe file varies according to the type of data that are deposited. The DataverseNO Deposit Guidelines [1] give some recommendations for ReadMe files for two common types of data, tabular data and computer scripts. If needed, advice on other types of data is given to depositors on request before data deposit and/or as feedback during the curational review of datasets submitted for publication. In addition, we recommend depositors to insert important parts of the ReadMe file into the Description field in the Citation Metadata of the repository software in order to increase the searchability of the dataset.

The metadata entered into and stored in Dataverse on deposit are standard-compliant metadata to ensure they can be mapped easily to standard metadata schemas and be exported into JSON format (XML for tabular file metadata) for preservation and interoperability. The metadata schemas in Dataverse employ a number of metadata standards from several academic disciplines [2]. All of these metadata schemas are available in all collections of DataverseNO. Citation metadata fields that are mandatory or recommended by DataCite are mandatory in all DataverseNO collections. As the institutional collections within DataverseNO as well as the top-level of DataverseNO accept data from all academic disciplines, which metadata fields are mandatory and which are recommended varies from subject to subject. Special collections within DataverseNO have their own rules for the mandatoriness of, and the recommendations for, domain-specific metadata fields. Depositors are recommended to add domain-specific metadata in the metadata schemas that are applicable; cf. DataverseNO Deposit Guidelines [1].

To ensure compliance with the DataverseNO Accession Policy [3], and the DataverseNO Deposit Guidelines [1], regarding completeness, organization and documentation of the data, each dataset is curated by Research Data Service staff before publication. The curation process ensures that datasets are furnished with relevant information that allows for well-informed reuse of the data. If a dataset does not comply with the DataverseNO Accession Policy and the DataverseNO Deposit Guidelines the curator communicates with the depositor to request necessary changes before the dataset can be published. Changes made to data file(s) and/or metadata after initial publication result in a new version of the dataset and are subject to a new round of curational review before the new version can be published. See also R7, R8, and R12.

Through discussions within the Network of Expertise among the curators, as well as in the DataverseNO Advisory Committee, DataverseNO makes a continuous effort to ensure consistency in both generic and domain-specific metadata across the different collections of the repository.

The quality of data curation in DataverseNO relies on the subject-expertise and the research data management expertise of Research Data Service staff at the DataverseNO owner institution and the DataverseNO partner institutions. This expertise, as well as the roles and responsibilities, are described in R5 and R6. The Research Data Service staff curating the different collections within DataverseNO are all highly educated and trained within the research disciplines represented by the datasets deposited into DataverseNO. The Research Data Service staff are also trained in research data management support, and they are in continuous dialog with the user groups of DataverseNO. Furthermore, DataverseNO Research Data Service staff have access to top-level expertise in subject-related issues and issues on research data management, both through their own networks and through training and advice provided by UiT The Arctic University (owner of DataverseNO). The management and Research Data Service staff of DataverseNO are closely following the development of domain-specific metadata standards as well as other international standards for research data management, such as Domain Data Protocols (DDPs) [4]. This framework aims to support research communities in setting up protocols for the collection and management of data within specified disciplinary domains and research communities.

Automated assessment of metadata
Some metadata fields in Dataverse automatically assess the adherence to the relevant schema. This is e.g. true for the format of dates, the names of language etc. Furthermore, the values of some of the metadata fields (where possible) are generated automatically by the system. This includes the name of the depositor, which is retrieved from the LDAP log-in information, and the deposit date. Some other fields are pre-populated in the metadata templates that are applied for the individual collections. Metadata may be provided both at dataset level and at file level. This is also true for provenance information, which may be provided in two forms: as a provenance file in JSON format and following W3C standards, and/or as a free-text provenance description.

Feedback from Designated Community
The landing pages of each dataset published in DataverseNO has feedback options for the user community to use for comments to the depositor. The default option is to use the contact button to send a question, request or feedback to the contact person for the dataset. DataverseNO does currently not provide end users the possibility to enter annotations or public comments to the datasets, other than by using general web annotation tools like hypothes.is [5].

Citation to related work
The citation metadata schema provides metadata fields for related datasets, related publications, and related other materials. DataverseNO will also benefit from the cooperation between DataCite and Crossref in the Framework for Scholarly Link eXchange (Scholix), that will provide interlinking between datasets and publications based on the datasets [6]. Furthermore, in a future version of Dataverse, planned to be released in 2019, the repository software will implement Make Data Count recommendations and report standardized usage metrics [7]. DataverseNO will use the output from these services (e.g., FAIR usage statistics) to provide additional links to related works and usage metrics together with datasets.

References:
[1] DataverseNO Deposit Guidelines: https://site.uit.no/dataverseno/deposit/
[2] Metadata References in the appendix to the Dataverse User Guide: http://guides.dataverse.org/en/latest/user/appendix.html
[3] DataverseNO Accession Policy: https://site.uit.no/dataverseno/about/policy-framework/accession-policy/
[4] Science Europe: Presenting a Framework for Discipline-specific Research Data Management: https://www.scienceeurope.org/wp-content/uploads/2018/01/SE_Guidance_Document_RDMPs.pdf
[5] Web annotation tool hypothes.is: https://web.hypothes.is/
[6] Framework for Scholarly Link eXchange (Scholix): http://www.scholix.org/
[7] Make Data Count (MDC) project: https://makedatacount.org/