R07. Data Integrity and Authenticity

From the CTS application:
The repository guarantees the integrity and authenticity of the data.

The Dataverse software supports the use of multiple storage locations for keeping redundant copies of files and metadata.

The Dataverse software supports version control for published datasets and files and upon file upload records file checksums at the bit level (MD5) and variable-level (UNF), which allows collection support staff, depositors and third parties to check file integrity.

Permissions and notification features, such as the submit for review workflow, can be used to ensure that changes to datasets are reviewed before they are finalized.

User account authentication features, such as institutional log in, can help collection support staff control who's able to create, edit and publish data.

Answers from successful applicants

Tilburg University Dataverse collection:

Once deposited, data files are never changed. UNF (Universal Numerical Fingerprint) checksums are applied to ensure the integrity and authenticity of each dataset. Only corrections to the descriptive metadata of a study are allowed for a dataset.

When archived in Tilburg University Dataverse, the data files cannot be modified by the depositors or data users. If changes are needed, the depositor needs to submit a new version with a new version name. The new version of the dataset will obtain a new persistent identifier. The changes compared to the earlier version are documented in the data report.

Only employees at Tilburg University are allowed to deposit data in the Tilburg University Dataverse. When a dataset is deposited, the Data Curator checks that the deposit comes from a person at Tilburg University. For example, the depositor usually holds a university e-mail account.

During the quality check of the data package, the Data Curator also checks that the description of the content of the data file, included in the required data report, corresponds with the related data. If there are doubts about the authenticity of the data, the Curator will contact the depositor or the research policy employee of the research school/department at which the data were produced.

The data report is accessible by the users. A sample data report is available via the URL
https://www.tilburguniversity.edu/dataverse-nl/ (click ‘Template data report’ under 'How to deposit'). An example data report in Tilburg University Dataverse is available via http://hdl.handle.net/10411/KL0X8C.


QDR follows the OAIS reference model in handling data.On receipt, QDR personnel check data files and metadata for completeness and integrity and, as needed, solicit updated or additional files from depositors. The complete initial deposit (Submission Information Package, SIP) is then committed to archival storage. It is also included in the data packages deposited with QDR’s long-term storage partner (DPN), where file integrity is periodically monitored.

While there is no formalized identity check in place, QDR’s curation team typically communicates directly by phone/skype with depositors and encourages the use of institutional emails for registration and communication. We are planning using ORCID for authentication.

On ingest, the Dataverse software automatically creates an MD5 checksum for every ingested file that allows for checking file integrity manually including by users and third parties. Files are stored on AWS S3, where redundant copies of each file are stored on distributed servers and integrity checks at rest are performed using content-MD5 checksums and cyclic redundancy checks (CRCs). AWS also performs integrity checks during data transfer.

The Dataverse software automatically enforces version updates on data for every change of published data using a two digit versioning system (e.g., 2.1). Small changes as well as changes to the metadata are recorded as minor changes, such as 2.1 to 2.2. Updates of data or other major changes receive a new version number (e.g., from 2.1 to 3.0).

Digital Preservation Network: https://www.dpn.org/
AWS S3 storage: https://aws.amazon.com/s3/faqs/
Digital preservation policy: https://qdr.syr.edu/policies/digitalpreservation
Curation policy: https://qdr.syr.edu/policies/curation


4 – The guideline has been fully implemented in the repository

When digital objects are uploaded to DataverseNO, the system runs two integrity checks during file ingest. Universal Numerical Fingerprint (UNF) checksums [1] are applied as indicators to be used to verify that no changes have been made to tabular data in the dataset. MD5 checksums [2] are applied to each file as indicators to be used to verify that the files have not been altered. The storage systems are renewed every 6-8 years which minimizes the risk for long-term deterioration of storage media. The transfer of data from old to new storage systems includes checks for bit-correctness of all data. See also R9.

On deposit, DataverseNO Research Data staff check data files and metadata for completeness and integrity and require, as needed, changes in data files and/or metadata from depositors. As an important part of the documentation, a ReadMe file must accompany each dataset, with a description of how to (re)use the dataset, including a statement of the completeness, or the limitations, of the dataset; see the DataverseNO Deposit Guidelines [3]. This ReadMe file is reviewed by Research Data Service staff before the dataset is published.

Changes to data files and metadata of published datasets are logged in the Dataverse version control report. Any change creates a new version of the dataset, including documentation of what has been changed and by whom. Minor additions or revisions of the metadata yield a decimal version number change. Additions of new data or other major alterations of existing data yield a change in the integer version number. Previous versions of datasets remain always openly accessible. Changes between subsequent versions of datasets are openly documented through version control. Any change to published datasets is subject to review by Research Data Service staff.

Data authenticity
Depositors may apply changes to their data published in DataverseNO as described above. The procedures for such data changes are communicated to depositors through the DataverseNO Deposit Guidelines [3], and to Curators through the DataverseNO Curator Guidelines [4]. In order to ensure the long-term preservation and usability of published datasets, changes to data may also be applied by Research Data Service staff; see R10. The rationale and procedures for such changes are regulated in and communicated to depositors through the DataverseNO Preservation Policy [5].

According to the DataverseNO Deposit Guidelines, depositors have to provide documentation about the creation of the data, and how the data can be used. This documentation must be provided in a ReadMe file that is deposited together with the data. In addition, provenance information may be entered into the metadata schemas provided by the repository software. Provenance information of the latter type is provided at file level and accepted in two forms: as a provenance file in JSON format and following W3C standards, and/or as a free-text provenance description.

Links to metadata records, and to other datasets that are related to the dataset in question, are maintained through the Related dataset metadata field. DataverseNO is following closely the project of integration between DataCite and CrossRef, that will enable automatic linking between related datasets and publications [6]. Keywords in the metadata record of deposited datasets may refer and link to metadata standards, e.g. controlled vocabularies. These links are reviewed by Research Data Service staff before the dataset is published. To ensure sustainability, the metadata elements from such external sources are in any case always stored in the metadata record of the dataset in question. The links are thus only meant for reference.

The version control system described above provides information about the essential properties of different versions of the same file.

The names of the depositor and the curator are registered automatically during data deposit and curation, and the identity of the depositor is verified by required log-in through the Norwegian national Lightweight Directory Access Protocol (LDAP) system (Feide) [7].

[1] Universal Numerical Fingerprint (UNF): http://guides.dataverse.org/en/latest/developers/unf/index.html
[2] Checksum (MD5): https://en.wikipedia.org/wiki/MD5
[3] DataverseNO Deposit Guidelines: https://site.uit.no/dataverseno/deposit/
[4] DataverseNO Curator Guidelines: https://site.uit.no/dataverseno/admin-en/curatorguide/
[5] DataverseNO Preservation Policy: https://site.uit.no/dataverseno/about/policy-framework/preservation-policy/
[6] Scholix: http://www.scholix.org/
[7] Feide: https://www.feide.no/introducing-feide