Dataverse Software 5.12 Release

This release brings new features, enhancements, and bug fixes to the Dataverse Software. Thank you to all of the community members who contributed code, suggestions, bug reports, and other assistance across the project.

Release Highlights

Support for Globus

Globus can be used to transfer large files. Part of "Harvard Data Commons Additions" below.

Support for Remote File Storage

Dataset files can be stored at remote URLs. Part of "Harvard Data Commons Additions" below.

New Computational Workflow Metadata Block

The new Computational Workflow metadata block will allow depositors to effectively tag datasets as computational workflows.

To add the new metadata block, follow the instructions in the Admin Guide:

The location of the new metadata block tsv file is scripts/api/data/metadatablocks/computational_workflow.tsv. Part of "Harvard Data Commons Additions" below.

Support for Linked Data Notifications (LDN)

Linked Data Notifications (LDN) is a standard from the W3C. Part of "Harvard Data Commons Additions" below.

Harvard Data Commons Additions

As reported at the 2022 Dataverse Community Meeting, the Harvard Data Commons project has supported a wide range of additions to the Dataverse software that improve support for Big Data, Workflows, Archiving, and interaction with other repositories. In many cases, these additions build upon features developed within the Dataverse community by Borealis, DANS, QDR, TDL, and others. Highlights from this work include:

  • Initial support for Globus file transfer to upload to and download from a Dataverse managed S3 store. The current implementation disables file restriction and embargo on Globus-enabled stores.
  • Initial support for Remote File Storage. This capability, enabled via a new RemoteOverlay store type, allows a file stored in a remote system to be added to a dataset (currently only via API) with download requests redirected to the remote system. Use cases include referencing public files hosted on external web servers as well as support for controlled access managed by Dataverse (e.g. via restricted and embargoed status) and/or by the remote store.
  • Initial support for computational workflows, including a new metadata block and detected filetypes.
  • Support for archiving to any S3 store using Dataverse's RDA-conformant BagIT file format (a BagPack).
  • Improved error handling and performance in archival bag creation and new options such as only supporting archiving of one dataset version.
  • Additions/corrections to the OAI-ORE metadata format (which is included in archival bags) such as referencing the name/mimetype/size/checksum/download URL of the original file for ingested files, the inclusion of metadata about the parent collection(s) of an archived dataset version, and use of the URL form of PIDs.
  • Display of archival status within the dataset page versions table, richer status options including success, pending, and failure states, with a complete API for managing archival status.
  • Support for batch archiving via API as an alternative to the current options of configuring archiving upon publication or archiving each dataset version manually.
  • Initial support for sending and receiving Linked Data Notification messages indicating relationships between a dataset and external resources (e.g. papers or other dataset) that can be used to trigger additional actions, such as the creation of a back-link to provide, for example, bi-directional linking between a published paper and a Dataverse dataset.
  • A new capability to provide custom per field instructions in dataset templates
  • The following file extensions are now detected:
    • wdl=text/x-workflow-description-language
    • cwl=text/x-computational-workflow-language
    • nf=text/x-nextflow
    • Rmd=text/x-r-notebook
    • rb=text/x-ruby-script
    • dag=text/x-dagman

Improvements to Fields that Appear in the Citation Metadata Block

Grammar, style and consistency improvements have been made to the titles, tooltip description text, and watermarks of metadata fields that appear in the Citation metadata block.

This includes fields that dataset depositors can edit in the Citation Metadata accordion (i.e. fields controlled by the citation.tsv and files) and fields whose values are system-generated, such as the Dataset Persistent ID, Previous Dataset Persistent ID, and Publication Date fields whose titles and tooltips are configured in the file.

The changes should provide clearer information to curators, depositors, and people looking for data about what the fields are for.

A new page in the Style Guides called "Text" has also been added. The new page includes a section called "Metadata Text Guidelines" with a link to a Google Doc where the guidelines are being maintained for now since we expect them to be revised frequently.

New Static Search Facet: Metadata Types

A new static search facet has been added to the search side panel. This new facet is called "Metadata Types" and is driven from metadata blocks. When a metadata field value is inserted into a dataset, an entry for the metadata block it belongs to is added to this new facet.

This new facet needs to be configured for it to appear on the search side panel. The configuration assigns to a dataverse what metadata blocks to show. The configuration is inherited by child dataverses.

To configure the new facet, use the Metadata Block Facet API:

Broader MicroProfile Config Support for Developers

As of this release, many JVM options can be set using any MicroProfile Config Source.

Currently this change is only relevant to developers but as settings are migrated to the new "lookup" pattern documented in the Consuming Configuration section of the Developer Guide, anyone installing the Dataverse software will have much greater flexibility when configuring those settings, especially within containers. These changes will be announced in future releases.

Please note that an upgrade to Payara 5.2021.8 or higher is required to make use of this. Payara 5.2021.5 threw exceptions, as explained in PR #8823.

HTTP Range Requests: New HTTP Status Codes and Headers for Datafile Access API

The Basic File Access resource for datafiles (/api/access/datafile/$id) was slightly modified in order to comply better with the HTTP specification for range requests.

If the request contains a "Range" header: * The returned HTTP status is now 206 (Partial Content) instead of 200 * A "Content-Range" header is returned containing information about the returned bytes * An "Accept-Ranges" header with value "bytes" is returned

CORS rules/headers were modified accordingly: * The "Range" header is added to "Access-Control-Allow-Headers" * The "Content-Range" and "Accept-Ranges" header are added to "Access-Control-Expose-Headers"

File Type Detection When File Has No Extension

File types are now detected based on the filename when the file has no extension.

The following filenames are now detected:

  • Makefile=text/x-makefile
  • Snakemake=text/x-snakemake
  • Dockerfile=application/x-docker-file
  • Vagrantfile=application/x-vagrant-file

These are defined in

Upgrade to Payara 5.2022.3 Highly Recommended

With lots of bug and security fixes included, we encourage everyone to upgrade to Payara 5.2022.3 as soon as possible. See below for details.

Major Use Cases and Infrastructure Enhancements

Changes and fixes in this release include:

  • Administrators can configure an S3 store used in Dataverse to support users uploading/downloading files via Globus File Transfer. (PR #8891)
  • Administrators can configure a RemoteOverlay store to allow files that remain hosted by a remote system to be added to a dataset. (PR #7325)
  • Administrators can configure the Dataverse software to send archival Bag copies of published dataset versions to any S3-compatible service. (PR #8751)
  • Users can see information about a dataset's parent collection(s) in the OAI-ORE metadata export. (PR #8770)
  • Users and administrators can now use the OAI-ORE metadata export to retrieve and assess the fixity of the original file (for ingested tabular files) via the included checksum. (PR #8901)
  • Archiving via RDA-conformant Bags is more robust and is more configurable. (PR #8773, #8747, #8699, #8609, #8606, #8610)
  • Users and administrators can see the archival status of the versions of the datasets they manage in the dataset page version table. (PR #8748, #8696)
  • Administrators can configure messaging between their Dataverse installation and other repositories that may hold related resources or services interested in activity within that installation. (PR #8775)
  • Collection managers can create templates that include custom instructions on how to fill out specific metadata fields.
  • Dataset update API users are given more information when the dataset they are updating is out of compliance with Terms of Access requirements (Issue #8859)
  • Adds a new setting (:ControlledVocabularyCustomJavaScript) that allows a JavaScript file to be loaded into the dataset page for the purpose of showing controlled vocabulary as a list (Issue #8722)
  • Fixes an issue with the Redetect File Type API (Issue #7527)
  • Terms of Use is now imported when using DDI format through harvesting or the native API. (Issue #8715, PR #8743)
  • Optimizes some code to improve application memory usage (Issue #8871)
  • Fixes sample data to reflect custom licenses.
  • Fixes the Archival Status Input API (available to superusers) (Issue #8924)
  • Small bugs have been fixed in the dataset export in the JSON and DDI formats; eliminating the export of "undefined" as a metadata language in the former, and a duplicate keyword tag in the latter. (Issue #8868)