R15. Technical Infrastructure

From the CTS application:
The repository functions on well-supported operating systems and other core infrastructural software and is using hardware and software technologies appropriate to the services it provides to its Designated Community.
 

Technical infrastructure

The Dataverse software is developed and deployed using a suite of well-supported and/or open source technologies:

  • Linux RHEL/CentOS - operating environment
  • Payara - application server (starting with Dataverse software version 5, Payara replaces Glassfish)
  • PostgreSQL - application database
  • Java - front end application
  • Solr - indexing
  • Optional tools for data analysis and curation, such as R, TwoRavens, ImageMagick, and Jhove
  • Docker and Kubernetes for installation/deployment

CTS applicants will have to detail how the technology stack their Dataverse installation uses is deployed and maintained. As outlined in the CTS requirements, these details would include tools used for systems and network monitoring, application backup and recovery processes, and workflows and schedules for application maintenance, testing and upgrades.
 

Development and Oversight

The Dataverse software is supported and developed by the Institute for Qualitative Social Science (IQSS) at Harvard University. A dedicated team supports the continuous development of the application, alongside community support from developers, experts in data curation and data preservation, user interaction and user experience, and quality assurance.

The Dataverse software's code is stored and openly available on GitHub and is open to feedback, comments and community contributions. This has led to many collaborations with external organizations who support the Dataverse Project through contributions of code and new features, testing, bug fixes, and training materials. An active issues repository on GitHub also tracks bugs and identifies improvements to the code.

New releases of the Dataverse software are continuous - approximately three to four per year. The software's development is informed by a strategic roadmap including and incorporating feedback from community members.

The software's development is also overseen by an advisory team composed of practitioners, and a broader community of users and contributors, who participate in an annual community meeting, a forum, and regular community calls. Additionally, Global Dataverse Community Consortium is a member-based group that also works to coordinate community contributions to the application.
 

Standards

The Dataverse software employs a variety of widely used community standards for metadata export:

  • Dublin Core
  • DDI (Data Documentation Initiative Codebook 2.5)
  • DDI HTML Codebook (A more human-readable, HTML version of the DDI Codebook 2.5 metadata export, added in Dataverse software version 4.16)
  • DataCite 4
  • OAI-ORE (added in Dataverse software version 4.11)
  • OpenAIRE (added in Dataverse software version 4.14)
  • Schema.org JSON-LD (added in Dataverse software version 4.8.4)

Additional standards for application functionality and data access/deposit employed:

  • OAI-PMH for harvesting to improve data visibility
  • SWORD API for data deposit from other applications
  • Support for WC3 Provenance JSON files (added in Dataverse software version 4.9)
  • A robust and well-documented suite of additional APIs for interacting with and managing the application
  • Ability to export RDA-compliant OAI-ORE Bags (added in Dataverse software version 4.11)
     

Links:

Dataverse software GitHub repository: https://github.com/IQSS/dataverse
Dataverse software roadmap: https://www.iq.harvard.edu/roadmap-dataverse-project
Dataverse software advisory team: https://dataverse.org/advisory
Global Dataverse Community Consortium: http://dataversecommunity.global/
Dataverse community meetings: https://dataverse.org/events
Dataverse community calls: https://dataverse.org/community-calls
Dataverse community forum: https://dataverse.org/forum
 

Answers from successful applicants

Tilburg University Dataverse collection:

In August 2012, Tilburg Library and IT Services concluded an agreement with Utrecht University to set up DataverseNL. Goal of this cooperation was to offer scientists facilities for research data storage and publishing on Dutch soil and within the framework of Dutch Laws. By September 2013, several other universities and research institutes have joined this cooperation: Erasmus University in Rotterdam, Maastricht University, 3TU Data Center, University of Groningen, and the Netherlands Institute for Ecology (NIOO-KNAW, for its initials in Dutch). Nowadays, Data Archiving and Networked Services (DANS) has taken over the infrastructure of DataverseNL and coordinates the network.  

Dataverse Network follows the guidance given in the OAIS reference model across the whole of the archival process. For example, the infrastructure supports separation between Supply Information Package, Archival Information Package and Dissemination Information Package.  

The DataverseNL Advisory Board determines DataverseNL's policy and strategy. The Advisory Board provides asked and unsolicited advice to DANS about the development of the service. A work plan is submitted annually to the Advisory Board with the planned work and developments for the coming year. The advisory board evaluates the activities of the previous year on the basis of an annual report. The Advisory Board meets at least twice a year. Each institutional repository within DataverseNL delivers a delegate to the Dataverse Advisory Board. Each institutional repository has one vote in the Advisory Board.  

In addition to the advisory board, DataverseNL has an Administrators’ Board, which discusses issues that relate to shared functionality, such as quality of service, migration, acceptance tests, support users, reports. The Administrators’ Board recommends the desired new functionality to the Advisory Board. Each institutional repository designates at least one employee responsible for managing the data within the institute's local Dataverse: the local administrator (Admin). This administrator is the first point of contact for data producers and data consumers of the local Dataverse. The administrator provides information and provides guidance in using the local Dataverse. The local administrator is also a contact person for the communication with DANS about the daily routine. The DANS service manager organizes and supervises the Administrators’ Board. The Administrators’ Board meets every second month per skype, or face-to-face if necessary.

Tilburg University Dataverse follows the technical development of DataverseNL. Dataverse software is developed at the Harvard University Institute for Quantitative Social Science (IQSS). The current version in use at DataverseNL is 4.6.1. In April/May 2018 version 4.8.2. will be implemented.
 

QDR:

QDR follows ISO 14721:2012 , section 4.1.1.1 (common services) as a reference model for technical infrastructure development. The technical directors for QDR, in consultation with a Technical Advisory Board, monitor the implementation of services, and review emerging standards for qualitative data management on a biannual basis.

Infrastructure development activities follow an annual roadmap produced by QDR’s technical directors, and approved by the QDR’s Technical Advisory Board.

Hardware and software inventories and configuration information are recorded in a QDR managed wiki, and updated quarterly. All software running the production environment of QDR is open-source - this includes operating systems running on EC2 and S3 servers (Linux), a content management system based on Drupal, a repository framework based on Dataverse, as well as a suite of configuration management (Chef), continuous integration (Jenkins) and monitoring tools (Nagios). To further ensure continuous delivery of deployed code, our team also relies upon open-source tools to perform automated tests (Selenium) as well as infrastructure execution and management tools (Terraform). Each of these tools are well-supported by open-source communities. Our technical directors, and system administrators regularly monitor security and vulnerabilities related to this suite of software.

The hardware used to run QDR is provisioned at Amazon Web Services, and managed by our technical development team. Our infrastructure at AWS is configured with a set of Virtual Private Clouds for security (described in detail in R16), and we ensure proper bandwidth is available by using Elastic Load Balancing which distributes incoming user traffic across multiple EC2 instances.

Links:
ISO 14721: https://www.iso.org/standard/57284.html
Terraform: https://www.terraform.io/
Nagios: https://www.nagios.org/
Chef: https://www.chef.io/solutions/infrastructure-automation/
Selenium: http://www.seleniumhq.org/
Dataverse: http://dataverse.org
Security and infrastructure: https://qdr.syr.edu/policies/security
Digital preservation policy: https://qdr.syr.edu/policies/digitalpreservation
 

DataverseNO:

Standards
DataverseNO follows the broad guidance given in the OAIS reference model across the archival process, as described in the section “OAIS compliance” in the DataverseNO Preservation Policy [1]. Changes of any operational and preservation principles for DataverseNO are checked for compatibility with the OAIS reference model, and adapted according to the framework.

The technical infrastructure employed by DataverseNO follows a number of international standards and best practices. Some examples of currently employed standards: The harvesting protocol OAI-PMH is used as a tool to increase visibility and dissemination of content in DataverseNO; the SWORD interoperability standard is used for ingest of structured dataset collections; the Shibboleth/SAML authentication and authorization infrastructure is used as the default for log-in; the industry-standard protocol OAuth 2.0 for authorization is supported and partly implemented; the Schema.org/JSON-LD for structured discovery metadata are implemented and increase the visibility of datasets and support the integration with other services; Docker for operating system level virtualization is being tested as a possible future infrastructure for DataverseNO.

As noted in Requirement R9, the technical infrastructure for the DataverseNO platform is currently running on enterprise class storage and virtualization hardware (VMWare) on a standard CentOS Linux distribution at UiT The Arctic University of Norway (owner of DataverseNO). The infrastructure resides in two datacentres, each in different buildings on campus, where data are replicated to avoid data loss in case of physical threats like fires, floods etc. The VMWare nodes have two power supplies, ups and at least two network cards connected to redundant switches, and the whole operation is monitored continuously with automatic error alerts. Both datacentres are secured with at least two layers of key access doors from public areas, and access is restricted to authorised operational staff. The development of DataverseNO is an ongoing process strongly influenced also by developments outside DataverseNO, particularly this applies to the system development for Dataverse at Harvard (see below). This means that review of standards and best practices and how they are supported and implemented is done on a more or less continuous basis.

Infrastructure Development
DataverseNO is part of the owner’s overall strategy for research data services and is under active development. Currently, the feasibility and evaluation of cloud services for DataverseNO are investigated through national grants applying Docker support for Dataverse. In addition, a future possible DataverseNO infrastructure where both the application and the data are moved into a national or public cloud is actively investigated in cooperation with other national research data services.

System Documentation
The DataverseNO system is run by UiT The Arctic University of Norway (owner of DataverseNO), and system documentation about installation, configuration, integrations and technical operation is kept up to date at a separate SharePoint area within UiTs internal SharePoint domain. Access to this information is restricted to authorized personnel at UiT only. In addition, there is extensive documentation of the Dataverse system provided by the Dataverse Development Community, including Installation Guide, Developer Guide, API Guide, as well as User- and Admin Guide (see below).

Community-Supported Software
The technical repository functions of DataverseNO are provided by the Dataverse software, a widely used open source software developed by an international developer community headed by the Institute for Quantitative Social Science (IQSS) at Harvard University [2]. The current version in use at DataverseNO is 4.15.1. The Dataverse roadmap for new versions is continuously updated. The Dataverse software is hosted on GitHub [3]. Minor releases of the Dataverse software are installed as they become available from the development group at Harvard. Major Dataverse release updates are subject to careful planning and testing before being put into production, in accordance with the Quality Handbook (see R9). DataverseNO continuously evaluates new infrastructure functionalities developed for the Dataverse application, and implement those that are considered useful for the service as a whole.

The system setup is thoroughly documented in the UiT IT department’s documentation system (internal) and different system administrators have performed redeployment of the production platform in order to minimize the vulnerability of the system. In addition to the Dataverse software the DataverseNO platform consists of the PostgreSQL database and a GlassFish application server, as well as standard OS related software. This is all open source software with strong and active community support.

Real-Time Data Streams
For the time being, DataverseNO does not provide real-time to near real-time data streams, but DataverseNO is operated with an around-the-clock connectivity to UiT The Arctic University of Norway (owner of DataverseNO). UiT is the research network hub in northern Norway, and the two UiT datacenters have direct redundant connections to the 100 gigabit/s academic national network backbone operated by UNINETT [4] and thus connectivity into the GEANT network [5].

References:
[1] DataverseNO Preservation Policy: https://site.uit.no/dataverseno/about/policy-framework/preservation-policy/
[2] Dataverse: http://dataverse.org
[3] Dataverse on Github: https://github.com/IQSS/dataverse
[4] UNINET: https://www.uninett.no/en
[5] GEANT: https://www.geant.org/