Dataverse 4.8.4 Release Adds Support for Schema.org

Dataverse’s latest update adds more metadata to dataset landing pages, using a community-driven vocabulary supported by major search engines to make it even easier to find open data online.

Search results account for a large portion of traffic to datasets published online. For example, since Dataverse 4 was released in June 2015, at least a fifth of the traffic to dataset pages in the largest Dataverse installation, Harvard Dataverse, has come from search engines, mostly Google. Giving search engines and other systems richer metadata to index datasets will help people find data faster.

What is Schema.org?

Schema.org is a metadata standard used to describe a growing number of types of content on the Internet. Since its introduction in 2011, communities have been developing and extending Schema.org’s vocabularies to describe people, places, actions and concepts in structured ways, which has allowed software, such as search engines and email clients, to recognize and act on pieces of information and the relationships between them.

How Dataverse uses Schema.org

This past summer, Dataverse added citation metadata to the markup of dataset landing pages, using the Dublin Core standard to enhance web discoverability of datasets and help browser plugins export metadata to reference managers like Zotero and Endnote. In November, the team consulted with members of the Dataverse community for its latest push to publish similar citation metadata using Schema.org.

Dataverse uses the dataset "type" and its set of recommended properties, taking cues from Google’s guidelines and the ways in which several other repositories are using Schema.org to describe datasets.

We’ve mapped the metadata elements most important for making each dataset discoverable - including titles, persistent identifiers, authors, descriptions, keywords, and related publications - to Schema.org properties. With the 4.8.4 update, Dataverse embeds this metadata as json-ld into the landing page HTML of every dataset that the software publishes or distributes. Here's what the metadata looks like:

{
    "@context": "http://schema.org",
    "@type": "Dataset",
    "identifier": "http://hdl.handle.net/1902.1/11992",
    "name": "The Standardized World Income Inequality Database",
    "author": [
        {
            "name": "Frederick Solt",
            "affiliation": "University of Iowa"
        }
    ],
    "datePublished": "2009-08-01",
    "dateModified": "2017-10-30",
    "version": "17",
    "description": "Cross-national research on the causes and consequences of income inequality has been hindered by the limitations of existing inequality datasets: greater coverage across countries and over time is available from these sources only at the cost of significantly reduced comparability across observations. The goal of the Standardized World Income Inequality Database (SWIID) is to overcome these limitations. A custom missing-data algorithm was used to standardize the United Nations University's World Income Inequality Database and data from other sources; data collected by the Luxembourg Income Study served as the standard. The SWIID provides comparable Gini indices of gross and net income inequality for 192 countries for as many years as possible from 1960 to the present along with estimates of uncertainty in these statistics. By maximizing comparability for the largest possible sample of countries and years, the SWIID is better suited to broadly cross-national research on income inequality than previously available sources: it offers coverage double that of the next largest income inequality dataset, and its record of comparability is three to eight times better than those of alternate datasets.  In any papers or publications that use the SWIID, authors are asked to cite the article of record for the data set and give the version number as follows: Solt, Frederick. 2016. \"The Standardized World Income Inequality Database.\" Social Science Quarterly 97(5):1267-1281. SWIID Version 6.1, October 2017.",
    "keywords": [
        "Social Sciences",
        "income inequality",
        "income distribution"
    ],
    "citation": [
        "<a href=\"http://onlinelibrary.wiley.com/doi/10.1111/ssqu.12295/abstract\"  target= \"_new\"> Solt, Frederick.  2016.  \"The Standardized World Income Inequality Database.\"  <i>Social Science Quarterly</i> 97(5):1267-1281.</a>"
    ],
    "temporalCoverage": [
        "1960/2016"
    ],
    "schemaVersion": "https://schema.org/version/3.3",
    "license": {
        "@type": "Dataset",
        "text": "CC0",
        "url": "https://creativecommons.org/publicdomain/zero/1.0"
    },
    "includedInDataCatalog": {
        "@type": "DataCatalog",
        "name": "Harvard Dataverse",
        "url": "https://dataverse.harvard.edu"
    },
    "provider": {
        "@type": "Organization",
        "name": "Dataverse"
    }
}

 

The use of both standards, Dublin Core and Schema.org, are included in the recommendations made by Force 11’s Repositories Early Adopters Expert Group in the article “A Data Citation Roadmap for Scholarly Data Repositories”. Both improvements are additional ways that Dataverse helps make data findable, accessible, interoperable and reusable (FAIR).

Next steps

In keeping with Dataverse’s agile development methodology, the team focused the initial work very narrowly while postponing mapping several Dataverse metadata elements to Schema.org. Elements such as author identifiers, the names and IDs of agencies funding the research, contributor names, and geographic coverage of the data were a little trickier to map to Schema.org, and we plan to add more elements like these in future releases.

Additionally, the Dataverse community looks forward to continue working with the Schema.org community to solve the technical and social challenges of making data discoverable and reusable.

To see the full release notes, check out the release notes on GitHub.