Publications

Working Paper
Crosas M. The Rise of Data Publishing (and how Dataverse 4 can help). Journal of Technology Science [Internet]. Working Paper. Publisher's VersionAbstract

 The research community needs reliable, standard ways to make the data produced by scientific research available to the community, while getting credit as data authors. As a result, a new form of scholarly publication is emerging: data publishing. Data pubishing - or making data long-term accessible, reusable and citable - is more involved than simply providing a link to a data file or posting the data to the researchers web site. In this paper, we define what is needed for proper data publishing and describe how the open-source Dataverse software helps define, enable and enhance data publishing for all.

In Press
Bar-Sinai M, Sweeney L, Crosas M. DataTags, Data Handling Policy Spaces and the Tags Language. In Proceedings of the International Workshop on Privacy Engineering, IEEE. In Press.
2016
Meyer PA, et al. Data publication with the structural biology data grid supports live analysis. Nature Communications. 2016;(10882).Abstract

Access to experimental X-ray diffraction image data is fundamental for validation and reproduction of macromolecular models and indispensable for development of structural biology processing methods. Here, we established a diffraction data publication and dissemination system, Structural Biology Data Grid (SBDG; data.sbgrid.org), to preserve primary experimental data sets that support scientific publications. Data sets are accessible to researchers through a community driven data grid, which facilitates global data access. Our analysis of a pilot collection of crystallographic data sets demonstrates that the information archived by SBDG is sufficient to reprocess data to statistics that meet or exceed the quality of the original published structures. SBDG has extended its services to the entire community and is used to develop support for other types of biomedical data sets. It is anticipated that access to the experimental data sets will enhance the paradigm shift in the community towards a much more dynamic body of continuously improving data analysis.

Wilkinson MD, Dumontier M, Aalbersberg IJJ, Appleton G, Axton M, Baak A, Blomberg N, Boiten J-W, da Santos LBS, Bourne PE, et al. The FAIR Guiding Principles for scientific data management and stewardship. Scientific Data. 2016;160018.Abstract

There is an urgent need to improve the infrastructure supporting the reuse of scholarly data. A diverse set of stakeholders—representing academia, industry, funding agencies, and scholarly publishers—have come together to design and jointly endorse a concise and measureable set of principles that we refer to as the FAIR Data Principles. The intent is that these may act as a guideline for those wishing to enhance the reusability of their data holdings. Distinct from peer initiatives that focus on the human scholar, the FAIR Principles put specific emphasis on enhancing the ability of machines to automatically find and use the data, in addition to supporting its reuse by individuals. This Comment is the first formal publication of the FAIR Principles, and includes the rationale behind them, and some exemplar implementations in the community.

2015
Starr J, Castro E, Crosas M, Dumontier M, Downs RR, Duerr R, Haak LL, Haendel M, Herman I, Hodson S, et al. Achieving human and machine accessibility of cited data in scholarly publications. PeerJ Computer Science [Internet]. 2015. Publisher's VersionAbstract

Reproducibility and reusability of research results is an important concern in scientific communication and science policy. A foundational element of reproducibility and reusability is the open and persistently available presentation of research data. However, many common approaches for primary data publication in use today do not achieve sufficient long-term robustness, openness, accessibility or uniformity. Nor do they permit comprehensive exploitation by modern Web technologies. This has led to several authoritative studies recommending uniform direct citation of data archived in persistent repositories. Data are to be considered as first-class scholarly objects, and treated similarly in many ways to cited and archived scientific and scholarly literature. Here we briefly review the most current and widely agreed set of principle-based recommendations for scholarly data citation, the Joint Declaration of Data Citation Principles (JDDCP). We then present a framework for operationalizing the JDDCP; and a set of initial recommendations on identifier schemes, identifier resolution behavior, required metadata elements, and best practices for realizing programmatic machine actionability of cited data. The main target audience for the common implementation guidelines in this article consists of publishers, scholarly organizations, and persistent data repositories, including technical staff members in these organizations. But ordinary researchers can also benefit from these recommendations. The guidance provided here is intended to help achieve widespread, uniform human and machine accessibility of deposited data, in support of significantly improved verification, validation, reproducibility and re-use of scholarly/scientific data.

Crosas M, King G, Honaker J, Sweeney L. Automating Open Science for Big Data. ANNALS of the American Academy of Political and Social Science. 2015;659 (1) :260-273.Abstract

The vast majority of social science research uses small (megabyte- or gigabyte-scale) datasets. these fixed- scale datasets are commonly downloaded to the researcher’s computer where the analysis is performed. the data can be shared, archived, and cited with well-established technologies, such as the Dataverse Project, to support the published results. the trend toward big data—including large-scale streaming data—is starting to transform research and has the potential to impact policymaking as well as our understanding of the social, economic, and political problems that affect human societies. However, big data research poses new challenges to the execution of the analysis, archiving and reuse of the data, and reproduction of the results. Downloading these datasets to a researcher’s computer is impractical, leading to analyses taking place in the cloud, and requiring unusual expertise, collaboration, and tool development. the increased amount of information in these large datasets is an advantage, but at the same time it poses an increased risk of revealing personally identifiable sensitive information. In this article, we discuss solutions to these new challenges so that the social sciences can realize the potential of big data.

Altman M, Borgman C, Crosas M, Martone M. An Introduction to the Joint Principles of Data Citation. Bulletin of the Association for Information Science and Technology. 2015;41 (3) :43-45.Abstract

Data citation is rapidly emerging as a key practice supporting data access, sharing and reuse, as well as sound and reproducible scholarship. Consensus data citation principles, articulated through the Joint Declaration of Data Citation Principles, represent an advance in the state of the practice and a new consensus on citation.

Altman M, Castro E, Crosas M, Durbin P, Garnett A, Whitney J. Open Journal Systems and Dataverse Integration-- Helping Journals to Upgrade Data Publication for Reusable Research. Code4Lib Journal. 2015;(30).Abstract

This article describes the novel open source tools for open data publication in open access journal workflows. This comprises a plugin for Open Journal Systems that supports a data submission, citation, review, and publication workflow; and an extension to the Dataverse system that provides a standard deposit API. We describe the function and design of these tools, provide examples of their use, and summarize their initial reception. We conclude by discussing future plans and potential impact.

Sweeney L, Crosas M. An Open Science Platform for the Next Generation of Data. Arxiv.org Computer Science, Computers and Scoiety [Internet]. 2015. Publisher's VersionAbstract

Imagine an online work environment where researchers have direct and immediate access to myriad data sources and tools and data management resources, useful throughout the research lifecycle. This is our vision for the next generation of the Dataverse Network: an Open Science Platform (OSP). For the first time, researchers would be able to seamlessly access and create primary and derived data from a variety of sources: prior research results, public data sets, harvested online data, physical instruments, private data collections, and even data from other standalone repositories. Researchers could recruit research participants and conduct research directly on the OSP, if desired, using readily available tools. Researchers could create private or shared workspaces to house data, access tools, and computation and could publish data directly on the platform or publish elsewhere with persistent, data citations on the OSP. This manuscript describes the details of an Open Science Platform and its construction. Having an Open Science Platform will especially impact the rate of new scientific discoveries and make scientific findings more credible and accountable. (This manuscript was originally conceived in 2013)

Sweeney L, Crosas M, Bar-Sinai M. Sharing Sensitive Data with Confidence: the DataTags System. Technology Science. 2015.Abstract

Society generates data on a scale previously unimagined. Wide sharing of these data promises to improve personal health, lower healthcare costs, and provide a better quality of life. There is a tendency to want to share data freely. However, these same data often include sensitive information about people that could cause serious harms if shared widely. A multitude of regulations, laws and best practices protect data that contain sensitive personal information. Government agencies, research labs, and corporations that share data, as well as review boards and privacy officers making data sharing decisions, are vigilant but uncertain. This uncertainty creates a tendency not to share data at all. Some data are more harmful than other data; sharing should not be an all-or-nothing choice. How do we share data in ways that ensure access is commensurate with risks of harm?

Bar-Sinai M. Big Data Technology Literature Review. 2015.
Altman M, Castro E, Crosas M, Durbin P, Garnett A, Whitney J. Open Journal Systems and Dataverse Integration– Helping Journals to Upgrade Data Publication for Reusable Research. The Code4Lib Journal [Internet]. 2015;30. Publisher's VersionAbstract
This article describes the novel open source tools for open data publication in open access journal workflows. This comprises a plugin for Open Journal Systems that supports a data submission, citation, review, and publication workflow; and an extension to the Dataverse system that provides a standard deposit API. We describe the function and design of these tools, provide examples of their use, and summarize their initial reception. We conclude by discussing future plans and potential impact.
Quigley E. Usability Testing Driven Redesign of Dataverse, an Open Source Data Repository. University of Massachusetts and New England Area Librarian e-Science Symposium [Internet]. 2015. Publisher's Version
2014
Pepe A, Goodman A, Muench A, Crosas M, Erdmann C. How Do Astronomers Share Data? Reliability and Persistence of Datasets Linked in AAS Publications and a Qualitative Study of Data Practices among US Astronomers. PLoS ONE. 2014;9 (8).Abstract

We analyze data sharing practices of astronomers over the past fifteen years. An analysis of URL links embedded in papers published by the American Astronomical Society reveals that the total number of links included in the literature rose dramatically from 1997 until 2005, when it leveled off at around 1500 per year. The analysis also shows that the availability of linked material decays with time: in 2011, 44% of links published a decade earlier, in 2001, were broken. A rough analysis of link types reveals that links to data hosted on astronomers' personal websites become unreachable much faster than links to datasets on curated institutional sites. To gauge astronomers' current data sharing practices and preferences further, we performed in-depth interviews with 12 scientists and online surveys with 173 scientists, all at a large astrophysical research institute in the United States: the Harvard-Smithsonian Center for Astrophysics, in Cambridge, MA. Both the in-depth interviews and the online survey indicate that, in principle, there is no philosophical objection to data-sharing among astronomers at this institution. Key reasons that more data are not presently shared more efficiently in astronomy include: the difficulty of sharing large data sets; over reliance on non-robust, non-reproducible mechanisms for sharing data (e.g. emailing it); unfamiliarity with options that make data-sharing easier (faster) and/or more robust; and, lastly, a sense that other researchers would not want the data to be shared. We conclude with a short discussion of a new effort to implement an easy-to-use, robust, system for data sharing in astronomy, at theastrodata.org, and we analyze the uptake of that system to-date.

Goodman A, Pepe A, Blocker AW, Borgman CL, Cranmer K, Crosas M, Stefano RD, Gil Y, Groth P, Hedstrom M, et al. Ten Simple Rules for the Care and Feeding of Scientific Data. PLoS computational biology. 2014;10 (4).
Castro E, Garnett A. Building a Bridge Between Journal Articles and Research Data: The PKP-Dataverse Integration Project. International Journal of Digital Curation [Internet]. 2014;9 :176-184. Publisher's VersionAbstract
A growing number of funding agencies and international scholarly organizations are requesting that research data be made more openly available to help validate and advance scientific research. Thus, this is an opportune moment for research data repositories to partner with journal editors and publishers in order to simplify and improve data curation and publishing practices. One practical example of this type of cooperation is currently being facilitated by a two year (2012-2014) one million dollar Sloan Foundation grant, integrating two well-established open source systems: the Public Knowledge Project’s (PKP) Open Journal Systems (OJS), developed by Stanford University and Simon Fraser University; and Harvard University’s Dataverse Network web application, developed by the Institute for Quantitative Social Science (IQSS). To help make this interoperability possible, an OJS Dataverse plugin and Data Deposit API are being developed, which together will allow authors to submit their articles and datasets through an existing journal management interface, while the underlying data are seamlessly deposited into a research data repository, such as the Harvard Dataverse. This practice paper will provide an overview of the project, and a brief exploration of some of the specific challenges to and advantages of this integration.
Honaker J, D’Orazio V. Statistical Modeling by Gesture: A graphical, browser-based statistical interface for data repositories. Extended Proceedings of ACM Hypertext 2014 [Internet]. 2014. Publisher's VersionAbstract
We detail our construction of TwoRavens, a graphical user interface for quantitative analysis that allows users at all levels of statistical expertise to explore their data, describe their substantive understanding of the data, and appropriately construct and interpret statistical models. The interface is a browser-based, thin client, with the data remaining in an online repository, and the statistical modeling occurring on a remote server.  In our implementation, we integrate with tens of thousands of datasets from the Dataverse repository, and the large library of statistical models available in the Zelig package for the R statistical language.  Our interface is entirely gesture-driven, and so easily used on tablets and phones.  This, in combination with being browser-based, makes data exploration and quantitative reasoning easily portable to the classroom with minimal infrastructure or technology overhead.
2013
Rajasekar A, Sankaran S, Lander H, Carsey T, Crabtree J, Kum H-C, Crosas M, King G, Zhan J. Sociometric Methods for Relevancy Analysis of Long Tail Science Data. Social Computing (SocialCom), 2013 International Conference on. IEEE. 2013 :1-6.Abstract

As the push towards electronic storage, publication, curation, and discoverability of research data collected in multiple research domains has grown, so too have the massive numbers of small to medium datasets that are highly distributed and not easily discoverable - a region of data that is sometimes referred to as the long tail of science. The rapidly increasing, sheer volume of these long tail data present one aspect of the Big Data problem: how does one more easily access, discover, use, and reuse long tail data to lead to new multidisciplinary collaborative research and scientific advancement? In this paper, we describe Data Bridge, a new e-science collaboration environment that will realize the potential of long tail data by implementing algorithms and tools to more easily enable data discoverability and reuse. Data Bridge will define different types of semantic bridges that link diverse datasets by applying a set of sociometric network analysis (SNA) and relevance algorithms. We will measure relevancy by examining different ways datasets can be related to each other: data to data, user to data, and method to data connections. Through analysis of metadata and ontology, by pattern analysis and feature extraction, through usage tools and models, and via human connections, Data Bridge will create an environment for long tail data that is greater than the sum of its parts. In the project's initial phase, we will test and validate the new tools with real-world data contained in the Data verse Network, the largest social science data repository. In this short paper, we discuss the background and vision for the Data Bridge project, and present an introduction to the proposed SNA algorithms and analytical tools that are relevant for discoverability of long tail science data.

Altman M, Crosas M. The Evolution of Data Citation: From Principles to Implementation. IASSIST Quarterly [Internet]. 2013;37 :62. Publisher's VersionAbstract
Data citation is rapidly emerging as a key practice in support of data access, sharing, reuse, and of sound and reproducible scholarship. In this article we review the evolution of data citation standards and practices – to which Sue Dodd was an early contributor – and the core principles of data citation that have emerged through a collaborative synthesis. We then discuss an example of the current state of the practice, and identify the remaining implementation challenges.
Gibbs E, Lin L, Quigley E, Tang R. Dataverse Usability Evaluation: Final Report. 2013 :1-18. dataverse_usability_report-participant_omitted.pdf

Pages