%0 Journal Article %J Scientific Data %D 2022 %T A Large-scale Study on Research Code Quality and Execution %A Trisovic, Ana %A Lau, Matthew K. %A Thomas Pasquier %A Mercè Crosas %X This article presents a study on the quality and execution of research code from publicly-available replication datasets at the Harvard Dataverse repository. Research code is typically created by a group of scientists and published together with academic papers to facilitate research transparency and reproducibility. For this study, we define ten questions to address aspects impacting research reproducibility and reuse. First, we retrieve and analyze more than 2000 replication datasets with over 9000 unique R files published from 2010 to 2020. Second, we execute the code in a clean runtime environment to assess its ease of reuse. Common coding errors were identified, and some of them were solved with automatic code cleaning to aid code execution. We find that 74% of R files failed to complete without error in the initial execution, while 56% failed when code cleaning was applied, showing that many errors can be prevented with good coding practices. We also analyze the replication datasets from journals’ collections and discuss the impact of the journal policy strictness on the code re-execution rate. Finally, based on our results, we propose a set of recommendations for code dissemination aimed at researchers, journals, and repositories. %B Scientific Data %V 9 %G eng %U https://www.nature.com/articles/s41597-022-01143-6 %N 60 %0 Journal Article %J Septentrio Reports %D 2022 %T Dataverse Community Survey 2022 – Report %A Conzett, Philipp %X

This report presents some of the results from the Dataverse Community Survey 2022.

The main goal of the survey was to help the Global Dataverse Community Consortium (GDCC; https://dataversecommunity.global/) and the Dataverse Project (https://dataverse.org/) decide on what actions to take to improve the Dataverse software and the larger ecosystem of integrated tools and services as well as better support community members. The results from the survey may also be of interest to other communities working on software and services for managing research data.

The survey was designed to map out the current status as well as the roadmaps and priorities of Dataverse installations around the world.

The main target group for participating in the survey were the people/teams responsible for operating Dataverse installations around the world. A secondary target group were people/teams at organizations that are planning to deploy or considering deploying a Dataverse installation. There were 34 existing and planned Dataverse installations participating in the survey

%B Septentrio Reports %V 1 %G eng %U https://septentrio.uit.no/index.php/SapReps/article/view/6872 %0 Journal Article %J Data Quality and Data Access for Research %D 2021 %T Repository Approaches to Improving the Quality of Shared Data and Code %A Trisovic, Ana %A Mika, Katherine %A Boyd, Ceilyn %A Sebastian Feger %A Mercè Crosas %X Sharing data and code for reuse has become increasingly important in scientific work over the past decade. However, in practice, shared data and code may be unusable, or published results obtained from them may be irreproducible. Data repository features and services contribute significantly to the quality, longevity, and reusability of datasets. This paper presents a combination of original and secondary data analysis studies focusing on computational reproducibility, data curation, and gamified design elements that can be employed to indicate and improve the quality of shared data and code. The findings of these studies are sorted into three approaches that can be valuable to data repositories, archives, and other research dissemination platforms. %B Data Quality and Data Access for Research %V 6 %G eng %U https://www.mdpi.com/2306-5729/6/2/15 %N 2 %0 Journal Article %J Nature Sustainability %D 2020 %T Qualitative data sharing and synthesis for sustainability science %A Steven M. Alexander %A Kristal Jones %A Nathan J. Bennett %A Amber Budden %A Michael Cox %A Mercè Crosas %A Edward T. Game %A Janis Geary %A R. Dean Hardy %A Jay T. Johnson %A Sebastian Karcher %A Nicole Motzer %A Jeremy Pittman %A Heather Randell %A Julie A. Silva %A Patricia Pinto da Silva %A Carly Strasser %A Colleen Strawhacker %A Andrew Stuhl %A Nic Weber %X Socio–environmental synthesis as a research approach contributes to broader sustainability policy and practice by reusing data from disparate disciplines in innovative ways. Synthesizing diverse data sources and types of evidence can help to better conceptualize, investigate and address increasingly complex socio–environmental problems. However, sharing qualitative data for re-use remains uncommon when compared to sharing quantitative data. We argue that qualitative data present untapped opportunities for sustainability science, and discuss practical pathways to facilitate and realize the benefits from sharing and reusing qualitative data. However, these opportunities and benefits are also hindered by practical, ethical and epistemological challenges. To address these challenges and accelerate qualitative data sharing, we outline enabling conditions and suggest actions for researchers, institutions, funders, data repository managers and publishers. %B Nature Sustainability %V 3 %P 81–88 %G eng %U https://www.nature.com/articles/s41893-019-0434-8 %0 Conference Paper %B P-RECS '20: Proceedings of the 3rd International Workshop on Practical Reproducible Evaluation of Computer Systems %D 2020 %T Advancing Computational Reproducibility in the Dataverse Data Repository Platform %A Trisovic, Ana %A Philip Durbin %A Tania Schlatter %A Gustavo Durand %A Sonia Barbosa %A Danny Brooke %A Mercè Crosas %X Recent reproducibility case studies have raised concerns showing that much of the deposited research has not been reproducible. One of their conclusions was that the way data repositories store research data and code cannot fully facilitate reproducibility due to the absence of a runtime environment needed for the code execution. New specialized reproducibility tools provide cloud-based computational environments for code encapsulation, thus enabling research portability and reproducibility. However, they do not often enable research discoverability, standardized data citation, or long-term archival like data repositories do. This paper addresses the shortcomings of data repositories and reproducibility tools and how they could be overcome to improve the current lack of computational reproducibility in published and archived research outputs. %B P-RECS '20: Proceedings of the 3rd International Workshop on Practical Reproducible Evaluation of Computer Systems %P 15–20 %G eng %U https://dl.acm.org/doi/10.1145/3391800.3398173 %0 Journal Article %J Scientific Data %D 2019 %T A data citation roadmap for scholarly data repositories %A Martin Fenner %A Mercè Crosas %A Jeffrey S. Grethe %A Kennedy, David %A Hermjakob, Henning %A Phillippe Rocca-Serra %A Gustavo Durand %A Robin Berjon %A Sebastian Karcher %A Maryann Martone %A Tim Clark %X This article presents a practical roadmap for scholarly data repositories to implement data citation in accordance with the Joint Declaration of Data Citation Principles, a synopsis and harmonization of the recommendations of major science policy bodies. The roadmap was developed by the Repositories Expert Group, as part of the Data Citation Implementation Pilot (DCIP) project, an initiative of FORCE11.org and the NIH-funded BioCADDIE (https://biocaddie.org) project. The roadmap makes 11 specific recommendations, grouped into three phases of implementation: a) required steps needed to support the Joint Declaration of Data Citation Principles, b) recommended steps that facilitate article/data publication workflows, and c) optional steps that further improve data citation support provided by data repositories. We describe the early adoption of these recommendations 18 months after they have first been published, looking specifically at implementations of machine-readable metadata on dataset landing pages. %B Scientific Data %V 6 %G eng %U https://www.nature.com/articles/s41597-019-0031-8 %N 28 %0 Journal Article %J Scientific Data %D 2019 %T Evaluating FAIR maturity through a scalable, automated, community-governed framework %A Mark D. Wilkinson %A Michel Dumontier %A Susanna-Assunta Sansone %A Luiz Olavo Bonino da Silva Santos %A Mario Prieto %A Dominique Batista %A McQuilton, Peter %A Tobias Kuhn %A Philippe Rocca-Serra %A Mercѐ Crosas %A Erik Schultes %X Transparent evaluations of FAIRness are increasingly required by a wide range of stakeholders, from scientists to publishers, funding agencies and policy makers. We propose a scalable, automatable framework to evaluate digital resources that encompasses measurable indicators, open source tools, and participation guidelines, which come together to accommodate domain relevant community-defined FAIR assessments. The components of the framework are: (1) Maturity Indicators – community-authored specifications that delimit a specific automatically-measurable FAIR behavior; (2) Compliance Tests – small Web apps that test digital resources against individual Maturity Indicators; and (3) the Evaluator, a Web application that registers, assembles, and applies community-relevant sets of Compliance Tests against a digital resource, and provides a detailed report about what a machine “sees” when it visits that resource. We discuss the technical and social considerations of FAIR assessments, and how this translates to our community-driven infrastructure. We then illustrate how the output of the Evaluator tool can serve as a roadmap to assist data stewards to incrementally and realistically improve the FAIRness of their resources. %B Scientific Data %V 6 %G eng %U https://www.nature.com/articles/s41597-019-0184-5 %N 174 %0 Journal Article %J SocArXiv %D 2018 %T Data policies of highly-ranked social science journals %A Mercè Crosas %A Julian Gautier %A Sebastian Karcher %A Dessi Kirilova %A Gerard Otalora %A Abigail Schwartz %X

By encouraging and requiring that authors share their data in order to publish articles, scholarly journals have become an important actor in the movement to improve the openness of data and the reproducibility of research. But how many social science journals encourage or mandate that authors share the data supporting their research findings? How does the share of journal data policies vary by discipline? What influences these journals’ decisions to adopt such policies and instructions? And what do those policies and instructions look like?

We discuss the results of our analysis of the instructions and policies of 291 highly-ranked journals publishing social science research, where we studied the contents of journal data policies and instructions across 14 variables, such as when and how authors are asked to share their data, and what role journal ranking and age play in the existence and quality of data policies and instructions. We also compare our results to the results of other studies that have analyzed the policies of social science journals, although differences in the journals chosen and how each study defines what constitutes a data policy limit this comparison.

We conclude that a little more than half of the journals in our study have data policies. Agreater share of the economics journals have data policies and mandate sharing, followed by political science/international relations and psychology journals.

Finally, we use our findings to make several recommendations: Policies should include the terms “data,” “dataset” or more specific terms that make it clear what to make available; policies should include the benefits of data sharing; journals, publishers, and associations need to collaborate more to clarify data policies; and policies should explicitly ask for qualitative data.

This paper has won the IASSIST & Carto 2018 Best Paper award.

%B SocArXiv %G eng %0 Web Page %D 2017 %T Cloud Dataverse: A Data Repository Platform for the Cloud %A Mercè Crosas %B CIO Review %G eng %U https://openstack.cioreview.com/cxoinsight/cloud-dataverse-a-data-repository-platform-for-the-cloud-nid-24199-cid-120.html %0 Journal Article %J Scientific Data %D 2017 %T If these data could talk %A Thomas Pasquier %A Lau, Matthew K. %A Trisovic, Ana %A Boose, Emery R. %A Couturier, Ben %A Mercè Crosas %A Ellison, Aaron M. %A Gibson, Valerie %A Jones, Chris R. %A Seltzer, Margo %X In the last few decades, data-driven methods have come to dominate many fields of scientific inquiry. Open data and open-source software have enabled the rapid implementation of novel methods to manage and analyze the growing flood of data. However, it has become apparent that many scientific fields exhibit distressingly low rates of reproducibility. Although there are many dimensions to this issue, we believe that there is a lack of formalism used when describing end-to-end published results, from the data source to the analysis to the final published results. Even when authors do their best to make their research and data accessible, this lack of formalism reduces the clarity and efficiency of reporting, which contributes to issues of reproducibility. Data provenance aids both reproducibility through systematic and formal records of the relationships among data sources, processes, datasets, publications and researchers. %B Scientific Data %V 4 %G eng %U https://www.nature.com/articles/sdata2017114 %N 170114 %0 Journal Article %J Ann. N.Y. Acad. Sci. %D 2016 %T Extension of research data repository system to support direct compute access to biomedical datasets: enhancing Dataverse to support large datasets %A Bill McKinney %A Peter A. Meyer %A Mercè Crosas %A Sliz, Piotr %X Access to experimental X‐ray diffraction image data is important for validation and reproduction of macromolecular models and indispensable for the development of structural biology processing methods. In response to the evolving needs of the structural biology community, we recently established a diffraction data publication system, the Structural Biology Data Grid (SBDG, data.sbgrid.org), to preserve primary experimental datasets supporting scientific publications. All datasets published through the SBDG are freely available to the research community under a public domain dedication license, with metadata compliant with the DataCite Schema (schema.datacite.org). A proof‐of‐concept study demonstrated community interest and utility. Publication of large datasets is a challenge shared by several fields, and the SBDG has begun collaborating with the Institute for Quantitative Social Science at Harvard University to extend the Dataverse (dataverse.org) open‐source data repository system to structural biology datasets. Several extensions are necessary to support the size and metadata requirements for structural biology datasets. In this paper, we describe one such extension—functionality supporting preservation of file system structure within Dataverse—which is essential for both in‐place computation and supporting non‐HTTP data transfers. %B Ann. N.Y. Acad. Sci. %V 1387 %P 95-104 %G eng %U https://nyaspubs.onlinelibrary.wiley.com/doi/full/10.1111/nyas.13272 %0 Web Page %D 2016 %T Data publication with the structural biology data grid supports live analysis %A Peter A. Meyer %A et al. %X

Access to experimental X-ray diffraction image data is fundamental for validation and reproduction of macromolecular models and indispensable for development of structural biology processing methods. Here, we established a diffraction data publication and dissemination system, Structural Biology Data Grid (SBDG; data.sbgrid.org), to preserve primary experimental data sets that support scientific publications. Data sets are accessible to researchers through a community driven data grid, which facilitates global data access. Our analysis of a pilot collection of crystallographic data sets demonstrates that the information archived by SBDG is sufficient to reprocess data to statistics that meet or exceed the quality of the original published structures. SBDG has extended its services to the entire community and is used to develop support for other types of biomedical data sets. It is anticipated that access to the experimental data sets will enhance the paradigm shift in the community towards a much more dynamic body of continuously improving data analysis.

%B Nature Communications %G eng %N 10882 %0 Conference Paper %B In Proceedings of the International Workshop on Privacy Engineering, IEEE %D 2016 %T DataTags, Data Handling Policy Spaces and the Tags Language %A Bar-Sinai, Michael %A Sweeney, Latanya %A Mercè Crosas %X Widespread sharing of scientific datasets holds great promise for new scientific discoveries and great risks for personal privacy. Dataset handling policies play the critical role of balancing privacy risks and scientific value. We propose an extensible, formal, theoretical model for dataset handling policies. We define binary operators for policy composition and for comparing policy strictness, such that propositions like "this policy is stricter than that policy" can be formally phrased. Using this model, The policies are described in a machine-executable and human-readable way. We further present the Tags programming language and toolset, created especially for working with the proposed model. Tags allows composing interactive, friendly questionnaires which, when given a dataset, can suggest a data handling policy that follows legal and technical guidelines. Currently, creating such a policy is a manual process requiring access to legal and technical experts, which are not always available. We present some of Tags' tools, such as interview systems, visualizers, development environment, and questionnaire inspectors. Finally, we discuss methodologies for questionnaire development. Data for this paper include a questionnaire for suggesting a HIPAA compliant data handling policy, and formal description of the set of data tags proposed by the authors in a recent paper. %B In Proceedings of the International Workshop on Privacy Engineering, IEEE %I IEEE %G eng %U https://ieeexplore.ieee.org/document/7527746?arnumber=7527746 %0 Journal Article %J Scientific Data %D 2016 %T The FAIR Guiding Principles for scientific data management and stewardship %A Mark D. Wilkinson %A Michel Dumontier %A IJsbrand Jan Aalbersberg %A Gabrielle Appleton %A Myles Axton %A Arie Baak %A Niklas Blomberg %A Jan-Willem Boiten %A Luiz Bonino da Silva Santos %A Philip E. Bourne %A Jildau Bouwman %A Anthony J. Brookes %A Tim Clark %A Mercè Crosas %A Ingrid Dillo %A Olivier Dumon %A Scott Edmunds %A Chris T. Evelo %A Richard Finkers %A Alejandra Gonzalez-Beltran %A Alasdair J.G. Gray %A Paul Groth %A Carole Goble %A Jeffrey S. Grethe %A Jaap Heringa %A Peter A.C. 't Hoen %A Rob Hooft %A Tobias Kuhn %A Ruben Kok %A Joost Kok %A Scott J. Lusher %A Maryann E. Martone %A Albert Mons %A Abel L. Packer %A Bengt Persson %A Philippe Rocca-Serra %A Marco Roos %A Rene van Schaik %A Susanna-Assunta Sansone %A Erik Schultes %A Thierry Sengstag %A Ted Slater %A George Strawn %A Morris A. Swertz %A Mark Thompson %A van der Lei, Johan %A van Mulligen, Erik %A Jan Velterop %A Andra Waagmeester %A Peter Wittenburg %A Katherine Wolstencroft %A Zhao, Jun %A Barend Mons %X

There is an urgent need to improve the infrastructure supporting the reuse of scholarly data. A diverse set of stakeholders—representing academia, industry, funding agencies, and scholarly publishers—have come together to design and jointly endorse a concise and measureable set of principles that we refer to as the FAIR Data Principles. The intent is that these may act as a guideline for those wishing to enhance the reusability of their data holdings. Distinct from peer initiatives that focus on the human scholar, the FAIR Principles put specific emphasis on enhancing the ability of machines to automatically find and use the data, in addition to supporting its reuse by individuals. This Comment is the first formal publication of the FAIR Principles, and includes the rationale behind them, and some exemplar implementations in the community.

%B Scientific Data %V 160018 %8 2016 %G eng %0 Web Page %D 2015 %T Achieving human and machine accessibility of cited data in scholarly publications %A Joan Starr %A Castro, Eleni %A Mercè Crosas %A Michel Dumontier %A Robert R. Downs %A Ruth Duerr %A Laurel L. Haak %A Melissa Haendel %A Ivan Herman %A Simon Hodson %A Joe Hourclé %A John Ernest Kratz %A Jennifer Lin %A Lars Holm Nielsen %A Amy Nurnberger %A Stefan Proell %A Andreas Rauber %A Simone Sacchi %A Arthur Smith %A Mike Taylor %A Tim Clark %X

Reproducibility and reusability of research results is an important concern in scientific communication and science policy. A foundational element of reproducibility and reusability is the open and persistently available presentation of research data. However, many common approaches for primary data publication in use today do not achieve sufficient long-term robustness, openness, accessibility or uniformity. Nor do they permit comprehensive exploitation by modern Web technologies. This has led to several authoritative studies recommending uniform direct citation of data archived in persistent repositories. Data are to be considered as first-class scholarly objects, and treated similarly in many ways to cited and archived scientific and scholarly literature. Here we briefly review the most current and widely agreed set of principle-based recommendations for scholarly data citation, the Joint Declaration of Data Citation Principles (JDDCP). We then present a framework for operationalizing the JDDCP; and a set of initial recommendations on identifier schemes, identifier resolution behavior, required metadata elements, and best practices for realizing programmatic machine actionability of cited data. The main target audience for the common implementation guidelines in this article consists of publishers, scholarly organizations, and persistent data repositories, including technical staff members in these organizations. But ordinary researchers can also benefit from these recommendations. The guidance provided here is intended to help achieve widespread, uniform human and machine accessibility of deposited data, in support of significantly improved verification, validation, reproducibility and re-use of scholarly/scientific data.

%B PeerJ Computer Science %G eng %U https://peerj.com/articles/cs-1/ %0 Journal Article %J ANNALS of the American Academy of Political and Social Science %D 2015 %T Automating Open Science for Big Data %A Mercè Crosas %A Gary King %A James Honaker %A Sweeney, Latanya %X

The vast majority of social science research uses small (megabyte- or gigabyte-scale) datasets. these fixed- scale datasets are commonly downloaded to the researcher’s computer where the analysis is performed. the data can be shared, archived, and cited with well-established technologies, such as the Dataverse Project, to support the published results. the trend toward big data—including large-scale streaming data—is starting to transform research and has the potential to impact policymaking as well as our understanding of the social, economic, and political problems that affect human societies. However, big data research poses new challenges to the execution of the analysis, archiving and reuse of the data, and reproduction of the results. Downloading these datasets to a researcher’s computer is impractical, leading to analyses taking place in the cloud, and requiring unusual expertise, collaboration, and tool development. the increased amount of information in these large datasets is an advantage, but at the same time it poses an increased risk of revealing personally identifiable sensitive information. In this article, we discuss solutions to these new challenges so that the social sciences can realize the potential of big data.

%B ANNALS of the American Academy of Political and Social Science %V 659 %P 260-273 %8 May 2015 %G eng %N 1 %0 Journal Article %J Bulletin of the Association for Information Science and Technology %D 2015 %T An Introduction to the Joint Principles of Data Citation %A Micah Altman %A Christine Borgman %A Mercè Crosas %A Maryann Martone %X

Data citation is rapidly emerging as a key practice supporting data access, sharing and reuse, as well as sound and reproducible scholarship. Consensus data citation principles, articulated through the Joint Declaration of Data Citation Principles, represent an advance in the state of the practice and a new consensus on citation.

%B Bulletin of the Association for Information Science and Technology %V 41 %P 43-45 %8 2015 %G eng %N 3 %0 Web Page %D 2015 %T Open Journal Systems and Dataverse Integration-- Helping Journals to Upgrade Data Publication for Reusable Research %A Micah Altman %A Castro, Eleni %A Mercè Crosas %A Philip Durbin %A Garnett, Alex %A Jen Whitney %X

This article describes the novel open source tools for open data publication in open access journal workflows. This comprises a plugin for Open Journal Systems that supports a data submission, citation, review, and publication workflow; and an extension to the Dataverse system that provides a standard deposit API. We describe the function and design of these tools, provide examples of their use, and summarize their initial reception. We conclude by discussing future plans and potential impact.

%B Code4Lib Journal %G eng %N 30 %0 Web Page %D 2015 %T An Open Science Platform for the Next Generation of Data %A Sweeney, Latanya %A Merce Crosas %X

Imagine an online work environment where researchers have direct and immediate access to myriad data sources and tools and data management resources, useful throughout the research lifecycle. This is our vision for the next generation of the Dataverse Network: an Open Science Platform (OSP). For the first time, researchers would be able to seamlessly access and create primary and derived data from a variety of sources: prior research results, public data sets, harvested online data, physical instruments, private data collections, and even data from other standalone repositories. Researchers could recruit research participants and conduct research directly on the OSP, if desired, using readily available tools. Researchers could create private or shared workspaces to house data, access tools, and computation and could publish data directly on the platform or publish elsewhere with persistent, data citations on the OSP. This manuscript describes the details of an Open Science Platform and its construction. Having an Open Science Platform will especially impact the rate of new scientific discoveries and make scientific findings more credible and accountable. (This manuscript was originally conceived in 2013)

%B Arxiv.org Computer Science, Computers and Scoiety %G eng %U http://arxiv.org/abs/1506.05632 %0 Web Page %D 2015 %T Sharing Sensitive Data with Confidence: the DataTags System %A Sweeney, Latanya %A Mercè Crosas %A Bar-Sinai, Michael %X

Society generates data on a scale previously unimagined. Wide sharing of these data promises to improve personal health, lower healthcare costs, and provide a better quality of life. There is a tendency to want to share data freely. However, these same data often include sensitive information about people that could cause serious harms if shared widely. A multitude of regulations, laws and best practices protect data that contain sensitive personal information. Government agencies, research labs, and corporations that share data, as well as review boards and privacy officers making data sharing decisions, are vigilant but uncertain. This uncertainty creates a tendency not to share data at all. Some data are more harmful than other data; sharing should not be an all-or-nothing choice. How do we share data in ways that ensure access is commensurate with risks of harm?

%B Technology Science %G eng %0 Journal Article %D 2015 %T Big Data Technology Literature Review %A Bar-Sinai, Michael %G eng %0 Journal Article %J The Code4Lib Journal %D 2015 %T Open Journal Systems and Dataverse Integration– Helping Journals to Upgrade Data Publication for Reusable Research %A Micah Altman %A Castro, Eleni %A Mercè Crosas %A Philip Durbin %A Garnett, Alex %A Jen Whitney %X This article describes the novel open source tools for open data publication in open access journal workflows. This comprises a plugin for Open Journal Systems that supports a data submission, citation, review, and publication workflow; and an extension to the Dataverse system that provides a standard deposit API. We describe the function and design of these tools, provide examples of their use, and summarize their initial reception. We conclude by discussing future plans and potential impact. %B The Code4Lib Journal %V 30 %G eng %U http://journal.code4lib.org/articles/10989 %0 Conference Proceedings %B University of Massachusetts and New England Area Librarian e-Science Symposium %D 2015 %T Usability Testing Driven Redesign of Dataverse, an Open Source Data Repository %A Quigley, Elizabeth %B University of Massachusetts and New England Area Librarian e-Science Symposium %C Worcester, MA %G eng %U http://escholarship.umassmed.edu/escience_symposium/2015/posters/8/ %0 Journal Article %J PLoS ONE %D 2014 %T How Do Astronomers Share Data? Reliability and Persistence of Datasets Linked in AAS Publications and a Qualitative Study of Data Practices among US Astronomers %A Alberto Pepe %A Goodman, Alyssa %A Muench, August %A Merce Crosas %A Christopher Erdmann %X

We analyze data sharing practices of astronomers over the past fifteen years. An analysis of URL links embedded in papers published by the American Astronomical Society reveals that the total number of links included in the literature rose dramatically from 1997 until 2005, when it leveled off at around 1500 per year. The analysis also shows that the availability of linked material decays with time: in 2011, 44% of links published a decade earlier, in 2001, were broken. A rough analysis of link types reveals that links to data hosted on astronomers' personal websites become unreachable much faster than links to datasets on curated institutional sites. To gauge astronomers' current data sharing practices and preferences further, we performed in-depth interviews with 12 scientists and online surveys with 173 scientists, all at a large astrophysical research institute in the United States: the Harvard-Smithsonian Center for Astrophysics, in Cambridge, MA. Both the in-depth interviews and the online survey indicate that, in principle, there is no philosophical objection to data-sharing among astronomers at this institution. Key reasons that more data are not presently shared more efficiently in astronomy include: the difficulty of sharing large data sets; over reliance on non-robust, non-reproducible mechanisms for sharing data (e.g. emailing it); unfamiliarity with options that make data-sharing easier (faster) and/or more robust; and, lastly, a sense that other researchers would not want the data to be shared. We conclude with a short discussion of a new effort to implement an easy-to-use, robust, system for data sharing in astronomy, at theastrodata.org, and we analyze the uptake of that system to-date.

%B PLoS ONE %V 9 %8 2014 %G eng %N 8 %0 Journal Article %J PLoS computational biology %D 2014 %T Ten Simple Rules for the Care and Feeding of Scientific Data %A Goodman, Alyssa %A Alberto Pepe %A Alexander W. Blocker %A Christine L. Borgman %A Kyle Cranmer %A Merce Crosas %A Rosanne Di Stefano %A Yolanda Gil %A Paul Groth %A Margaret Hedstrom %A David W. Hogg %A Vinay Kashyap %A Ashish Mahabal %A Aneta Siemiginowska %A Aleksandra Slavkovic %B PLoS computational biology %V 10 %8 2014 %G eng %N 4 %0 Journal Article %J International Journal of Digital Curation %D 2014 %T Building a Bridge Between Journal Articles and Research Data: The PKP-Dataverse Integration Project %A Castro, Eleni %A Garnett, Alex %X A growing number of funding agencies and international scholarly organizations are requesting that research data be made more openly available to help validate and advance scientific research. Thus, this is an opportune moment for research data repositories to partner with journal editors and publishers in order to simplify and improve data curation and publishing practices. One practical example of this type of cooperation is currently being facilitated by a two year (2012-2014) one million dollar Sloan Foundation grant, integrating two well-established open source systems: the Public Knowledge Project’s (PKP) Open Journal Systems (OJS), developed by Stanford University and Simon Fraser University; and Harvard University’s Dataverse Network web application, developed by the Institute for Quantitative Social Science (IQSS). To help make this interoperability possible, an OJS Dataverse plugin and Data Deposit API are being developed, which together will allow authors to submit their articles and datasets through an existing journal management interface, while the underlying data are seamlessly deposited into a research data repository, such as the Harvard Dataverse. This practice paper will provide an overview of the project, and a brief exploration of some of the specific challenges to and advantages of this integration. %B International Journal of Digital Curation %V 9 %P 176-184 %G eng %U http://www.ijdc.net/index.php/ijdc/article/view/311 %0 Conference Proceedings %B Extended Proceedings of ACM Hypertext 2014 %D 2014 %T Statistical Modeling by Gesture: A graphical, browser-based statistical interface for data repositories %A James Honaker %A Vito D’Orazio %X We detail our construction of TwoRavens, a graphical user interface for quantitative analysis that allows users at all levels of statistical expertise to explore their data, describe their substantive understanding of the data, and appropriately construct and interpret statistical models. The interface is a browser-based, thin client, with the data remaining in an online repository, and the statistical modeling occurring on a remote server.  In our implementation, we integrate with tens of thousands of datasets from the Dataverse repository, and the large library of statistical models available in the Zelig package for the R statistical language.  Our interface is entirely gesture-driven, and so easily used on tablets and phones.  This, in combination with being browser-based, makes data exploration and quantitative reasoning easily portable to the classroom with minimal infrastructure or technology overhead. %B Extended Proceedings of ACM Hypertext 2014 %G eng %U http://ceur-ws.org/Vol-1210/datawiz2014_05.pdf %0 Journal Article %J Social Computing (SocialCom), 2013 International Conference on. IEEE %D 2013 %T Sociometric Methods for Relevancy Analysis of Long Tail Science Data %A Arcot Rajasekar %A Sharlini Sankaran %A Howard Lander %A Tom Carsey %A Jonathan Crabtree %A Hye-Chung Kum %A Merce Crosas %A Gary King %A Justin Zhan %X

As the push towards electronic storage, publication, curation, and discoverability of research data collected in multiple research domains has grown, so too have the massive numbers of small to medium datasets that are highly distributed and not easily discoverable - a region of data that is sometimes referred to as the long tail of science. The rapidly increasing, sheer volume of these long tail data present one aspect of the Big Data problem: how does one more easily access, discover, use, and reuse long tail data to lead to new multidisciplinary collaborative research and scientific advancement? In this paper, we describe Data Bridge, a new e-science collaboration environment that will realize the potential of long tail data by implementing algorithms and tools to more easily enable data discoverability and reuse. Data Bridge will define different types of semantic bridges that link diverse datasets by applying a set of sociometric network analysis (SNA) and relevance algorithms. We will measure relevancy by examining different ways datasets can be related to each other: data to data, user to data, and method to data connections. Through analysis of metadata and ontology, by pattern analysis and feature extraction, through usage tools and models, and via human connections, Data Bridge will create an environment for long tail data that is greater than the sum of its parts. In the project's initial phase, we will test and validate the new tools with real-world data contained in the Data verse Network, the largest social science data repository. In this short paper, we discuss the background and vision for the Data Bridge project, and present an introduction to the proposed SNA algorithms and analytical tools that are relevant for discoverability of long tail science data.

%B Social Computing (SocialCom), 2013 International Conference on. IEEE %P 1-6 %8 2013 %G eng %0 Journal Article %J IASSIST Quarterly %D 2013 %T The Evolution of Data Citation: From Principles to Implementation %A Micah Altman %A Merce Crosas %X Data citation is rapidly emerging as a key practice in support of data access, sharing, reuse, and of sound and reproducible scholarship. In this article we review the evolution of data citation standards and practices – to which Sue Dodd was an early contributor – and the core principles of data citation that have emerged through a collaborative synthesis. We then discuss an example of the current state of the practice, and identify the remaining implementation challenges. %B IASSIST Quarterly %V 37 %P 62 %G eng %U http://www.iassistdata.org/iq/content/5457 %0 Journal Article %D 2013 %T Dataverse Usability Evaluation: Final Report %A Gibbs, Eric %A Lin, Lin %A Quigley, Elizabeth %A Tang, Rong %I Simmons GSLIS Usability Lab %C Boston %P 1-18 %8 05/29 %G eng %0 Journal Article %J Journal of eScience Librarianship %D 2012 %T A Data Sharing Story %A Merce Crosas %X From the early days of modern science through this century of Big Data, data sharing has enabled some of the greatest advances in science. In the digital age, technology can facilitate more effective and efficient data sharing and preservation practices, and provide incentives for making data easily accessible among researchers. At the Institute for Quantitative Social Science at Harvard University, we have developed an open-source software to share, cite, preserve, discover and analyze data, named the Dataverse Network. We share here the project’s motivation, its growth and successes, and likely evolution. %B Journal of eScience Librarianship %V 1 %P 173-179 %G eng %U http://escholarship.umassmed.edu/jeslib/vol1/iss3/7/ %0 Journal Article %J D-Lib Magazine %D 2011 %T The Dataverse Network: An Open-source Application for Sharing, Discovering and Preserving Data %A Merce Crosas %X The Dataverse Network is an open-source application for publishing, referencing, extracting and analyzing research data. The main goal of the Dataverse Network is to solve the problems of data sharing through building technologies that enable institutions to reduce the burden for researchers and data publishers, and incentivize them to share their data. By installing Dataverse Network software, an institution is able to host multiple individual virtual archives, called "dataverses" for scholars, research groups, or journals, providing a data publication framework that supports author recognition, persistent citation, data discovery and preservation. Dataverses require no hardware or software costs, nor maintenance or backups by the data owner, but still enable all web visibility and credit to devolve to the data owner. %B D-Lib Magazine %V Volume 17 %G eng %U http://www.dlib.org/dlib/january11/crosas/01crosas.html %0 Book Section %B iPres %D 2011 %T Overview of SafeArchive : An Open-Source System for Automatic Policy-Based Collaborative Archival Replication %A Christian, Thu-mai %A Jonathan Crabtree %A Mcgovern, Nancy %A Micah Altman %X n/a %B iPres %V 02 %G eng %U http://www.safearchive.org/ %0 Conference Proceedings %B Proceedings of Archiving 2011 %D 2011 %T Using the SafeArchive System: TRAC-Based Auditing of LOCKSS %A Micah Altman %A Jonathan Crabtree %B Proceedings of Archiving 2011 %I IS & T %C SLC, UT %P 165-170 %G eng %U http://www.box.net/shared/xj8r2nuiugsn8ukxf1lt %0 Journal Article %J The American Archivist %D 2009 %T Digital Preservation Through Archival Collaboration: The Data Preservation Alliance for the Social Sciences %A Micah Altman %A Margaret Adams %A Jonathan Crabtree %A Darrell Donakowski %A Marc Maynard %A Amy Pienta %A Copeland Young %B The American Archivist %V 72 %P 169-182 %G eng %U http://www.linkedin.com/in/micahaltman %0 Journal Article %J Library Trends %D 2009 %T From Preserving the Past to Preserving the Future: The Data-PASS Project and the Challenges of Preserving Digital Social Science Data %A Gutmann, Myron P. %A Mark Abrahamson %A Margaret O. Adams %A Micah Altman %A Caroline Arms %A Kenneth Bollen %A Michael Carlson %A Jonathan Crabtree %A Darrell Donakowski %A Gary King %A Jaret Lyle %A Marc Maynard %A Amy Pienta %A Richard Rockwell %A Lois Rocms-Ferrara %A Copeland H. Young %X Social science data are an unusual part of the past, present, and future of digital preservation. They are both an unqualified success, due to long-lived and sustainable archival organizations, and in need of further development because not all digital content is being preserved. This article is about the Data Preservation Alliance for Social Sciences (Data-PASS), a project supported by the National Digital Information Infrastructure and Preservation Program (NDIIPP), which is a partnership of five major U.S. social science data archives. Broadly speaking, Data-PASS has the goal of ensuring that at-risk social science data are identified, acquired, and preserved, and that we have a future-oriented organization that could collaborate on those preservation tasks for the future. Throughout the life of the Data-PASS project we have worked to identify digital materials that have never been systematically archived, and to appraise and acquire them. As the project has progressed, however, it has increasingly turned its attention from identifying and acquiring legacy and at-risk social science data to identifying on going and future research projects that will produce data. This article is about the project’s history, with an emphasis on the issues that underlay the transition from looking backward to looking forward. %B Library Trends %V 57 %P 315–337 %G eng %U http://gking.harvard.edu/files/gking/files/GutAbrAda09.pdf %0 Journal Article %J Library Trends %D 2009 %T Transformative Effects of NDIIPP, the case of the Henry A. Murray Archive %A Micah Altman %B Library Trends %V 57 %P 338-351 %G eng %U http://www.linkedin.com/in/micahaltman %0 Book Section %B A Fingerprint Method for Verification of Scientific Data %D 2008 %T A Fingerprint Method for Verification of Scientific Data %A Micah Altman %B A Fingerprint Method for Verification of Scientific Data %I Springer-Verlag %G eng %U http://www.mendeley.com/profiles/micah-altman/ %0 Journal Article %J Journal of Computational Graphics and Statistics %D 2008 %T Toward A Common Framework for Statistical Analysis and Development %A Kosuke Imai %A Gary King %A Olivia Lau %X We describe some progress toward a common framework for statistical analysis and software development built on and within the R language, including R’s numerous existing packages. The framework we have developed offers a simple unified structure and syntax that can encompass a large fraction of statistical procedures already implemented in R, without requiring any changes in existing approaches. We conjecture that it can be used to encompass and present simply a vast majority of existing statistical methods, regardless of the theory of inference on which they are based, notation with which they were developed, and programming syntax with which they have been implemented. This development enabled us, and should enable others, to design statistical software with a single, simple, and unified user interface that helps overcome the conflicting notation, syntax, jargon, and statistical methods existing across the methods subfields of numerous academic disciplines. The approach also enables one to build a graphical user interface that automatically includes any method encompassed within the framework. We hope that the result of this line of research will greatly reduce the time from the creation of a new statistical innovation to its widespread use by applied researchers whether or not they use or program in R. %B Journal of Computational Graphics and Statistics %V 17 %P 1–22 %G eng %0 Journal Article %J Sociological Methods and Research %D 2007 %T An Introduction to the Dataverse Network as an Infrastructure for Data Sharing %A Gary King %B Sociological Methods and Research %V 36 %P 173-199 %G eng %U http://gking.harvard.edu/files/gking/files/dvn.pdf %0 Journal Article %J D-Lib Magazine %D 2007 %T A Proposed Standard for the Scholarly Citation of Quantitative Data %A Micah Altman %A Gary King %X An essential aspect of science is a community of scholars cooperating and competing in the pursuit of common goals. A critical component of this community is the common language of and the universal standards for scholarly citation, credit attribution, and the location and retrieval of articles and books. We propose a similar universal standard for citing quantitative data that retains the advantages of print citations, adds other components made possible by, and needed due to, the digital form and systematic nature of quantitative data sets, and is consistent with most existing subfield-specific approaches. Although the digital library field includes numerous creative ideas, we limit ourselves to only those elements that appear ready for easy practical use by scientists, journal editors, publishers, librarians, and archivists. %B D-Lib Magazine %V 13 %8 March/ April %G eng %U http://www.dlib.org/dlib/march07/altman/03altman.html %0 Journal Article %J International Studies Perspectives %D 2003 %T The Future of Replication %A Gary King %X Since the replication standard was proposed for political science research, more journals have required or encouraged authors to make data available, and more authors have shared their data. The calls for continuing this trend are more persistent than ever, and the agreement among journal editors in this Symposium continues this trend. In this article, I offer a vision of a possible future of the replication movement. The plan is to implement this vision via the Virtual Data Center project, (pre-Dataverse) which – by automating the process of finding, sharing, archiving, subsetting, converting, analyzing, and distributing data – may greatly facilitate adherence to the replication standard. %B International Studies Perspectives %V 4 %P 443–499 %8 February %G eng %0 Journal Article %D 2001 %T A Digital Library for the Dissemination and Replication of Quantitative Social Science Research %A Micah Altman %A Leonid Andreev %A Mark Diggory %A Gary King %A Kiskis, Daniel %A Elizabeth Kolster %A Sidney Verba %X The Virtual Data Center (VDC) software is an open-source, digital library system for quantitative data. We discuss what the software does, and how it provides an infrastructure for the management and dissemination of disturbed collections of quantitative data, and the replication of results derived from this data. %V Social Science Computer Review, 19 %P 458-470 %G eng %U http://gking.harvard.edu/files/gking/files/vdcwhitepaper.pdf %0 Journal Article %D 2001 %T An Introduction to the Virtual Data Center Project and Software %A Micah Altman %A Leonid Andreev %A Mark Diggory %A Gary King %A Elizabeth Kolster %A Krot, M. %A Sidney Verba %A Kiskis, Daniel %V Proceedings of The First ACM+IEEE Joint Conference on Digital Libraries %P 203-204 %G eng %U http://gking.harvard.edu/files/gking/files/jcdl01.pdf %0 Journal Article %J PS: Political Science and Politics %D 1995 %T Replication, Replication %A Gary King %X Political science is a community enterprise and the community of empirical political scientists need access to the body of data necessary to replicate existing studies to understand, evaluate, and especially build on this work. Unfortunately, the norms we have in place now do not encourage, or in some cases even permit, this aim. Following are suggestions that would facilitate replication and are easy to implement – by teachers, students, dissertation writers, graduate programs, authors, reviewers, funding agencies, and journal and book editors. %B PS: Political Science and Politics %V 28 %P 443–499 %8 September %G eng %U http://gking.harvard.edu/files/gking/files/replication.pdf