Sharing biological data: why, when, and how

Like Comment

Sharing biological data: why, when, and how

Samantha L. Wilson, Gregory P. Way, Wout Bittremieux, Jean‐Paul Armache, Melissa A. Haendel, Michael M. Hoffman

Data sharing is an essential element of the scientific method, imperative to ensure transparency and reproducibility. Researchers often reuse shared data for meta‐analyses or to accompany new data. Different areas of research collect fundamentally different types of data, such as tabular data, sequence data, and image data. These types of data differ greatly in size and require different approaches for sharing. Here, we outline good practices to make your biological data publicly accessible and usable, generally and for several specific kinds of data.

FAIR principles

Sharing data proves more useful when others can easily find and access, interpret, and reuse the data. To maximize the benefit of sharing your data, follow the findable, accessible, interoperable, and reusable (FAIR) guiding principles of data sharing [[1]] (Box 1), which optimize reuse of generated data. The FAIR principles outline clear standards for ensuring that others can find and access your data and that once accessed, users can easily understand and reuse the data. The FAIR principles provide a clear collection of important details to include within your data and metadata (see ‘Data metadata and documentation’).

Box 1. FAIR data sharing principles

Findable

The first step in (re)using data is to find them. Metadata and data should be easy to find for both humans and computers. Machine‐readable metadata are essential for automatic discovery of datasets and services, so this is an essential component of the FAIRification process.

  • F1. (Meta)data are assigned a globally unique and persistent identifier
  • F2. Data are described with rich metadata (defined by R1 below)
  • F3. (Meta)data clearly and explicitly include the identifier of the data they describe
  • F4. (Meta)data are registered or indexed in a searchable resource

Accessible

Once the user finds the required data, she/he needs to know how can they be accessed, possibly including authentication and authorization.

  • A1. (Meta)data are retrievable by their identifier using a standardized communications protocol
    • A1.1 The protocol is open, free, and universally implementable
    • A1.2 The protocol allows for an authentication and authorization procedure, where necessary
  • A2. Metadata are accessible, even when the data are no longer available

Interoperable

The data usually need to be integrated with other data. In addition, the data need to interoperate with applications or workflows for analysis, storage, and processing.

  • I1. (Meta)data use a formal, accessible, shared, and broadly applicable language for knowledge representation
  • I2. (Meta)data use vocabularies that follow FAIR principles
  • I3. (Meta)data include qualified references to other (meta)data

Reusable

The ultimate goal of FAIR is to optimize the reuse of data. To achieve this, metadata and data should be well‐described so that they can be replicated and/or combined in different settings.

  • R1. (Meta)data are richly described with a plurality of accurate and relevant attributes
    • R1.1 (Meta)data are released with a clear and accessible data usage license
    • R1.2 (Meta)data are associated with detailed provenance
    • R1.3 (Meta)data meet domain‐relevant community standards

By GO FAIR [[1]] (https://www.go‐fair.org/fair‐principles/), provided under the Creative Commons Attribution 4.0 International license.

The repositories and practices we recommend below fulfill some of these principles and make it easier for you to follow others. This will not only help others using your data, but can also save you time in the future (see ‘The benefits of sharing data to individual researchers’).

The National Institutes of Health (NIH), Canadian Institutes of Health Research (CIHR), Monarch Initiative [[2, 3]], and the Research Data Alliance (https://www.rd‐alliance.org/) all recommend FAIR principles for data sharing. Amendments to these recommendations that add measures for traceability (such as evidence and provenance), licensing, and connectedness (such as identifiers and versioning) further improve data reusability [[4, 5]].

Why share?

The benefits of sharing data to science and society

Sharing data allows for transparency in scientific studies and allows one to fully understand what occurred in an analysis and reproduce the results. Without complete data, metadata (see ‘Data and metadata’), and information about resources used to generate the data, reproducing a study proves impossible [[6, 7]].

Within the biological sciences, we have a problem of data waste—ostensibly shared data that no one ever uses. Many otherwise useful datasets go underused because researchers cannot effectively reuse the data. The inability to reuse arises from lack of discoverability, lack of important information provided, inconsistencies in data and metadata, and licensing issues.

When shared effectively, we can multiply the benefits of large datasets that cost large amounts of funds and research time. Combining previously shared biological data accelerates development of analytical methods used to analyze biological data. Reusing rare samples increases the sample impact. Combining data together in meta‐analyses increases study power. Data sharing also leads to fewer duplicate studies. Researchers can build on previous studies to corroborate or falsify their findings rather than repeating the same experiment. Many research projects rely on data from resources such as the Encyclopedia of DNA Elements (ENCODE) Project [[8, 9]]. The existence of a large collection of accessible data also aids in the development of cross‐cutting analyses such as recount2 [[10]].

Published manuscripts with reusable data will garner more citations and have more long‐term impact on scientific knowledge [[11]]. As such, many funders now require that grant proposals include a data management and sharing plan describing biological data and metadata [[12, 13]]. Many journals have also implemented policies making public data sharing a requirement upon publication.

The benefits of sharing data to individual researchers

Sharing data increases the impact of a researcher's work and reputation for sound science [[14]]. Awards for those with an excellent record of data sharing [[15]] (https://researchsymbionts.org/) or data reuse [[16]] (https://researchparasite.com/) can exemplify this reputation.

Demonstrating a track record of excellence in resource sharing benefits you when applying for funding. A commitment to and detailed plan for sharing data publicly increase the perception of a grant proposal's impact [[14]]. A detailed data sharing plan outlines the types of data you will share, available metadata, and in which repositories you will deposit the data.

Preparing to share data publicly reduces unintentional errors within your own research group. When preparing the data for sharing, providing detailed metadata and documentation will eliminate guesswork, lost details, and maintain tacit knowledge that might otherwise remain unrecorded. Posting data on public repositories with links to the publication and links to data deposited within your publication ensure findability of your data.

Data citation standards now allow directly citing datasets in journal reference list [[17]]. Citable datasets provide an important incentive to data sharing since those using your shared data can now properly attribute citations to your dataset.

Addressing common concerns about data sharing

Despite the clear benefits of sharing data, some researchers still have concerns about doing so. Some worry that sharing data may decrease the novelty of their work and their chance to publish in prominent journals. You can address this concern by sharing your data only after publication. You can also choose to preprint your manuscript when you decide to share your data. Furthermore, you only need to share the data and metadata required to reproduce your published study.

Time spent on sharing data

Some have concerns about the time it takes to organize and share data publicly. Many add ‘data available upon request’ to manuscripts instead of depositing the data in a public repository in hopes of getting the work out sooner. It does take time to organize data in preparation for sharing, but sharing data publicly may save you time. Sharing data in a public repository that guarantees archival persistence means that you will not have to worry about storing and backing up the data yourself.

You can consider putting off data sharing tasks as incurring a form of ‘sharing debt’, by analogy with the concept of technical debt used in software engineering. Delaying these tasks may appear to save you time in the short run, but sharing the data later will take at least as much time as doing it now. You may also incur interest, as it can take longer in the long run to handle individual requests for data availability. Taking a few hours now to organize data and submit it to a repository will save you much of this time.

Human subject data

Sharing of data on human subjects requires special ethical, legal, and privacy considerations. Existing recommendations [[18-24]] largely aim to balance the privacy of human participants with the benefits of data sharing by de‐identifying human participants and obtaining consent for sharing. Sharing human data poses a variety of challenges for analysis, transparency, reproducibility, interoperability, and access [[18-24]].

Sometimes you cannot publicly post all human data, even after de‐identification [[25]]. We suggest three strategies for making these data maximally accessible. First, deposit raw data files in a controlled‐access repository, such as the European Genome‐phenome Archive (EGA) [[67]. Controlled‐access repositories allow only qualified researchers who apply to access the data. Second, even if you cannot make individual‐level raw data available, you can make as much processed data available as possible. This may take the form of summary statistics such as means and standard deviations, rather than individual‐level data. Third, you may want to generate simulated data distinct from the original data but statistically similar to it. Simulated data would allow others to reproduce your analysis without disclosing the original data or requiring the security controls needed for controlled access [[21]].

Continue reading here to access for free these state-of-the-art guidelines on when and how to share data ranging from genomics, to proteomics, microscopy, structural biology and more. 

First published in FEBS Letters on April 11, 2021.

How to cite this article:

Wilson, S.L., Way, G.P., Bittremieux, W., Armache, J.‐P., Haendel, M.A. and Hoffman, M.M. (2021), Sharing biological data: why, when, and how. FEBS Lett, 595: 847-863. https://doi.org/10.1002/1873-3468.14067

FEBS Letters

FEBS Letters is renowned both for its quality of content and speed of production. Bringing together the most important developments in the molecular biosciences, FEBS Letters provides an international forum for Minireviews, Research Letters and Hypotheses that merit urgent publication.