RESEARCH

A revolution of data in the life sciences

Implementing Findable, Accessible, Interoperable, Reusable (FAIR) data solutions can be challenging. Semi-automated decentralised data sharing tools could support reserchers in making their data FAIR. Here the authors explore one possible solution.

Mar 06, 2023

Simon Rayner and Pável Vázquez

2 contributors

A revolution of data in the life sciences

Like Be the first to like this

This post has been jointly written by Simon Rayner (Oslo University Hospital/University of Oslo) and Pável Vázquez Faci (University of Oslo).

Technological advances in recent years have produced a data revolution in the life sciences. Life science research increasingly occurs in multidisciplinary environments and researchers can generate large quantities of diverse data in a single experiment. Consequently, the field is transitioning towards a 'life cycle' perspective of data handling and researchers face new challenges such as consolidating heterogenous data, storing data in a machine-readable format, and ensuring data provenance (Gruning, et al., 2018).

The Findable, Accessible, Interoperable, Reusable (FAIR) data initiative has provided a framework for defining the minimum elements required for effective data management (Wilkinson, et al., 2016). However, studies suggest that only a partial level of FAIRness is achieved in scientific research (Wilkinson, et al., 2019). This is partly a reflection of the lack of clear implementation guidelines in the original FAIR principles (Koers, et al., 2020) but also due to a lack of methods to measure a level of adherence to the FAIR principles.

Another issue is that of centralised solutions. For example, tools such as InterMine (Smith, et al., 2012) apply a level of FAIRness to allow users to effectively query and analyze integrated and diverse datasets. However, InterMine and other similar data resources implement a centralized solution for data sharing. Hosting local and centralized solutions for personal research needs can be challenging and relies on the continued participation of the host, which in turn may be dependent on factors such as continued funding (Imker, 2018). Additionally, decentralized systems can be more scalable, allowing for more efficient use of resources and better performance, and can be more transparent, providing users with more control and visibility into how data is managed and used. While decentralized solutions exist, they involve private third-party solutions, such as Dropbox¹ or OneDrive², which may compromise privacy and offer limited control of data access.

Moreover, implementing FAIR-ish solutions requires a steep learning curve for many researchers as they must (i) master concepts related to metadata, schemata protocols, policies, and community agreements; (ii) select from different and possibly incompatible realizations; (iii) implement their data plan as a software/hardware solution. Thus, implementing FAIR solutions may be beyond the means of many researchers in terms of both resources and expertise. Software such as COPO (Shaw, et al., 2020) have taken a first step to provide off-the-shelf solutions, allowing users to set up a data server that can describe research data using community sanctioned vocabularies. However, COPO uses a traditional centralised data architecture.

A data sharing solution to the rescue

To investigate the potential of semi-automated decentralised solutions, we developed the Globally Accessible Distributed Data Sharing (GADDS) platform. The GADDS platform uses (i) a tamper-proof blockchain algorithm to enforce metadata quality control; (ii) a cloud architecture storage system to split, replicate, and store data across multiple devices; (iii) version-control software to handle metadata submits and changes/updates; (iv) a web browser interface to simplify data collection, data storage and data retrieval. An entry in the GADDS platform corresponds to the data (for example, a microscope image) and the metadata associated with the measures (for example, date of collection, microscope model, operator) to form a meta(data) tuple.

While blockchain technology is usually associated with cryptocurrency, it is also well suited to general data tracking tasks. Here the database, or ledger, records all transactions (exchanges) that have been executed among participants. In the GADDS platform, a transaction is the validated metadata, and the consensus is an agreement about whether to add a specific metadata entry to the ledger (by ensuring it adheres to a predefined standard). In the GADDS platform we use the Hyperledger Fabric Proof-of-Authority (PoA) consensus protocol (Cachin, 2016) as it is less computationally demanding. Thus, the blockchain plays a dual role of performing metadata quality control and acting as a database for the metadata.

The GADDS platform uses the MinIO cloud architecture to split, replicate, and store data across multiple devices. This approach offers a high level of redundancy as it is possible to lose up to half (N/2) of the total storage devices and still recover the data. Our version control software, DIVECO, handles the (meta)data submits and changes/updates, to permit modification of submissions that are already present in the ledger. In this case, if the validation is successful a new block of metadata is created. The older version of the metadata remains in the ledger, and both point to the same data, so that it is possible to review the metadata changes associated with that data.

Hyperledger Fabric uses dedicated nodes called peers to manage the metadata. These peers can be grouped into organizations, and groups of organizations can be configured to communicate with each other in a secure way to form a consortium. This allows data to be shared safely across different locations. Therefore, the GADDS platform is decentralized, so that responsibility is shared and there is no dependency on the continued participation of a single collaborator.

In our proof-of-concept implementation of the GADDS platform we have configured three Hyperledger Fabric organizations in three different locations, at Imperial College, University of London, Oslo University Hospital, and the Norwegian Research and Education Cloud in Bergen. The platform is configured to collect information from tissue fibre experiments conducted by the Tissue Engineering group within the Hybrid Technology Hub³ at the University of Oslo (UiO), and the Sensors group in the Department of Physics at UiO.

Ultimately, data sharing tools such as the GADDS platform are of paramount importance in the life sciences. With vast amounts of data being generated through research and experimentation, it is critical that this data is effectively collected, stored, and shared among researchers. The ability to deposit and access data quickly and accurately can help accelerate the pace of scientific discovery. Additionally, data sharing can improve the transparency and reproducibility of research in a FAIR way, leading to better science.

[1] https://www.dropbox.com/, Last access: 22.03.2022.
[2] https://onedrive.live.com/, Last access: 22.03.2022.
[3] https://www.med.uio.no/hth/english/, Last access: 10.11.2021.

References:

Grüning B. et al. (2018) Practical Computational Reproducibility in the Life Sciences. Cell Systems, 6: 631–5. https://doi.org/10.1016/j.cels.2018.03.014
Wilkinson, M. et al. (2016) The FAIR Guiding Principles for scientific data management and stewardship. Sci Data 3, 160018. https://doi.org/10.1038/sdata.2016.18
Wilkinson, M.D. et al. (2019) Evaluating FAIR maturity through a scalable, automated, community-governed framework. Sci Data 6, 174. https://doi.org/10.1038/s41597-019-0184-5
Koers H. et al. (2020) Recommendations for Services in a FAIR Data Ecosystem. Patterns (N Y). Jul 7;1(5):100058. https://doi.org/10.1016/j.patter.2020.100058. Erratum in: Patterns (N Y). 2020 Sep 11;1(6):100104. PMID: 33205119; PMCID: PMC7660419
Smith RN. Et al. (2012) InterMine: a flexible data warehouse system for the integration and analysis of heterogeneous biological data. Bioinformatics. Dec 1;28(23):3163-5. https://doi.org/10.1093/bioinformatics/bts577
Imker, HJ (2018) 25 Years of Molecular Biology Databases: A Study of Proliferation, Impact, and Maintenance. Front. Res. Metr. Anal. 3:18. https://doi.org/10.3389/frma.2018.00018
Cachin, C. (2016) Architecture of the hyperledger blockchain fabric. In: Workshop on Distributed Cryptocurrencies and Consensus Ledgers, Zurich, Vol. 310, IBM Research, Zurich, pp. 1–4

Photo by GuerrillaBuzz on Unsplash

Multiple Contributors

Simon Rayner and Pável Vázquez

Join the FEBS Network today

Joining the FEBS Network’s molecular life sciences community enables you to access special content on the site, present your profile, 'follow' contributors, 'comment' on and 'like' content, post your own content, and set up a tailored email digest for updates.