In recent decades, the development of computational and bioinformatics tools and websites for life sciences has increased exponentially. This great development has gone hand in hand with the availability of genome, proteome and macromolecule structure databases, and also of functional experiments, including microarray and RNAseq expression data, RNA-protein interactions, ChIP-seq, bioactivity assays, bio-images, among others, as open sources. The improvement of high-throughput sequencing and bioassays technologies with lower costs, and also with reduced costs of data storage and processing, contribute to that.
In this scenario, large projects such as ENCODE (Standford University), GTEx (Genotype-Tissue Expression), TARA/Magellan, the Human Cell Atlas (HCA Data Portal), 10KP (10,000 Plant Genomes Project), JGI Plant Gene Atlas (Gene expression across diverse plant species), VGP (Vertebrate Genomes Project), and ERGA (European Reference Genome Atlas) highlight both the enormous capacity of data generation and its usefulness, and the need for infrastructures to order and exploit the opportunities that this data represents.
Genomes and proteomes databases of numerous organisms from all kingdoms are available in websites of different consortia and organizations. Some of them are well-known: NCBI (The National Center for Biotechnology Information of USA); EMBL-EBI (The National Center for Biotechnology Information, UK). Also, there are websites specialized in specific organisms: Biocyc that mainly integrates collection of bacteria genomes; CyanoBase, a genome database for cyanobacteria (model organisms for photosynthesis) that houses cyanobacteria species information, complete genome sequences, genome-scale experiment data, gene information, gene annotations and mutant information. For plants, PLAZA (VIB Center for Plant Systems Biology, University of Ghent, Belgium) and Phytozome (Plant Comparative Genomics portal of the Department of Energy’s Joint Genome Institute, USA) are available. PLAZA integrates plant sequence data and comparative genomics methods, and provides an online platform to perform evolutionary analyses within the green plant lineages. It provides tools to explore gene families and genomic homology (orthology and paralogy). Phytozome hosts assembled and annotated genomes with KOG, KEGG, ENZYME, Pathway and the InterPro family of protein analysis tools. Also pairwise orthology and paralogy groups have been calculated across all Phytozome proteomes. EMSEMBL, started in 1999, some years before the draft human genome was completed and EmsemblPlants integrate genome annotation and comparative genomic with other available biological data and taxonomic reference points giving evolutionary context in which genes can be understood.
In addition, RSAT Plants (Regulatory Sequence Analysis Tools) offers tools to analyse cis-regulatory elements in genome sequences: a) motif discovery (support genome-wide data sets like ChIP-seq); b) transcription factor binding motif analysis (quality assessment, comparisons and clustering); c) comparative genomics; d) analysis of regulatory variations.
Protein structures are organized in the Protein Data Bank (PDB), established in 1971. PDB is the central archive of all experimentally determined protein structure data. Today it is maintained by international consortia collectively known as the Worldwide Protein Data Bank (wwPDB). The mission is to maintain a single archive of macromolecular structural data that is freely and publicly available to the global community. It is a global 3D structure data for large biological molecules (proteins, DNA, and RNA).
Other databases and tools for structural protein analysis are SCOP (Medical Research Council (MRC), UK), that provides a detailed and comprehensive description of the structural and evolutionary relationships between proteins whose three-dimensional structure is known and deposited in the Protein Data Bank; InterPro that provides functional analysis of proteins by classifying them into families and predicting domains and important sites; ModBase (University of California at San Francisco) that provides three-dimensional protein models calculated by comparative modeling based on MODELLER programme; DisProt (University of Padua), a manually curated repository of Intrinsically Disordered Proteins, both for structural and functional aspects (improved and updated in 2022); AlphaFold DB (AlphaFold Protein Structure Database) created by DeepMind and EMBL-EBI, and based on highly accurate protein structure prediction with AlphaFold using Artificial Intelligence (AI); RosettaCommons (University of Washington) that offers algorithms for computational modeling and analysis of protein structures including de novo protein design, enzyme design, ligand docking, and structure prediction of biological macromolecules and macromolecular complexes.
Due to this broad set of data available, different platforms have emerged to integrate the information produced over the years. In this sense UniProt (Universal Protein), from 2002, is the world's leading high-quality, comprehensive and freely accessible resource of protein sequence and functional information. It is a consortium of EMBL's European Bioinformatics Institute, Protein Information Resource and Swiss Institute of Bioinformatics. For each protein entry, UniProt offers information about taxonomy, subcellular localization, function, Gene Ontology annotation, sequence, isoform variants, tissue specificity, expression, structural characteristics (features of binding site, domains), 3D-structure, molecular interactions, post-transcriptional modifications (PTM), Mass Spectrometry information, references and bibliography. This hub for the collection of functional and structural information on proteins provides accurate, consistent and rich annotation in UniProtKB/Swiss-Prot.
More recently, the pan-European infrastructure ELIXIR has emerged to bring together life science resources from across Europe. These open resources include databases, software tools, training materials, cloud storage and supercomputers. ELIXIR has the mission of coordinating database resources and analysis platforms, together with the capacity of building national nodes and training, contributing to form a single federated infrastructure that makes it easier for researchers to find and share data, develop software and exchange experiences and best practices in life sciences. It does not offer individual services such as access to biological samples, computational analysis or specific instruments. All software and database resources it offers are openly accessible and usable. This infrastructure currently has 22 national nodes which, together with EMBL-EBI, form the backbone of this federated infrastructure that coordinates local activities in relation to its goal. Currently, the national nodes contribute more than 300 services covering different scientific fields within the life sciences, ranging from biochemistry, interaction networks, evolution and phylogeny, genomics, proteomics, metagenomics, structural biology, among others. The service ELIXIR Bio.tools helps to explore essential scientific and technical information about software tools, databases and services for bioinformatics and the life sciences making them accessible and usable by the broad community of scientists in the areas of life sciences.