Biomedical research plays a key role in the big data revolution. The ability to generate large amounts of genomic data cheaply through improvements in sequencing technologies enables us to perform exome and genome sequencing on hundreds of thousands of patients, allowing us to probe the genetic basis of disease. It also lets us better understand the molecular underpinnings of disease through a multitude of functional -omic technologies. In terms of phenotype data, technological advances have enabled us to assimilate patient history from multiple cohorts of patients, through improvements in software, the use of controlled vocabularies such as the Human Phenotype Ontology (HPO), and through advanced techniques such as text mining and natural language processing. Better still, there are initiatives that actively seek to share pathological phenotype data from patients between groups, allowing the identification of patterns and helping connect phenotype to genotype.
For rare disease, these technological improvements are of particular importance. Taken together, rare diseases affect an estimated 36 million people in the EU, however each rare disease affects only a small number of people. As a result, a doctor is unlikely to see more than a handful of patients with a given rare disease. This makes their study rather difficult, as most approaches to uncover the genetic associations between variants and disease, with the aim of understanding the causes and underlying molecular mechanisms, require many patients to obtain the statistical power necessary. To this end, multiple projects such as DECIPHER, The Deciphering Developmental Disorders Project, and The 100 Genomes Project have stepped in, allowing data from patients with rare and often undiagnosed disorders to be shared globally. For example, the DECIPHER project allows medical professionals from all over the world to upload phenotypic and genomic data pertaining to patients with rare disorders to a centralized database. This can be helpful for individual users, effectively allowing doctors to communicate their findings on a larger scale. It is also useful as it allows the identification of patterns across multiple patients, which can be key to finding out more about rare disease.
The projects have been fundamental to better understanding and diagnosing diseases. However, they are not the only piece of the puzzle. The ever-increasing datasets require the development of suitable computational tools for their analysis. The days in which a researcher could feasibly open the results of his research in a spreadsheet and manually inspect each datapoint are long behind us. This is where bioinformatics and computational systems biology comes in. Our group at the University of Malaga, forming part of the Spanish Rare Disease Research Network (CIBERER) and IBIMA-RARE, part of the Biomedical Research Institute of Malaga (IBIMA) has developed a range of tools to analyse large datasets of rare disease patients. These include PhenCo, which looks for patterns of co-occurrence between pathological phenotypes across patients within a cohort. Taking as input a list of HPO phenotypic profiles, each representative of a different patient, it looks to see which of them tend to occur together in the same individuals. This results in a long list of co-occurrent pairs, which are then used to build a phenotype network and find clusters of phenotypes that tend to co-occur together. This has applications for better understanding disease patterns and guiding diagnosis by suggesting potential missed phenotypes to test for. Similarly, we recently published a study looking at the co-occurrence of pathological phenotypes within neuromuscular disorders. This led to clusters of phenotypes with potential for diagnosis, including clusters that were indicative of different diseases depending on the presence or absence of specific phenotypes.
Other key computational methods for the analysis of phenotypic patient data include Phenomizer. The tool has a range of potential uses, but the underlying premise is that given a list of phenotype terms – representative of a phenotypic patient profile – Phenomizer will return potential diseases for differential diagnosis, including rare and Orphan diseases, consistent with these phenotypes. There are several excellent tutorials and articles with more details on this resource.
Such tools are limited by the amount of phenotypic information available within the dataset under study. Even in global initiatives such as the DECIPHER project, many of the patients have only been assigned one or two unspecific phenotypes, such as “intellectual disability” or “abnormality of body height”. Such general phenotypes barely scratch the surface of the richness available in the HPO. Conversely, we have worked with datasets in which the average number of phenotypes per patient was 25 HPO terms, and these were much more specific. Such broader and deeper phenotyping leads to a better characterisation of the patient and, in terms of algorithms that take into account the entire patient cohort, it also allows us to find much more precise results. However, it not always clear whether a data cohort contains a large amount of phenotypic information or not. To rectify this, our group developed the tool Cohort Analyzer, which takes as input a list of phenotypic profiles, representative of patients within a cohort, and returns reports with multiple summary statistics, metrics and plots, which allow the user to better understand the depth and breadth of phenotyping in their cohort.
Whilst phenotypic based approaches are useful to better understand patterns and to predict potential genetic causes, in terms of finding the underlying genes of rare diseases, exome and genome sequencing is quickly becoming the go to method. Once this data is generated, it is down to bioinformaticians, together with disease domain-experts, to turn the relatively short reads, or DNA sequences, that come off of the sequencing machine, into useful and potentially actionable information regarding the gene(s) responsible for the disease.
This is a rapidly expanding area of rare-disease research, and is fuelled to a great extent by the technological improvements in genome sequencing. Spanish Rare Disease Research Network (CIBERER) for example is generating a large amount of data in this area, and has set up a Bioinformatics working group to develop pipelines for their analysis. One of the most important tools of reference in this area is Exomiser, part of the Monarch Initiative. This tool looks for likely pathogenic variants within a patient’s genome and prioritizes them, based on the predicted impact and the patient’s phenotypes, among other factors. This involves comparing the phenotypes related to the genes affected by the genetic variants, obtained from the Monarch Initiative, to the phenotypic profile of the patient. A better explanation of the tool, putting it in the context of how it can be applied to rare disease can be found here and documentation on how to run it here.
Other methodologies that seek to combine phenotype and genetic data include the PhenFun methodology, developed by our group. It models the relationship between phenotypes and genes using a network approach. It is based on the assumption that if many patients suffering from the same phenotype also share mutations mapping to the same gene, this gene is likely to be associated with the phenotype. In fact, PhenFun goes a step further: as well as associating phenotypes with multiple genes based on overlap, it also performs functional enrichment on the associated genes, to look for potential enriched functional systems. This can be of great use for understanding the molecular underpinnings of pathological phenotypes.
Of course, there are many other approaches that seek to analyse entire cohorts of data. In fact, the DECIPHER resource in its most recent publication devoted an entire section to novel methods that seek to exploit the data in its entirety. We recommend the reader to consult this article for more details. There are also many other roles for computational biology and bioinformatics in rare disease research that we have not been able to touch on here, such as the analysis of “omics” data – where we go a step beyond the DNA sequence and start looking at the key players in cell function, including gene expression, epigenetic modifications, metabolomic affects and more. Analysing and extracting information from the many gigabytes of data generated by these technologies requires much computational effort. One thing is for sure, there will be a strong need for bioinformaticians in rare disease research in the coming years.
Our group's research is funded by the Spanish government (grants: SAF2016-78041-C2-1-R, PID2019-108096RB-C21, PID2019-105010RB-I00), the Andalusian regional government/FEDER/Fundación Progreso y Salud (UMA18-FEDERJA-102, UMA18-FEDERJA-220, PY20-00372, PY20_00257, PI-0075-2017), the Unversity of Málaga and private funding bodies such as the Fundación Ramón Areces. The “CIBER de Enfermedades Raras (CIBERER)” is an initiative from the Instituto de Salud Carlos III (ISCIII).