As part of World Cancer Day (4th Feb), the journal of Molecular Oncology invited researchers to take part in a writing competition aimed at highlighting how research in other areas of life sciences or technology influences the field of cancer biology and promotes cancer research. This entry, by Óscar González-Velasco, received the second prize.
New generative Artificial Intelligence (AI) methods, based on Large Language Models (LLM) for text generation like ChatGPT-3, PaLM and LLaMA, and diffusion models for image creation like Dalle2, Midjourney and StableDiffusion2, have open a new framework for creating novel entities, including text, images, and other data; these AI creations resemble characteristics and functionalities of those of real-word or human-made objects, but with no natural analogs. While there are still lot of challenges and problems to overcome, these AI methods are becoming an integral part of daily life, even for professional activities, and they will in the next decades drive a profound transformation in human society. These advances are especially remarkable on language models, capable of human-level text generation. Now these new generative AI methods are now starting to also understand, and speak, the language of life: that of DNA and protein sequences.
The first highly successful AI model in molecular biology that makes partial use of generative models has been AlphaFold2, a protein folding prediction tool that can read amino acid sequences and predict their 3-D structure. It uses what is known as a transformer, a state-of-the-art deep learning model typically used in natural language processing to understand word-sentence relationships depending on position and context, in this case amino acid sequences. Transformers are capable of learning patterns that arise in complex systems using a multiple layered architecture of neural networks.
Now, a new generation of these algorithms is beginning to design de novo artificial molecules. Traditional protein engineering methods relied on iterative mutagenesis to identify and select structural and functional properties. More recent methods use computational simulations of biophysical properties or evolution-based models for the design and identification of structures with desired stability and functionality; however, these methods have several limitations. However, each of these methods has its limitations. For example, some are restricted to specific protein families, while others may be computationally expensive or even unfeasible to simulate.
Researchers and pharmaceutical companies are starting to use generative models to efficiently produce and test thousands of candidates; new, more sophisticated, and datadriven models are now being developed to produce bioactive compounds, monoclonal antibodies, and proteins with specific functions. As an example of this, a recent paper by Madani et al. demonstrated the potential of LLM to read protein sequences and create new ones from scratch with a predictable function and across a wide range of protein families. This is not just a toy experiment or a curiosity; the researchers tested the validity of these new proteins from five distinct lysozyme families, demonstrating to have similar catalytic efficiencies as natural lysozymes, having only a 31.4% sequence identity compared with their natural counterparts.
These new advances could be groundbreaking for some areas of cancer research; this is particularly true for monoclonal antibodies, which can be pre-designed with specific
functionalities by AI models, but also could be applied to build genetically modified T cells with both T cell receptor (TCR) and Chimeric antigen receptors (CARs).
As a postdoc in bioinformatics on the German Cancer Research Center (DKFZ) funded by the AI Health Innovation Cluster program, my focus is to apply AI on biomolecular data. As an example, I am using deep learning convolutional networks to predict the primary site of Cancers of Unknown Primary, as well as their response to treatment using RNA-Seq data. My focus is not only helping on the diagnosis stage, but also as a means of understanding gene markers and networks that drives the AI classification using explainable machine learning methods. One of our next goals is to use sequencing data with LLM, so we can design sequences that matches our needs for some specific biological questions, and use these complex models to understand, for example, immune system adaptability.
Despite some challenges, such as the availability of large public collections of biomedical and molecular data or the requirement of large infrastructures and their associated energy consumption, the emergence of new generative AI technologies will revolutionize several scientific fields and industries in the coming years. This includes cancer research, where AI could facilitate the creation of new molecules, including the design of new drugs and antibodies based on desired features, or even genetically engineered immune cells that we hope one day be able to fight cancer for us.