While protein folding prediction took a big step forward in 2020 with the release of AlphaFold, modeling RNA by computational methods remains partially inaccurate for secondary structures over 500 nucleotides and largely inaccurate for tertiary structure despite application of neural network methods, as assessed by the worldwide structure prediction CASP (1). Machine learning requires large datasets and they simply don't exist for RNA structures. What if there were a way to generate a large dataset? Would machine learning be able to improve on current structure prediction algorithms? That is the question Eterna and the Das Lab (Stanford) are trying to answer through the Kaggle Ribonanza competition which launched in September 2023.
The idea was to generate a large, diverse dataset of RNA sequences with different lengths and chemically probe them in the Das Lab using 1M7, 2A3, and DMS high-throughput methods. The resulting reactivity data then was split into a training dataset (806,578 sequences length 50-100) and test dataset (1,343,823 sequences length 100-240) for the Ribonanza competition. (Note: all lengths exclude leader, barcode, and tail.) The work builds on the success of the OpenVaccine competition held in September 2020 which combined two crowdsourcing platforms to produce a computational model for predicting RNA degradation patterns (2).
Community scientists take up the challenge
You might be wondering where the Eterna citizen science game fits into all this. The Eterna players submitted 4,902 sequences through a puzzle interface that were tested during the piloting of the experimental protocol, and 678,987 sequences for subsequent experimental testing. The Eterna puzzle interface allowed players to submit known natural pseudoknots from the PseudoBase++2.0 database, create new pseudoknots using NUPACK and other heuristic methods, and scan portions of genomes for potential novel pseudoknots. The Das Lab researchers generated the remaining sequences through viral windowing and mutate-and-map procedures. Combining multiple sequence sources ensured a diverse set of sequences both for training and testing the models.
Ribonanza competition results
The top six models of the Ribonanza competition featured advanced transformer architectures and attention biased with pair features, and needed base pair probabilities (bpp) generated by Eternafold to outperform other models. Some of the models came from experienced researchers with prior successes in bioinformatic competitions, but most of the top models came from teams with no prior biological experience or even experience with the transformer architecture. The top models were able to predict chemical mapping patterns for mutants that look just like ‘mutate-and-map’ experiments, where mutations introduced in one position lead to variation in chemical reactivity at other positions that are involved in the same base pair or RNA structure.
These results indicate that the models ‘understand’ RNA structure, including complex ones with names like pseudoknots or kissing loops. Nevertheless, for other RNAs, different models trained on this same data set produced different predictions, suggesting that there’s more work to be done.
Increasing the throughput of sequence design, chemical mapping, and model training should be technically feasible – and provide another showcase for dual crowdsourcing to the vibrant communities participating on Eterna and Kaggle. In addition, it remains to be seen if these models can be used to predict 3D structure. The researchers hope that they and the broader community will take this additional step in time for another big competition coming up soon – the next round of the CASP 3D structure prediction trials will begin in April 2024.
Share your sequences or host a competition?
The Das Lab currently is accepting sequence sets for Ribonanza 2 chemical mapping. The goal is to construct the world's largest RNA SHAPE mapping library of diverse sequences, from a diverse set of contributors. Researchers who have sequences that they want to see tested are invited to please fill out this form to request submission access. The terms and instructions for sequence and other data submission are on the form. The submission deadline is January 15.
And for researchers and institutions interested in hosting their own competition, it is worth noting that Kaggle offers a research grant program. Competitions can help the host team make progress in their machine learning research problem and they also act as a challenge to the Kaggle Community. Find out more about the eligibility and selection criteria to access these funds on the Kaggle Competitions Research Grants program page.
References
- Das R, Kretsch RC, Simpkin AJ, et al. Assessment of three-dimensional RNA structure prediction in CASP15. Proteins. 2023; 91(12): 1747-1770. doi:10.1002/prot.26602
- Wayment-Steele, H.K., Kladwang, W., Watkins, A.M. et al. Deep learning models for predicting RNA degradation via dual crowdsourcing. Nat Mach Intell 4, 1174–1184 (2022). https://doi.org/10.1038/s42256-022-00571-8
Top image by Sharif Ezzat
Join the FEBS Network today
Joining the FEBS Network’s molecular life sciences community enables you to access special content on the site, present your profile, 'follow' contributors, 'comment' on and 'like' content, post your own content, and set up a tailored email digest for updates.