Nucleotide transformer: building and evaluating robust foundation models for human genomics.
H. Dalla-Torre, L. Gonzalez, J. Mendoza-Revilla, N. Lopez Carranza, A. H. Grzywaczewski, F. Oteri, C. Dallago, E. Trop, B. P. De Almeida, H. Sirelkhatim, G. Richard, M. Skwark, K. Beguir, M. Lopez, and T. Pierrot. Nucleotide transformer: building and evaluating robust foundation models for human genomics. Nature Methods, 22(2):287–297, Feb. 2025. ISSN 1548-7091, 1548-7105. doi: 10.1038/s41592-024-02523-z. URL https://www.nature.com/articles/s41592-024-02523-z.
References
Devlin, J., Chang, M.-W., Lee, K. & Toutanova, K. Bert: pre-training of deep bidirectional transformers for language understanding. Preprint at https://arxiv.org/abs/1810.04805 (2018).
Brown, T. et al. Language models are few-shot learners. Adv. Neural Inf. Process. Syst. 33, 1877–1901 (2020).
Jumper, J. et al. Highly accurate protein structure prediction with alphafold. Nature 596, 583–589 (2021).
Rives, A. et al. Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences. Proc. Natl Acad. Sci. USA 118, e2016239118 (2021).
Elnaggar, A. et al. ProtTrans: towards cracking the language of life’s code through self-supervised deep learning and high performance computing. IEEE Trans. Pattern Anal. Mach. Intell. 44, 7112–7127 (2022).
Littmann, M., Heinzinger, M., Dallago, C., Olenyi, T. & Rost, B. Embeddings from deep learning transfer go annotations beyond homology. Sci. Rep. 11, 1–14 (2021).
Marquet, C. et al. Embeddings from protein language models predict conservation and variant effects. Hum. Genet. 141, 1629–1647 (2022).
Littmann, M., Heinzinger, M., Dallago, C., Weissenow, K. & Rost, B. Protein embeddings and deep learning predict binding residues for various ligand classes. Sci. Rep. 11, 23916 (2021).
Avsec, Ž. et al. Base-resolution models of transcription-factor binding reveal soft motif syntax. Nat. Genet. 53, 354–366 (2021).
Zhou, J. & Troyanskaya, O. G. Predicting effects of noncoding variants with deep learning–based sequence model. Nat. Methods 12, 931–934 (2015).
Mateo, L. J., Sinnott-Armstrong, N. & Boettiger, A. N. Tracing dna paths and rna profiles in cultured cells and tissues with orca. Nat. Protoc. 16, 1647–1713 (2021).
de Almeida, B. P., Reiter, F., Pagani, M. & Stark, A. DeepSTARR predicts enhancer activity from dna sequence and enables the de novo design of synthetic enhancers. Nat. Genet. 54, 613–624 (2022).
Eraslan, G., Avsec, Ž., Gagneur, J. & Theis, F. J. Deep learning: new computational modelling techniques for genomics. Nat. Rev. Genet. 20, 389–403 (2019).
Zhou, J. et al. Deep learning sequence-based ab initio prediction of variant effects on expression and disease risk. Nat. Genet. 50, 1171–1179 (2018).
Kelley, D. R. Cross-species regulatory sequence activity prediction. PLOS Comput. Biol. 16, e1008050 (2020).
Kelley, D. R. et al. Sequential regulatory activity prediction across chromosomes with convolutional neural networks. Genome Res. 28, 739–750 (2018).
Agarwal, V. & Shendure, J. Predicting mRNA abundance directly from genomic sequence using deep convolutional neural networks. Cell Rep. 31, 107663 (2020).
Chen, K. M., Wong, A. K., Troyanskaya, O. G. & Zhou, J. A sequence-based global map of regulatory activity for deciphering human genetics. Nat. Genet. 54, 940–949 (2022).
Avsec, Ž. et al. Effective gene expression prediction from sequence by integrating long-range interactions. Nat. Methods 18, 1196–1203 (2021).
Ji, Y., Zhou, Z., Liu, H. & Davuluri, R. V. Dnabert: pre-trained bidirectional encoder representations from transformers model for DNA-language in genome. Bioinformatics 37, 2112–2120 (2021).
Zvyagin, M. T. et al. Genslms: genome-scale language models reveal SARS-CoV-2 evolutionary dynamics. Preprint at bioRxiv https://doi.org/10.1101/2022.10.10.511571 (2022).
Outeiral, C. & Deane, C. M. Codon language embeddings provide strong signals for protein engineering. Nat. Mach. Intell. 6, 170–179 (2024).
Zhou, Z. et al. Dnabert-2: efficient foundation model and benchmark for multi-species genome. in Proceedings of the Twelfth International Conference on Learning Representations https://openreview.net/pdf?id=oMLQB4EZE1 (ICLR, 2024).
Fishman, V. et al. Gena-lm: A family of open-source foundational models for long dna sequences. Preprint at bioRxiv https://doi.org/10.1101/2023.06.12.544594 (2023).
Nguyen, E. et al. Hyenadna: Long-range genomic sequence modeling at single nucleotide resolution. in 37th Conference on Neural Information Processing Systems https://openreview.net/pdf?id=ubzNoJjOKj (NeurIPS, 2023).
Mendoza-Revilla, J. et al. A foundational large language model for edible plant genomes. Commun. Biol. 7, 835 (2024).
Rae, J. W. et al. Scaling language models: methods, analysis & insights from training gopher. Preprint at https://arxiv.org/abs/2112.11446 (2021).
Consortium, G. P. et al. A global reference for human genetic variation. Nature 526, 68 (2015).
Harrow, J. et al. GENCODE: the reference human genome annotation for the encode project. Genome Res. 22, 1760–1774 (2012).
Meylan, P., Dreos, R., Ambrosini, G., Groux, R. & Bucher, P. Epd in 2020: enhanced data visualization and extension to ncRNA promoters. Nucleic Acids Res. 48, D65–D69 (2020).
ENCODE. An integrated encyclopedia of dna elements in the human genome. Nature 489, 57–74 (2012).
The ENCODE Project Consortium. Expanded encyclopaedias of dna elements in the human and mouse genomes. Nature 583, 699–710 (2020).
Meuleman, W. et al. Index and biological spectrum of human DNase I hypersensitive sites. Nature 584, 244–251 (2020).
Li, F.-Z., Amini, A. P., Yang, K. K. & Lu, A. X. Pretrained protein language model transfer learning: is the final layer representation what we want? in Machine Learning for Structural Biology Workshop (NeurIPS, 2022).
Liu, H. et al. Few-shot parameter-efficient fine-tuning is better and cheaper than in-context learning. in 36th Conference on Neural Information Processing Systems (NeurIPS, 2022).
Jaganathan, K. et al. Predicting splicing from primary sequence with deep learning. Cell 176, 535–548 (2019).
Benegas, G., Batra, S. S. & Song, Y. S. DNA language models are powerful zero-shot predictors of non-coding variant effects. Proc. Natl Acad. Sci. USA 120, e2311219120 (2023).
Vig, J. et al. BERTology meets biology: interpreting attention in protein language models. in Proceedings of the International Conference on Learning Representations 2021 https://openreview.net/pdf?id=YWtLZvLmud7 (ICLR, 2021).
Braun, S. et al. Decoding a cancer-relevant splicing decision in the ron proto-oncogene using high-throughput mutagenesis. Nat. Commun. 9, 3315 (2018).
McLaren, W. et al. The Ensembl variant effect predictor. Genome Biol. 17, 1–14 (2016).
Lappalainen, T. et al. Transcriptome and genome sequencing uncovers functional variation in humans. Nature 501, 506–511 (2013).
Consortium, G. The gtex consortium atlas of genetic regulatory effects across human tissues. Science 369, 1318–1330 (2020).
Võsa, U. et al. Large-scale cis-and trans-eqtl analyses identify thousands of genetic loci and polygenic scores that regulate blood gene expression. Nat. Genetics 53, 1300–1310 (2021).
Chowdhery, A. et al. PaLM: scaling language modeling with pathways. J. Mach. Learn. Technol. 24, 11324–11436 (2021).
Hoffmann, J. et al. Training compute-optimal large language models. in 36th Conference on Neural Information Processing Systems https://proceedings.neurips.cc/paper_files/paper/2022/file/c1e2faff6f588870935f114ebe04a3e5-Paper-Conference.pdf (NeurIPS, 2022).
Rogers, A., Kovaleva, O. & Rumshisky, A. A primer in BERTology: what we know about how bert works. Trans. Assoc. Comput. Linguist. 8, 842–866 (2020).
Stärk, H., Dallago, C., Heinzinger, M. & Rost, B. Light attention predicts protein location from the language of life. Bioinform. Adv. 1, vbab035 (2021).
Zou, J. et al. A primer on deep learning in genomics. Nat. Genetics 51, 12–18 (2019).
Wang, A. et al. Superglue: a stickier benchmark for general-purpose language understanding systems. in 33rd Conference on Neural Information Processing Systems https://papers.nips.cc/paper_files/paper/2019/file/4496bf24afe7fab6f046bf4923da8de6-Paper.pdf (NeurIPS, 2019).
Hendrycks, D. & Gimpel, K. Gaussian error linear units (gelus). Preprint at https://arxiv.org/abs/1606.08415 (2016).
Su, J. et al. Roformer: enhanced transformer with rotary position embedding. Neurocomputing 568, 127063 (2024).
Kingma, D. P. & Ba, J. Adam: a method for stochastic optimization. Preprint at https://arxiv.org/abs/1412.6980v5 (2015).
Pedregosa, F. et al. Scikit-learn: machine learning in Python. J. Mach. Learn. Res. 12, 2825–2830 (2011).
Bergstra, J., Bardenet, R., Bengio, Y. & Kégl, B. Algorithms for hyper-parameter optimization. in Advances in Neural Information Processing Systems 24 https://papers.nips.cc/paper_files/paper/2011/file/86e8f7ab32cfd12577bc2619bc635690-Paper.pdf (NeurIPS, 2011).
Byrska-Bishop, M. et al. High-coverage whole-genome sequencing of the expanded 1000 genomes project cohort including 602 trios. Cell 185, 3426–3440 (2022).
Kim, D., Paggi, J. M., Park, C., Bennett, C. & Salzberg, S. L. Graph-based genome alignment and genotyping with hisat2 and hisat-genotype. Nat. Biotechnol.37, 907–905 (2019).
Leslie, R., O’Donnell, C. J. & Johnson, A. D. GRASP: analysis of genotype–phenotype results from 1390 genome-wide association studies and corresponding open access database. Bioinformatics 30, i185–i194 (2014).
Landrum, M. J. et al. Clinvar: improving access to variant interpretations and supporting evidence. Nucleic Acids Res. 46, D1062–D1067 (2018).
Stenson, P. D. et al. The Human Gene Mutation Database (HGMD®): optimizing its use in a clinical diagnostic or research setting. Hum. Genet. 139, 1197–1207 (2020).
Related document on the Qiita
making reference list on biorxiv pdf file
https://qiita.com/kaizen_nagoya/items/75f6f93ce9872a5d622d
Genome modeling and design across all domains of life with evo 2
https://qiita.com/kaizen_nagoya/items/eecda74f758008633ee2
BIOREASON: DNA-LLMモデルによるマルチモーダル生物学的推論の動機付け
https://qiita.com/kaizen_nagoya/items/0718b214043a614deee0
Mckusick’s online mendelian inheritance in man (omim®)
https://qiita.com/kaizen_nagoya/items/c599d867201d1ffb1f4d
Anthropic. Claude 3.7 sonnet
https://qiita.com/kaizen_nagoya/items/4364d9c475114353cf2a
Genomic language models: Opportunities and challenges
https://qiita.com/kaizen_nagoya/items/f797330e64e0c7d05f39
A dna language model based on multispecies alignment predicts the effects of genome-wide variants
https://qiita.com/kaizen_nagoya/items/6e8858c2395dcc98804a
A genomic mutational constraint map using variation in 76,156 human genomes
https://qiita.com/kaizen_nagoya/items/e799ad85ee98bb2a8cf6
Genomic language models: Opportunities and challenges
https://qiita.com/kaizen_nagoya/items/f797330e64e0c7d05f39
Nucleotide transformer: building and evaluating robust foundation models for human genomics
https://qiita.com/kaizen_nagoya/items/1c147c2b095364f04ef7
A genomic mutational constraint map using variation in 76,156 human genomes
https://qiita.com/kaizen_nagoya/items/e799ad85ee98bb2a8cf6
DeepSeek-AI
https://qiita.com/kaizen_nagoya/items/bb5ee9f17c03e07659d8
Codontransformer: A multispecies codon optimizer using context-aware neural networks.
https://qiita.com/kaizen_nagoya/items/d4be1d4dd9eb307f09cc
Medrax: Medical reasoning agent for chest x-ray
https://qiita.com/kaizen_nagoya/items/94c7835b2f461452b2e7
Benchmarking dna foundation models for genomic sequence classification running title: Dna foundation models benchmarking.
https://qiita.com/kaizen_nagoya/items/01e3dde0d8274fee0fd8
Lora: Low-rank adaptation of large language models,
https://qiita.com/kaizen_nagoya/items/877058f681d77808b44c
kegg pull: a software package for the restful access and pulling from the kyoto encyclopedia of gene and genomes.
https://qiita.com/kaizen_nagoya/items/05be40565793f2b4f7f3
Genegpt: augmenting large language models with domain tools for improved access to biomedical information.
https://qiita.com/kaizen_nagoya/items/8897792ff52fb5e68a46
Kegg: biological systems database as a model of the real world.
https://qiita.com/kaizen_nagoya/items/f63573043eaf8f9c6a2c
Entrez direct: E-utilities on the unix command line
https://qiita.com/kaizen_nagoya/items/cc4bbde566e67abc93d9
Clinvar: Public archive of relationships among sequence variation and human phenotype.
https://qiita.com/kaizen_nagoya/items/8149b7a5a4f930490fad
Biobert: a pre-trained biomedical language representation model for biomedical text mining.
https://qiita.com/kaizen_nagoya/items/63781eb6db1fc2ded80a
Progress and opportunities of foundation models in bioinformatics. Briefings in Bioinformatics
https://qiita.com/kaizen_nagoya/items/6ef20eaf796532fed6f8
Bend: Benchmarking dna language models on biologically meaningful tasks.
https://qiita.com/kaizen_nagoya/items/8417e72454d2107a9d06
Hyenadna: Long-range genomic sequence modeling at single nucleotide resolution.
https://qiita.com/kaizen_nagoya/items/07e1ba1138b0825c8a73
Gpt-4o system card.
https://qiita.com/kaizen_nagoya/items/06e4c54af663456b49f9
wen2.5 technical report.
https://qiita.com/kaizen_nagoya/items/f87fac7f9a83f54328fe
Deepseekmath: Pushing the limits of mathematical reasoning in open language models
https://qiita.com/kaizen_nagoya/items/2add7a80056850b9ce87
dbsnp: the ncbi database of genetic variation.
https://qiita.com/kaizen_nagoya/items/756da32e4c0868d84da0
Cosmic: a curated database of somatic variants and clinical data for cancer.
https://qiita.com/kaizen_nagoya/items/2b0960d4e1ff26a9b01f
Roformer: Enhanced transformer with rotary position embedding,
https://qiita.com/kaizen_nagoya/items/a12a45518f28a5133af2
Txgemma: Efficient and agentic llms for therapeutics.
https://qiita.com/kaizen_nagoya/items/e4eff5d51f926e943b9e
Qwen2 technical report.
https://qiita.com/kaizen_nagoya/items/29a77b25282c8822011e
Qwen2.5 technical report
https://qiita.com/kaizen_nagoya/items/c275c2e1f24bbc5019c1
Scientific large language models: A survey on biological and chemical domains.
https://qiita.com/kaizen_nagoya/items/6505717d7c4769a4ff31
Dnabert-2: Efficient foundation model and benchmark for multi-species genome
https://qiita.com/kaizen_nagoya/items/d711266990ec2bed35f2
Striped-Hyena: Moving Beyond Transformers with Hybrid Signal
https://qiita.com/kaizen_nagoya/items/423495ed798c1eaf89f1
Nucleotide transformer: building and evaluating robust foundation models for human genomics.
https://qiita.com/kaizen_nagoya/items/fe607a3aaf7ffb309d33