Scientific large language models: A survey on biological and chemical domains.
Q. Zhang, K. Ding, T. Lyv, X. Wang, Q. Yin, Y. Zhang, J. Yu, Y. Wang, X. Li, Z. Xiang, K. Feng, X. Zhuang, Z. Wang, M. Qin, M. Zhang, J. Zhang, J. Cui, T. Huang, P. Yan, R. Xu, H. Chen, X. Li, X. Fan, H. Xing, and H. Chen. A Survey on Biological and Chemical Domains, 1:90, 1 2024. doi:10.1145/nnnnnnn.nnnnnnn. URL https://arxiv.org/pdf/2401.14656v2.
REFERENCES
[1] Hisham Abdel-Aty and Ian R Gould. 2022. Large-scale distributed training of transformers for chemical fingerprinting. Journal of Chemical Information and Modeling 62, 20 (2022), 4852–4862.
[2] Hadi Abdine, Michail Chatzianastasis, Costas Bouyioukos, and Michalis Vazirgiannis. 2023. Prot2Text: Multimodal Protein’s Function Generation with GNNs and Transformers. arXiv:2307.14367 [q-bio.QM]
[3] Sanjar Adilov. 2021. Generative pre-training from molecules. (2021).
[4] Walid Ahmad, Elana Simon, Seyone Chithrananda, Gabriel Grand, and Bharath Ramsundar. 2022. Chemberta-2: Towards chemical foundation models. arXiv preprint arXiv:2209.01712 (2022).
[5] Microsoft Research AI4Science and Microsoft Azure Quantum. 2023. The Impact of Large Language Models on Scientific Discovery: A Preliminary Study using GPT-4. arXiv preprint:2311.07361 (2023).
[6] Sultan Alrowili and Vijay Shanker. 2021. BioM-Transformers: Building Large Biomedical Language Models with BERT, ALBERT and ELECTRA. In Proceedings of the 20th Workshop on Biomedical Language Processing. 221–227.
[7] Weizhi An, Yuzhi Guo, Yatao Bian, Hehuan Ma, Jinyu Yang, Chunyuan Li, and Junzhou Huang. 2022. MoDNA: motif-oriented pre-training for DNA language model. In Proceedings of the 13th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics. 1–5.
[8] Michael Ashburner, Catherine A Ball, Judith A Blake, David Botstein, Heather Butler, J Michael Cherry, Allan P Davis, Kara Dolinski, Selina S Dwight, Janan T Eppig, et al. 2000. Gene ontology: tool for the unification of biology. Nature genetics 25, 1 (2000), 25–29.
[9] Timothy F. Truong Jr au2 and Tristan Bepler. 2023. PoET: A generative model of protein families as sequences-of-sequences. arXiv:2306.06156 [qbio.QM]
[10] Žiga Avsec, Vikram Agarwal, Daniel Visentin, Joseph R Ledsam, Agnieszka Grabska-Barwinska, Kyle R Taylor, Yannis Assael, John Jumper, Pushmeet Kohli, and David R Kelley. 2021. Effective gene expression prediction from sequence by integrating long-range interactions. Nature methods 18, 10 (2021), 1196–1203.
[11] Simon Axelrod and Rafael Gomez-Bombarelli. 2022. GEOM, energy-annotated molecular conformations for property prediction and molecular generation. Scientific Data 9, 1 (2022), 185.
[12] Sarp Aykent and Tian Xia. 2022. Gbpnet: Universal geometric representation learning on protein structures. In Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining. 4–14.
[13] Viraj Bagal, Rishal Aggarwal, PK Vinod, and U Deva Priyakumar. 2021. MolGPT: molecular generation using a transformer-decoder model. Journal of Chemical Information and Modeling 62, 9 (2021), 2064–2076.
[14] Baichuan. 2023. Baichuan 2: Open Large-scale Language Models. arXiv preprint arXiv:2309.10305 (2023).
[15] Amos Bairoch and Rolf Apweiler. 2000. The SWISS-PROT protein sequence database and its supplement TrEMBL in 2000. Nucleic acids research 28, 1 (2000), 45–48.
[16] Suryanarayanan Balaji, Rishikesh Magar, Yayati Jadhav, and Amir Barati Farimani. 2023. GPT-MolBERTa: GPT Molecular Features Language Model for molecular property prediction. arXiv:2310.03030 [physics.chem-ph]
[17] Dominique Beaini, Shenyang Huang, Joao Alex Cunha, Gabriela Moisescu-Pareja, Oleksandr Dymov, Samuel Maddrell-Mander, Callum McLean, Frederik Wenkel, Luis Müller, Jama Hussein Mohamud, et al. 2023. Towards Foundational Models for Molecular Learning on Large-Scale Multi-Task Datasets. arXiv preprint arXiv:2310.04292 (2023).
[18] Iz Beltagy, Kyle Lo, and Arman Cohan. 2019. SciBERT: A Pretrained Language Model for Scientific Text. arXiv:1903.10676 [cs.CL]
[19] Asma Ben Abacha and Dina Demner-Fushman. 2019. A question-entailment approach to question answering. BMC bioinformatics 20 (2019), 1–23.
[20] Gonzalo Benegas, Carlos Albors, Alan J Aw, Chengzhong Ye, and Yun S Song. 2023. GPN-MSA: an alignment-based DNA language model for genome-wide variant effect prediction. bioRxiv (2023), 2023–10.
[21] Gonzalo Benegas, Sanjit Singh Batra, and Yun S Song. 2022. DNA language models are powerful zero-shot predictors of genome-wide variant effects. bioRxiv (2022), 2022–08.
[22] Gonzalo Benegas, Sanjit Singh Batra, and Yun S Song. 2023. DNA language models are powerful predictors of genome-wide variant effects. Proceedings of the National Academy of Sciences 120, 44 (2023), e2311219120.
[23] Mostapha Benhenda. 2017. ChemGAN challenge for drug discovery: can AI reproduce natural chemical diversity? (2017). Tristan Bepler and Bonnie Berger. 2021. Learning the protein language: Evolution, structure, and function. Cell systems 12, 6 (2021), 654–669.
[24] Bishwaranjan Bhattacharjee, Aashka Trivedi, Masayasu Muraoka, Muthukumaran Ramasubramanian, Takuma Udagawa, Iksha Gurung, Rong
[25] Zhang, Bharath Dandala, Rahul Ramachandran, Manil Maskey, et al. 2024. INDUS: Effective and Efficient Language Models for Scientific Applications. arXiv preprint arXiv:2405.10725 (2024).
[26] Lorenz C Blum and Jean-Louis Reymond. 2009. 970 million druglike small molecules for virtual screening in the chemical universe database GDB-13. Journal of the American Chemical Society 131, 25 (2009), 8732–8733.
[27] Antoine Bordes, Nicolas Usunier, Alberto Garcia-Duran, Jason Weston, and Oksana Yakhnenko. 2013. Translating embeddings for modeling multi-relational data. Advances in neural information processing systems 26 (2013).
[28] Emmanuel Boutet, Damien Lieberherr, Michael Tognolli, Michel Schneider, and Amos Bairoch. 2007. UniProtKB/Swiss-Prot: the manually annotated section of the UniProt KnowledgeBase. In Plant bioinformatics: methods and protocols. Springer, 89–112. Scientific Large Language Models: A Survey on Biological & Chemical Domains 75
[29] Emmanuel Boutet, Damien Lieberherr, Michael Tognolli, Michel Schneider, Parit Bansal, Alan J Bridge, Sylvain Poux, Lydie Bougueleret, and Ioannis Xenarios. 2016. UniProtKB/Swiss-Prot, the manually annotated section of the UniProt KnowledgeBase: how to use the entry view. Plant bioinformatics: methods and protocols (2016), 23–54.
[30] Andres M Bran, Sam Cox, Oliver Schilter, Carlo Baldassari, Andrew D White, and Philippe Schwaller. 2023. ChemCrow: Augmenting large-language models with chemistry tools. arXiv:2304.05376 [physics.chem-ph]
[31] Andres M Bran and Philippe Schwaller. 2023. Transformers and Large Language Models for Chemistry and Drug Discovery. arXiv:2310.06083 [cs.LG]
[32] Nadav Brandes, Dan Ofer, Yam Peleg, Nadav Rappoport, and Michal Linial. 2022. ProteinBERT: a universal deep-learning model of protein sequence and function. Bioinformatics 38, 8 (2022), 2102–2110.
[33] Nathan Brown, Marco Fiscato, Marwin HS Segler, and Alain C Vaucher. 2019. GuacaMol: benchmarking models for de novo molecular design. Journal of chemical information and modeling 59, 3 (2019), 1096–1108.
[34] Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. 2020. Language models are few-shot learners. Advances in neural information processing systems 33 (2020), 1877–1901.
[35] Sébastien Bubeck, Varun Chandrasekaran, Ronen Eldan, Johannes Gehrke, Eric Horvitz, Ece Kamar, Peter Lee, Yin Tat Lee, Yuanzhi Li, Scott Lundberg, et al. 2023. Sparks of artificial general intelligence: Early experiments with gpt-4. arXiv preprint:2303.12712 (2023).
[36] Hengxing Cai, Xiaochen Cai, Junhan Chang, Sihang Li, Lin Yao, Changxin Wang, Zhifeng Gao, Yongge Li, Mujie Lin, Shuwen Yang, et al. 2024. Sciassess: Benchmarking llm proficiency in scientific literature analysis. arXiv preprint arXiv:2403.01976 (2024).
[37] Hengxing Cai, Xiaochen Cai, Shuwen Yang, Jiankun Wang, Lin Yao, Zhifeng Gao, Junhan Chang, Sihang Li, Mingjun Xu, Changxin Wang, et al. 2024. Uni-SMART: Universal Science Multimodal Analysis and Research Transformer. arXiv preprint arXiv:2403.10301 (2024).
[38] Zheng Cai, Maosong Cao, Haojiong Chen, Kai Chen, Keyu Chen, Xin Chen, Xun Chen, Zehui Chen, Zhi Chen, Pei Chu, et al. 2024. Internlm2 technical report. arXiv preprint arXiv:2403.17297 (2024).
[39] Kathi Canese and Sarah Weis. 2013. PubMed: the bibliographic database. The NCBI handbook 2, 1 (2013).
[40] Yue Cao, Payel Das, Vijil Chenthamarakshan, Pin-Yu Chen, Igor Melnyk, and Yang Shen. 2021. Fold2seq: A joint sequence (1d)-fold (3d) embedding-based generative model for protein design. In International Conference on Machine Learning. PMLR, 1261–1271.
[41] Antje Chang, Lisa Jeske, Sandra Ulbrich, Julia Hofmann, Julia Koblitz, Ida Schomburg, Meina Neumann-Schaal, Dieter Jahn, and Dietmar Schomburg. 2021. BRENDA, the ELIXIR core data resource in 2021: new developments and updates. Nucleic acids research 49, D1 (2021), D498–D508.
[42] Jinho Chang and Jong Chul Ye. 2024. Bidirectional generation of structure and properties through a single molecular foundation model. Nature Communications 15, 1 (2024), 2323.
[43] Bo Chen, Xingyi Cheng, Yangli-ao Geng, Shen Li, Xin Zeng, Boyan Wang, Jing Gong, Chiming Liu, Aohan Zeng, Yuxiao Dong, et al. 2023. xTrimoPGLM: Unified 100b-scale pre-trained transformer for deciphering the language of protein. bioRxiv (2023), 2023–07.
[44] Dong Chen, Kaifu Gao, Duc Duy Nguyen, Xin Chen, Yi Jiang, Guo-Wei Wei, and Feng Pan. 2021. Algebraic graph-assisted bidirectional transformers for molecular property prediction. Nature communications 12, 1 (2021), 3521.
[45] Hanjie Chen, Zhouxiang Fang, Yash Singla, and Mark Dredze. 2024. Benchmarking Large Language Models on Answering and Explaining Challenging Medical Questions. arXiv:2402.18060 [cs.CL]
[46] Ken Chen, Yue Zhou, Maolin Ding, Yu Wang, Zhixiang Ren, and Yuedong Yang. 2023. Self-supervised learning on millions of pre-mRNA sequences improves sequence-based RNA splicing prediction. bioRxiv (2023), 2023–01.
[47] Linqing Chen, Weilei Wang, Zilong Bai, Peng Xu, Yan Fang, Jie Fang, Wentao Wu, Lizhi Zhou, Ruiji Zhang, Yubin Xia, et al. 2024. PharmGPT: Domain-Specific Large Language Models for Bio-Pharmaceutical and Chemistry. arXiv preprint arXiv:2406.18045 (2024).
[48] Qiyuan Chen and Cheng Deng. 2023. Bioinfo-Bench: A Simple Benchmark Framework for LLM Bioinformatics Skills Evaluation. bioRxiv (2023), 2023–10.
[49] Yangyang Chen, Zixu Wang, Lei Wang, Jianmin Wang, Pengyong Li, Dongsheng Cao, Xiangxiang Zeng, Xiucai Ye, and Tetsuya Sakurai. 2023. Deep generative model for drug design from protein target sequence. Journal of Cheminformatics 15, 1 (2023), 38.
[50] Yirong Chen, Zhenyu Wang, Xiaofen Xing, huimin zheng, Zhipei Xu, Kai Fang, Junhong Wang, Sihang Li, Jieling Wu, Qi Liu, and Xiangmin Xu. 2023. BianQue: Balancing the Questioning and Suggestion Ability of Health LLMs with Multi-turn Health Conversations Polished by ChatGPT. arXiv:2310.15896 [cs.CL]
[51] Yiqun Chen and James Zou. 2023. GenePT: A Simple But Effective Foundation Model for Genes and Cells Built From ChatGPT. bioRxiv (2023).
[52] Zeming Chen, Alejandro Hernández Cano, Angelika Romanou, Antoine Bonnet, Kyle Matoba, Francesco Salvi, Matteo Pagliardini, Simin Fan, Andreas Köpf, Amirkeivan Mohtashami, Alexandre Sallinen, Alireza Sakhaeirad, Vinitra Swamy, Igor Krawczuk, Deniz Bayazit, Axel Marmet, Syrielle Montariol, Mary-Anne Hartley, Martin Jaggi, and Antoine Bosselut. 2023. MEDITRON-70B: Scaling Medical Pretraining for Large Language Models. arXiv:2311.16079 [cs.CL]
[53] Jun Cheng, Muhammed Hasan Çelik, Thi Yen Duong Nguyen, Žiga Avsec, and Julien Gagneur. 2019. CAGI 5 splicing challenge: improved exon skipping and intron retention predictions with MMSplice. Human mutation 40, 9 (2019), 1243–1251.
[54] Jun Cheng, Guido Novati, Joshua Pan, Clare Bycroft, Akvil˙ e Žemgulyt˙ e, Taylor Applebaum, Alexander Pritzel, Lai Hong Wong, Michal Zielinski, Tobias Sargeant, et al. 2023. Accurate proteome-wide missense variant effect prediction with AlphaMissense. Science 381, 6664 (2023), eadg7492.
[55] Wei-Lin Chiang, Zhuohan Li, Zi Lin, Ying Sheng, Zhanghao Wu, Hao Zhang, Lianmin Zheng, Siyuan Zhuang, Yonghao Zhuang, Joseph E. Gonzalez, Ion Stoica, and Eric P. Xing. 2023. Vicuna: An Open-Source Chatbot Impressing GPT-4 with 90%* ChatGPT Quality. https://vicuna.lmsys.org
[56] Gayane Chilingaryan, Hovhannes Tamoyan, Ani Tevosyan, Nelly Babayan, Lusine Khondkaryan, Karen Hambardzumyan, Zaven Navoyan, Hrant Khachatrian, and Armen Aghajanyan. 2022. Bartsmiles: Generative masked language models for molecular representations. arXiv preprint arXiv:2211.16349 (2022).
Seyone Chithrananda, Gabriel Grand, and Bharath Ramsundar. 2020. ChemBERTa: large-scale self-supervised pretraining for molecular property prediction. arXiv preprint arXiv:2010.09885 (2020).
Kwang-Hwi Cho, Kyoung Tai No, et al. 2023. iupacGPT: IUPAC-based large-scale molecular pre-trained model for property prediction and molecule generation. (2023).
Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, et al. 2023. Palm: Scaling language modeling with pathways. Journal of Machine Learning Research 24, 240 (2023), 1–113.
Dimitrios Christofidellis, Giorgio Giannone, Jannis Born, Ole Winther, Teodoro Laino, and Matteo Manica. 2023. Unifying Molecular and Textual Representations via Multi-task Language Modelling. arXiv:2301.12586 [cs.LG]
Simon Chu and Kathy Wei. 2023. Generative Antibody Design for Complementary Chain Pairing Sequences through Encoder-Decoder Language Model. In NeurIPS 2023 Generative AI and Biology (GenBio) Workshop. Kevin Clark, Minh-Thang Luong, Quoc V Le, and Christopher D Manning. 2020. Electra: Pre-training text encoders as discriminators rather than generators. arXiv preprint arXiv:2003.10555 (2020).
Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. 2018. Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge. arXiv:1803.05457 [cs.AI]
Emily Clough and Tanya Barrett. 2016. The gene expression omnibus database. Statistical Genomics: Methods and Protocols (2016), 93–110.
Micaela E Consens, Cameron Dufault, Michael Wainberg, Duncan Forster, Mehran Karimzadeh, Hani Goodarzi, Fabian J Theis, Alan Moses, and Bo Wang. 2023. To Transformers and Beyond: Large Language Models for the Genome. arXiv preprint arXiv:2311.07621 (2023).
1000 Genomes Project Consortium et al. 2015. A global reference for human genetic variation. Nature 526, 7571 (2015), 68.
The UniProt Consortium. 2021. UniProt: the universal protein knowledgebase in 2021. Nucleic acids research 49, D1 (2021), D480–D489.
The UniProt Consortium. 2023. UniProt: the universal protein knowledgebase in 2023. Nucleic Acids Research 51, D1 (2023), D523–D531.
UniProt Consortium. 2019. UniProt: a worldwide hub of protein knowledge. Nucleic acids research 47, D1 (2019), D506–D515.
ENCODE Project Consortium Overall coordination (data analysis coordination) Dunham Ian 2 Kundaje Anshul 3 81 82 82, Writing group Bernstein
Bradley E. 7 34 Birney Ewan Dunham Ian Green Eric D. 35 Gunter Chris 15 Snyder Michael 13, et al. 2012. An integrated encyclopedia of DNA
elements in the human genome. Nature 489, 7414 (2012), 57–74.
Jiyu Cui, Fang Wu, Wen Zhang, Lifeng Yang, Jianbo Hu, Yin Fang, Peng Ye, Qiang Zhang, Xian Suo, Yiming Mo, et al. 2023. Direct prediction of gas
adsorption via spatial atom interaction learning. Nature Communications 14, 1 (2023), 7043.
Zhanbei Cui, Yu Liao, Tongda Xu, and Yan Wang. 2022. Geneformer: Learned gene compression using transformer-based context modeling. arXiv preprint arXiv:2212.08379 (2022).
Hugo Dalla-Torre, Liam Gonzalez, Javier Mendoza-Revilla, Nicolas Lopez Carranza, Adam Henryk Grzywaczewski, Francesco Oteri, Christian Dallago, Evan Trop, Bernardo P de Almeida, Hassan Sirelkhatim, et al. 2023. The nucleotide transformer: Building and evaluating robust foundation models for human genomics. bioRxiv (2023), 2023–01.
Christian Dallago, Jody Mou, Kadina E Johnston, Bruce J Wittmann, Nicholas Bhattacharya, Samuel Goldman, Ali Madani, and Kevin K Yang. 2021.
FLIP: Benchmark tasks in fitness landscape inference for proteins. bioRxiv (2021), 2021–11.
Dina Demner-Fushman, Marc D Kohli, Marc B Rosenman, Sonya E Shooshan, Laritza Rodriguez, Sameer Antani, George R Thoma, and Clement J McDonald. 2016. Preparing a collection of radiology examinations for distribution and retrieval. Journal of the American Medical Informatics Association 23, 2 (2016), 304–310.
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In NAACL, Jill Burstein, Christy Doran, and Thamar Solorio (Eds.). Association for Computational Linguistics, 4171–4186.
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018).
C Domínguez Conde, C Xu, LB Jarvis, DB Rainbow, SB Wells, T Gomes, SK Howlett, O Suchanek, K Polanski, HW King, et al. 2022. Cross-tissue immune cell analysis reveals tissue-specific features in humans. Science 376, 6594 (2022), eabl5197.
René Dreos, Giovanna Ambrosini, Rouayda Cavin Périer, and Philipp Bucher. 2013. EPD and EPDnew, high-quality promoter resources in the next-generation sequencing era. Nucleic acids research 41, D1 (2013), D157–D164.
Yuanqi Du, Tianfan Fu, Jimeng Sun, and Shengchao Liu. 2022. Molgensurvey: A systematic survey in machine learning models for molecule design. arXiv preprint arXiv:2203.14500 (2022).
Zhengxiao Du, Yujie Qian, Xiao Liu, Ming Ding, Jiezhong Qiu, Zhilin Yang, and Jie Tang. 2021. Glm: General language model pretraining with autoregressive blank infilling. arXiv preprint arXiv:2103.10360 (2021).
ChenRui Duan, Zelin Zang, Yongjie Xu, Hang He, Zihan Liu, Zijia Song, Ju-Sheng Zheng, and Stan Z. Li. 2024. FGBERT: Function-Driven Pre-trained Gene Language Model for Metagenomics. arXiv:2402.16901 [q-bio.GN]
Carl Edwards, Tuan Lai, Kevin Ros, Garrett Honke, Kyunghyun Cho, and Heng Ji. 2022. Translation between Molecules and Natural Language. arXiv:2204.11817 [cs.CL]
Scientific Large Language Models: A Survey on Biological & Chemical Domains 77
[84] Carl Edwards, ChengXiang Zhai, and Heng Ji. 2021. Text2mol: Cross-modal molecule retrieval with natural language queries. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. 595–607.
[85] Yasha Ektefaie, Andrew Shen, Daria Bykova, Maximillian Marin, Marinka Zitnik, and Maha R Farhat. 2024. Evaluating generalizability of artificial intelligence models for molecular datasets. bioRxiv (2024).
[86] Ahmed Elnaggar, Michael Heinzinger, Christian Dallago, Ghalia Rehawi, Yu Wang, Llion Jones, Tom Gibbs, Tamas Feher, Christoph Angerer, Martin Steinegger, Debsindhu Bhowmik, and Burkhard Rost. 2022. ProtTrans: Toward Understanding the Language of Life Through Self-Supervised Learning. IEEE Transactions on Pattern Analysis and Machine Intelligence 44, 10 (2022), 7112–7127.
[87] Janan T Eppig, Cynthia L Smith, Judith A Blake, Martin Ringwald, James A Kadin, Joel E Richardson, and Carol J Bult. 2017. Mouse Genome Informatics (MGI): resources for mining mouse genetic, genomic, and biological data in support of primary and translational research. Systems Genetics: Methods and Protocols (2017), 47–73.
[88] Nicholas Evans and Stephen C Levinson. 2009. The myth of language universals: Language diversity and its importance for cognitive science. Behavioral and brain sciences 32, 5 (2009), 429–448.
[89] Benedek Fabian, Thomas Edlich, Héléna Gaspar, Marwin Segler, Joshua Meyers, Marco Fiscato, and Mohamed Ahmed. 2020. Molecular representation learning with language models and domain-relevant auxiliary tasks. arXiv preprint arXiv:2011.13230 (2020).
[90] Xiaomin Fang, Fan Wang, Lihang Liu, Jingzhou He, Dayong Lin, Yingfei Xiang, Kunrui Zhu, Xiaonan Zhang, Hua Wu, Hui Li, et al. 2023. A method for multiple-sequence-alignment-free protein structure prediction using a protein language model. Nature Machine Intelligence (2023), 1–10.
[91] Yin Fang, Xiaozhuan Liang, Ningyu Zhang, Kangwei Liu, Rui Huang, Zhuo Chen, Xiaohui Fan, and Huajun Chen. 2023. Mol-Instructions: A Large-Scale Biomolecular Instruction Dataset for Large Language Models. arXiv:2306.08018 [q-bio.QM]
[92] Yin Fang, Ningyu Zhang, Zhuo Chen, Xiaohui Fan, and Huajun Chen. 2023. Domain-agnostic molecular generation with self-feedback. CoRR, abs/2301.11259 (2023).
[93] Yin Fang, Qiang Zhang, Haihong Yang, Xiang Zhuang, Shumin Deng, Wen Zhang, Ming Qin, Zhuo Chen, Xiaohui Fan, and Huajun Chen. 2022. Molecular contrastive learning with chemical element knowledge graph. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 36. 3968–3976.
[94] Yin Fang, Qiang Zhang, Ningyu Zhang, Zhuo Chen, Xiang Zhuang, Xin Shao, Xiaohui Fan, and Huajun Chen. 2023. Knowledge graph-enhanced molecular contrastive learning with functional prompt. Nature Machine Intelligence (2023), 1–12.
[95] Henri A Favre and Warren H Powell. 2013. Nomenclature of organic chemistry: IUPAC recommendations and preferred names 2013. Royal Society of Chemistry.
[96] EA Feingold, PJ Good, MS Guyer, S Kamholz, L Liefer, K Wetterstrand, FS Collins, TR Gingeras, D Kampa, EA Sekinger, et al. 2004. The ENCODE (ENCyclopedia of DNA elements) project. Science 306, 5696 (2004), 636–640.
[97] Kehua Feng, Keyan Ding, Weijie Wang, Xiang Zhuang, Zeyuan Wang, Ming Qin, Yu Zhao, Jianhua Yao, Qiang Zhang, and Huajun Chen. 2024. SciKnowEval: Evaluating Multi-level Scientific Knowledge of Large Language Models. arXiv preprint arXiv:2406.09098 (2024).
[98] [99] Noelia Ferruz, Steffen Schmidt, and Birte Höcker. 2022. A deep unsupervised language model for protein design. bioRxiv (2022). Robert D Finn, Jaina Mistry, Benjamin Schuster-Böckler, Sam Griffiths-Jones, Volker Hollich, Timo Lassmann, Simon Moxon, Mhairi Marshall, Ajay Khanna, Richard Durbin, et al. 2006. Pfam: clans, web tools and services. Nucleic acids research 34, suppl_1 (2006), D247–D251.
[100] Veniamin Fishman, Yuri Kuratov, Maxim Petrov, Aleksei Shmelev, Denis Shepelin, Nikolay Chekanov, Olga Kardymon, and Mikhail Burtsev. 2023. GENA-LM: A Family of Open-Source Foundational Models for Long DNA Sequences. bioRxiv (2023), 2023–06.
[101] Mary Forehand. 2010. Bloom’s taxonomy. Emerging perspectives on learning, teaching, and technology 41, 4 (2010), 47–56.
[102] Oscar Franzén, Li-Ming Gan, and Johan LM Björkegren. 2019. PanglaoDB: a web server for exploration of mouse and human single-cell RNA sequencing data. Database 2019 (2019), baz046.
[103] Nathan C Frey, Daniel Berenberg, Karina Zadorozhny, Joseph Kleinhenz, Julien Lafrance-Vanasse, Isidro Hotzel, Yan Wu, Stephen Ra, Richard Bonneau, Kyunghyun Cho, et al. 2023. Protein discovery with discrete walk-jump sampling. arXiv preprint arXiv:2306.12360 (2023).
[104] Dennis Gankin, Alexander Karollus, Martin Grosshauser, Kristian Klemon, Johannes Hingerl, and Julien Gagneur. 2023. Species-aware DNA language modeling. bioRxiv (2023), 2023–01.
[105] Bowen Gao, Bo Qiang, Haichuan Tan, Minsi Ren, Yinjun Jia, Minsi Lu, Jingjing Liu, Weiying Ma, and Yanyan Lan. 2023. DrugCLIP: Contrastive Protein-Molecule Representation Learning for Virtual Screening. arXiv:2310.06367 [cs.LG]
[106] Zijing Gao, Qiao Liu, Wanwen Zeng, Wing H Wong, and Rui Jiang. 2023. EpiGePT: a Pretrained Transformer model for epigenomics. bioRxiv (2023), 2023–07.
[107] Iker García-Ferrero, Rodrigo Agerri, Aitziber Atutxa Salazar, Elena Cabrio, Iker de la Iglesia, Alberto Lavelli, Bernardo Magnini, Benjamin Molinet, Johana Ramirez-Romero, German Rigau, Jose Maria Villa-Gonzalez, Serena Villata, and Andrea Zaninello. 2024. Medical mT5: An Open-Source Multilingual Text-to-Text LLM for The Medical Domain. arXiv:2404.07613
[108] Anna Gaulton, Louisa J Bellis, A Patricia Bento, Jon Chambers, Mark Davies, Anne Hersey, Yvonne Light, Shaun McGlinchey, David Michalovich, Bissan Al-Lazikani, et al. 2012. ChEMBL: a large-scale bioactivity database for drug discovery. Nucleic acids research 40, D1 (2012), D1100–D1107.
[109] H BIELKA GDR, N Sharon, and EW Australia. 1984. Nomenclature and symbolism for amino acids and peptides. Pure and Applied Chemistry 56 (1984), 595–624.
[110] Michael K Gilson, Tiqing Liu, Michael Baitaluk, George Nicola, Linda Hwang, and Jenny Chong. 2016. BindingDB in 2015: a public database for medicinal chemistry, computational chemistry and systems pharmacology. Nucleic acids research 44, D1 (2016), D1045–D1053. 78 Zhang and Ding, et al.
[111] Yu Gu, Robert Tinn, Hao Cheng, Michael Lucas, Naoto Usuyama, Xiaodong Liu, Tristan Naumann, Jianfeng Gao, and Hoifung Poon. 2021. Domain-specific language model pretraining for biomedical natural language processing. ACM Transactions on Computing for Healthcare (HEALTH) 3, 1 (2021), 1–23.
[112] Zhouhong Gu, Xiaoxuan Zhu, Haoning Ye, Lin Zhang, Jianchen Wang, Sihang Jiang, Zhuozhi Xiong, Zihan Li, Qianyu He, Rui Xu, Wenhao Huang, Zili Wang, Shusen Wang, Weiguo Zheng, Hongwei Feng, and Yanghua Xiao. 2023. Xiezhi: An Ever-Updating Benchmark for Holistic Domain Knowledge Evaluation. arXiv:2306.05783 [cs.CL]
[113] Jiang Guo, A Santiago Ibanez-Lopez, Hanyu Gao, Victor Quach, Connor W Coley, Klavs F Jensen, and Regina Barzilay. 2021. Automated chemical reaction extraction from scientific literature. Journal of chemical information and modeling 62, 9 (2021), 2035–2045.
[114] Yan Guo, Yulin Dai, Hui Yu, Shilin Zhao, David C Samuels, and Yu Shyr. 2017. Improvements and impacts of GRCh38 human reference on high throughput sequencing data analysis. Genomics 109, 2 (2017), 83–90.
[115] Zhichun Guo, Kehan Guo, Bozhao Nan, Yijun Tian, Roshni G Iyer, Yihong Ma, Olaf Wiest, Xiangliang Zhang, Wei Wang, Chuxu Zhang, et al. 2023. Graph-based molecular representation learning. IJCAI (2023).
[116] Tanishq Gupta, Mohd Zaki, N. M. Anoop Krishnan, and Mausam. 2022. MatSciBERT: A Materials Domain Language Model for Text Mining and Information Extraction. NPJ Computational Materials 8, 1 (May 2022), 102.
[117] Suchin Gururangan, Ana Marasović, Swabha Swayamdipta, Kyle Lo, Iz Beltagy, Doug Downey, and Noah A. Smith. 2020. Don’t Stop Pretraining: Adapt Language Models to Domains and Tasks. In Proceedings of ACL.
[118] Janna Hastings, Gareth Owen, Adriano Dekker, Marcus Ennis, Namrata Kale, Venkatesh Muthukrishnan, Steve Turner, Neil Swainston, Pedro Mendes, and Christoph Steinbeck. 2016. ChEBI in 2016: Improved services and an expanding collection of metabolites. Nucleic acids research 44, D1 (2016), D1214–D1219.
[119] Thomas Hayes, Roshan Rao, Halil Akin, Nicholas J. Sofroniew, Deniz Oktay, Zeming Lin, Robert Verkuil, Vincent Q. Tran, Jonathan Deaton, Marius Wiggert, Rohil Badkundri, Irhum Shafkat, Jun Gong, Alexander Derry, Raul S. Molina, Neil Thomas, Yousuf Khan, Chetan Mishra, Carolyn Kim, Liam J. Bartie, Matthew Nemeth, Patrick D. Hsu, Tom Sercu, Salvatore Candido, and Alexander Rives. 2024. Simulating 500 million years of evolution with a language model. bioRxiv: 2024.07.01.600583 (2024).
[120] Liang He, Shizhuo Zhang, Lijun Wu, Huanhuan Xia, Fusong Ju, He Zhang, Siyuan Liu, Yingce Xia, Jianwei Zhu, Pan Deng, Bin Shao, Tao Qin, and Tie-Yan Liu. 2021. Pre-training Co-evolutionary Protein Representation via A Pairwise Masked Language Model. CoRR abs/2110.15527 (2021).
[121] Shanshan He, Ruchir Bhatt, Brian Birditt, Carl Brown, Emily Brown, Kan Chantranuvatana, Patrick Danaher, Dwayne Dunaway, Brian Filanoski, Ryan G Garrison, et al. 2021. High-plex multiomic analysis in FFPE tissue at single-cellular and subcellular resolution by spatial molecular imaging. bioRxiv (2021), 2021–11.
[122] Xuehai He, Shu Chen, Zeqian Ju, Xiangyu Dong, Hongchao Fang, Sicheng Wang, Yue Yang, Jiaqi Zeng, Ruisi Zhang, Ruoyu Zhang, Meng Zhou, Penghui Zhu, and Pengtao Xie. 2020. MedDialog: Two Large-scale Medical Dialogue Datasets. arXiv:2004.03329 [cs.LG]
[123] Michael Heinzinger, Konstantin Weissenow, Joaquin Gomez Sanchez, Adrian Henkel, Martin Steinegger, and Burkhard Rost. 2023. ProstT5: Bilingual Language Model for Protein Sequence and Structure. bioRxiv (2023).
[124] Stephen Heller, Alan McNaught, Stephen Stein, Dmitrii Tchekhovskoi, and Igor Pletnev. 2013. InChI- the worldwide chemical structure identifier standard. Journal of Cheminformatics 5, 1 (2013), 1–9.
[125] Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. 2021. Measuring Massive Multitask Language Understanding. arXiv:2009.03300 [cs.CY]
[126] Daniel Hesslow, Niccoló Zanichelli, Pascal Notin, Iacopo Poli, and Debora Marks. 2022. RITA: a Study on Scaling Up Generative Protein Sequence Models. arXiv (May 2022), arXiv:2205.05789.
[127] David Hiscock and Chris Upton. 2000. Viral Genome DataBase: storing and analyzing genes and proteins from complete viral genomes. Bioinformatics 16, 5 (2000), 484–485.
[128] Shion Honda, Shoi Shi, and Hiroki R Ueda. 2019. Smiles transformer: Pre-trained molecular fingerprint for low data drug discovery. arXiv preprint arXiv:1911.04738 (2019).
[129] Zhi Hong, Aswathy Ajith, Gregory Pauloski, Eamon Duede, Kyle Chard, and Ian Foster. 2023. The Diminishing Returns of Masked Language Models to Science. arXiv:2205.11342 [cs.CL]
[130] Marc Horlacher, Giulia Cantini, Julian Hesse, Patrick Schinke, Nicolas Goedert, Shubhankar Londhe, Lambert Moyon, and Annalisa Marsico. 2023. A systematic benchmark of machine learning methods for protein–RNA interaction prediction. Briefings in Bioinformatics 24, 5 (2023), bbad307.
[131] Wenpin Hou and Zhicheng Ji. 2024. Assessing GPT-4 for cell type annotation in single-cell RNA-seq analysis. Nature Methods (2024), 1–4.
[132] Bozhen Hu, Jun Xia, Jiangbin Zheng, Cheng Tan, Yufei Huang, Yongjie Xu, and Stan Z Li. 2022. Protein language models and structure prediction: Connection and progression. arXiv preprint arXiv:2211.16742 (2022).
[133] Weihua Hu, Matthias Fey, Hongyu Ren, Maho Nakata, Yuxiao Dong, and Jure Leskovec. 2023. Ogb-lsc: A large-scale challenge for machine learning on graphs. arXiv preprint arXiv:2103.09430 (2023).
[134] Weihua Hu, Matthias Fey, Marinka Zitnik, Yuxiao Dong, Hongyu Ren, Bowen Liu, Michele Catasta, and Jure Leskovec. 2020. Open graph benchmark: Datasets for machine learning on graphs. Advances in neural information processing systems 33 (2020), 22118–22133.
[135] Kexin Huang, Jaan Altosaar, and Rajesh Ranganath. 2020. ClinicalBERT: Modeling Clinical Notes and Predicting Hospital Readmission. arXiv:1904.05342 [cs.CL]
[136] Yuzhen Huang, Yuzhuo Bai, Zhihao Zhu, Junlei Zhang, Jinghan Zhang, Tangjun Su, Junteng Liu, Chuancheng Lv, Yikai Zhang, Jiayi Lei, Yao Fu, Maosong Sun, and Junxian He. 2023. C-Eval: A Multi-Level Multi-Discipline Chinese Evaluation Suite for Foundation Models. In Advances in Neural Information Processing Systems.
[137] Yunha Hwang, Andre L Cornman, Elizabeth H Kellogg, Sergey Ovchinnikov, and Peter R Girguis. 2024. Genomic language model predicts protein co-regulation and function. Nature communications 15, 1 (2024), 2880.
[138] John J Irwin, Teague Sterling, Michael M Mysinger, Erin S Bolstad, and Ryan G Coleman. 2012. ZINC: a free tool to discover chemistry for biology. Journal of chemical information and modeling 52, 7 (2012), 1757–1768.
[139] John J Irwin, Khanh G Tang, Jennifer Young, Chinzorig Dandarchuluun, Benjamin R Wong, Munkhzul Khurelbaatar, Yurii S Moroz, John Mayfield, and Roger A Sayle. 2020. ZINC20—a free ultralarge-scale chemical database for ligand discovery. Journal of chemical information and modeling 60, 12 (2020), 6065–6073.
[140] Ross Irwin, Spyridon Dimitriadis, Jiazhen He, and Esben Jannik Bjerrum. 2022. Chemformer: a pre-trained transformer for computational chemistry. Machine Learning: Science and Technology 3, 1 (2022), 015022.
[141] Nikita Janakarajan, Tim Erdmann, Sarath Swaminathan, Teodoro Laino, and Jannis Born. 2023. Language models in molecular discovery. arXiv preprint arXiv:2309.16235 (2023).
[142] Bijay Jassal, Lisa Matthews, Guilherme Viteri, Chuqiao Gong, Pascual Lorente, Antonio Fabregat, Konstantinos Sidiropoulos, Justin Cook, Marc Gillespie, Robin Haw, et al. 2020. The reactome pathway knowledgebase. Nucleic acids research 48, D1 (2020), D498–D503.
[143] Yanrong Ji, Zhihan Zhou, Han Liu, and Ramana V Davuluri. 2021. DNABERT: pre-trained Bidirectional Encoder Representations from Transformers model for DNA-language in genome. Bioinformatics 37, 15 (2021), 2112–2120.
[144] Minghui Jiang, Ying Xu, and Binhai Zhu. 2008. Protein structure–structure alignment with discrete Fréchet distance. Journal of bioinformatics and computational biology 6, 01 (2008), 51–64.
[145] Di Jin, Eileen Pan, Nassim Oufattole, Wei-Hung Weng, Hanyi Fang, and Peter Szolovits. 2020. What Disease does this Patient Have? A Large-scale Open Domain Question Answering Dataset from Medical Exams. arXiv:2009.13081
[146] Junru Jin, Yingying Yu, Ruheng Wang, Xin Zeng, Chao Pang, Yi Jiang, Zhongshen Li, Yutong Dai, Ran Su, Quan Zou, et al. 2022. iDNA-ABF: multi-scale deep biological language learning model for the interpretable prediction of DNA methylations. Genome biology 23, 1 (2022), 1–23.
[147] Qiao Jin, Bhuwan Dhingra, William W. Cohen, and Xinghua Lu. 2019. Probing Biomedical Embeddings from Language Models. arXiv:1904.02181 [cs.CL]
[148] Qiao Jin, Bhuwan Dhingra, Zhengping Liu, William W. Cohen, and Xinghua Lu. 2019. PubMedQA: A Dataset for Biomedical Research Question Answering. arXiv:1909.06146 [cs.CL]
[149] Matt Gardner Johannes Welbl, Nelson F. Liu. 2017. Crowdsourcing Multiple Choice Science Questions. arXiv:1707.06209v1.
[150] Alistair Johnson, Tom Pollard, Steven Horng, Leo Anthony Celi, and Roger Mark. 2023. MIMIC-IV-Note: Deidentified free-text clinical notes (version 2.2). PhysioNet. https://doi.org/10.13026/1n74-ne17
[151] Alistair E. W. Johnson, Lucas Bulgarelli, Lu Shen, Alvin Gayles, Ayad Shammout, Steven Horng, Tom J. Pollard, Benjamin Moody, Brian Gow, Li wei H. Lehman, Leo Anthony Celi, and Roger G. Mark. 2023. MIMIC-IV, a freely accessible electronic health record dataset. Scientific Data 10 (2023). https://api.semanticscholar.org/CorpusID:255439889
[152] Alistair E. W. Johnson, Tom J. Pollard, Seth J. Berkowitz, Nathaniel R. Greenbaum, Matthew P. Lungren, Chih ying Deng, Roger G. Mark, and Steven Horng. 2019. MIMIC-CXR, a de-identified publicly available database of chest radiographs with free-text reports. Scientific Data 6 (2019). https://api.semanticscholar.org/CorpusID:209342303
[153] Alistair E. W. Johnson, Tom J. Pollard, Lu Shen, Li wei H. Lehman, Mengling Feng, Mohammad Mahdi Ghassemi, Benjamin Moody, Peter Szolovits, Leo Anthony Celi, and Roger G. Mark. 2016. MIMIC-III, a freely accessible critical care database. Scientific Data 3 (2016). https://api.semanticscholar.org/CorpusID:33285731
[154] John Jumper, Richard Evans, Alexander Pritzel, Tim Green, Michael Figurnov, Olaf Ronneberger, Kathryn Tunyasuvunakool, Russ Bates, Augustin Žídek, Anna Potapenko, et al. 2021. Highly accurate protein structure prediction with AlphaFold. Nature 596, 7873 (2021), 583–589.
[155] Donna Karolchik, Robert Baertsch, Mark Diekhans, Terrence S Furey, Angie Hinrichs, YT Lu, Krishna M Roskin, Matt Schwartz, Charles W Sugnet, Daryl J Thomas, et al. 2003. The UCSC genome browser database. Nucleic acids research 31, 1 (2003), 51–54.
[156] Pavel Karpov, Guillaume Godin, and Igor V Tetko. 2019. A transformer model for retrosynthesis. In International Conference on Artificial Neural Networks. Springer, 817–830.
[157] Panagiotis Katsonis and Olivier Lichtarge. 2019. CAGI5: Objective performance assessments of predictions based on the Evolutionary Action equation. Human mutation 40, 9 (2019), 1436–1454.
[158] Eunji Kim, Dongseon Lee, Youngchun Kwon, Min Sik Park, and Youn-Suk Choi. 2021. Valid, plausible, and diverse retrosynthesis using tied two-way transformers with latent variables. Journal of Chemical Information and Modeling 61, 1 (2021), 123–133.
[159] Hyunjae Kim, Hyeon Hwang, Jiwoo Lee, Sihyeon Park, Dain Kim, Taewhoo Lee, Chanwoong Yoon, Jiwoong Sohn, Donghee Choi, and Jaewoo Kang. 2024. Small Language Models Learn Enhanced Reasoning Skills from Medical Textbooks. arXiv:2404.00376 [cs.CL]
[160] Hyunseung Kim, Jonggeol Na, and Won Bo Lee. 2021. Generative chemical transformer: neural machine learning of molecular geometric structures from chemical language via attention. Journal of chemical information and modeling 61, 12 (2021), 5804–5814.
[161] Sunghwan Kim, Jie Chen, Tiejun Cheng, Asta Gindulyte, Jia He, Siqian He, Qingliang Li, Benjamin A Shoemaker, Paul A Thiessen, Bo Yu, et al. 2019. PubChem 2019 update: improved access to chemical data. Nucleic acids research 47, D1 (2019), D1102–D1109. 80 Zhang and Ding, et al.
[162] Sunghwan Kim, Jie Chen, Tiejun Cheng, Asta Gindulyte, Jia He, Siqian He, Qingliang Li, Benjamin A Shoemaker, Paul A Thiessen, Bo Yu, et al. 2021. PubChem in 2021: new data content and improved web interfaces. Nucleic acids research 49, D1 (2021), D1388–D1395.
[163] Sunghwan Kim, Jie Chen, Tiejun Cheng, Asta Gindulyte, Jia He, Siqian He, Qingliang Li, Benjamin A Shoemaker, Paul A Thiessen, Bo Yu, et al. 2023. PubChem 2023 update. Nucleic acids research 51, D1 (2023), D1373–D1380.
[164] Sunghwan Kim, Paul A Thiessen, Evan E Bolton, Jie Chen, Gang Fu, Asta Gindulyte, Lianyi Han, Jane He, Siqian He, Benjamin A Shoemaker, et al. 2016. PubChem substance and compound databases. Nucleic acids research 44, D1 (2016), D1202–D1213.
[165] David R Krathwohl. 2002. A revision of Bloom’s taxonomy: An overview. Theory into practice 41, 4 (2002), 212–218.
[164] Mario Krenn, Florian Häse, AkshatKumar Nigam, Pascal Friederich, and Alan Aspuru-Guzik. 2020. Self-referencing embedded strings (SELFIES): A 100% robust molecular string representation. Machine Learning: Science and Technology 1, 4 (2020), 045024.
[167] Jens Kringelum, Sonny Kim Kjaerulff, Søren Brunak, Ole Lund, Tudor I Oprea, and Olivier Taboureau. 2016. ChemProt-3.0: a global chemical biology diseases mapping. Database 2016 (2016), bav123.
[168] Andriy Kryshtafovych, Torsten Schwede, Maya Topf, Krzysztof Fidelis, and John Moult. 2021. Critical assessment of methods of protein structure prediction (CASP)—Round XIV. Proteins: Structure, Function, and Bioinformatics 89, 12 (2021), 1607–1617.
[169] Yanis Labrak, Adrien Bazoge, Emmanuel Morin, Pierre-Antoine Gourraud, Mickael Rouvier, and Richard Dufour. 2024. Biomistral: A collection of open-source pretrained large language models for medical domains. arXiv preprint arXiv:2402.10373 (2024).
[170] Alexander Lachmann, Denis Torre, Alexandra B Keenan, Kathleen M Jagodnik, Hoyjin J Lee, Lily Wang, Moshe C Silverstein, and Avi Ma’ayan. 2018. Massive mining of publicly available RNA-seq data from human and mouse. Nature communications 9, 1 (2018), 1366.
[171] Philippe Lamesch, Tanya Z Berardini, Donghui Li, David Swarbreck, Christopher Wilks, Rajkumar Sasidharan, Robert Muller, Kate Dreher, Debbie L Alexander, Margarita Garcia-Hernandez, et al. 2012. The Arabidopsis Information Resource (TAIR): improved gene annotation and new tools. Nucleic acids research 40, D1 (2012), D1202–D1210.
[172] Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush Sharma, and Radu Soricut. 2019. Albert: A lite bert for self-supervised learning of language representations. arXiv preprint arXiv:1909.11942 (2019).
[173] [174] Ursula K Le Guin. 2004. The Wave in the Mind: Talks and Essays on the Writer, the Reader, and the Imagination. Shambhala Publications. Jinhyuk Lee, Wonjin Yoon, Sungdong Kim, Donghyeon Kim, Sunkyu Kim, Chan Ho So, and Jaewoo Kang. 2020. BioBERT: a pre-trained biomedical language representation model for biomedical text mining. Bioinformatics 36, 4 (2020), 1234–1240.
[175] Minji Lee, Luiz Felipe Vecchietti, Hyunkyu Jung, Hyunjoo Ro, Meeyoung Cha, and Ho Min Kim. 2022. Protein sequence design in a latent space via model-based reinforcement learning. (2022).
[176] Youhan Lee, Hasun Yu, Jaemyung Lee, and Jaehoon Kim. 2023. Pre-training Sequence, Structure, and Surface Features for Comprehensive Protein Representation Learning. In The Twelfth International Conference on Learning Representations.
[177] Daniel Levine, Sacha Lévy, Syed Asad Rizvi, Nazreen Pallikkavaliyaveetil, Xingyu Chen, David Zhang, Sina Ghadermarzi, Ruiming Wu, Zihe Zheng, Ivan Vrkic, et al. 2023. Cell2sentence: Teaching large language models the language of biology. bioRxiv (2023), 2023–09.
[178] Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdelrahman Mohamed, Omer Levy, Veselin Stoyanov, and Luke Zettlemoyer. 2020. BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, ACL 2020, Online, July 5-10, 2020. 7871–7880.
[179] Han Li, Dan Zhao, and Jianyang Zeng. 2022. KPGT: knowledge-guided pre-training of graph transformer for molecular property prediction. In Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining. 857–867.
[180] Hong-Liang Li, Yi-He Pang, and Bin Liu. 2021. BioSeq-BLM: a platform for analyzing DNA, RNA and protein sequences based on biological language models. Nucleic acids research 49, 22 (2021), e129–e129.
[181] Juncai Li and Xiaofei Jiang. 2021. Mol-BERT: an effective molecular representation with BERT for molecular property prediction. Wireless Communications and Mobile Computing 2021 (2021), 1–7.
[182] Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. 2023. BLIP-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. arXiv preprint:2301.12597 (2023).
[183] Jiatong Li, Yunqing Liu, Wenqi Fan, Xiao-Yong Wei, Hui Liu, Jiliang Tang, and Qing Li. 2023. Empowering Molecule Discovery for Molecule-Caption Translation with Large Language Models: A ChatGPT Perspective. arXiv:2306.06615 [cs.CL]
[184] Jiao Li, Yueping Sun, Robin J Johnson, Daniela Sciaky, Chih-Hsuan Wei, Robert Leaman, Allan Peter Davis, Carolyn J Mattingly, Thomas C Wiegers, and Zhiyong Lu. 2016. BioCreative V CDR task corpus: a resource for chemical disease relation extraction. Database 2016 (2016).
[185] Jianquan Li, Xidong Wang, Xiangbo Wu, Zhiyi Zhang, Xiaolong Xu, Jie Fu, Prayag Tiwari, Xiang Wan, and Benyou Wang. 2023. Huatuo-26M, a Large-scale Chinese Medical QA Dataset. arXiv:2305.01526 [cs.CL]
[186] Jiahao Li, Zhourun Wu, Wenhao Lin, Jiawei Luo, Jun Zhang, Qingcai Chen, and Junjie Chen. 2023. iEnhancer-ELM: improve enhancer identification by extracting position-related multiscale contextual information based on enhancer language models. Bioinformatics Advances 3, 1 (2023), vbad043.
[187] Yuesen Li, Chengyi Gao, Xin Song, Xiangyu Wang, Yungang Xu, and Suxia Han. 2023. Druggpt: A gpt-based strategy for designing potential ligands targeting specific proteins. bioRxiv (2023), 2023–06.
[188] Yunxiang Li, Zihan Li, Kai Zhang, Ruilong Dan, Steve Jiang, and You Zhang. 2023. ChatDoctor: A Medical Chat Model Fine-Tuned on a Large Language Model Meta-AI (LLaMA) Using Medical Domain Knowledge. arXiv:2303.14070 [cs.CL]
[189] Zhongshen Li, Junru Jin, Wentao Long, and Leyi Wei. 2023. PLPMpro: Enhancing promoter sequence prediction with prompt-learning based pre-trained language model. Computers in Biology and Medicine 164 (2023), 107260.
[190] Youwei Liang, Ruiyi Zhang, Li Zhang, and Pengtao Xie. 2023. DrugChat: towards enabling ChatGPT-like capabilities on drug molecule graphs. arXiv preprint arXiv:2309.03907 (2023).
[191] Chin-Yew Lin. 2004. Rouge: A package for automatic evaluation of summaries. In Text summarization branches out. 74–81.
[192] Zeming Lin, Halil Akin, Roshan Rao, Brian Hie, Zhongkai Zhu, Wenting Lu, Allan dos Santos Costa, Maryam Fazel-Zarandi, Tom Sercu, Sal Candido, et al. 2022. Language models of protein sequences at the scale of evolution enable accurate structure prediction. BioRxiv 2022 (2022), 500902.
[193] Huaqing Liu, Shuxian Zhou, Peiyi Chen, Jiahui Liu, Ku-Geng Huo, and Lanqing Han. 2024. Exploring Genomic Large Language Models: Bridgin the Gap between Natural Language and Gene Sequences. bioRxiv (2024), 2024–02.
[194] June M. Liu, Donghao Li, He Cao, Tianhe Ren, Zeyi Liao, and Jiamin Wu. 2023. ChatCounselor: A Large Language Models for Mental Health Support. arXiv:2309.15461 [cs.CL]
[195] Pengfei Liu, Yiming Ren, and Zhixiang Ren. 2023. GIT-Mol: A Multi-modal Large Language Model for Molecular Science with Graph, Image, and Text. arXiv:2308.06911 [cs.LG]
[196] Shengchao Liu, Weili Nie, Chengpeng Wang, Jiarui Lu, Zhuoran Qiao, Ling Liu, Jian Tang, Chaowei Xiao, and Anima Anandkumar. 2023. Multi-modal Molecule Structure-text Model for Text-based Retrieval and Editing. arXiv:2212.10789 [cs.LG]
[197] Shengchao Liu, Hanchen Wang, Weiyang Liu, Joan Lasenby, Hongyu Guo, and Jian Tang. 2021. Pre-training molecular graph representation with 3d geometry. arXiv preprint arXiv:2110.07728 (2021).
[198] Shengchao Liu, Jiongxiao Wang, Yijin Yang, Chengpeng Wang, Ling Liu, Hongyu Guo, and Chaowei Xiao. 2023. ChatGPT-powered Conversational Drug Editing Using Retrieval and Domain Feedback. arXiv:2305.18090 [q-bio.BM]
[199] Shengchao Liu, Yutao Zhu, Jiarui Lu, Zhao Xu, Weili Nie, Anthony Gitter, Chaowei Xiao, Jian Tang, Hongyu Guo, and Anima Anandkumar. 2023. A Text-guided Protein Design Framework. arXiv:2302.04611 [cs.LG]
[200] Xianggen Liu, Yan Guo, Haoran Li, Jin Liu, Shudong Huang, Bowen Ke, and Jiancheng Lv. 2024. DrugLLM: Open Large Language Model for Few-shot Molecule Generation. arXiv preprint arXiv:2405.06690 (2024).
[201] Yuyan Liu, Sirui Ding, Sheng Zhou, Wenqi Fan, and Qiaoyu Tan. 2024. MolecularGPT: Open Large Language Model (LLM) for Few-Shot Molecular Property Prediction. arXiv preprint arXiv:2406.12950 (2024).
[202] Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692 (2019).
[203] Yunwu Liu, Ruisheng Zhang, Tongfeng Li, Jing Jiang, Jun Ma, and Ping Wang. 2023. MolRoPE-BERT: An enhanced molecular representation with Rotary Position Embedding for molecular property prediction. Journal of Molecular Graphics and Modelling 118 (2023), 108344.
[204] Zicheng Liu, Jiahui Li, Siyuan Li, Zelin Zang, Cheng Tan, Yufei Huang, Yajing Bai, and Stan Z Li. 2024. GenBench: A Benchmarking Suite for Systematic Evaluation of Genomic Foundation Models. arXiv preprint arXiv:2406.01627 (2024).
[205] Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. 2021. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF international conference on computer vision. 10012–10022.
[206] Zhiyuan Liu, An Zhang, Hao Fei, Enzhi Zhang, Xiang Wang, Kenji Kawaguchi, and Tat-Seng Chua. 2024. ProtT3: Protein-to-Text Generation for Text-based Protein Understanding. arXiv preprint arXiv:2405.12564 (2024).
[207] Zequn Liu, Wei Zhang, Yingce Xia, Lijun Wu, Shufang Xie, Tao Qin, Ming Zhang, and Tie-Yan Liu. 2023. MolXPT: Wrapping Molecules with Text for Generative Pre-training. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics. 1606–1616.
[208] Kyle Lo, Lucy Lu Wang, Mark Neumann, Rodney Kinney, and Dan S. Weld. 2020. S2ORC: The Semantic Scholar Open Research Corpus. arXiv:1911.02782 [cs.CL]
[209] Loredana Lo Conte, Bart Ailey, Tim JP Hubbard, Steven E Brenner, Alexey G Murzin, and Cyrus Chothia. 2000. SCOP: a structural classification of proteins database. Nucleic acids research 28, 1 (2000), 257–259.
[210] Daniel Mark Lowe. 2012. Extraction of chemical structures and reactions from the literature. Ph. D. Dissertation. University of Cambridge.
[211] Pan Lu, Swaroop Mishra, Tony Xia, Liang Qiu, Kai-Wei Chang, Song-Chun Zhu, Oyvind Tafjord, Peter Clark, and Ashwin Kalyan. 2022. Learn to Explain: Multimodal Reasoning via Thought Chains for Science Question Answering. arXiv:2209.09513 [cs.CL]
[212] Xingyu Lu, He Cao, Zijing Liu, Shengyuan Bai, Leqing Chen, Yuan Yao, Hai-Tao Zheng, and Yu Li. 2024. MoleculeQA: A Dataset to Evaluate Factual Accuracy in Molecular Comprehension. arXiv preprint arXiv:2403.08192 (2024).
[213] Hanyu Luo, Cheng Chen, Wenyu Shan, Pingjian Ding, and Lingyun Luo. 2022. iEnhancer-BERT: A novel transfer learning architecture based on DNA-Language model for identifying enhancers and their strength. In International Conference on Intelligent Computing. Springer, 153–165.
[214] Hanyu Luo, Wenyu Shan, Cheng Chen, Pingjian Ding, and Lingyun Luo. 2023. Improving language model of human genome for DNA–protein binding prediction based on task-specific pre-training. Interdisciplinary Sciences: Computational Life Sciences 15, 1 (2023), 32–43.
[215] Renqian Luo, Liai Sun, Yingce Xia, Tao Qin, Sheng Zhang, Hoifung Poon, and Tie-Yan Liu. 2022. BioGPT: generative pre-trained transformer for biomedical text generation and mining. Briefings in Bioinformatics 23, 6 (Sept. 2022).
[216] Yizhen Luo, Kai Yang, Massimo Hong, Xing Yi Liu, and Zaiqing Nie. 2023. MolFM: A Multimodal Molecular Foundation Model. arXiv:2307.09484 [q-bio.BM]
[217] Yizhen Luo, Jiahuan Zhang, Siqi Fan, Kai Yang, Yushuai Wu, Mu Qiao, and Zaiqing Nie. 2023. Biomedgpt: Open multimodal generative pre-trained transformer for biomedicine. arXiv preprint arXiv:2308.09442 (2023). 82 Zhang and Ding, et al.
[218] Rachel K. Luu and Markus J. Buehler. 2023. BioinspiredLLM: Conversational Large Language Model for the Mechanics of Biological and Bio-inspired Materials. arXiv:2309.08788 [cond-mat.mtrl-sci]
[219] Liuzhenghao Lv, Zongying Lin, Hao Li, Yuyang Liu, Jiaxi Cui, Calvin Yu-Chian Chen, Li Yuan, and Yonghong Tian. 2024. ProLLaMA: A Protein Large Language Model for Multi-Task Protein Language Processing. arXiv preprint arXiv:2402.16445 (2024).
[220] Chang Ma, Haiteng Zhao, Lin Zheng, Jiayi Xin, Qintong Li, Lijun Wu, Zhihong Deng, Yang Lu, Qi Liu, and Lingpeng Kong. 2023. Retrieved Sequence Augmentation for Protein Representation Learning. bioRxiv (2023), 2023–02.
[221] Ali Madani, Bryan McCann, Nikhil Naik, Nitish Shirish Keskar, Namrata Anand, Raphael R. Eguchi, Po-Ssu Huang, and Richard Socher. 2020. ProGen: Language Modeling for Protein Generation. arXiv e-prints (March 2020), 2004.03497.
[222] Aditya Malusare, Harish Kothandaraman, Dipesh Tamboli, Nadia A. Lanman, and Vaneet Aggarwal. 2023. Understanding the Natural Language of DNA using Encoder-Decoder Foundation Models with Byte-level Precision. arXiv:2311.02333 [cs.LG]
[223] Vipul Mann and Venkat Venkatasubramanian. 2021. Predicting chemical reaction outcomes: A grammar ontology-based transformer framework. AIChE Journal 67, 3 (2021), e17190.
[224] Sanaa Mansoor, Minkyung Baek, Umesh Madan, and Eric Horvitz. 2021. Toward more general embeddings for protein design: Harnessing joint representations of sequence and structure. bioRxiv (2021), 2021–09.
[225] Kelong Mao, Xi Xiao, Tingyang Xu, Yu Rong, Junzhou Huang, and Peilin Zhao. 2021. Molecular graph enhanced transformer for retrosynthesis prediction. Neurocomputing 457 (2021), 193–202.
[226] Valerio Mariani, Marco Biasini, Alessandro Barbato, and Torsten Schwede. 2013. lDDT: a local superposition-free score for comparing protein structures and models using distance difference tests. Bioinformatics 29, 21 (2013), 2722–2728.
[227] Fergal J Martin, M Ridwan Amode, Alisha Aneja, Olanrewaju Austine-Orimoloye, Andrey G Azov, If Barnes, Arne Becker, Ruth Bennett, Andrew Berry, Jyothish Bhai, et al. 2023. Ensembl 2023. Nucleic acids research 51, D1 (2023), D933–D941.
[228] Łukasz Maziarka, Tomasz Danel, Sławomir Mucha, Krzysztof Rataj, Jacek Tabor, and Stanisław Jastrzębski. 2020. Molecule attention transformer. arXiv preprint arXiv:2002.08264 (2020).
[229] Łukasz Maziarka, Dawid Majchrowski, Tomasz Danel, Piotr Gaiński, Jacek Tabor, Igor Podolak, Paweł Morkisz, and Stanisław Jastrzębski. 2024. Relative molecule self-attention transformer. Journal of Cheminformatics 16, 1 (2024), 3.
[230] Eyal Mazuz, Guy Shtar, Bracha Shapira, and Lior Rokach. 2023. Molecule generation using transformers and policy gradient reinforcement learning. Scientific Reports 13, 1 (2023), 8799.
[231] Andrew G. McDonald, Sinéad Boyce, and Keith F. Tipton. 2008. ExplorEnz: the primary source of the IUBMB enzyme list. Nucleic Acids Research 37, suppl_1 (09 2008), 593–597.
[232] Colin Megill, Bruce Martin, Charlotte Weaver, Sidney Bell, Lia Prins, Seve Badajoz, Brian McCandless, Angela Oliveira Pisco, Marcus Kinsella, Fiona Griffin, et al. 2021. Cellxgene: a performant, scalable exploration platform for high dimensional sparse matrices. bioRxiv (2021), 2021–04.
[233] Joshua Meier, Roshan Rao, Robert Verkuil, Jason Liu, Tom Sercu, and Alexander Rives. 2021. Language models enable zero-shot prediction of the effects of mutations on protein function. bioRxiv (2021).
[234] Arindam Mitra, Luciano Del Corro, Shweti Mahajan, Andres Codas, Clarisse Simoes, Sahaj Agrawal, Xuxi Chen, Anastasia Razdaibiedina, Erik Jones, Kriti Aggarwal, Hamid Palangi, Guoqing Zheng, Corby Rosset, Hamed Khanpour, and Ahmed Awadallah. 2023. Orca 2: Teaching Small Language Models How to Reason. arXiv:2311.11045 [cs.AI]
[235] S Moller, Ulf Leser, Wolfgang Fleischmann, and Rolf Apweiler. 1999. EDITtoTrEMBL: a distributed approach to high-quality automated protein sequence annotation. Bioinformatics (Oxford, England) 15, 3 (1999), 219–227.
[236] Geraldene Munsamy, Sebastian Lindner, Philipp Lorenz, and Noelia Ferruz. 2022. ZymCTRL: a conditional language model for the controllable generation of artificial enzymes. In Machine Learning for Structural Biology Workshop. NeurIPS 2022.
[237] Michael M Mysinger, Michael Carchia, John J Irwin, and Brian K Shoichet. 2012. Directory of useful decoys, enhanced (DUD-E): better ligands and decoys for better benchmarking. Journal of medicinal chemistry 55, 14 (2012), 6582–6594.
[238] Eric Nguyen, Michael Poli, Matthew G Durrant, Armin W Thomas, Brian Kang, Jeremy Sullivan, Madelena Y Ng, Ashley Lewis, Aman Patel, Aaron Lou, et al. 2024. Sequence modeling and design from molecular to genome scale with Evo. bioRxiv (2024), 2024–02.
[239] Eric Nguyen, Michael Poli, Marjan Faizi, Armin Thomas, Callum Birch-Sykes, Michael Wornow, Aman Patel, Clayton Rabideau, Stefano Massaroli, Yoshua Bengio, et al. 2023. Hyenadna: Long-range genomic sequence modeling at single nucleotide resolution. arXiv preprint arXiv:2306.15794 (2023).
[240] Erik Nijkamp, Jeffrey A Ruffolo, Eli N Weinstein, Nikhil Naik, and Ali Madani. 2023. ProGen2: exploring the boundaries of protein language models. Cell Systems 14, 11 (2023), 968–978.
[241] Pascal Notin, Mafalda Dias, Jonathan Frazer, Javier Marchena Hurtado, Aidan N Gomez, Debora Marks, and Yarin Gal. 2022. Tranception: Protein Fitness Prediction with Autoregressive Transformers and Inference-time Retrieval. In Proceedings of the 39th International Conference on Machine Learning, Vol. 162. 16990–17017.
[242] Pascal Notin, Aaron W. Kollasch, Daniel Ritter, Lood van Niekerk, Steffanie Paul, Hansen Spinner, Nathan Rollins, Ada Shaw, Ruben Weitzman, Jonathan Frazer, Mafalda Dias, Dinko Franceschi, Rose Orenbuch, Yarin Gal, and Debora S. Marks. 2023. ProteinGym: Large-Scale Benchmarks for Protein Design and Fitness Prediction. bioRxiv (2023).
[243] Pascal Notin, Ruben Weitzman, Debora S. Marks, and Yarin Gal. 2023. ProteinNPT: Improving Protein Property Prediction and Design with Non-Parametric Transformers. bioRxiv (2023). Scientific Large Language Models: A Survey on Biological & Chemical Domains 83
[244] Robert D Olson, Rida Assaf, Thomas Brettin, Neal Conrad, Clark Cucinell, James J Davis, Donald M Dempsey, Allan Dickerman, Emily M Dietrich, Ronald W Kenyon, et al. 2023. Introducing the bacterial and viral bioinformatics resource center (BV-BRC): a resource combining PATRIC, IRD and ViPR. Nucleic acids research 51, D1 (2023), D678–D689.
[245] OpenAI. 2022. Introducing ChatGPT. OpenAI Blog (November 2022).
[246] OpenAI. 2023. GPT-4 Technical Report. OpenAI (2023).
[247] Christine A Orengo, Alex D Michie, Susan Jones, David T Jones, Mark B Swindells, and Janet M Thornton. 1997. CATH–a hierarchic classification of protein domain structures. Structure 5, 8 (1997), 1093–1109.
[248] Rose Oughtred, Chris Stark, Bobby-Joe Breitkreutz, Jennifer Rust, Lorrie Boucher, Christie Chang, Nadine Kolas, Lara O’Donnell, Genie Leung, Rochelle McAdam, et al. 2019. The BioGRID interaction database: 2019 update. Nucleic acids research 47, D1 (2019), D529–D541.
[249] Carlos Outeiral and Charlotte M Deane. 2024. Codon language embeddings provide strong signals for use in protein engineering. Nature Machine Intelligence 6, 2 (2024), 170–179.
[250] Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul Christiano, Jan Leike, and Ryan Lowe. 2022. Training language models to follow instructions with human feedback. cite arxiv:2203.02155.
[251] Ankit Pal, Logesh Kumar Umapathi, and Malaikannan Sankarasubbu. 2022. MedMCQA : A Large-scale Multi-Subject Multi-Choice Dataset for Medical domain Question Answering. arXiv:2203.14371 [cs.CL]
[252] Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting of the Association for Computational Linguistics. 311–318.
[253] Typhaine Paysan-Lafosse, Matthias Blum, Sara Chuguransky, Tiago Grego, Beatriz Lázaro Pinto, Gustavo A Salazar, Maxwell L Bileschi, Peer Bork, Alan Bridge, Lucy Colwell, et al. 2023. InterPro in 2022. Nucleic acids research 51, D1 (2023), D418–D427.
[254] Qizhi Pei, Lijun Wu, Kaiyuan Gao, Xiaozhuan Liang, Yin Fang, Jinhua Zhu, Shufang Xie, Tao Qin, and Rui Yan. 2024. Biot5+: Towards generalized biological understanding with iupac integration and multi-task tuning. arXiv preprint arXiv:2402.17810 (2024).
[255] Qizhi Pei, Lijun Wu, Kaiyuan Gao, Jinhua Zhu, Yue Wang, Zun Wang, Tao Qin, and Rui Yan. 2024. Leveraging Biomolecule and Natural Language through Multi-Modal Learning: A Survey. arXiv preprint arXiv:2403.01528 (2024).
[256] Qizhi Pei, Lijun Wu, Kaiyuan Gao, Jinhua Zhu, and Rui Yan. 2024. 3D-MolT5: Towards Unified 3D Molecule-Text Modeling with 3D Molecular Tokenization. arXiv preprint arXiv:2406.05797 (2024).
[257] Qizhi Pei, Wei Zhang, Jinhua Zhu, Kehan Wu, Kaiyuan Gao, Lijun Wu, Yingce Xia, and Rui Yan. 2023. BioT5: Enriching Cross-modal Integration in Biology with Chemical Knowledge and Natural Language Associations. arXiv:2310.07276 [cs.CL]
[258] Cheng Peng, Xi Yang, Aokun Chen, Kaleb E. Smith, Nima PourNejatian, Anthony B. Costa, Cheryl Martin, Mona G. Flores, Ying Zhang, Tanja Magoc, Gloria Lipori, Duane A. Mitchell, Naykky S. Ospina, Mustafa M. Ahmed, William R. Hogan, Elizabeth A. Shenkman, Yi Guo, Jiang Bian, and Yonghui Wu. 2023. A study of generative large language model for medical research and healthcare. npj Digital Medicine 6, 1 (2023). https://doi.org/10.1038/s41746-023-00958-w
[259] Yifan Peng, Qingyu Chen, and Zhiyong Lu. 2020. An Empirical Study of Multi-Task Learning on BERT for Biomedical Text Mining. arXiv:2005.02799 [cs.CL]
[260] Rafael Josip Penić, Tin Vlašić, Roland G Huber, Yue Wan, and Mile Šikić. 2024. Rinalmo: General-purpose rna language models can generalize well on structure prediction tasks. arXiv preprint arXiv:2403.00043 (2024).
[261] Sara Pieri, Sahal Shaji Mullappilly, Fahad Shahbaz Khan, Rao Muhammad Anwer, Salman Khan, Timothy Baldwin, and Hisham Cholakkal. 2024. BiMediX: Bilingual Medical Mixture of Experts LLM. arXiv:2402.13253
[262] Tom J Pollard, Alistair EW Johnson, Jesse D Raffa, Leo A Celi, Roger G Mark, and Omar Badawi. 2018. The eICU Collaborative Research Database, a freely available multi-center database for critical care research. Scientific data 5, 1 (2018), 1–13.
[263] Daniil Polykovskiy, Alexander Zhebrak, Benjamin Sanchez-Lengeling, Sergey Golovanov, Oktai Tatanov, Stanislav Belyaev, Rauf Kurbanov, Aleksey Artamonov, Vladimir Aladinskiy, Mark Veselov, et al. 2020. Molecular sets (MOSES): a benchmarking platform for molecular generation models. Frontiers in pharmacology 11 (2020), 565644.
[264] [265] Alec Radford, Karthik Narasimhan, Tim Salimans, Ilya Sutskever, et al. 2018. Improving language understanding by generative pre-training. (2018). Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al. 2019. Language models are unsupervised multitask learners. OpenAI blog (2019), 9.
[266] Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al. 2019. Language models are unsupervised multitask learners. OpenAI blog 1, 8 (2019), 9.
[267] Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. 2020. Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research 21, 1 (2020), 5485–5551.
[268] Soumya Ram and Tristan Bepler. 2022. Few Shot Protein Generation. arXiv preprint arXiv:2204.01168 (2022).
[269] Mayk Caldas Ramos, Christopher J. Collison, and Andrew D. White. 2024. A Review of Large Language Models and Autonomous Agents in Chemistry. arXiv:2407.01603 [cs.LG]
[270] Frank P Ramsey. 1923. Tractatus Logico-Philosophicus.
[271] Roshan Rao, Nicholas Bhattacharya, Neil Thomas, Yan Duan, Xi Chen, John Canny, Pieter Abbeel, and Yun S Song. 2019. Evaluating Protein Transfer Learning with TAPE. In Advances in Neural Information Processing Systems. 84 Zhang and Ding, et al.
[272] Roshan M Rao, Jason Liu, Robert Verkuil, Joshua Meier, John Canny, Pieter Abbeel, Tom Sercu, and Alexander Rives. 2021. MSA transformer. In International Conference on Machine Learning. PMLR, 8844–8856.
[273] Aviv Regev, Sarah A Teichmann, Eric S Lander, Ido Amit, Christophe Benoist, Ewan Birney, Bernd Bodenmiller, Peter Campbell, Piero Carninci, Menna Clatworthy, et al. 2017. The human cell atlas. elife 6 (2017), e27041.
[274] Yuchen Ren, Zhiyuan Chen, Lifeng Qiao, Hongtai Jing, Yuchen Cai, Sheng Xu, Peng Ye, Xinzhu Ma, Siqi Sun, Hongliang Yan, et al. 2024. BEACON: Benchmark for Comprehensive RNA Tasks and Language Models. arXiv preprint arXiv:2406.10391 (2024).
[275] Alexander Rives, Joshua Meier, Tom Sercu, Siddharth Goyal, Zeming Lin, Jason Liu, Demi Guo, Myle Ott, C. Lawrence Zitnick, Jerry Ma, and Rob Fergus. 2019. Biological Structure and Function Emerge from Scaling Unsupervised Learning to 250 Million Protein Sequences. PNAS (2019).
[276] Yu Rong, Yatao Bian, Tingyang Xu, Weiyang Xie, Ying Wei, Wenbing Huang, and Junzhou Huang. 2020. Self-supervised graph transformer on large-scale molecular data. Advances in Neural Information Processing Systems 33 (2020), 12559–12571.
[277] Peter W Rose, Bojan Beran, Chunxiao Bi, Wolfgang F Bluhm, Dimitris Dimitropoulos, David S Goodsell, Andreas Prlić, Martha Quesada, Gregory B Quinn, John D Westbrook, et al. 2010. The RCSB Protein Data Bank: redesigned web site and web services. Nucleic acids research 39, suppl_1 (2010), D392–D401.
[278] Jerret Ross, Brian Belgodere, Vijil Chenthamarakshan, Inkit Padhi, Youssef Mroueh, and Payel Das. 2022. Large-scale chemical language representations capture molecular structure and properties. Nature Machine Intelligence 4, 12 (2022), 1256–1264.
[279] Lars Ruddigkeit, Ruud Van Deursen, Lorenz C Blum, and Jean-Louis Reymond. 2012. Enumeration of 166 billion organic small molecules in the chemical universe database GDB-17. Journal of chemical information and modeling 52, 11 (2012), 2864–2875.
[280] Nicole Rusk. 2018. Sequence-based prediction of variants’ effects. Nature Methods 15, 8 (2018), 571–571.
[281] Moritz Schaefer, Peter Peneder, Daniel Malzl, Anna Hakobyan, Varun S Sharma, Thomas Krausgruber, Jörg Menche, Eleni Tomazou, and Christoph Bock. [n. d.]. Joint Embedding of Transcriptomes and Text Enables Interactive Single-Cell RNA-seq Data Exploration via Natural Language. In ICLR 2024 Workshop on Machine Learning for Genomics Explorations.
[282] Martin H Schaefer, Jean-Fred Fontaine, Arunachalam Vinayagam, Pablo Porras, Erich E Wanker, and Miguel A Andrade-Navarro. 2012. HIPPIE: Integrating protein interaction networks with experiment based quality scores. PloS one 7, 2 (2012), e31826.
[283] Yair Schiff, Chia-Hsiang Kao, Aaron Gokaslan, Tri Dao, Albert Gu, and Volodymyr Kuleshov. 2024. Caduceus: Bi-directional equivariant long-range dna sequence modeling. arXiv preprint arXiv:2403.03234 (2024).
[284] Michael Schlichtkrull, Thomas N Kipf, Peter Bloem, Rianne Van Den Berg, Ivan Titov, and Max Welling. 2018. Modeling relational data with graph convolutional networks. In The Semantic Web: 15th International Conference, ESWC 2018, Heraklion, Crete, Greece, June 3–7, 2018, Proceedings 15.Springer, 593–607.
[285] Nadine Schneider, Nikolaus Stiefl, and Gregory A Landrum. 2016. What’s what: The (nearly) definitive guide to reaction role assignment. Journal of chemical information and modeling 56, 12 (2016), 2336–2346.
[286] Philippe Schwaller, Teodoro Laino, Théophile Gaudin, Peter Bolgar, Christopher A Hunter, Costas Bekas, and Alpha A Lee. 2019. Molecular transformer: a model for uncertainty-calibrated chemical reaction prediction. ACS central science 5, 9 (2019), 1572–1583.
[287] Philippe Schwaller, Daniel Probst, Alain C Vaucher, Vishnu H Nair, David Kreutter, Teodoro Laino, and Jean-Louis Reymond. 2021. Mapping the space of chemical reactions using attention-based neural networks. Nature machine intelligence 3, 2 (2021), 144–152.
[288] Yaiza Serrano, Sergi Roda, Victor Guallar, and Alexis Molina. 2023. Efficient and accurate sequence generation with small-scale protein language models. bioRxiv (2023), 2023–08.
[289] Damiano Sgarbossa, Umberto Lupo, and Anne-Florence Bitbol. 2023. Generative power of a protein language model trained on multiple sequence alignments. Elife 12 (2023), e79854.
[290] Murray Shanahan. 2022. Talking About Large Language Models. CoRR abs/2212.03551 (2022).
[291] Soumya Sharma, Bishal Santra, Abhik Jana, TYSS Santosh, Niloy Ganguly, and Pawan Goyal. 2019. Incorporating domain knowledge into medical NLI using knowledge graphs. arXiv preprint arXiv:1909.00160 (2019).
[292] Yiqing Shen, Zan Chen, Michail Mamalakis, Luhan He, Haiyang Xia, Tianbin Li, Yanzhou Su, Junjun He, and Yu Guang Wang. 2024. A Fine-tuning Dataset and Benchmark for Large Language Models for Protein Understanding. arXiv preprint arXiv:2406.05540 (2024).
[293] Pranav Shetty, Arunkumar Chitteth Rajan, Chris Kuenneth, Sonakshi Gupta, Lakshmi Prerana Panchumarti, Lauren Holm, Chao Zhang, and Rampi Ramprasad. 2023. A general-purpose material property data extraction pipeline from large polymer corpora using natural language processing. npj Computational Materials 9, 1 (2023), 52.
[294] Hoo-Chang Shin, Yang Zhang, Evelina Bakhturina, Raul Puri, Mostofa Patwary, Mohammad Shoeybi, and Raghav Mani. 2020. BioMegatron: Larger Biomedical Domain Language Model. arXiv:2010.06060
[295] Ofir Ben Shoham and Nadav Rappoport. 2023. CPLLM: Clinical Prediction with Large Language Models. arXiv:2309.11295 [cs.CL]
[296] Richard W Shuai, Jeffrey A Ruffolo, and Jeffrey J Gray. 2021. Generative language modeling for antibody design. bioRxiv (2021), 2021–12.
[297] Christian JA Sigrist, Edouard De Castro, Lorenzo Cerutti, Béatrice A Cuche, Nicolas Hulo, Alan Bridge, Lydie Bougueleret, and Ioannis Xenarios. 2012. New and continuing developments at PROSITE. Nucleic acids research 41, D1 (2012), D344–D347.
[298] Karan Singhal, Shekoofeh Azizi, Tao Tu, S Sara Mahdavi, Jason Wei, Hyung Won Chung, Nathan Scales, Ajay Tanwani, Heather Cole-Lewis, Stephen Pfohl, et al. 2023. Large language models encode clinical knowledge. Nature 620, 7972 (2023), 172–180.
[299] Karan Singhal, Tao Tu, Juraj Gottweis, Rory Sayres, Ellery Wulczyn, Le Hou, Kevin Clark, Stephen Pfohl, Heather Cole-Lewis, Darlene Neal, Mike Schaekermann, Amy Wang, Mohamed Amin, Sami Lachgar, Philip Mansfield, Sushant Prakash, Bradley Green, Ewa Dominowska, Blaise Aguera y Scientific Large Language Models: A Survey on Biological & Chemical Domains 85 Arcas, Nenad Tomasev, Yun Liu, Renee Wong, Christopher Semturs, S. Sara Mahdavi, Joelle Barral, Dale Webster, Greg S. Corrado, Yossi Matias, Shekoofeh Azizi, Alan Karthikesalingam, and Vivek Natarajan. 2023. Towards Expert-Level Medical Question Answering with Large Language Models. arXiv:2305.09617 [cs.CL]
[300] Garrett A Soukup. 2001. Nucleic acids: General properties. e LS (2001).
[301] Martin Steinegger, Milot Mirdita, and Johannes Söding. 2019. Protein-level assembly increases protein sequence recovery from metagenomic samples manyfold. Nature methods 16, 7 (2019), 603–606.
[302] Martin Steinegger and Johannes Söding. 2018. Clustering huge protein sequence sets in linear time. Nature communications 9, 1 (2018), 2542.
[303] Teague Sterling and John J Irwin. 2015. ZINC 15–ligand discovery for everyone. Journal of chemical information and modeling 55, 11 (2015), 2324–2337.
[304] Matt Sternke and Joel Karpiak. 2023. ProteinRL: Reinforcement learning with generative protein language models for property-directed sequence design. In NeurIPS 2023 Generative AI and Biology (GenBio) Workshop.
[305] Robert L Strausberg, Elise A Feingold, Richard D Klausner, and Francis S Collins. 1999. The mammalian gene collection. Science 286, 5439 (1999), 455–457.
[306] Bing Su, Dazhao Du, Zhao Yang, Yujie Zhou, Jiangmeng Li, Anyi Rao, Hao Sun, Zhiwu Lu, and Ji-Rong Wen. 2022. A molecular multimodal foundation model associating molecule graphs with natural language. arXiv preprint arXiv:2209.05481 (2022).
[307] Jin Su, Chenchen Han, Yuyang Zhou, Junjie Shan, Xibin Zhou, and Fajie Yuan. 2023. Saprot: Protein language modeling with structure-aware vocabulary. bioRxiv (2023), 2023–10.
[308] Jiangming Sun, Nina Jeliazkova, Vladimir Chupakhin, Jose-Felipe Golib-Dzib, Ola Engkvist, Lars Carlsson, Jörg Wegner, Hugo Ceulemans, Ivan Georgiev, Vedrin Jeliazkov, et al. 2017. ExCAPE-DB: an integrated large scale dataset facilitating Big Data analysis in chemogenomics. Journal of cheminformatics 9 (2017), 1–9.
[309] Liangtai Sun, Yang Han, Zihan Zhao, Da Ma, Zhennan Shen, Baocai Chen, Lu Chen, and Kai Yu. 2023. SciEval: A Multi-Level Large Language Model Evaluation Benchmark for Scientific Research. arXiv:2308.13149 [cs.CL]
[310] Baris E Suzek, Hongzhan Huang, Peter McGarvey, Raja Mazumder, and Cathy H Wu. 2007. UniRef: comprehensive and non-redundant UniProt reference clusters. Bioinformatics 23, 10 (2007), 1282–1288.
[311] Baris E Suzek, Yuqi Wang, Hongzhan Huang, Peter B McGarvey, Cathy H Wu, and UniProt Consortium. 2015. UniRef clusters: a comprehensive and scalable alternative for improving sequence similarity searches. Bioinformatics 31, 6 (2015), 926–932.
[312] Damian Szklarczyk, Rebecca Kirsch, Mikaela Koutrouli, Katerina Nastou, Farrokh Mehryary, Radja Hachilif, Annika L Gable, Tao Fang, Nadezhda T Doncheva, Sampo Pyysalo, et al. 2023. The STRING database in 2023: protein–protein association networks and functional enrichment analyses for any sequenced genome of interest. Nucleic acids research 51, D1 (2023), D638–D646.
[313] Ross Taylor, Marcin Kardas, Guillem Cucurull, Thomas Scialom, Anthony Hartshorn, Elvis Saravia, Andrew Poulton, Viktor Kerkez, and Robert Stojnic. 2022. Galactica: A Large Language Model for Science. CoRR abs/2211.09085 (2022).
[314] Ross Taylor, Marcin Kardas, Guillem Cucurull, Thomas Scialom, Anthony Hartshorn, Elvis Saravia, Andrew Poulton, Viktor Kerkez, and Robert Stojnic. 2022. Galactica: A Large Language Model for Science. arXiv:2211.09085 [cs.CL]
[315] [316] Gemini Team. 2023. Gemini: A Family of Highly Capable Multimodal Models. arXiv:2312.11805 [cs.CL] Igor V Tetko, Pavel Karpov, Ruud Van Deursen, and Guillaume Godin. 2020. State-of-the-art augmented NLP transformer models for direct and single-step retrosynthesis. Nature communications 11, 1 (2020), 5575.
[317] Amol Thakkar, Alain C Vaucher, Andrea Byekwaso, Philippe Schwaller, Alessandra Toniato, and Teodoro Laino. 2023. Unbiasing retrosynthesis language models with disconnection prompts. ACS Central Science 9, 7 (2023), 1488–1498.
[318] Augustin Toma, Patrick R. Lawler, Jimmy Ba, Rahul G. Krishnan, Barry B. Rubin, and Bo Wang. 2023. Clinical Camel: An Open Expert-Level Medical Language Model with Dialogue-Based Knowledge Encoding. arXiv:2305.12031 [cs.CL]
[319] Alessandra Toniato, Alain C Vaucher, Philippe Schwaller, and Teodoro Laino. 2023. Enhancing diversity in language based models for single-step retrosynthesis. Digital Discovery 2, 2 (2023), 489–501.
[320] Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, Aurélien Rodriguez, Armand Joulin, Edouard Grave, and Guillaume Lample. 2023. LLaMA: Open and Efficient Foundation Language Models. CoRR (2023).
[321] Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. 2023. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023).
[322] Chau Tran, Siddharth Khadkikar, and Aleksey Porollo. 2023. Survey of Protein Sequence Embedding Models. International Journal of Molecular Sciences 24, 4 (2023), 3775.
[323] Tuan Tran and Chinwe Ekenna. 2023. Molecular Descriptors Property Prediction Using Transformer-Based Approach. International Journal of Molecular Sciences 24, 15 (2023), 11948.
[324] Zhengkai Tu and Connor W Coley. 2022. Permutation invariant graph-to-sequence model for template-free retrosynthesis and reaction prediction. Journal of chemical information and modeling 62, 15 (2022), 3503–3513.
[325] Umit V Ucak, Islambek Ashyrmamatov, Junsu Ko, and Juyong Lee. 2022. Retrosynthetic reaction pathway prediction through neural machine translation of atomic environments. Nature communications 13, 1 (2022), 1186. 86 Zhang and Ding, et al.
[326] Gökçe Uludoğan, Elif Ozkirimli, Kutlu O. Ulgen, Nilgün Karalı, and Arzucan Özgür. 2022. Exploiting Pretrained Biochemical Language Models for Targeted Drug Design. arXiv:2209.00981 [cs.LG]
[327] Serbulent Unsal, Heval Atas, Muammer Albayrak, Kemal Turhan, Aybar C Acar, and Tunca Doğan. 2022. Learning functional properties of proteins with language models. Nature Machine Intelligence 4, 3 (2022), 227–245.
[328] Michel van Kempen, Stephanie S Kim, Charlotte Tumescheit, Milot Mirdita, Cameron LM Gilchrist, Johannes Söding, and Martin Steinegger. 2022. Foldseek: fast and accurate protein structure search. Biorxiv (2022), 2022–02.
[329] Mihaly Varadi, Stephen Anyango, Mandar Deshpande, Sreenath Nair, Cindy Natassia, Galabina Yordanova, David Yuan, Oana Stroe, Gemma Wood, Agata Laydon, Augustin Žídek, Tim Green, Kathryn Tunyasuvunakool, Stig Petersen, John Jumper, Ellen Clancy, Richard Green, Ankur Vora, Mira Lutfi, Michael Figurnov, Andrew Cowie, Nicole Hobbs, Pushmeet Kohli, Gerard Kleywegt, Ewan Birney, Demis Hassabis, and Sameer Velankar. 2021. AlphaFold Protein Structure Database: massively expanding the structural coverage of protein-sequence space with high-accuracy models. Nucleic Acids Research 50, D1 (11 2021), D439–D444.
[330] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is All you Need. In Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, December 4-9, 2017, Long Beach, CA, USA. 5998–6008.
[331] A Venigalla, Jonathan Frankle, and M Carbin. 2022. Biomedlm: a domain-specific large language model for biomedical text. MosaicML. Accessed: Dec 23, 3 (2022), 2.
[332] Guangyu Wang, Guoxing Yang, Zongxin Du, Longjun Fan, and Xiaohu Li. 2023. ClinicalGPT: Large Language Models Finetuned with Diverse Medical Data and Comprehensive Evaluation. arXiv:2306.09968 [cs.CL]
[333] Haochun Wang, Chi Liu, Nuwa Xi, Zewen Qiang, Sendong Zhao, Bing Qin, and Ting Liu. 2023. HuaTuo: Tuning LLaMA Model with Chinese Medical Knowledge. arXiv:2304.06975 [cs.CL]
[334] Lei Wang, Hui Zhang, Wei Xu, Zhidong Xue, and Yan Wang. 2023. Deciphering the protein landscape with ProtFlash: a lightweight language model. Cell Reports Physical Science 4, 10 (2023), 101600.
[335] Ning Wang, Jiang Bian, Yuchen Li, Xuhong Li, Shahid Mumtaz, Linghe Kong, and Haoyi Xiong. 2024. Multi-purpose RNA language modelling with motif-aware pretraining and type-guided fine-tuning. Nature Machine Intelligence (2024), 1–10.
[336] Renxiao Wang, Xueliang Fang, Yipin Lu, Chao-Yie Yang, and Shaomeng Wang. 2005. The PDBbind database: methodologies and updates. Journal of medicinal chemistry 48, 12 (2005), 4111–4119.
[337] Sheng Wang, Yuzhi Guo, Yuhong Wang, Hongmao Sun, and Junzhou Huang. 2019. Smiles-bert: large scale unsupervised pre-training for molecular property prediction. In Proceedings of the 10th ACM international conference on bioinformatics, computational biology and health informatics. 429–436.
[338] Wenlu Wang, Ye Wang, Honggang Zhao, and Simone Sciabola. 2022. A pre-trained conditional transformer for Target-specific De Novo Molecular Generation. (2022).
[339] Xin Wang, Xin Gao, Guohua Wang, and Dan Li. 2023. miProBERT: identification of microRNA promoters based on the pre-trained model BERT. Briefings in bioinformatics 24, 3 (2023), bbad093.
[340] Xi Wang, Ruichu Gu, Zhiyuan Chen, Yongge Li, Xiaohong Ji, Guolin Ke, and Han Wen. 2023. UNI-RNA: universal pre-trained models revolutionize RNA research. bioRxiv (2023), 2023–07.
[341] Xiaoxuan Wang, Ziniu Hu, Pan Lu, Yanqiao Zhu, Jieyu Zhang, Satyen Subramaniam, Arjun R. Loomba, Shichang Zhang, Yizhou Sun, and Wei Wang. 2023. SciBench: Evaluating College-Level Scientific Problem-Solving Abilities of Large Language Models. arXiv:2307.10635 [cs.CL]
[342] Yanli Wang, Jewen Xiao, Tugba O Suzek, Jian Zhang, Jiyao Wang, and Stephen H Bryant. 2009. PubChem: a public information system for analyzing bioactivities of small molecules. Nucleic acids research 37, suppl_2 (2009), W623–W633.
[343] Yanli Wang, Jewen Xiao, Tugba O Suzek, Jian Zhang, Jiyao Wang, Zhigang Zhou, Lianyi Han, Karen Karapetyan, Svetlana Dracheva, Benjamin A Shoemaker, et al. 2012. PubChem’s BioAssay database. Nucleic acids research 40, D1 (2012), D400–D412.
[344] Yuhao Wang, Qiang Zhang, Ming Qin, Xiang Zhuang, Xiaotong Li, Zhichen Gong, Zeyuan Wang, Yu Zhao, Jianhua Yao, Keyan Ding, et al. [n. d.]. Knowledge-aware Reinforced Language Models for Protein Directed Evolution. In Forty-first International Conference on Machine Learning.
[345] Ye Wang, Honggang Zhao, Simone Sciabola, and Wenlu Wang. 2023. cMolGPT: A Conditional Generative Pre-Trained Transformer for Target- Specific De Novo Molecular Generation. Molecules 28, 11 (2023), 4430.
[346] Zichen Wang, Steven A Combs, Ryan Brand, Miguel Romero Calvo, Panpan Xu, George Price, Nataliya Golovach, Emmanuel O Salawu, Colby J Wise, Sri Priya Ponnapalli, et al. 2022. Lm-gvp: an extensible sequence and structure informed deep learning framework for protein property prediction. Scientific reports 12, 1 (2022), 6832.
[347] Zifeng Wang, Zichen Wang, Balasubramaniam Srinivasan, Vassilis N Ioannidis, Huzefa Rangwala, and Rishita Anubhai. 2023. BioBridge: Bridging Biomedical Foundation Models via Knowledge Graph. arXiv preprint arXiv:2310.03320 (2023).
[348] Zeyuan Wang, Qiang Zhang, Keyan Ding, Ming Qin, Xiang Zhuang, Xiaotong Li, and Huajun Chen. 2023. InstructProtein: Aligning Human and Protein Language via Knowledge Instruction. arXiv preprint:2310.03269 (2023).
[349] Zeyuan Wang, Qiang Zhang, Shuang-Wei HU, Haoran Yu, Xurui Jin, Zhichen Gong, and Huajun Chen. 2023. Multi-level Protein Structure Pre-training via Prompt Learning. In The Eleventh International Conference on Learning Representations.
[350] David Weininger. 1988. SMILES, a chemical language and information system. 1. Introduction to methodology and encoding rules. Journal of chemical information and computer sciences 28, 1 (1988), 31–36.
[351] Johannes Welbl, Nelson F Liu, and Matt Gardner. 2017. Crowdsourcing multiple choice science questions. arXiv preprint arXiv:1707.06209 (2017). Scientific Large Language Models: A Survey on Biological & Chemical Domains 87
[352] David L Wheeler, Tanya Barrett, Dennis A Benson, Stephen H Bryant, Kathi Canese, Vyacheslav Chetvernin, Deanna M Church, Michael DiCuccio, Ron Edgar, Scott Federhen, et al. 2007. Database resources of the national center for biotechnology information. Nucleic acids research 36, suppl_1 (2007), D13–D21.
[353] Jacob White. 2020. PubMed 2.0. Medical reference services quarterly 39, 4 (2020), 382–387.
[354] Mark D Wilkinson, Michel Dumontier, IJsbrand Jan Aalbersberg, Gabrielle Appleton, Myles Axton, Arie Baak, Niklas Blomberg, Jan-Willem Boiten, Luiz Bonino da Silva Santos, Philip E Bourne, et al. 2016. The FAIR Guiding Principles for scientific data management and stewardship. Scientific data 3, 1 (2016), 1–9.
[355] David S Wishart, Yannick D Feunang, An C Guo, Elvis J Lo, Ana Marcu, Jason R Grant, Tanvir Sajed, Daniel Johnson, Carin Li, Zinat Sayeeda, et al. 2018. DrugBank 5.0: a major update to the DrugBank database for 2018. Nucleic acids research 46, D1 (2018), D1074–D1082.
[356] Chaoyi Wu, Weixiong Lin, Xiaoman Zhang, Ya Zhang, Yanfeng Wang, and Weidi Xie. 2023. PMC-LLaMA: Towards Building Open-source Language Models for Medicine. arXiv:2304.14454 [cs.CL]
[357] Cathy H Wu, Anastasia Nikolskaya, Hongzhan Huang, Lai-Su L Yeh, Darren A Natale, Cholanayakanahalli R Vinayaka, Zhang-Zhi Hu, Raja Mazumder, Sandeep Kumar, Panagiotis Kourtesis, et al. 2004. PIRSF: family classification system at the Protein Information Resource. Nucleic acids research 32, suppl_1 (2004), D112–D114.
[358] Fang Wu, Dragomir Radev, and Stan Z Li. 2023. Molformer: Motif-based transformer on 3d heterogeneous molecular graphs. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 37. 5312–5320.
[359] Zhenqin Wu, Bharath Ramsundar, Evan N Feinberg, Joseph Gomes, Caleb Geniesse, Aneesh S Pappu, Karl Leswing, and Vijay Pande. 2018. MoleculeNet: a benchmark for molecular machine learning. Chemical science 9, 2 (2018), 513–530.
[360] wwPDB consortium. 2018. Protein Data Bank: the single global archive for 3D macromolecular structure data. Nucleic Acids Research 47, D1 (10 2018), D520–D528.
[361] Ioannis Xenarios, Lukasz Salwinski, Xiaoqun Joyce Duan, Patrick Higney, Sul-Min Kim, and David Eisenberg. 2002. DIP, the Database of Interacting Proteins: a research tool for studying cellular networks of protein interactions. Nucleic acids research 30, 1 (2002), 303–305.
[362] Jun Xia, Yanqiao Zhu, Yuanqi Du, Y Liu, and SZ Li. 2023. A Systematic Survey of Chemical Pre-trained Models. IJCAI.
[363] Qianqian Xie, Qingyu Chen, Aokun Chen, Cheng Peng, Yan Hu, Fongci Lin, Xueqing Peng, Jimin Huang, Jeffrey Zhang, Vipina Keloth, Xinyu Zhou, Huan He, Lucila Ohno-Machado, Yonghui Wu, Hua Xu, and Jiang Bian. 2024. Me LLaMA: Foundation Large Language Models for Medical Applications. arXiv:2402.12749
[364] Renchunzi Xie, Hongxin Wei, Lei Feng, and Bo An. 2022. Gearnet: Stepwise dual learning for weakly supervised domain adaptation. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 36. 8717–8725.
[365] Tong Xie, Yuwei Wan, Wei Huang, Zhenyu Yin, Yixuan Liu, Shaozhou Wang, Qingyuan Linghu, Chunyu Kit, Clara Grazian, Wenjie Zhang, Imran Razzak, and Bram Hoex. 2023. DARWIN Series: Domain Specific Large Language Models for Natural Science. arXiv:2308.13565 [cs.CL]
[366] Guoli Xiong, Zhenxing Wu, Jiacai Yi, Li Fu, Zhijiang Yang, Changyu Hsieh, Mingzhu Yin, Xiangxiang Zeng, Chengkun Wu, Aiping Lu, et al. 2021. ADMETlab 2.0: an integrated online platform for accurate and comprehensive predictions of ADMET properties. Nucleic Acids Research 49, W1 (2021), W5–W14.
[367] Honglin Xiong, Sheng Wang, Yitao Zhu, Zihao Zhao, Yuxiao Liu, Linlin Huang, Qian Wang, and Dinggang Shen. 2023. DoctorGLM: Fine-tuning your Chinese Doctor is not a Herculean Task. arXiv:2304.01097 [cs.CL]
[368] Canwen Xu, Daya Guo, Nan Duan, and Julian McAuley. 2023. Baize: An open-source chat model with parameter-efficient tuning on self-chat data. arXiv preprint arXiv:2304.01196 (2023).
[369] Guikun Xu, Yongquan Jiang, PengChuan Lei, Yan Yang, and Jim Chen. [n. d.]. GTMGC: Using Graph Transformer to Predict Molecule’s Ground-State Conformation. In The Twelfth International Conference on Learning Representations.
[370] Hanwen Xu and Sheng Wang. 2022. ProTranslator: zero-shot protein function prediction using textual description. In International Conference on Research in Computational Molecular Biology. Springer, 279–294.
[371] Hanwen Xu, Addie Woicik, Hoifung Poon, Russ B Altman, and Sheng Wang. 2023. Multilingual translation for zero-shot biomedical classification using BioTranslator. Nature Communications 14, 1 (2023), 738.
[372] Keyulu Xu, Weihua Hu, Jure Leskovec, and Stefanie Jegelka. 2018. How powerful are graph neural networks? arXiv preprint arXiv:1810.00826 (2018).
[373] Minghao Xu, Xinyu Yuan, Santiago Miret, and Jian Tang. 2023. ProtST: Multi-Modality Learning of Protein Sequences and Biomedical Texts. arXiv:2301.12040 [q-bio.BM]
[374] Minghao Xu, Zuobai Zhang, Jiarui Lu, Zhaocheng Zhu, Yangtian Zhang, Chang Ma, Runcheng Liu, and Jian Tang. 2022. PEER: A Comprehensive and Multi-Task Benchmark for Protein Sequence Understanding. arXiv:2206.02096 [cs.LG]
[375] Zhao Xu, Youzhi Luo, Xuan Zhang, Xinyi Xu, Yaochen Xie, Meng Liu, Kaleb Dickerson, Cheng Deng, Maho Nakata, and Shuiwang Ji. 2021. Molecule3d: A benchmark for predicting 3d geometries from molecular graphs. arXiv preprint arXiv:2110.01717 (2021).
[376] Dongyu Xue, Han Zhang, Dongling Xiao, Yukang Gong, Guohui Chuai, Yu Sun, Hao Tian, Hua Wu, Yukun Li, and Qi Liu. 2020. X-MOL: large-scale pre-training for molecular understanding and diverse molecular analysis. bioRxiv (2020), 2020–12.
[377] Hideki Yamaguchi and Yutaka Saito. 2022. EvoOpt: an MSA-guided, fully unsupervised sequence optimization pipeline for protein design. In Machine Learning for Structural Biology Workshop, NeurIPS. 88 Zhang and Ding, et al.
[378] Fan Yang, Wenchuan Wang, Fang Wang, Yuan Fang, Duyu Tang, Junzhou Huang, Hui Lu, and Jianhua Yao. 2022. scBERT as a large-scale pretrained deep language model for cell type annotation of single-cell RNA-seq data. Nature Machine Intelligence 4, 10 (2022), 852–866.
[379] Jingfeng Yang, Hongye Jin, Ruixiang Tang, Xiaotian Han, Qizhang Feng, Haoming Jiang, Bing Yin, and Xia Hu. 2023. Harnessing the power of llms in practice: A survey on chatgpt and beyond. arXiv preprint arXiv:2304.13712 (2023).
[380] Jianyi Yang, Ambrish Roy, and Yang Zhang. 2012. BioLiP: a semi-manually curated database for biologically relevant ligand–protein interactions. Nucleic acids research 41, D1 (2012), D1096–D1103.
[381] Meng Yang, Haiping Huang, Lichao Huang, Nan Zhang, Jihong Wu, Huanming Yang, and Feng Mu. 2021. LOGO, a contextualized pre-trained language model of human genome flexibly adapts to various downstream tasks by fine-tuning. (2021).
[382] Songhua Yang, Hanjie Zhao, Senbin Zhu, Guangyu Zhou, Hongfei Xu, Yuxiang Jia, and Hongying Zan. 2023. Zhongjing: Enhancing the Chinese Medical Capabilities of Large Language Model through Expert Feedback and Real-world Multi-turn Dialogue. arXiv:2308.03549 [cs.CL]
[383] Xi Yang, Aokun Chen, Nima PourNejatian, Hoo Chang Shin, Kaleb E Smith, Christopher Parisien, Colin Compas, Cheryl Martin, Mona G Flores, Ying Zhang, Tanja Magoc, Christopher A Harle, Gloria Lipori, Duane A Mitchell, William R Hogan, Elizabeth A Shenkman, Jiang Bian, and Yonghui Wu. 2022. GatorTron: A Large Clinical Language Model to Unlock Patient Information from Unstructured Electronic Health Records. arXiv:2203.03540 [cs.CL]
[384] Michihiro Yasunaga, Jure Leskovec, and Percy Liang. 2022. LinkBERT: Pretraining Language Models with Document Links. arXiv:2203.15827 [cs.CL]
[385] Qichen Ye, Junling Liu, Dading Chong, Peilin Zhou, Yining Hua, and Andrew Liu. 2023. Qilin-Med: Multi-stage Knowledge Injection Advanced Medical Large Language Model. arXiv:2310.09089 [cs.CL]
[386] Chengxuan Ying, Mingqi Yang, Shuxin Zheng, Guolin Ke, Shengjie Luo, Tianle Cai, Chenglin Wu, Yuxin Wang, Yanming Shen, and Di He. 2021. First place solution of KDD Cup 2021 & OGB large-scale challenge graph prediction track. arXiv preprint arXiv:2106.08279 (2021).
[387] Botao Yu, Frazier N Baker, Ziqi Chen, Xia Ning, and Huan Sun. 2024. Llasmol: Advancing large language models for chemistry with a large-scale, comprehensive, high-quality instruction tuning dataset. arXiv preprint arXiv:2402.09391 (2024).
[388] Chaohao Yuan, Songyou Li, Geyan Ye, Yikun Zhang, Long-Kai Huang, Wenbing Huang, Wei Liu, Jianhua Yao, and Yu Rong. 2024. Functional Protein Design with Local Domain Alignment. arXiv preprint arXiv:2404.16866 (2024).
[389] Atakan Yüksel, Erva Ulusoy, Atabey Ünlü, and Tunca Doğan. 2023. Selformer: Molecular representation learning via selfies language models. Machine Learning: Science and Technology (2023).
[390] Barbara Zdrazil, Eloy Felix, Fiona Hunter, Emma J Manners, James Blackshaw, Sybilla Corbett, Marleen de Veij, Harris Ioannidis, David Mendez Lopez, Juan F Mosquera, et al. 2023. The ChEMBL Database in 2023: a drug discovery platform spanning multiple bioactivity data types and time periods. Nucleic Acids Research (2023), gkad1004.
[391] Aohan Zeng, Xiao Liu, Zhengxiao Du, Zihan Wang, Hanyu Lai, Ming Ding, Zhuoyi Yang, Yifan Xu, Wendi Zheng, Xiao Xia, Weng Lam Tam, Zixuan Ma, Yufei Xue, Jidong Zhai, Wenguang Chen, Zhiyuan Liu, Peng Zhang, Yuxiao Dong, and Jie Tang. 2023. GLM-130B: An Open Bilingual Pre-trained Model. In The Eleventh International Conference on Learning Representations (ICLR).
[392] Haoyang Zeng, Matthew D Edwards, Ge Liu, and David K Gifford. 2016. Convolutional neural network architectures for predicting DNA–protein binding. Bioinformatics 32, 12 (2016), i121–i127.
[393] Zheni Zeng, Yuan Yao, Zhiyuan Liu, and Maosong Sun. 2022. A deep-learning system bridging molecule structure and biomedical text with comprehension comparable to human professionals. Nature communications 13, 1 (2022), 862.
[394] Zheni Zeng, Bangchen Yin, Shipeng Wang, Jiarui Liu, Cheng Yang, Haishen Yao, Xingzhi Sun, Maosong Sun, Guotong Xie, and Zhiyuan Liu. 2023. Interactive Molecular Discovery with Natural Language. arXiv:2306.11976 [cs.CL]
[395] Dan Zhang, Ziniu Hu, Sining Zhoubian, Zhengxiao Du, Kaiyu Yang, Zihan Wang, Yisong Yue, Yuxiao Dong, and Jie Tang. 2024. Sciglm: Training scientific language models with self-reflective instruction annotation and tuning. arXiv preprint arXiv:2401.07950 (2024).
[396] Di Zhang, Wei Liu, Qian Tan, Jingdan Chen, Hang Yan, Yuliang Yan, Jiatong Li, Weiran Huang, Xiangyu Yue, Dongzhan Zhou, et al. 2024. Chemllm: A chemical large language model. arXiv preprint arXiv:2402.06852 (2024).
[397] Daoan Zhang, Weitong Zhang, Bing He, Jianguo Zhang, Chenchen Qin, and Jianhua Yao. 2023. DNAGPT: A Generalized Pretrained Tool for Multiple DNA Sequence Analysis Tasks. bioRxiv (2023), 2023–07.
[398] Hongbo Zhang, Junying Chen, Feng Jiang, Fei Yu, Zhihong Chen, Jianquan Li, Guiming Chen, Xiangbo Wu, Zhiyi Zhang, Qingying Xiao, Xiang Wan, Benyou Wang, and Haizhou Li. 2023. HuatuoGPT, towards Taming Language Model to Be a Doctor. arXiv:2305.15075 [cs.CL]
[399] Le Zhang, Jiayang Chen, Tao Shen, Yu Li, and Siqi Sun. 2023. Enhancing the Protein Tertiary Structure Prediction by Multiple Sequence Alignment Generation. arXiv e-prints (June 2023), 2306.01824.
[400] Ningyu Zhang, Zhen Bi, Xiaozhuan Liang, Siyuan Cheng, Haosen Hong, Shumin Deng, Jiazhang Lian, Qiang Zhang, and Huajun Chen. 2022. OntoProtein: Protein Pretraining With Gene Ontology Embedding. arXiv (Jan. 2022), 2201.11147.
[401] Ningyu Zhang, Zhen Bi, Xiaozhuan Liang, Siyuan Cheng, Haosen Hong, Shumin Deng, Jiazhang Lian, Qiang Zhang, and Huajun Chen. 2022. Ontoprotein: Protein pretraining with gene ontology embedding. arXiv preprint arXiv:2201.11147 (2022).
[402] Susan Zhang, Stephen Roller, Naman Goyal, Mikel Artetxe, Moya Chen, Shuohui Chen, Christopher Dewan, Mona Diab, Xian Li, Xi Victoria Lin, et al. 2022. Opt: Open pre-trained transformer language models. arXiv preprint arXiv:2205.01068 (2022).
[403] Sheng Zhang, Xin Zhang, Hui Wang, Lixiang Guo, and Shanshan Liu. 2018. Multi-scale attentive interaction networks for chinese medical question answer selection. IEEE Access 6 (2018), 74061–74071. Scientific Large Language Models: A Survey on Biological & Chemical Domains 89
[404] Tianyi Zhang, Varsha Kishore, Felix Wu, Kilian Q Weinberger, and Yoav Artzi. 2019. Bertscore: Evaluating text generation with bert. arXiv preprint:1904.09675 (2019).
[405] Xiaotian Zhang, Chunyang Li, Yi Zong, Zhengyu Ying, Liang He, and Xipeng Qiu. 2023. Evaluating the Performance of Large Language Models on GAOKAO Benchmark.
[406] Xiao-Chen Zhang, Cheng-Kun Wu, Zhi-Jiang Yang, Zhen-Xing Wu, Jia-Cai Yi, Chang-Yu Hsieh, Ting-Jun Hou, and Dong-Sheng Cao. 2021. MG-BERT: leveraging unsupervised atomic representation learning for molecular property prediction. Briefings in bioinformatics 22, 6 (2021), bbab152.
[407] Xiao-Chen Zhang, Cheng-Kun Wu, Jia-Cai Yi, Xiang-Xiang Zeng, Can-Qun Yang, Ai-Ping Lu, Ting-Jun Hou, and Dong-Sheng Cao. 2022. Pushing the Boundaries of Molecular Property Prediction for Drug Discovery with Multitask Learning BERT Enhanced by SMILES Enumeration. Research 2022 (2022), 0004.
[408] Ying Zhang, Fang Ge, Fuyi Li, Xibei Yang, Jiangning Song, and Dong-Jun Yu. 2023. Prediction of multiple types of RNA modifications via biological language model. IEEE/ACM Transactions on Computational Biology and Bioinformatics (2023).
[409] Yikun Zhang, Mei Lang, Jiuhong Jiang, Zhiqiang Gao, Fan Xu, Thomas Litfin, Ke Chen, Jaswinder Singh, Xiansong Huang, Guoli Song, et al. 2023. Multiple sequence-alignment-based RNA language model and its application to structural inference. bioRxiv (2023), 2023–03.
[410] Yang Zhang and Jeffrey Skolnick. 2007. Scoring function for automated assessment of protein structure template quality. PROTEINS-NEW YORK- 68, 4 (2007), 1020.
[411] Yikun Zhang, Geyan Ye, Chaohao Yuan, Bo Han, Long-Kai Huang, Jianhua Yao, Wei Liu, and Yu Rong. 2024. Atomas: Hierarchical Alignment on Molecule-Text for Unified Molecule Understanding and Generation. arXiv preprint arXiv:2404.16880 (2024).
[412] Zuobai Zhang, Chuanrui Wang, Minghao Xu, Vijil Chenthamarakshan, Aurélie Lozano, Payel Das, and Jian Tang. 2023. A Systematic Study of Joint Representation Learning on Protein Sequences and Structures. arXiv:2303.06275 [q-bio.QM]
[413] Zuobai Zhang, Minghao Xu, Vijil Chenthamarakshan, Aurélie Lozano, Payel Das, and Jian Tang. 2023. Enhancing protein language models with structure-based encoder and pre-training. arXiv preprint arXiv:2303.06275 (2023).
[414] Haiteng Zhao, Shengchao Liu, Chang Ma, Hannan Xu, Jie Fu, Zhi-Hong Deng, Lingpeng Kong, and Qi Liu. 2023. GIMLET: A Unified Graph-Text Model for Instruction-Based Molecule Zero-Shot Learning. bioRxiv (2023), 2023–05.
[415] Suyuan Zhao, Jiahuan Zhang, Yizhen Luo, Yushuai Wu, and Zaiqing Nie. 2024. LangCell: Language-Cell Pre-training for Cell Identity Understanding. arXiv preprint arXiv:2405.06708 (2024).
[416] Wayne Xin Zhao, Kun Zhou, Junyi Li, Tianyi Tang, Xiaolei Wang, Yupeng Hou, Yingqian Min, Beichen Zhang, Junjie Zhang, Zican Dong, et al. 2023. A survey of large language models. arXiv preprint arXiv:2303.18223 (2023).
[417] Zihan Zhao, Da Ma, Lu Chen, Liangtai Sun, Zihao Li, Hongshen Xu, Zichen Zhu, Su Zhu, Shuai Fan, Guodong Shen, et al. 2024. Chemdfm: Dialogue foundation model for chemistry. arXiv preprint arXiv:2401.14818 (2024).
[418] Kangjie Zheng, Siyu Long, Tianyu Lu, Junwei Yang, Xinyu Dai, Ming Zhang, Zaiqing Nie, Wei-Ying Ma, and Hao Zhou. 2024. Multi-Scale Protein Language Model for Unified Molecular Modeling. bioRxiv (2024), 2024–03.
[419] Shuangjia Zheng, Jiahua Rao, Zhongyue Zhang, Jun Xu, and Yuedong Yang. 2019. Predicting retrosynthetic reactions using self-corrected transformer neural networks. Journal of chemical information and modeling 60, 1 (2019), 47–55.
[420] Zaixiang Zheng, Yifan Deng, Dongyu Xue, Yi Zhou, Fei Ye, and Quanquan Gu. 2023. Structure-informed Language Models Are Protein Designers. bioRxiv (2023).
[421] Zhiling Zheng, Oufan Zhang, Christian Borgs, Jennifer T. Chayes, and Omar M. Yaghi. 2023. ChatGPT Chemistry Assistant for Text Mining and the Prediction of MOF Synthesis. Journal of the American Chemical Society 145, 32 (aug 2023), 18048–18062.
[422] Wanjun Zhong, Ruixiang Cui, Yiduo Guo, Yaobo Liang, Shuai Lu, Yanlin Wang, Amin Saied, Weizhu Chen, and Nan Duan. 2023. AGIEval: A Human-Centric Benchmark for Evaluating Foundation Models. arXiv:2304.06364 [cs.CL]
[423] Gengmo Zhou, Zhifeng Gao, Qiankun Ding, Hang Zheng, Hongteng Xu, Zhewei Wei, Linfeng Zhang, and Guolin Ke. 2023. Uni-Mol: a universal 3D molecular representation learning framework. (2023).
[424] Hong-Yu Zhou, Yunxiang Fu, Zhicheng Zhang, Cheng Bian, and Yizhou Yu. 2023. Protein Representation Learning via Knowledge Enhanced Primary Structure Modeling. arXiv e-prints (Jan. 2023), arXiv:2301.13154. arXiv:2301.13154 [cs.LG]
[425] Jian Zhou and Olga G Troyanskaya. 2015. Predicting effects of noncoding variants with deep learning–based sequence model. Nature methods 12, 10 (2015), 931–934.
[426] Zhihan Zhou, Yanrong Ji, Weijian Li, Pratik Dutta, Ramana Davuluri, and Han Liu. 2023. Dnabert-2: Efficient foundation model and benchmark for multi-species genome. arXiv preprint arXiv:2306.15006 (2023).
[427] Yanqiao Zhu, Jeehyun Hwang, Keir Adams, Zhen Liu, Bozhao Nan, Brock Stenfors, Yuanqi Du, Jatin Chauhan, Olaf Wiest, Olexandr Isayev, et al. 2023. Learning Over Molecular Conformer Ensembles: Datasets and Benchmarks. arXiv preprint arXiv:2310.00115 (2023).
[428] Xiang Zhuang, Qiang Zhang, Keyan Ding, Yatao Bian, Xiao Wang, Jingsong Lv, Hongyang Chen, and Huajun Chen. 2023. Learning Invariant Molecular Representation in Latent Discrete Space.
[429] Xiang Zhuang, Qiang Zhang, Bin Wu, Keyan Ding, Yin Fang, and Huajun Chen. 2023. Graph Sampling-based Meta-Learning for Molecular Property Prediction.
[430] Rustam Zhumagambetov, Ferdinand Molnár, Vsevolod A Peshkov, and Siamac Fazli. 2021. Transmol: repurposing a language model for molecular generation. RSC advances 11, 42 (2021), 25921–25932.
[431] Le Zhuo, Zewen Chi, Minghao Xu, Heyan Huang, Heqi Zheng, Conghui He, Xian-Ling Mao, and Wentao Zhang. 2024. Protllm: An interleaved protein-language llm with protein-as-word pre-training. arXiv preprint arXiv:2403.07920 (2024).
[432] Maxim Zvyagin, Alexander Brace, Kyle Hippe, Yuntian Deng, Bin Zhang, Cindy Orozco Bohorquez, Austin Clyde, Bharat Kale, Danilo Perez-Rivera, Heng Ma, et al. 2022. GenSLMs: Genome-scale language models reveal SARS-CoV-2 evolutionary dynamics. bioRxiv (2022).
Related document on the Qiita
making reference list on biorxiv pdf file
https://qiita.com/kaizen_nagoya/items/75f6f93ce9872a5d622d
Genome modeling and design across all domains of life with evo 2
https://qiita.com/kaizen_nagoya/items/eecda74f758008633ee2
BIOREASON: DNA-LLMモデルによるマルチモーダル生物学的推論の動機付け
https://qiita.com/kaizen_nagoya/items/0718b214043a614deee0
Mckusick’s online mendelian inheritance in man (omim®)
https://qiita.com/kaizen_nagoya/items/c599d867201d1ffb1f4d
Anthropic. Claude 3.7 sonnet
https://qiita.com/kaizen_nagoya/items/4364d9c475114353cf2a
Genomic language models: Opportunities and challenges
https://qiita.com/kaizen_nagoya/items/f797330e64e0c7d05f39
A dna language model based on multispecies alignment predicts the effects of genome-wide variants
https://qiita.com/kaizen_nagoya/items/6e8858c2395dcc98804a
A genomic mutational constraint map using variation in 76,156 human genomes
https://qiita.com/kaizen_nagoya/items/e799ad85ee98bb2a8cf6
Genomic language models: Opportunities and challenges
https://qiita.com/kaizen_nagoya/items/f797330e64e0c7d05f39
Nucleotide transformer: building and evaluating robust foundation models for human genomics
https://qiita.com/kaizen_nagoya/items/1c147c2b095364f04ef7
A genomic mutational constraint map using variation in 76,156 human genomes
https://qiita.com/kaizen_nagoya/items/e799ad85ee98bb2a8cf6
DeepSeek-AI
https://qiita.com/kaizen_nagoya/items/bb5ee9f17c03e07659d8
Codontransformer: A multispecies codon optimizer using context-aware neural networks.
https://qiita.com/kaizen_nagoya/items/d4be1d4dd9eb307f09cc
Medrax: Medical reasoning agent for chest x-ray
https://qiita.com/kaizen_nagoya/items/94c7835b2f461452b2e7
Benchmarking dna foundation models for genomic sequence classification running title: Dna foundation models benchmarking.
https://qiita.com/kaizen_nagoya/items/01e3dde0d8274fee0fd8
Lora: Low-rank adaptation of large language models,
https://qiita.com/kaizen_nagoya/items/877058f681d77808b44c
kegg pull: a software package for the restful access and pulling from the kyoto encyclopedia of gene and genomes.
https://qiita.com/kaizen_nagoya/items/05be40565793f2b4f7f3
Genegpt: augmenting large language models with domain tools for improved access to biomedical information.
https://qiita.com/kaizen_nagoya/items/8897792ff52fb5e68a46
Kegg: biological systems database as a model of the real world.
https://qiita.com/kaizen_nagoya/items/f63573043eaf8f9c6a2c
Entrez direct: E-utilities on the unix command line
https://qiita.com/kaizen_nagoya/items/cc4bbde566e67abc93d9
Clinvar: Public archive of relationships among sequence variation and human phenotype.
https://qiita.com/kaizen_nagoya/items/8149b7a5a4f930490fad
Biobert: a pre-trained biomedical language representation model for biomedical text mining.
https://qiita.com/kaizen_nagoya/items/63781eb6db1fc2ded80a
Progress and opportunities of foundation models in bioinformatics. Briefings in Bioinformatics
https://qiita.com/kaizen_nagoya/items/6ef20eaf796532fed6f8
Bend: Benchmarking dna language models on biologically meaningful tasks.
https://qiita.com/kaizen_nagoya/items/8417e72454d2107a9d06
Hyenadna: Long-range genomic sequence modeling at single nucleotide resolution.
https://qiita.com/kaizen_nagoya/items/07e1ba1138b0825c8a73
Gpt-4o system card.
https://qiita.com/kaizen_nagoya/items/06e4c54af663456b49f9
wen2.5 technical report.
https://qiita.com/kaizen_nagoya/items/f87fac7f9a83f54328fe
Deepseekmath: Pushing the limits of mathematical reasoning in open language models
https://qiita.com/kaizen_nagoya/items/2add7a80056850b9ce87
dbsnp: the ncbi database of genetic variation.
https://qiita.com/kaizen_nagoya/items/756da32e4c0868d84da0
Cosmic: a curated database of somatic variants and clinical data for cancer.
https://qiita.com/kaizen_nagoya/items/2b0960d4e1ff26a9b01f
Roformer: Enhanced transformer with rotary position embedding,
https://qiita.com/kaizen_nagoya/items/a12a45518f28a5133af2
Txgemma: Efficient and agentic llms for therapeutics.
https://qiita.com/kaizen_nagoya/items/e4eff5d51f926e943b9e
Qwen2 technical report.
https://qiita.com/kaizen_nagoya/items/29a77b25282c8822011e
Qwen2.5 technical report
https://qiita.com/kaizen_nagoya/items/c275c2e1f24bbc5019c1
Scientific large language models: A survey on biological and chemical domains.
https://qiita.com/kaizen_nagoya/items/6505717d7c4769a4ff31