Overview of Large Language Models, Reference

Posted at 2025-10-15

Overview of Large Language Models

term

LLM Large-Language Model
AI
Gemini
Claude
Autoregressive Language Models
RNN
Transformer
Self Attention
softmax
教師あり学習
Generative Pretraining Transformer (GPT)
Pre-training
領域特化LLM
金融　FINGPT,
医療　MODEL PARM M,
法律　HARVEY
コーディング:StarCoder,Retrieval:Command R+
Scaling Law
Emergent Ability

“Large-Language Models are Zero-Shot Reasoners”,

NeurIPS2022
https://arxiv.org/pdf/2205.11916

References

Michael Ahn, Anthony Brohan, Noah Brown, Yevgen Chebotar, Omar Cortes, Byron David, Chelsea Finn, Keerthana Gopalakrishnan, Karol Hausman, Alex Herzog, Daniel Ho, Jasmine Hsu, Julian
Ibarz, Brian Ichter, Alex Irpan, Eric Jang, Rosario Jauregui Ruano, Kyle Jeffrey, Sally Jesmonth, Nikhil J Joshi, Ryan Julian, Dmitry Kalashnikov, Yuheng Kuang, Kuang-Huei Lee, Sergey Levine, Yao Lu, Linda Luu, Carolina Parada, Peter Pastor, Jornell Quiambao, Kanishka Rao, Jarek Rettinghouse, Diego Reyes, Pierre Sermanet, Nicolas Sievers, Clayton Tan, Alexander Toshev, Vincent Vanhoucke, Fei Xia, Ted Xiao, Peng Xu, Sichun Xu, and Mengyuan Yan. Do as i can, not as i say: Grounding language in robotic affordances, 2022. URL https://arxiv.org/abs/2204.01691.
Sid Black, Leo Gao, Phil Wang, Connor Leahy, and Stella Biderman. GPT-Neo: Large Scale
Autoregressive Language Modeling with Mesh-Tensorflow, March 2021. URL https://doi.org/10.5281/zenodo.5297715.
Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel Ziegler, Jeffrey Wu, Clemens Winter, Chris Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, and Dario Amodei. Language models are few-shot learners. In H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin, editors, Advances in NeurIPS, volume 33, pages 1877–1901.Curran Associates, Inc., 2020. URL https://proceedings.neurips.cc/paper/2020/file/1457c0d6bfcb4967418bfb8ac142f64a-Paper.pdf.
François Chollet. On the measure of intelligence. arXiv preprint arXiv:1911.01547, 2019. URL https://arxiv.org/abs/1911.01547.
Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam
Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, Parker Schuh, Kensen Shi, Sasha Tsvyashchenko, Joshua Maynez, Abhishek Rao, Parker Barnes, Yi Tay, Noam Shazeer, Vinodkumar Prabhakaran, Emily Reif, Nan Du, Ben Hutchinson, Reiner Pope, James Bradbury, Jacob Austin, Michael Isard, Guy Gur-Ari, Pengcheng Yin, Toju Duke, Anselm Levskaya, Sanjay Ghemawat, Sunipa Dev, Henryk Michalewski, Xavier Garcia, Vedant Misra, Kevin Robinson, Liam Fedus, Denny Zhou, Daphne Ippolito, David Luan, Hyeontaek Lim, Barret Zoph, Alexander Spiridonov, Ryan Sepassi, David Dohan, Shivani Agrawal, Mark Omernick, Andrew M. Dai, Thanumalayan Sankaranarayana Pillai, Marie Pellat, Aitor Lewkowycz, Erica Moreira, Rewon Child, Oleksandr Polozov, Katherine Lee, Zongwei Zhou, Xuezhi Wang, Brennan Saeta, Mark Diaz, Orhan Firat, Michele Catasta, Jason Wei, Kathy Meier-Hellstern, Douglas Eck, Jeff Dean, Slav Petrov, and Noah Fiedel. Palm: Scaling language modeling with pathways, 2022. URL https://arxiv.org/abs/2204.02311.
Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. Training verifiers to solve math word problems, 2021. URL https://arxiv.org/abs/2110.14168.
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of NAACL, pages 4171–4186, 2019. URL https://aclanthology.org/N19-1423.
Leo Gao, Stella Biderman, Sid Black, Laurence Golding, Travis Hoppe, Charles Foster, Jason Phang, Horace He, Anish Thite, Noa Nabeshima, Shawn Presser, and Connor Leahy. The pile: An 800gb dataset of diverse text for language modeling. arXiv preprint arXiv: Arxiv-2101.00027, 2020.
Tianyu Gao, Adam Fisch, and Danqi Chen. Making pre-trained language models better few-shot learners. In Proceedings of ACL-IJCNLP, pages 3816–3830, 2021. URL https://aclanthology.org/2021.acl-long.295.
Mor Geva, Daniel Khashabi, Elad Segal, Tushar Khot, Dan Roth, and Jonathan Berant. Did aristotle use a laptop? a question answering benchmark with implicit reasoning strategies. TACL, 9:346–361, 2021. URL https://aclanthology.org/2021.tacl-1.21/.
Mohammad Javad Hosseini, Hannaneh Hajishirzi, Oren Etzioni, and Nate Kushman. Learning to solve arithmetic word problems with verb categorization. In EMNLP, volume 523533. Citeseer, 2014. URL https://aclanthology.org/D14-1058/.
Wendy Johnson and Thomas J Bouchard Jr. The structure of human intelligence: It is verbal, perceptual, and image rotation (vpr), not fluid and crystallized. Intelligence, 33(4):393–416, 2005.
Rik Koncel-Kedziorski, Hannaneh Hajishirzi, Ashish Sabharwal, Oren Etzioni, and Siena Dumas Ang. Parsing algebraic word problems into equations. TACL, 3:585–597, 2015. URL https://aclanthology.org/Q15-1042.
Rik Koncel-Kedziorski, Subhro Roy, Aida Amini, Nate Kushman, and Hannaneh Hajishirzi. MAWPS: A math word problem repository. In Proceedings of NAACL, pages 1152–1157, 2016. URL https://aclanthology.org/N16-1136.
Wang Ling, Dani Yogatama, Chris Dyer, and Phil Blunsom. Program induction by rationale generation: Learning to solve and explain algebraic word problems. In Proceedings of ACL, pages 158–167, 2017. URL https://aclanthology.org/P17-1015.
Jiachang Liu, Dinghan Shen, Yizhe Zhang, Bill Dolan, Lawrence Carin, and Weizhu Chen. What makes good in-context examples for gpt-3? arXiv preprint arXiv:2101.06804, 2021a. URL https://arxiv.org/abs/2101.06804.
Pengfei Liu, Weizhe Yuan, Jinlan Fu, Zhengbao Jiang, Hiroaki Hayashi, and Graham Neubig. Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing. arXiv preprint arXiv:2107.13586, 2021b. URL https://arxiv.org/abs/2107.13586.
Yao Lu, Max Bartolo, Alastair Moore, Sebastian Riedel, and Pontus Stenetorp. Fantastically ordered prompts and where to find them: Overcoming few-shot prompt order sensitivity. In Proceedings of ACL, pages 8086–8098, 2022. URL https://aclanthology.org/2022.acl-long.556.
Kevin S McGrew. The cattell-horn-carroll theory of cognitive abilities: Past, present, and future. 2005.
Stephen Merity, Caiming Xiong, James Bradbury, and Richard Socher. Pointer sentinel mixture models. arXiv preprint arXiv: Arxiv-1609.07843, 2016. URL https://arxiv.org/abs/1609.07843.
Sewon Min, Xinxi Lyu, Ari Holtzman, Mikel Artetxe, Mike Lewis, Hannaneh Hajishirzi, and Luke Zettlemoyer. Rethinking the role of demonstrations: What makes in-context learning work? arXiv preprint arXiv:2202.12837, 2022. URL https://arxiv.org/pdf/2202.12837.pdf.
Maxwell Nye, Anders Johan Andreassen, Guy Gur-Ari, Henryk Michalewski, Jacob Austin, David Bieber, David Dohan, Aitor Lewkowycz, Maarten Bosma, David Luan, Charles Sutton, and Augustus Odena. Show your work: Scratchpads for intermediate computation with language models. In Deep Learning for Code Workshop, 2022. URL https://openreview.net/forum?id=HBlx2idbkbq.
Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul Christiano, Jan Leike, and Ryan Lowe. Training language models to follow instructions with human feedback, 2022. URL https://arxiv.org/abs/2203.02155.
Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, et al. Pytorch: An imperative style, high-performance deep learning library. Advances in NeurIPS, 32:8026–8037, 2019. URL https://papers.nips.cc/paper/2019/hash/bdbca288fee7f92f2bfa9f7012727740-Abstract.html.
Arkil Patel, Satwik Bhattamishra, and Navin Goyal. Are NLP models really able to solve simple math word problems? In Proceedings of NAACL, pages 2080–2094, 2021. URL https://
aclanthology.org/2021.naacl-main.168.
Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al. Language models are unsupervised multitask learners. OpenAI blog, page 9, 2019. URL http://www.persagen.com/files/misc/radford2019language.pdf.
Jack W. Rae, Sebastian Borgeaud, Trevor Cai, Katie Millican, Jordan Hoffmann, Francis Song, John Aslanides, Sarah Henderson, Roman Ring, Susannah Young, Eliza Rutherford, Tom Hennigan, Jacob Menick, Albin Cassirer, Richard Powell, George van den Driessche, Lisa Anne Hendricks, Maribeth Rauh, Po-Sen Huang, Amelia Glaese, Johannes Welbl, Sumanth Dathathri, Saffron Huang, Jonathan Uesato, John Mellor, Irina Higgins, Antonia Creswell, Nat McAleese, Amy Wu, Erich Elsen, Siddhant Jayakumar, Elena Buchatskaya, David Budden, Esme Sutherland, Karen Simonyan, Michela Paganini, Laurent Sifre, Lena Martens, Xiang Lorraine Li, Adhiguna Kuncoro, Aida Nematzadeh, Elena Gribovskaya, Domenic Donato, Angeliki Lazaridou, Arthur Mensch, Jean-Baptiste Lespiau, Maria Tsimpoukelli, Nikolai Grigorev, Doug Fritz, Thibault Sottiaux, Mantas Pajarskas, Toby Pohlen, Zhitao Gong, Daniel Toyama, Cyprien de Masson d’Autume, Yujia Li, Tayfun Terzi, Vladimir Mikulik, Igor Babuschkin, Aidan Clark, Diego de Las Casas, Aurelia Guy, Chris Jones, James Bradbury, Matthew Johnson, Blake Hechtman, Laura Weidinger, Iason Gabriel, William Isaac, Ed Lockhart, Simon Osindero, Laura Rimell, Chris Dyer, Oriol Vinyals, Kareem Ayoub, Jeff Stanway, Lorrayne Bennett, Demis Hassabis, Koray Kavukcuoglu, and Geoffrey Irving. Scaling language models: Methods, analysis & insights from training gopher, 2021. URL https://arxiv.org/abs/2112.11446.
Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. Exploring the limits of transfer learning with a unified text-to-text transformer. JMLR, 21(140):1–67, 2020. URL http://jmlr.org/papers/v21/20-074.html.
Nazneen Fatema Rajani, Bryan McCann, Caiming Xiong, and Richard Socher. Explain yourself! leveraging language models for commonsense reasoning. In Proceedings of ACL, pages 4932–4942, 2019. URL https://aclanthology.org/P19-1487.
Laria Reynolds and Kyle McDonell. Prompt programming for large language models: Beyond the few-shot paradigm. In Extended Abstracts of the 2021 CHI Conference on Human Factors in Computing Systems, pages 1–7, 2021. URL https://arxiv.org/pdf/2102.07350.pdf.
Subhro Roy and Dan Roth. Solving general arithmetic word problems. In Proceedings of EMNLP, pages 1743–1752, 2015. URL https://aclanthology.org/D15-1202.
Victor Sanh, Albert Webson, Colin Raffel, Stephen Bach, Lintang Sutawika, Zaid Alyafeai, Antoine Chaffin, Arnaud Stiegler, Arun Raja, Manan Dey, M Saiful Bari, Canwen Xu, Urmish Thakker, Shanya Sharma Sharma, Eliza Szczechla, Taewoon Kim, Gunjan Chhablani, Nihal Nayak, Debajyoti Datta, Jonathan Chang, Mike Tian-Jian Jiang, Han Wang, Matteo Manica, Sheng Shen, Zheng Xin Yong, Harshit Pandey, Rachel Bawden, Thomas Wang, Trishala Neeraj, Jos Rozen, Abheesht Sharma, Andrea Santilli, Thibault Fevry, Jason Alan Fries, Ryan Teehan, Teven Le Scao, Stella Biderman, Leo Gao, Thomas Wolf, and Alexander M Rush.
Multitask prompted training enables zero-shot task generalization. In ICLR, 2022. URL https://openreview.net/forum?id=9Vrb9D0WI4.
Timo Schick and Hinrich Schütze. It’s not just size that matters: Small language models are also few-shot learners. In Proceedings of NAACL, pages 2339–2352, 2021. URL https://aclanthology.org/2021.naacl-main.185.
Taylor Shin, Yasaman Razeghi, Robert L. Logan IV, Eric Wallace, and Sameer Singh. Auto-Prompt: Eliciting Knowledge from Language Models with Automatically Generated Prompts. In Proceedings of EMNLP, pages 4222–4235, 2020. URL https://aclanthology.org/2020.
emnlp-main.346.
Vered Shwartz, Peter West, Ronan Le Bras, Chandra Bhagavatula, and Yejin Choi. Unsupervised commonsense question answering with self-talk. In Proceedings of EMNLP, pages 4615–4629, 2020. URL https://aclanthology.org/2020.emnlp-main.373.
Shaden Smith, Mostofa Patwary, Brandon Norick, Patrick LeGresley, Samyam Rajbhandari, Jared Casper, Zhun Liu, Shrimai Prabhumoye, George Zerveas, Vijay Korthikanti, Elton Zhang, Rewon Child, Reza Yazdani Aminabadi, Julie Bernauer, Xia Song, Mohammad Shoeybi, Yuxiong He, Michael Houston, Saurabh Tiwary, and Bryan Catanzaro. Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model, 2022. URL https://arxiv.org/abs/2201.11990.
Aarohi Srivastava, Abhinav Rastogi, Abhishek Rao, Abu Awal Md Shoeb, Abubakar Abid, Adam Fisch, Adam R Brown, Adam Santoro, Aditya Gupta, Adrià Garriga-Alonso, et al. Beyond the imitation game: Quantifying and extrapolating the capabilities of language models. arXiv preprint arXiv:2206.04615, 2022. URL https://arxiv.org/abs/2206.04615.
Keith E Stanovich and Richard F West. Individual differences in reasoning: Implications for the rationality debate? Behavioral and brain sciences, 23(5):645–665, 2000.
Alon Talmor, Jonathan Herzig, Nicholas Lourie, and Jonathan Berant. Commonsenseqa: A question answering challenge targeting commonsense knowledge. In Proceedings of NAACL-HLT, pages 4149–4158, 2019. URL https://aclanthology.org/N19-1421/.
Romal Thoppilan, Daniel De Freitas, Jamie Hall, Noam Shazeer, Apoorv Kulshreshtha, Heng-Tze Cheng, Alicia Jin, Taylor Bos, Leslie Baker, Yu Du, YaGuang Li, Hongrae Lee, Huaixiu Steven Zheng, Amin Ghafouri, Marcelo Menegali, Yanping Huang, Maxim Krikun, Dmitry Lepikhin, James Qin, Dehao Chen, Yuanzhong Xu, Zhifeng Chen, Adam Roberts, Maarten Bosma, Vincent Zhao, Yanqi Zhou, Chung-Ching Chang, Igor Krivokon, Will Rusch, Marc Pickett, Pranesh Srinivasan, Laichee Man, Kathleen Meier-Hellstern, Meredith Ringel Morris, Tulsee Doshi, Renelito Delos Santos, Toju Duke, Johnny Soraker, Ben Zevenbergen, Vinodkumar Prabhakaran, Mark Diaz, Ben Hutchinson, Kristen Olson, Alejandra Molina, Erin Hoffman-John, Josh Lee, Lora Aroyo, Ravi Rajakumar, Alena Butryna, Matthew Lamm, Viktoriya Kuzmina, Joe Fenton, Aaron Cohen, Rachel Bernstein, Ray Kurzweil, Blaise Aguera-Arcas, Claire Cui, Marian Croak, Ed Chi, and Quoc Le. Lamda: Language models for dialog applications, 2022. URL https://arxiv.org/abs/2201.08239.
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Ł ukasz Kaiser, and Illia Polosukhin. Attention is all you need. In I. Guyon, U. Von Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in NeurIPS, 2017. URL https://proceedings.neurips.cc/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf.
Ben Wang and Aran Komatsuzaki. GPT-J-6B: A 6 Billion Parameter Autoregressive Language Model. https://github.com/kingoflolz/mesh-transformer-jax, May 2021.
Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le, Ed Chi, and Denny Zhou. Self-consistency improves chain of thought reasoning in language models. arXiv preprint arXiv:2203.11171, 2022. URL https://arxiv.org/abs/2203.11171.
Albert Webson and Ellie Pavlick. Do prompt-based models really understand the meaning of their prompts? In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 2300–2344. Association for Computational Linguistics, July 2022. URL https://aclanthology.org/2022.naacl-main.167.
Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Ed Chi, Quoc Le, and Denny Zhou. Chain of thought prompting elicits reasoning in large language models, 2022. URL https://arxiv.org/abs/2201.11903.
Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, Remi Louf, Morgan Funtowicz, Joe Davison, Sam Shleifer, Patrick von Platen, Clara Ma, Yacine Jernite, Julien Plu, Canwen Xu, Teven Le Scao, Sylvain Gugger, Mariama Drame, Quentin Lhoest, and Alexander Rush. Transformers: State-of-the-art natural language processing. In Proceedings of EMNLP, 2020. URL https://aclanthology.org/2020.emnlp-demos.6.
Eric Zelikman, Yuhuai Wu, and Noah D. Goodman. Star: Bootstrapping reasoning with reasoning, 2022. URL https://arxiv.org/abs/2203.14465.
Susan Zhang, Stephen Roller, Naman Goyal, Mikel Artetxe, Moya Chen, Shuohui Chen, Christopher Dewan, Mona Diab, Xian Li, Xi Victoria Lin, et al. Opt: Open pre-trained transformer language models. arXiv preprint arXiv:2205.01068, 2022. URL https://arxiv.org/abs/2205.01068.

JSAI2023,CSS2023での「基盤モデルの技術と展望」

のチュートリアル
https://speakerdeck.com/yusuke0519/jsai2023-tutorial-ji-pan-moderunoji-shu-tozhan-wang

Attention Is All You Need

References

[1] Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hinton. Layer normalization. arXiv preprint arXiv:1607.06450, 2016.
[2] Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine translation by jointly learning to align and translate. CoRR, abs/1409.0473, 2014.
[3] Denny Britz, Anna Goldie, Minh-Thang Luong, and Quoc V. Le. Massive exploration of neural machine translation architectures. CoRR, abs/1703.03906, 2017.
[4] Jianpeng Cheng, Li Dong, and Mirella Lapata. Long short-term memory-networks for machine reading. arXiv preprint arXiv:1601.06733, 2016.
[5] Kyunghyun Cho, Bart van Merrienboer, Caglar Gulcehre, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. Learning phrase representations using rnn encoder-decoder for statistical machine translation. CoRR, abs/1406.1078, 2014.
[6] Francois Chollet. Xception: Deep learning with depthwise separable convolutions. arXiv preprint arXiv:1610.02357, 2016.
[7] Junyoung Chung, Çaglar Gülçehre, Kyunghyun Cho, and Yoshua Bengio. Empirical evaluation of gated recurrent neural networks on sequence modeling. CoRR, abs/1412.3555, 2014.
[8] Jonas Gehring, Michael Auli, David Grangier, Denis Yarats, and Yann N. Dauphin. Convolutional sequence to sequence learning. arXiv preprint arXiv:1705.03122v2, 2017.
[9] Alex Graves. arXiv:1308.0850, 2013. Generating sequences with recurrent neural networks. arXiv preprint
[10] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 770–778, 2016.
[11] Sepp Hochreiter, Yoshua Bengio, Paolo Frasconi, and Jürgen Schmidhuber. Gradient flow in recurrent nets: the difficulty of learning long-term dependencies, 2001.
[12] Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory. Neural computation, 9(8):1735–1780, 1997.
[13] Rafal Jozefowicz, Oriol Vinyals, Mike Schuster, Noam Shazeer, and Yonghui Wu. Exploring the limits of language modeling. arXiv preprint arXiv:1602.02410, 2016.
[14] Łukasz Kaiser and Ilya Sutskever. Neural GPUs learn algorithms. In International Conference on Learning Representations (ICLR), 2016.
[15] Nal Kalchbrenner, Lasse Espeholt, Karen Simonyan, Aaron van den Oord, Alex Graves, and Koray Kavukcuoglu. Neural machine translation in linear time. arXiv preprint arXiv:1610.10099v2, 2017.
[16] Yoon Kim, Carl Denton, Luong Hoang, and Alexander M. Rush. Structured attention networks. In International Conference on Learning Representations, 2017.
[17] Diederik Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In ICLR, 2015.
[18] Oleksii Kuchaiev and Boris Ginsburg. Factorization tricks for LSTM networks. arXiv preprint arXiv:1703.10722, 2017.
[19] Zhouhan Lin, Minwei Feng, Cicero Nogueira dos Santos, Mo Yu, Bing Xiang, Bowen Zhou, and Yoshua Bengio. A structured self-attentive sentence embedding. arXiv preprint arXiv:1703.03130, 2017.
[20] Samy Bengio Łukasz Kaiser. Can active memory replace attention? In Advances in Neural Information Processing Systems, (NIPS), 2016.
[21] Minh-Thang Luong, Hieu Pham, and Christopher D Manning. Effective approaches to attention-based neural machine translation. arXiv preprint arXiv:1508.04025, 2015.
[22] Ankur Parikh, Oscar Täckström, Dipanjan Das, and Jakob Uszkoreit. A decomposable attention model. In Empirical Methods in Natural Language Processing, 2016.
[23] Romain Paulus, Caiming Xiong, and Richard Socher. A deep reinforced model for abstractive summarization. arXiv preprint arXiv:1705.04304, 2017.
[24] Ofir Press and Lior Wolf. Using the output embedding to improve language models. arXiv preprint arXiv:1608.05859, 2016.
[25] Rico Sennrich, Barry Haddow, and Alexandra Birch. Neural machine translation of rare words with subword units. arXiv preprint arXiv:1508.07909, 2015.
[26] Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc Le, Geoffrey Hinton, and Jeff Dean. Outrageously large neural networks: The sparsely-gated mixture-of-experts layer. arXiv preprint arXiv:1701.06538, 2017.
[27] Nitish Srivastava, Geoffrey E Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research, 15(1):1929–1958, 2014.
[28] Sainbayar Sukhbaatar, arthur szlam, Jason Weston, and Rob Fergus. End-to-end memory networks. In C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett, editors, Advances in Neural Information Processing Systems 28, pages 2440–2448. Curran Associates, Inc., 2015.
[29] Ilya Sutskever, Oriol Vinyals, and Quoc VV Le. Sequence to sequence learning with neural networks. In Advances in Neural Information Processing Systems, pages 3104–3112, 2014.
[30] Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jonathon Shlens, and Zbigniew Wojna. Rethinking the inception architecture for computer vision. CoRR, abs/1512.00567, 2015.
[31] Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V Le, Mohammad Norouzi, Wolfgang Macherey, Maxim Krikun, Yuan Cao, Qin Gao, Klaus Macherey, et al. Google’s neural machine translation system: Bridging the gap between human and machine translation. arXiv preprint arXiv:1609.08144, 2016.
[32] Jie Zhou, Ying Cao, Xuguang Wang, Peng Li, and Wei Xu. Deep recurrent models with fast-forward connections for neural machine translation. CoRR, abs/1606.04199, 2016.

Improving Language Understanding by Generative Pre-Training

Alec Radford OpenAI alec@openai.com
Karthik Narasimhan OpenAI karthikn@openai.com
Tim Salimans OpenAI tim@openai.com
Ilya Sutskever OpenAI ilyasu@openai.com
https://cdn.openai.com/research-covers/language-unsupervised/language_understanding_paper.pdf

References

[1] S. Arora, Y. Liang, and T. Ma. A simple but tough-to-beat baseline for sentence embeddings. 2016.
[2] J. L. Ba, J. R. Kiros, and G. E. Hinton. Layer normalization. arXiv preprint arXiv:1607.06450, 2016.
[3] Y. Bengio, P. Lamblin, D. Popovici, and H. Larochelle. Greedy layer-wise training of deep networks. In Advances in neural information processing systems, pages 153–160, 2007.
[4] L. Bentivogli, P. Clark, I. Dagan, and D. Giampiccolo. The fifth pascal recognizing textual entailment challenge. In TAC, 2009.
[5] S. R. Bowman, G. Angeli, C. Potts, and C. D. Manning. A large annotated corpus for learning natural language inference. EMNLP, 2015.
[6] D. Cer, M. Diab, E. Agirre, I. Lopez-Gazpio, and L. Specia. Semeval-2017 task 1: Semantic textual similarity-multilingual and cross-lingual focused evaluation. arXiv preprint arXiv:1708.00055, 2017.
[7] S. Chaturvedi, H. Peng, and D. Roth. Story comprehension for predicting what happens next. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 1603–1614, 2017.
[8] D. Chen and C. Manning. A fast and accurate dependency parser using neural networks. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pages 740–750, 2014.
[9] Z. Chen, H. Zhang, X. Zhang, and L. Zhao. Quora question pairs. https://data.quora.com/First-Quora-Dataset-Release-Question-Pairs, 2018.
[10] R. Collobert and J. Weston. A unified architecture for natural language processing: Deep neural networks with multitask learning. In Proceedings of the 25th international conference on Machine learning, pages 160–167. ACM, 2008.
[11] R. Collobert, J. Weston, L. Bottou, M. Karlen, K. Kavukcuoglu, and P. Kuksa. Natural language processing (almost) from scratch. Journal of Machine Learning Research, 12(Aug):2493–2537, 2011.
[12] A. Conneau, D. Kiela, H. Schwenk, L. Barrault, and A. Bordes. Supervised learning of universal sentence representations from natural language inference data. EMNLP, 2017.
[13] A. M. Dai and Q. V. Le. Semi-supervised sequence learning. In Advances in Neural Information Processing Systems, pages 3079–3087, 2015.
[14] W. B. Dolan and C. Brockett. Automatically constructing a corpus of sentential paraphrases. In Proceedings of the Third International Workshop on Paraphrasing (IWP2005), 2005.
[15] D. Erhan, Y. Bengio, A. Courville, P.-A. Manzagol, P. Vincent, and S. Bengio. Why does unsupervised pre-training help deep learning? Journal of Machine Learning Research, 11(Feb):625–660, 2010.
[16] S. Gray, A. Radford, and K. P. Diederik. Gpu kernels for block-sparse weights. 2017.
[17] Z. He, S. Liu, M. Li, M. Zhou, L. Zhang, and H. Wang. Learning entity representation for entity disambiguation. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), volume 2, pages 30–34, 2013.
[18] D. Hendrycks and K. Gimpel. Bridging nonlinearities and stochastic regularizers with gaussian error linear units. arXiv preprint arXiv:1606.08415, 2016.
[19] K. M. Hermann, T. Kocisky, E. Grefenstette, L. Espeholt, W. Kay, M. Suleyman, and P. Blunsom. Teaching machines to read and comprehend. In Advances in Neural Information Processing Systems, pages 1693–1701, 2015.
[20] G. E. Hinton, S. Osindero, and Y.-W. Teh. A fast learning algorithm for deep belief nets. Neural computation, 18(7):1527–1554, 2006.
[21] J. Howard and S. Ruder. Universal language model fine-tuning for text classification. Association for Computational Linguistics (ACL), 2018.
[22] Y. Jernite, S. R. Bowman, and D. Sontag. Discourse-based objectives for fast unsupervised sentence representation learning. arXiv preprint arXiv:1705.00557, 2017.
[23] Y. Ji and J. Eisenstein. Discriminative improvements to distributional sentence similarity. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, pages 891–896, 2013.
[24] F. Jiao, S. Wang, C.-H. Lee, R. Greiner, and D. Schuurmans. Semi-supervised conditional random fields for improved sequence segmentation and labeling. In Proceedings of the 21st International Conference on Computational Linguistics and the 44th annual meeting of the Association for Computational Linguistics, pages 209–216. Association for Computational Linguistics, 2006.
[25] T. Khot, A. Sabharwal, and P. Clark. Scitail: A textual entailment dataset from science question answering. In Proceedings of AAAI, 2018.
[26] Y. Kim. Convolutional neural networks for sentence classification. EMNLP, 2014.
[27] D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
[28] R. Kiros, Y. Zhu, R. R. Salakhutdinov, R. Zemel, R. Urtasun, A. Torralba, and S. Fidler. Skip-thought vectors. In Advances in neural information processing systems, pages 3294–3302, 2015.
[29] N. Kitaev and D. Klein. Constituency parsing with a self-attentive encoder. ACL, 2018.
[30] G. Lai, Q. Xie, H. Liu, Y. Yang, and E. Hovy. Race: Large-scale reading comprehension dataset from examinations. EMNLP, 2017.
[31] G. Lample, L. Denoyer, and M. Ranzato. Unsupervised machine translation using monolingual corpora only. ICLR, 2018.
[32] Q. Le and T. Mikolov. Distributed representations of sentences and documents. In International Conference on Machine Learning, pages 1188–1196, 2014.
[33] P. Liang. Semi-supervised learning for natural language. PhD thesis, Massachusetts Institute of Technology, 2005.
[34] P. J. Liu, M. Saleh, E. Pot, B. Goodrich, R. Sepassi, L. Kaiser, and N. Shazeer. Generating wikipedia by summarizing long sequences. ICLR, 2018.
[35] X. Liu, K. Duh, and J. Gao. Stochastic answer networks for natural language inference. arXiv preprint arXiv:1804.07888, 2018.
[36] L. Logeswaran and H. Lee. An efficient framework for learning sentence representations. ICLR, 2018.
[37] I. Loshchilov and F. Hutter. Fixing weight decay regularization in adam. arXiv preprint arXiv:1711.05101, 2017.
[38] B. McCann, J. Bradbury, C. Xiong, and R. Socher. Learned in translation: Contextualized word vectors. In Advances in Neural Information Processing Systems, pages 6297–6308, 2017.
[39] T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean. Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems, pages 3111–3119, 2013.
[40] N. Mostafazadeh, M. Roth, A. Louis, N. Chambers, and J. Allen. Lsdsem 2017 shared task: The story cloze test. In Proceedings of the 2nd Workshop on Linking Models of Lexical, Sentential and Discourse-level Semantics, pages 46–51, 2017.
[41] K. Nigam, A. McCallum, and T. Mitchell. Semi-supervised text classification using em. Semi-Supervised Learning, pages 33–56, 2006.
[42] J. Pennington, R. Socher, and C. Manning. Glove: Global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pages 1532–1543, 2014.
[43] M. E. Peters, W. Ammar, C. Bhagavatula, and R. Power. Semi-supervised sequence tagging with bidirectional language models. ACL, 2017.
[44] M. E. Peters, M. Neumann, M. Iyyer, M. Gardner, C. Clark, K. Lee, and L. Zettlemoyer. Deep contextualized word representations. NAACL, 2018.
[45] Y. Qi, D. S. Sachan, M. Felix, S. J. Padmanabhan, and G. Neubig. When and why are pre-trained word embeddings useful for neural machine translation? NAACL, 2018.
[46] A. Rahman and V. Ng. Resolving complex cases of definite pronouns: the winograd schema challenge. In Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, pages 777–789. Association for Computational Linguistics, 2012.
[47] P. Rajpurkar, J. Zhang, K. Lopyrev, and P. Liang. Squad: 100,000+ questions for machine comprehension of text. EMNLP, 2016.
[48] P. Ramachandran, P. J. Liu, and Q. V. Le. Unsupervised pretraining for sequence to sequence learning. arXiv preprint arXiv:1611.02683, 2016.
[49] M. Ranzato, C. Poultney, S. Chopra, and Y. LeCun. Efficient learning of sparse representations with an energy-based model. In Advances in neural information processing systems, pages 1137–1144, 2007.
[50] M. Rei. Semi-supervised multitask learning for sequence labeling. ACL, 2017.
[51] H. Robbins and S. Monro. A stochastic approximation method. The annals of mathematical statistics, pages 400–407, 1951.
[52] T. Rocktäschel, E. Grefenstette, K. M. Hermann, T. Koˇ cisk` y, and P. Blunsom. Reasoning about entailment with neural attention. arXiv preprint arXiv:1509.06664, 2015.
[53] R. Sennrich, B. Haddow, and A. Birch. Neural machine translation of rare words with subword units. arXiv preprint arXiv:1508.07909, 2015.
[54] R. Socher, A. Perelygin, J. Wu, J. Chuang, C. D. Manning, A. Ng, and C. Potts. Recursive deep models for semantic compositionality over a sentiment treebank. In Proceedings of the 2013 conference on empirical methods in natural language processing, pages 1631–1642, 2013.
[55] S. Srinivasan, R. Arora, and M. Riedl. A simple and effective approach to the story cloze test. arXiv preprint arXiv:1803.05547, 2018.
[56] S. Subramanian, A. Trischler, Y. Bengio, and C. J. Pal. Learning general purpose distributed sentence representations via large scale multi-task learning. arXiv preprint arXiv:1804.00079, 2018.
[57] J. Suzuki and H. Isozaki. Semi-supervised sequential labeling and segmentation using giga-word scale unlabeled data. Proceedings of ACL-08: HLT, pages 665–673, 2008.
[58] Y. Tay, L. A. Tuan, and S. C. Hui. A compare-propagate architecture with alignment factorization for natural language inference. arXiv preprint arXiv:1801.00102, 2017.
[59] Y. Tay, L. A. Tuan, and S. C. Hui. Multi-range reasoning for machine comprehension. arXiv preprint arXiv:1803.09074, 2018.
[60] J. Tian, Z. Zhou, M. Lan, and Y. Wu. Ecnu at semeval-2017 task 1: Leverage kernel-based traditional nlp features and neural networks to build a universal model for multilingual and cross-lingual semantic textual similarity. In Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval-2017), pages 191–197, 2017.
[61] Y. Tsvetkov. Opportunities and challenges in working with low-resource languages. CMU, 2017.
[62] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin. Attention is all you need. In Advances in Neural Information Processing Systems, pages 6000–6010, 2017.
[63] P. Vincent, H. Larochelle, Y. Bengio, and P.-A. Manzagol. Extracting and composing robust features with denoising autoencoders. In Proceedings of the 25th international conference on Machine learning, pages 1096–1103. ACM, 2008.
[64] A. Wang, A. Singh, J. Michael, F. Hill, O. Levy, and S. R. Bowman. Glue: A multi-task benchmark and analysis platform for natural language understanding. arXiv preprint arXiv:1804.07461, 2018.
[65] A. Warstadt, A. Singh, and S. R. Bowman. Corpus of linguistic acceptability. http://nyu-mll.github.io/cola, 2018.
[66] A. Williams, N. Nangia, and S. R. Bowman. A broad-coverage challenge corpus for sentence understanding through inference. NAACL, 2018.
[67] Y. Xu, J. Liu, J. Gao, Y. Shen, and X. Liu. Towards human-level machine reading comprehension: Reasoning and inference with multiple strategies. arXiv preprint arXiv:1711.04964, 2017.
[68] D. Yu, L. Deng, and G. Dahl. Roles of pre-training and fine-tuning in context-dependent dbn-hmms for real-world speech recognition. In Proc. NIPS Workshop on Deep Learning and Unsupervised Feature Learning, 2010.
[69] R. Zhang, P. Isola, and A. A. Efros. Split-brain autoencoders: Unsupervised learning by cross-channel prediction. In CVPR, volume 1, page 6, 2017.
[70] X. Zhu. Semi-supervised learning literature survey. 2005.
[71] Y. Zhu, R. Kiros, R. Zemel, R. Salakhutdinov, R. Urtasun, A. Torralba, and S. Fidler. Aligning books and movies: Towards story-like visual explanations by watching movies and reading books. In Proceedings of the IEEE international conference on computer vision, pages 19–27, 2015.

[4] A Survey of Large Language Models, Reference

[5] Scaling Laws for Neural Language Models

Jared Kaplan∗ Johns Hopkins University, OpenAI jaredk@jhu.edu
Sam McCandlish∗ OpenAI sam@openai.com
Tom Henighan OpenAI henighan@openai.com
Scott Gray OpenAI scott@openai.com
Tom B. Brown OpenAI tom@openai.com
Alec Radford OpenAI alec@openai.com
Benjamin Chess OpenAI bchess@openai.com
Rewon Child OpenAI rewon@openai.com
Jeffrey Wu OpenAI jeffwu@openai.com
Dario Amodei OpenAI damodei@openai.com
https://arxiv.org/pdf/2001.08361

References
[ACDE12] Eduardo G Altmann, Giampaolo Cristadoro, and Mirko Degli Esposti. On the origin of long-
range correlations in texts. Proceedings of the National Academy of Sciences, 109(29):11582–
11587, 2012. 25
[AS17] Madhu S. Advani and Andrew M. Saxe. High-dimensional dynamics of generalization error in
neural networks. arXiv, 2017, 1710.03667. 11, 18, 22
[BB01] Michele Banko and Eric Brill. Scaling to very very large corpora for natural language disam-
biguation. In Proceedings of the 39th annual meeting on association for computational linguis-
tics, pages 26–33. Association for Computational Linguistics, 2001. 18
[BHMM18] Mikhail Belkin, Daniel Hsu, Siyuan Ma, and Soumik Mandal. Reconciling modern machine
learning and the bias-variance trade-off. arXiv, 2018, 1812.11118. 18
[Bia12] GÃŠrard Biau. Analysis of a random forests model. Journal of Machine Learning Research,
13(Apr):1063–1095, 2012. 18
[CGRS19] Rewon Child, Scott Gray, Alec Radford, and Ilya Sutskever. Generating long sequences with
sparse transformers. CoRR, abs/1904.10509, 2019, 1904.10509. URL http://arxiv.org/
abs/1904.10509. 19
[DCLT18] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep
bidirectional transformers for language understanding, 2018, arXiv:1810.04805. 2
[DGV+18] Mostafa Dehghani, Stephan Gouws, Oriol Vinyals, Jakob Uszkoreit, and Lukasz Kaiser. Uni-
versal transformers. CoRR, abs/1807.03819, 2018, 1807.03819. URL http://arxiv.org/
abs/1807.03819. 6, 9, 23, 24
[EP94] Werner Ebeling and Thorsten Pöschel. Entropy and long-range correlations in literary english.
EPL (Europhysics Letters), 26(4):241, 1994. 25
[Fou] [GARD18] The Common Crawl Foundation. Common crawl. URL http://commoncrawl.org. 7
Guy Gur-Ari, Daniel A. Roberts, and Ethan Dyer. Gradient descent happens in a tiny subspace.
2018, arXiv:1812.04754. 18
[GJS+19] Mario Geiger, Arthur Jacot, Stefano Spigler, Franck Gabriel, Levent Sagun, Stéphane d’Ascoli,
Giulio Biroli, Clément Hongler, and Matthieu Wyart. Scaling description of generalization with
number of parameters in deep learning. arXiv, 2019, 1901.01608. 18
[GKX19] Behrooz Ghorbani, Shankar Krishnan, and Ying Xiao. An investigation into neural net op-
timization via hessian eigenvalue density. CoRR, abs/1901.10159, 2019, 1901.10159. URL
http://arxiv.org/abs/1901.10159. 18
[Goo01] Joshua Goodman. A bit of progress in language modeling. CoRR, cs.CL/0108005, 2001. URL
http://arxiv.org/abs/cs.CL/0108005. 18
[GRK17] Scott Gray, Alec Radford, and Diederik P Kingma. Gpu kernels for block-sparse weights. ope-
nai.com, 2017. 19
[HAD19] Joel Hestness, Newsha Ardalani, and Gregory Diamos. Beyond human-level accuracy: Compu-
tational challenges in deep learning. In Proceedings of the 24th Symposium on Principles and
Practice of Parallel Programming, PPoPP ’19, pages 1–14, New York, NY, USA, 2019. ACM.
doi:10.1145/3293883.3295710. 18
28
[HCC+18] Yanping Huang, Yonglong Cheng, Dehao Chen, HyoukJoong Lee, Jiquan Ngiam, Quoc V. Le,
and Zhifeng Chen. Gpipe: Efficient training of giant neural networks using pipeline parallelism.
CoRR, abs/1811.06965, 2018, 1811.06965. URL http://arxiv.org/abs/1811.06965. 19
[HNA+17] Joel Hestness, Sharan Narang, Newsha Ardalani, Gregory Diamos, Heewoo Jun, Hassan Kia-
ninejad, Md. Mostofa Ali Patwary, Yang Yang, and Yanqi Zhou. Deep learning scaling is pre-
dictable, empirically, 2017, 1712.00409. 18
[JGH18] Arthur Jacot, Franck Gabriel, and Clément Hongler. Neural tangent kernel: Convergence and
generalization in neural networks. In Advances in neural information processing systems, pages
8571–8580, 2018. 18
[KB14] Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization, 2014,
1412.6980. 7
[Kom19] [KSH12] Aran Komatsuzaki. One epoch is all you need, 2019, arXiv:1906.06669. 18
Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. Imagenet classification with deep
convolutional neural networks. In Proceedings of the 25th International Conference on Neural
Information Processing Systems - Volume 1, NIPS’12, pages 1097–1105, USA, 2012. Curran
Associates Inc. URL http://dl.acm.org/citation.cfm?id=2999134.2999257. 19
[LCG+19] Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush Sharma, and Radu
Soricut. Albert: A lite bert for self-supervised learning of language representations, 2019,
1909.11942. 9
[LOG+19] Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike
Lewis, Luke Zettlemoyer, and Veselin Stoyanov. Roberta: A robustly optimized BERT pretrain-
ing approach. CoRR, abs/1907.11692, 2019, 1907.11692. URL http://arxiv.org/abs/
1907.11692. 2
[LSP+18] Peter J. Liu, Mohammad Saleh, Etienne Pot, Ben Goodrich, Ryan Sepassi, Lukasz Kaiser, and
Noam Shazeer. Generating wikipedia by summarizing long sequences. arXiv:1801.10198 [cs],
2018, 1801.10198. URL http://arxiv.org/abs/1801.10198. 2, 6
[LT16] Henry W Lin and Max Tegmark. Criticality in formal languages and statistical physics. arXiv
preprint arXiv:1606.06737, 2016. 25
[LXS+19] Jaehoon Lee, Lechao Xiao, Samuel S. Schoenholz, Yasaman Bahri, Roman Novak, Jascha Sohl-
Dickstein, and Jeffrey Pennington. Wide neural networks of any depth evolve as linear models
under gradient descent, 2019, arXiv:1902.06720. 18
[MKAT18] Sam McCandlish, Jared Kaplan, Dario Amodei, and OpenAI Dota Team. An empirical model
of large-batch training, 2018, arXiv:1812.06162. 3, 5, 6, 12, 13, 21
[Pap18] Vardan Papyan. The full spectrum of deep net hessians at scale: Dynamics with sample size.
CoRR, abs/1811.07062, 2018, 1811.07062. URL http://arxiv.org/abs/1811.07062. 18
[RNSS18] Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. Improving language
understanding by generative pre-training. URL https://s3-us-west-2. amazonaws. com/openai-
assets/research-covers/languageunsupervised/language understanding paper. pdf, 2018. 2, 6
[RRBS19a] Jonathan S. Rosenfeld, Amir Rosenfeld, Yonatan Belinkov, and Nir Shavit. A constructive
prediction of the generalization error across scales, 2019, 1909.12673. 18
[RRBS19b] Jonathan S. Rosenfeld, Amir Rosenfeld, Yonatan Belinkov, and Nir Shavit. A constructive
prediction of the generalization error across scales, 2019, arXiv:1909.12673. 18
[RSR+19] Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena,
Yanqi Zhou, Wei Li, and Peter J. Liu. Exploring the limits of transfer learning with a unified
text-to-text transformer, 2019, arXiv:1910.10683. 2
[RWC+19] Alec Radford, Jeff Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. Language
models are unsupervised multitask learners. openai.com, 2019. 2, 5, 6, 7, 8
[SCP+18] Noam Shazeer, Youlong Cheng, Niki Parmar, Dustin Tran, Ashish Vaswani, Penporn Koanan-
takool, Peter Hawkins, HyoukJoong Lee, Mingsheng Hong, Cliff Young, Ryan Sepassi, and
Blake Hechtman. Mesh-tensorflow: Deep learning for supercomputers, 2018, 1811.02084. 19
[SHB15] Rico Sennrich, Barry Haddow, and Alexandra Birch. Neural machine translation of rare words
with subword units. CoRR, 2015, 1508.07909. 6
29
[SLA+18] Christopher J. Shallue, Jaehoon Lee, Joe Antognini, Jascha Sohl-Dickstein, Roy Frostig, and
George E. Dahl. Measuring the effects of data parallelism on neural network training, 2018,
arXiv:1811.03600. 12
[SS18] Noam Shazeer and Mitchell Stern. Adafactor: Adaptive learning rates with sublinear memory
cost. CoRR, abs/1804.04235, 2018, 1804.04235. URL http://arxiv.org/abs/1804.04235.
7
[THK18] Stefan Thurner, Rudolf Hanel, and Peter Klimek. Introduction to the theory of complex systems.
Oxford University Press, 2018. 18
[TL19] Mingxing Tan and Quoc V. Le. Efficientnet: Rethinking model scaling for convolutional neural
networks. CoRR, abs/1905.11946, 2019, 1905.11946. URL http://arxiv.org/abs/1905.
11946. 18
[VSP+17] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez,
Ł ukasz Kaiser, and Illia Polosukhin. Attention is all you need. In I. Guyon, U. V. Luxburg,
S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural
Information Processing Systems 30, pages 5998–6008. Curran Associates, Inc., 2017. URL
http://papers.nips.cc/paper/7181-attention-is-all-you-need.pdf. 2, 6
[VWB16] Andreas Veit, Michael Wilber, and Serge Belongie. Residual networks behave like ensembles
of relatively shallow networks, 2016, arXiv:1605.06431. 8, 18
[Was06] Larry Wasserman. All of nonparametric statistics. Springer Science & Business Media, 2006.
18
[WPN+19] Alex Wang, Yada Pruksachatkun, Nikita Nangia, Amanpreet Singh, Julian Michael, Felix Hill,
Omer Levy, and Samuel R. Bowman. Superglue: A stickier benchmark for general-purpose
language understanding systems, 2019, 1905.00537. 2
[WRH17] Yu-Xiong Wang, Deva Ramanan, and Martial Hebert. Growing a brain: Fine-tuning by in-
creasing model capacity. 2017 IEEE Conference on Computer Vision and Pattern Recognition
(CVPR), Jul 2017. doi:10.1109/cvpr.2017.323. 19
[WYL19] Wei Wen, Feng Yan, and Hai Li. Autogrow: Automatic layer growing in deep convolutional
networks, 2019, 1906.02909. 19
[YDY+19] Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Carbonell, Ruslan Salakhutdinov, and Quoc V.
Le. Xlnet: Generalized autoregressive pretraining for language understanding, 2019,
arXiv:1906.08237. 2
[ZK16] Sergey Zagoruyko and Nikos Komodakis. Wide residual networks. Procedings of the British
Machine Vision Conference 2016, 2016. doi:10.5244/c.30.87. 18
[ZKZ+15] Yukun Zhu, Ryan Kiros, Rich Zemel, Ruslan Salakhutdinov, Raquel Urtasun, Antonio Tor-
ralba, and Sanja Fidler. Aligning books and movies: Towards story-like visual explanations by
watching movies and reading books. 2015 IEEE International Conference on Computer Vision
(ICCV), Dec 2015. doi:10.1109/iccv.2015.11. 7
[ZLN+19] Guodong Zhang, Lala Li, Zachary Nado, James Martens, Sushant Sachdeva, George E. Dahl,
Christopher J. Shallue, and Roger B. Grosse. Which algorithmic choices matter at which batch
sizes? insights from a noisy quadratic model. CoRR, abs/1907.04164, 2019, 1907.04164. URL
http://arxiv.org/abs/1907.04164. 12, 18

[6] Emergent Abilities of Large Language Models

Jason Wei1 jasonwei@google.com
Yi Tay1 yitay@google.com
Rishi Bommasani2 nlprishi@stanford.edu
Colin Raﬀel3 craﬀel@gmail.com
Barret Zoph1 barretzoph@google.com
Sebastian Borgeaud4 sborgeaud@deepmind.com
Dani Yogatama4 dyogatama@deepmind.com
Maarten Bosma1 bosma@google.com
Denny Zhou1 dennyzhou@google.com
Donald Metzler1 metzler@google.com
Ed H. Chi1 edchi@google.com
Tatsunori Hashimoto2 thashim@stanford.edu
Oriol Vinyals4 vinyals@deepmind.com
Percy Liang2 pliang@stanford.edu
Jeﬀ Dean1 jeﬀ@google.com
William Fedus1 liamfedus@google.com

References

Omri Abend, Tom Kwiatkowski, Nathaniel J Smith, Sharon Goldwater, and Mark Steedman. Bootstrapping
language acquisition. Cognition, 164:116–143, 2017. URL https://homepages.inf.ed.ac.uk/
sgwater/papers/cognition17-bootstrapping.pdf.
Michael Ahn, Anthony Brohan, Noah Brown, Yevgen Chebotar, Omar Cortes, Byron David, Chelsea Finn,
Keerthana Gopalakrishnan, Karol Hausman, Alex Herzog, et al. Do as I can, not as I say: Grounding
language in robotic aﬀordances. arXiv preprint arXiv:2204.01691, 2022. URL https://arxiv.org/
abs/2204.01691.
Jean-Baptiste Alayrac, Jeﬀ Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc,
Arthur Mensch, Katie Millican, Malcolm Reynolds, et al. Flamingo: A visual language model for few-shot
learning. NeurIPS, 2022. URL https://arxiv.org/abs/2204.14198.
Philip W. Anderson. More is diﬀerent: Broken symmetry and the nature of the hierarchical structure of
science. Science, 177(4047):393–396, 1972. URL http://www.lanais.famaf.unc.edu.ar/cursos/
em/Anderson-MoreDifferent-1972.pdf.
Simran Arora, Avanika Narayan, Mayee F Chen, Laurel J Orr, Neel Guha, Kush Bhatia, Ines Chami, Frederic
Sala, and Christopher Ré. Ask me anything: A simple strategy for prompting language models. arXiv
preprint arXiv:2210.02441, 2022. URL https://arxiv.org/abs/2210.02441.
Mikel Artetxe, Shruti Bhosale, Naman Goyal, Todor Mihaylov, Myle Ott, Sam Shleifer, Xi Victoria Lin, Jingfei
Du, Srinivasan Iyer, Ramakanth Pasunuru, et al. Eﬃcient large scale language modeling with mixtures of
experts. arXiv preprint arXiv:2112.10684, 2021. URL https://arxiv.org/abs/2112.10684.
Amanda Askell, Yuntao Bai, Anna Chen, Dawn Drain, Deep Ganguli, Tom Henighan, Andy Jones, Nicholas
Joseph, Ben Mann, Nova DasSarma, et al. A general language assistant as a laboratory for alignment.
arXiv preprint arXiv:2112.00861, 2021. URL https://arxiv.org/abs/2112.00861.
Fabrice Bellard. gpt2tc: Text completion and compression using GPT-2, 2021. URL https://bellard.
org/libnc/gpt2tc.html. Accessed Apr. 26, 2022.
Emily M. Bender, Timnit Gebru, Angelina McMillan-Major, and Shmargaret Shmitchell. On the dangers of
stochastic parrots: Can language models be too big? Proceedings of the 2021 ACM Conference on Fairness,
Accountability, and Transparency, 2021. URL https://dl.acm.org/doi/pdf/10.1145/3442188.
3445922.
BIG-Bench. Beyond the imitation game: Measuring and extrapolating the capabilities of language models.
arXiv preprint arXiv:2206.04615, 2022. URL https://arxiv.org/abs/2206.04615.
Rishi Bommasani, Drew A. Hudson, Ehsan Adeli, Russ Altman, Simran Arora, Sydney von Arx, Michael S.
Bernstein, Jeannette Bohg, Antoine Bosselut, Emma Brunskill, et al. On the opportunities and risks
of foundation models. arXiv preprint arXiv:2108.07258, 2021. URL https://arxiv.org/abs/2108.
07258.
Sebastian Borgeaud, Arthur Mensch, Jordan Hoﬀmann, Trevor Cai, Eliza Rutherford, Katie Millican, George
van den Driessche, Jean-Baptiste Lespiau, Bogdan Damoc, Aidan Clark, et al. Improving language
models by retrieving from trillions of tokens. arXiv preprint arXiv:2112.04426, 2021. URL https:
//arxiv.org/abs/2112.04426.
Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D. Kaplan, Prafulla Dhari-
wal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language mod-
els are few-shot learners. NeurIPS, 2020. URL https://papers.nips.cc/paper/2020/hash/
1457c0d6bfcb4967418bfb8ac142f64a-Abstract.html.
12
Published in Transactions on Machine Learning Research (08/2022)
Nicholas Carlini, Florian Tramer, Eric Wallace, Matthew Jagielski, Ariel Herbert-Voss, Katherine Lee, Adam
Roberts, Tom Brown, Dawn Song, Ulfar Erlingsson, et al. Extracting training data from large language mod-
els. USENIX Security, 2021. URL https://www.usenix.org/conference/usenixsecurity21/
presentation/carlini-extracting.
Nicholas Carlini, Daphne Ippolito, Matthew Jagielski, Katherine Lee, Florian Tramer, and Chiyuan Zhang.
Quantifying memorization across neural language models. arXiv preprint arXiv:2202.07646, 2022. URL
https://arxiv.org/abs/2202.07646.
Stephanie C.Y. Chan, Adam Santoro, Andrew K. Lampinen, Jane X. Wang, Aaditya Singh, Pierre H.
Richemond, Jay McClelland, and Felix Hill. Data distributional properties drive emergent few-shot learning
in transformers. arXiv preprint arXiv:2205.05055, 2022. URL https://arxiv.org/abs/2205.05055.
Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Hyung Won Chung,
Charles Sutton, Sebastian Gehrmann, Parker Schuh, et al. PaLM: Scaling language modeling with Pathways.
arXiv preprint arXiv:2204.02311, 2022. URL https://arxiv.org/abs/2204.02311.
Hyung Won Chung, Le Hou, Shayne Longpre, Barret Zoph, Yi Tay, William Fedus, Eric Li, Xuezhi Wang,
Mostafa Dehghani, Siddhartha Brahma, et al. Scaling instruction-finetuned language models. arXiv
preprint arXiv:2210.11416, 2022. URL https://arxiv.org/abs/2210.11416.
Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and
John Schulman. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168, 2021.
URL https://arxiv.org/abs/2110.14168.
Andy Coenen, Luke Davis, Daphne Ippolito, Emily Reif, and Ann Yuan. Wordcraft: A human-AI collaborative
editor for story writing. arXiv preprint arXiv:2107.07430, 2021. URL https://arxiv.org/abs/2107.
07430.
Antonella Corradini and Timothy O’Connor. Emergence in science and philosophy, volume 6. Rout-
ledge, 2010. URL https://books.google.com/books?hl=en&lr=&id=55RaBwAAQBAJ&oi=
fnd&pg=PP1&dq=Emergence+in+science+and+philosophy&ots=2_8VNDXLfv&sig=1aisq_
WouF95Cx58WWMZ0Gq3RNk.
Mostafa Dehghani, Stephan Gouws, Oriol Vinyals, Jakob Uszkoreit, and Łukasz Kaiser. Universal transformers.
arXiv preprint arXiv:1807.03819, 2018. URL https://arxiv.org/abs/1807.03819.
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: Pre-training of deep bidirec-
tional transformers for language understanding. NAACL, 2019. URL https://aclanthology.org/
N19-1423.
Nan Du, Yanping Huang, Andrew M. Dai, Simon Tong, Dmitry Lepikhin, Yuanzhong Xu, Maxim Krikun,
Yanqi Zhou, Adams Wei Yu, Orhan Firat, et al. GLaM: Eﬃcient scaling of language models with
mixture-of-experts. ICML, 2021. URL https://arxiv.org/abs/2112.06905.
William Fedus, Barret Zoph, and Noam Shazeer. Switch transformers: Scaling to trillion parameter models
with simple and eﬃcient sparsity. arXiv preprint arXiv:2101.03961, 2021. URL https://arxiv.org/
abs/2101.03961.
Stephanie Forrest. Emergent computation: Self-organizing, collective, and cooperative phenomena in
natural and artificial computing networks. Physica D: Nonlinear Phenomena, 42(1-3):1–11, 1990. URL
https://www.sciencedirect.com/science/article/abs/pii/016727899090063U.
Deep Ganguli, Danny Hernandez, Liane Lovitt, Nova DasSarma, Tom Henighan, Andy Jones, Nicholas
Joseph, Jackson Kernion, Ben Mann, Amanda Askell, et al. Predictability and surprise in large generative
models. arXiv preprint arXiv:2202.07785, 2022. URL https://arxiv.org/abs/2202.07785.
Tianyu Gao, Adam Fisch, and Danqi Chen. Making pre-trained language models better few-shot learners. ACL,
2021. doi: 10.18653/v1/2021.acl-long.295. URL https://aclanthology.org/2021.acl-long.295.
13
Published in Transactions on Machine Learning Research (08/2022)
Samuel Gehman, Suchin Gururangan, Maarten Sap, Yejin Choi, and Noah A. Smith. RealToxicityPrompts:
Evaluating neural toxic degeneration in language models. In Findings of EMNLP, 2020. doi: 10.18653/v1/
2020.findings-emnlp.301. URL https://aclanthology.org/2020.findings-emnlp.301.
Alex Graves. Adaptive computation time for recurrent neural networks. arXiv preprint arXiv:1603.08983,
2016. URL https://arxiv.org/abs/1603.08983.
Kelvin Guu, Kenton Lee, Zora Tung, Panupong Pasupat, and Ming-Wei Chang. REALM: Retrieval-augmented
language model pre-training. ICML, 2020. URL https://arxiv.org/abs/2002.08909.
David A. Harper and Paul A. Lewis. New perspectives on emergence in economics. New Per-
spectives on Emergence in Economics, pp. 2–3, 2012. URL https://www.sciencedirect.
com/science/article/pii/S0167268112000200?casa_token=fLs2nCYo_64AAAAA:
H2sSpSygJmEqXgmpM4jLyeppph3C4TgEsaSXm5RkOpT0r4q2A1x9Su3u4uycK4sIC6a8NdLiSw.
Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt.
Measuring massive multitask language understanding. ICLR, 2021a. URL https://openreview.net/
forum?id=d7KBjmI3GmQ.
Dan Hendrycks, Nicholas Carlini, John Schulman, and Jacob Steinhardt. Unsolved problems in ML safety.
arXiv preprint arXiv:2109.13916, 2021b. URL https://arxiv.org/abs/2109.13916.
Jordan Hoﬀmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford,
Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, et al. Training compute-optimal
large language models. NeurIPS, 2022. URL https://arxiv.org/abs/2203.15556.
Ari Holtzman, Peter West, Vered Shwartz, Yejin Choi, and Luke Zettlemoyer. Surface form competition:
Why the highest probability answer isn’t always right. EMNLP, 2021. URL https://aclanthology.
org/2021.emnlp-main.564.
Wenlong Huang, Pieter Abbeel, Deepak Pathak, and Igor Mordatch. Language models as zero-shot planners:
Extracting actionable knowledge for embodied agents. arXiv preprint arXiv:2201.07207, 2022. URL
https://arxiv.org/pdf/2201.07207.
Bernardo A. Huberman and Tad Hogg. Phase transitions in artificial intelligence systems. Artificial
Intelligence, 33(2):155–171, 1987. URL https://www.sciencedirect.com/science/article/
abs/pii/0004370287900336.
Harold Y. Hwang, Yoh Iwasa, Masashi Kawasaki, Bernhard Keimer, Naoto Nagaosa, and Yoshinori Tokura.
Emergent phenomena at oxide interfaces. Nature Materials, 11(2):103–113, 2012. URL https://www.
nature.com/articles/nmat3223.
Max Jaderberg, Wojciech Marian Czarnecki, Simon Osindero, Oriol Vinyals, Alex Graves, David Silver,
and Koray Kavukcuoglu. Decoupled neural interfaces using synthetic gradients. ICML, 2017. URL
https://arxiv.org/abs/1608.05343.
Dan Jurafsky and James H. Martin. Speech and Language Processing: An Introduction to Natural Language
Processing, Computational Linguistics, and Speech Recognition. Prentice Hall series in Artificial Intelligence.
Pearson Prentice Hall, 2009. ISBN 9780131873216. URL https://books.google.com/books?id=
fZmj5UNK8AQC.
Saurav Kadavath, Tom Conerly, Amanda Askell, Tom Henighan, Dawn Drain, Ethan Perez, Nicholas Schiefer,
Zac Hatfield Dodds, Nova DasSarma, Eli Tran-Johnson, et al. Language models (mostly) know what they
know. arXiv preprint arXiv:2207.05221, 2022. URL https://arxiv.org/abs/2207.05221.
Nikhil Kandpal, Eric Wallace, and Colin Raﬀel. Deduplicating training data mitigates privacy risks in
language models. ICML, 2022. URL https://arxiv.org/abs/2202.06539.
14
Published in Transactions on Machine Learning Research (08/2022)
Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B. Brown, Benjamin Chess, Rewon Child, Scott Gray,
Alec Radford, Jeﬀrey Wu, and Dario Amodei. Scaling laws for neural language models. arXiv preprint
arXiv:2001.08361, 2020. URL https://arxiv.org/abs/2001.08361.
Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yutaka Matsuo, and Yusuke Iwasawa. Large language
models are zero-shot reasoners. NeurIPS, 2022. URL https://arxiv.org/abs/2205.11916.
Andrew K. Lampinen, Ishita Dasgupta, Stephanie C.Y. Chan, Kory Matthewson, Michael Henry Tessler,
Antonia Creswell, James L. McClelland, Jane X. Wang, and Felix Hill. Can language models learn from
explanations in context? Findings of EMNLP, 2022. URL https://arxiv.org/abs/2204.02329.
Katherine Lee, Daphne Ippolito, Andrew Nystrom, Chiyuan Zhang, Douglas Eck, Chris Callison-Burch,
and Nicholas Carlini. Deduplicating training data makes language models better. ACL, 2022a. URL
https://arxiv.org/abs/2107.06499.
Mina Lee, Percy Liang, and Qian Yang. Coauthor: Designing a human-AI collaborative writing dataset for
exploring language model capabilities. CHI, 2022b. URL https://arxiv.org/abs/2201.06796.
Dmitry Lepikhin, HyoukJoong Lee, Yuanzhong Xu, Dehao Chen, Orhan Firat, Yanping Huang, Maxim
Krikun, Noam Shazeer, and Zhifeng Chen. Gshard: Scaling giant models with conditional computation
and automatic sharding. ICLR, 2021. URL https://openreview.net/forum?id=qrwe7XHTmYb.
Percy Liang. Semi-supervised learning for natural language. PhD thesis, Massachusetts Institute of Technology,
2005. URL https://www-cs.stanford.edu/~pliang/papers/meng-thesis.pdf.
Percy Liang, Rishi Bommasani, Kathleen A. Creel, and Rob Reich. The time is now to develop community
norms for the release of foundation models, 2022. URL https://crfm.stanford.edu/2022/05/17/
community-norms.html.
Stephanie Lin, Jacob Hilton, and Owain Evans. TruthfulQA: Measuring how models mimic human falsehoods.
arXiv preprint arXiv:2109.07958, 2021. URL https://arxiv.org/abs/2109.07958.
Christopher D. Manning. Human language understanding & reasoning. Daedalus, 151(2):127–138, 2022. URL
https://www.amacad.org/publication/human-language-understanding-reasoning.
Christopher D. Manning, Kevin Clark, John Hewitt, Urvashi Khandelwal, and Omer Levy. Emergent
linguistic structure in artificial neural networks trained by self-supervision. Proceedings of the National
Academy of Sciences, 117(48):30046–30054, 2020. URL https://www.pnas.org/doi/10.1073/pnas.
1907367117.
Bryan McCann, Nitish Shirish Keskar, Caiming Xiong, and Richard Socher. The natural language decathlon:
Multitask learning as question answering. arXiv preprint arXiv:1806.08730, 2018. URL https://arxiv.
org/abs/1806.08730.
Stephen Merity, Caiming Xiong, James Bradbury, and Richard Socher. Pointer sentinel mixture models, 2016.
URL https://huggingface.co/datasets/wikitext.
Scott Miller, Jethran Guinness, and Alex Zamanian. Name tagging with word clusters and discriminative
training. In NAACL, 2004. URL https://aclanthology.org/N04-1043.
Sewon Min, Mike Lewis, Hannaneh Hajishirzi, and Luke Zettlemoyer. Noisy channel language model prompting
for few-shot text classification. ACL, 2022a. URL https://arxiv.org/abs/2108.04106.
Sewon Min, Xinxi Lyu, Ari Holtzman, Mikel Artetxe, Mike Lewis, Hannaneh Hajishirzi, and Luke Zettle-
moyer. Rethinking the role of demonstrations: What makes in-context learning work? arXiv preprint
arXiv:2202.12837, 2022b. URL https://arxiv.org/abs/2202.12837.
Maxwell Nye, Anders Johan Andreassen, Guy Gur-Ari, Henryk Michalewski, Jacob Austin, David Bieber,
David Dohan, Aitor Lewkowycz, Maarten Bosma, David Luan, et al. Show your work: Scratchpads
for intermediate computation with language models. arXiv preprint arXiv:2112.00114, 2021. URL
https://openreview.net/forum?id=iedYJm92o0a.
15
Published in Transactions on Machine Learning Research (08/2022)
Catherine Olsson, Nelson Elhage, Neel Nanda, Nicholas Joseph, Nova DasSarma, Tom Henighan,
et al. In-context learning and induction heads. Transformer Circuits, 2022. URL https:
//transformer-circuits.pub/2022/in-context-learning-and-induction-heads/
index.html.
Long Ouyang, Jeﬀ Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, Chong Zhang,
Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with
human feedback. arXiv preprint arXiv:2203.02155, 2022. URL https://arxiv.org/abs/2203.02155.
Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. BLEU: a method for automatic evaluation
of machine translation. In ACL, 2002. URL https://aclanthology.org/P02-1040.pdf.
Alicia Parrish, Angelica Chen, Nikita Nangia, Vishakh Padmakumar, Jason Phang, Jana Thompson, Phu Mon
Htut, and Samuel Bowman. BBQ: A hand-built bias benchmark for question answering. In Findings of
ACL, 2022. URL https://arxiv.org/abs/2110.08193.
Roma Patel and Ellie Pavlick. Mapping language models to grounded conceptual spaces. ICLR, 2022. URL
https://openreview.net/forum?id=gJcEM8sxHK.
Ethan Perez, Saﬀron Huang, Francis Song, Trevor Cai, Roman Ring, John Aslanides, Amelia Glaese, Nat
McAleese, and Geoﬀrey Irving. Red teaming language models with language models. arXiv preprint
arXiv:2202.03286, 2022. URL https://arxiv.org/abs/2202.03286.
Mohammad Taher Pilehvar and Jose Camacho-Collados. WiC: the word-in-context dataset for evaluat-
ing context-sensitive meaning representations. NAACL, 2019. URL https://aclanthology.org/
N19-1128.
Alec Radford, Jeﬀrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al.
Language models are unsupervised multitask learners. OpenAI blog, 1(8), 2019. URL
https://d4mucfpksywv.cloudfront.net/better-language-models/language_models_
are_unsupervised_multitask_learners.pdf.
Jack W. Rae, Sebastian Borgeaud, Trevor Cai, Katie Millican, Jordan Hoﬀmann, Francis Song, John Aslanides,
Sarah Henderson, Roman Ring, Susannah Young, et al. Scaling language models: Methods, analysis &
insights from training Gopher. arXiv preprint arXiv:2112.11446, 2021. URL https://arxiv.org/abs/
2112.11446.
Colin Raﬀel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou,
Wei Li, and Peter J. Liu. Exploring the limits of transfer learning with a unified text-to-text transformer.
Journal of Machine Learning Research, 2020. URL https://jmlr.org/papers/v21/20-074.html.
Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. Hierarchical text-conditional
image generation with clip latents. arXiv preprint arXiv:2204.06125, 2022. URL https://arxiv.org/
abs/2204.06125.
Yasaman Razeghi, Robert L Logan IV, Matt Gardner, and Sameer Singh. Impact of pretraining term
frequencies on few-shot reasoning. arXiv preprint arXiv:2202.07206, 2022. URL https://arxiv.org/
abs/2202.07206.
Laria Reynolds and Kyle McDonell. Prompt programming for large language models: Beyond the few-shot
paradigm. Extended Abstracts of the 2021 CHI Conference on Human Factors in Computing Systems, 2021.
URL https://arxiv.org/abs/2102.07350.
Rachel Rudinger, Chandler May, and Benjamin Van Durme. Social bias in elicited natural language
inferences. In Proceedings of the First ACL Workshop on Ethics in Natural Language Processing, 2017.
URL https://aclanthology.org/W17-1609.
Victor Sanh, Albert Webson, Colin Raﬀel, Stephen Bach, Lintang Sutawika, Zaid Alyafeai, Antoine Chaﬃn,
Arnaud Stiegler, Teven Le Scao, Arun Raja, et al. Multitask prompted training enables zero-shot task
generalization. ICLR, 2022. URL https://openreview.net/forum?id=9Vrb9D0WI4.
16
Published in Transactions on Machine Learning Research (08/2022)
Nikunj Saunshi, Sadhika Malladi, and Sanjeev Arora. A mathematical exploration of why language models
help solve downstream tasks. ICLR, 2021. URL https://arxiv.org/abs/2010.03648.
Timo Schick and Hinrich Schütze. It’s not just size that matters: Small language models are also few-shot
learners. NAACL, June 2021. URL https://aclanthology.org/2021.naacl-main.185.
Freda Shi, Mirac Suzgun, Markus Freitag, Xuezhi Wang, Suraj Srivats, Soroush Vosoughi, Hyung Won Chung,
Yi Tay, Sebastian Ruder, Denny Zhou, Dipanjan Das, and Jason Wei. Language models are multilingual
chain-of-thought reasoners. arXiv preprint arXiv:2210.03057, 2022. URL https://arxiv.org/abs/
2210.03057.
Jacob Steinhardt. On the risks of emergent behavior in foundation
models, October 2021. URL https://bounded-regret.ghost.io/
on-the-risks-of-emergent-behavior-in-foundation-models/. Accessed Apr 13, 2022.
Jacob Steinhardt. Future ml systems will be qualitatively diﬀerent, 2022. URL https://bounded-regret.
ghost.io/future-ml-systems-will-be-qualitatively-different/. Accessed May 20,
2022.
MiracSuzgun, NathanScales, NathanealScharli, SebastianGehrmann, YiTay, HyungWonChung, Aakanksha
Chowdhery, Quoc V. Le, Ed H. Chi, Denny ZHou, and Jason Wei. Challenging BIG-Bench tasks
and whether chain-of-thought can solve them. arXiv preprint arXiv:2210.09261, 2022. URL https:
//arxiv.org/abs/2210.09261.
Yi Tay, Mostafa Dehghani, Vinh Q. Tran, Xavier Garcia, Dara Bahri, Tal Schuster, Huaixiu Steven Zheng,
Neil Houlsby, and Donald Metzler. Unifying language learning paradigms. arXiv preprint arXiv:2205.05131,
2022a. URL https://arxiv.org/abs/2205.05131.
Yi Tay, Vinh Q. Tran, Mostafa Dehghani, Jianmo Ni, Dara Bahri, Harsh Mehta, Zhen Qin, Kai Hui, Zhe Zhao,
Jai Gupta, et al. Transformer memory as a diﬀerentiable search index. arXiv preprint arXiv:2202.06991,
2022b. URL https://arxiv.org/abs/2202.06991.
Yi Tay, Jason Wei, Hyung Won Chung, Vinh Q Tran, David R So, Siamak Shakeri, Xavier Garcia,
Huaixiu Steven Zheng, Jinfeng Rao, Aakanksha Chowdhery, et al. Transcending scaling laws with 0.1%
extra compute. arXiv preprint arXiv:2210.11399, 2022c. URL https://arxiv.org/abs/2210.11399.
Ryan Teehan, Miruna Clinciu, Oleg Serikov, Eliza Szczechla, Natasha Seelam, Shachar Mirkin, and Aaron
Gokaslan. Emergent structures and training dynamics in large language models. In ACL Big Science
Workshop, 2022. URL https://aclanthology.org/2022.bigscience-1.11/.
Romal Thoppilan, Daniel De Freitas, Jamie Hall, Noam Shazeer, Apoorv Kulshreshtha, Heng-Tze Cheng,
Alicia Jin, Taylor Bos, Leslie Baker, Yu Du, et al. LaMDA: Language models for dialog applications. arXiv
preprint arXiv:2201.08239, 2022. URL https://arxiv.org/abs/2201.08239.
TrieuH.TrinhandQuocV.Le. Asimplemethodforcommonsensereasoning. arXiv preprint arXiv:1806.02847,
2018. URL https://arxiv.org/abs/1806.02847.
Thomas Wang, Adam Roberts, Daniel Hesslow, Teven Le Scao, Hyung Won Chung, Iz Beltagy, Julien Launay,
and Colin Raﬀel. What language model architecture and pretraining objective work best for zero-shot
generalization? ICML, 2022a. URL https://arxiv.org/abs/2204.05832.
Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le, Ed Chi, and Denny Zhou. Self-consistency improves
chain of thought reasoning in language models. arXiv preprint arXiv:2203.11171, 2022b. URL https:
//arxiv.org/abs/2203.11171.
Colin Wei, Sang Michael Xie, and Tengyu Ma. Why do pretrained language models help in downstream
tasks? An analysis of head and prompt tuning. NeurIPS, 2021a. URL https://openreview.net/
forum?id=MDMV2SxCboX.
17
Published in Transactions on Machine Learning Research (08/2022)
Jason Wei, Dan Garrette, Tal Linzen, and Ellie Pavlick. Frequency eﬀects on syntactic rule learning in
transformers. EMNLP, 2021b. doi: 10.18653/v1/2021.emnlp-main.72. URL https://aclanthology.
org/2021.emnlp-main.72.
Jason Wei, Maarten Bosma, Vincent Y. Zhao, Kelvin Guu, Adams Wei Yu, Brian Lester, Nan Du, Andrew M.
Dai, and Quoc V. Le. Finetuned language models are zero-shot learners. ICLR, 2022a. URL https:
//openreview.net/forum?id=gEZrGCozdqR.
Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Ed Chi, Quoc Le, and Denny Zhou. Chain of
thought prompting elicits reasoning in large language models. NeurIPS, 2022b. URL https://arxiv.
org/abs/2201.11903.
Laura Weidinger, John Mellor, Maribeth Rauh, Conor Griﬃn, Jonathan Uesato, Po-Sen Huang, Myra Cheng,
Mia Glaese, Borja Balle, Atoosa Kasirzadeh, et al. Ethical and social risks of harm from language models.
arXiv preprint arXiv:2112.04359, 2021. URL https://arxiv.org/abs/2112.04359.
Tongshuang Wu, Michael Terry, and Carrie J. Cai. AI chains: Transparent and controllable human-AI
interaction by chaining large language model prompts. arXiv preprint arXiv:2110.01691, 2021. URL
https://arxiv.org/abs/2110.01691.
Tongshuang Wu, Ellen Jiang, Aaron Donsbach, Jeﬀ Gray, Alejandra Molina, Michael Terry, and Carrie J.
Cai. PromptChainer: Chaining large language model prompts through visual programming. arXiv preprint
arXiv:2203.06566, 2022a. URL https://arxiv.org/abs/2203.06566.
Yuhuai Wu, Markus N Rabe, DeLesley Hutchins, and Christian Szegedy. Memorizing transformers. arXiv
preprint arXiv:2203.08913, 2022b.
Sang Michael Xie, Aditi Raghunathan, Percy Liang, and Tengyu Ma. An explanation of in-context learning
as implicit bayesian inference. ICLR, 2022. URL https://arxiv.org/abs/2111.02080.
Andy Zeng, Adrian Wong, Stefan Welker, Krzysztof Choromanski, Federico Tombari, Aveek Purohit, Michael
Ryoo, Vikas Sindhwani, Johnny Lee, Vincent Vanhoucke, et al. Socratic models: Composing zero-shot
multimodal reasoning with language. arXiv preprint arXiv:2204.00598, 2022. URL https://arxiv.
org/abs/2204.00598.
Yian Zhang, Alex Warstadt, Xiaocheng Li, and Samuel R. Bowman. When do you need billions of words of
pretraining data? In ACL, 2021. doi: 10.18653/v1/2021.acl-long.90. URL https://aclanthology.
org/2021.acl-long.90.
Zihao Zhao, Eric Wallace, Shi Feng, Dan Klein, and Sameer Singh. Calibrate before use: Improving few-shot
performance of language models. ICML, 2021. URL https://arxiv.org/abs/2102.09690.
Denny Zhou, Nathanael Schärli, Le Hou, Jason Wei, Nathan Scales, Xuezhi Wang, Dale Schuurmans, Olivier
Bousquet, Quoc Le, and Ed Chi. Least-to-most prompting enables complex reasoning in large language
models. arXiv preprint arXiv:2205.10625, 2022. URL https://arxiv.org/abs/2205.10625.
Barret Zoph, Irwan Bello, Sameer Kumar, Nan Du, Yanping Huang, Jeﬀ Dean, Noam Shazeer, and William
Fedus. Designing eﬀective sparse expert models. arXiv preprint arXiv:2202.08906, 2022. URL https:
//arxiv.org/abs/2202.08906.

[7] Language Models are Few-Shot Learners

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up