0
1

Delete article

Deleted articles cannot be recovered.

Draft of this article would be also deleted.

Are you sure you want to delete this article?

A Survey of Large Language Models, Reference

Last updated at Posted at 2025-10-15

A Survey of Large Language Models

Wayne Xin Zhao, Kun Zhou*, Junyi Li*, Tianyi Tang, Xiaolei Wang, Yupeng Hou, Yingqian Min, Beichen Zhang, Junjie Zhang, Zican Dong, Yifan Du, Chen Yang, Yushuo Chen, Zhipeng Chen, Jinhao Jiang, Ruiyang Ren, Yifan Li, Xinyu Tang, Zikang Liu, Peiyu Liu, Jian-Yun Nie and Ji-Rong Wen
https://arxiv.org/pdf/2303.18223

REFERENCES

[1] Y. Bengio, R. Ducharme, P. Vincent, and C. Janvin, “A neural probabilistic language model,” J. Mach. Learn.Res., vol. 3, pp. 1137–1155, 2003.
[2] R. Collobert, J. Weston, L. Bottou, M. Karlen, K. Kavukcuoglu, and P. P. Kuksa, “Natural language processing (almost) from scratch,” J. Mach. Learn. Res., vol. 12, pp. 2493–2537, 2011.
[3] S. Pinker, The Language Instinct: How the Mind Creates Language. Brilliance Audio; Unabridged edition, 2014.
[4] M. D. Hauser, N. Chomsky, and W. T. Fitch, “The faculty of language: what is it, who has it, and how did it evolve?” science, vol. 298, no. 5598, pp. 1569–1579, 2002.
[5] A. M. Turing, “Computing machinery and intelligence,” Mind, vol. LIX, no. 236, pp. 433–460, 1950.
[6] F. Jelinek, Statistical Methods for Speech Recognition. MIT Press, 1998.
[7] J. Gao and C. Lin, “Introduction to the special issue on statistical language modeling,” ACM Trans. Asian Lang. Inf. Process., vol. 3, no. 2, pp. 87–93, 2004.
[8] R. Rosenfeld, “Two decades of statistical language modeling: Where do we go from here?” Proceedings of the IEEE, vol. 88, no. 8, pp. 1270–1278, 2000.
[9] A. Stolcke, “Srilm-an extensible language modeling toolkit,” in Seventh international conference on spoken language processing, 2002.
[10] X. Liu and W. B. Croft, “Statistical language modeling for information retrieval,” Annu. Rev. Inf. Sci. Technol., vol. 39, no. 1, pp. 1–31, 2005.
[11] C. Zhai, Statistical Language Models for Information Retrieval, ser. Synthesis Lectures on Human Language Technologies. Morgan & Claypool Publishers, 2008.
[12] S. M. Thede and M. P. Harper, “A second-order hidden markov model for part-of-speech tagging,” in 27th Annual Meeting of the Association for Computational Linguistics, University of Maryland, College Park, Maryland, USA, 20-26 June 1999, R. Dale and K. W. Church, Eds. ACL, 1999, pp. 175–182.
[13] L. R. Bahl, P. F. Brown, P. V. de Souza, and R. L. Mercer, “A tree-based statistical language model for natural language speech recognition,” IEEE Transactions on Acoustics, Speech, and Signal Processing, vol. 37, no. 7, pp. 1001–1008, 1989.
[14] T. Brants, A. C. Popat, P. Xu, F. J. Och, and J. Dean, “Large language models in machine translation,” in EMNLP-CoNLL 2007, Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, June 28-30, 2007, Prague, Czech Republic, J. Eisner, Ed. ACL, 2007, pp. 858–867.
[15] S. M. Katz, “Estimation of probabilities from sparse data for the language model component of a speech recognizer,” IEEE Trans. Acoust. Speech Signal Process., vol. 35, no. 3, pp. 400–401, 1987.
[16] W. A. Gale and G. Sampson, “Good-turing frequency estimation without tears,” J. Quant. Linguistics, vol. 2, no. 3, pp. 217–237, 1995.
[17] T. Mikolov, M. Karafi´ at, L. Burget, J. Cernock´ y, and S. Khudanpur, “Recurrent neural network based language model,” in INTERSPEECH 2010, 11th Annual Conference of the International Speech Communication Association, Makuhari, Chiba, Japan, September 26-30, 2010, T. Kobayashi, K. Hirose, and S. Nakamura, Eds. ISCA, 2010, pp. 1045–1048.
[18] S. Kombrink, T. Mikolov, M. Karafi´ at, and L. Burget,“Recurrent neural network based language modeling in meeting recognition,” in INTERSPEECH 2011, 12th Annual Conference of the International Speech Communication Association, Florence, Italy, August 27-31, 2011. ISCA, 2011, pp. 2877–2880.
[19] T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean, “Distributed representations of words and phrases and their compositionality,” in Advances in Neural Information Processing Systems 26: 27th Annual Conference on Neural Information Processing Systems 2013. Proceedings of a meeting held December 5-8, 2013,
Lake Tahoe, Nevada, United States, C. J. C. Burges,
L. Bottou, Z. Ghahramani, and K. Q. Weinberger,
Eds., 2013, pp. 3111–3119.
T. Mikolov, K. Chen, G. Corrado, and J. Dean, “Ef-
ficient estimation of word representations in vector
space,” in 1st International Conference on Learning Rep-
resentations, ICLR 2013, Scottsdale, Arizona, USA, May
2-4, 2013, Workshop Track Proceedings, Y. Bengio and
Y. LeCun, Eds., 2013.
M. E. Peters, M. Neumann, M. Iyyer, M. Gardner,
C. Clark, K. Lee, and L. Zettlemoyer, “Deep contex-
tualized word representations,” in Proceedings of the
2018 Conference of the North American Chapter of the
Association for Computational Linguistics: Human Lan-
guage Technologies, NAACL-HLT 2018, New Orleans,
Louisiana, USA, June 1-6, 2018, Volume 1 (Long Papers),
M. A. Walker, H. Ji, and A. Stent, Eds. Association
for Computational Linguistics, 2018, pp. 2227–2237.
A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit,
L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin,
“Attention is all you need,” in Advances in Neural In-
formation Processing Systems 30: Annual Conference on
Neural Information Processing Systems 2017, December
4-9, 2017, Long Beach, CA, USA, 2017, pp. 5998–6008.
J. Devlin, M. Chang, K. Lee, and K. Toutanova,
“BERT: pre-training of deep bidirectional transform-
ers for language understanding,” in Proceedings of
the 2019 Conference of the North American Chapter of
the Association for Computational Linguistics: Human
Language Technologies, NAACL-HLT 2019, Minneapolis,
MN, USA, June 2-7, 2019, Volume 1 (Long and Short
Papers), J. Burstein, C. Doran, and T. Solorio, Eds.
Association for Computational Linguistics, 2019, pp.
4171–4186.
M. Lewis, Y. Liu, N. Goyal, M. Ghazvininejad, A. Mo-
hamed, O. Levy, V. Stoyanov, and L. Zettlemoyer,
“BART: denoising sequence-to-sequence pre-training
for natural language generation, translation, and
comprehension,” in Proceedings of the 58th Annual
Meeting of the Association for Computational Linguistics,
ACL 2020, Online, July 5-10, 2020, 2020, pp. 7871–
7880.
W. Fedus, B. Zoph, and N. Shazeer, “Switch trans-
formers: Scaling to trillion parameter models with
simple and efficient sparsity,” J. Mach. Learn. Res, pp.
1–40, 2021.
A. Radford, J. Wu, R. Child, D. Luan, D. Amodei,
I. Sutskever et al., “Language models are unsuper-
vised multitask learners,” OpenAI blog, p. 9, 2019.
Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen,
O. Levy, M. Lewis, L. Zettlemoyer, and V. Stoyanov,
“Roberta: A robustly optimized BERT pretraining
approach,” CoRR, vol. abs/1907.11692, 2019.
[28] V. Sanh, A. Webson, C. Raffel, S. H. Bach,
L. Sutawika, Z. Alyafeai, A. Chaffin, A. Stiegler,
A. Raja, M. Dey, M. S. Bari, C. Xu, U. Thakker, S. S.
Sharma, E. Szczechla, T. Kim, G. Chhablani, N. V.
Nayak, D. Datta, J. Chang, M. T. Jiang, H. Wang,
M. Manica, S. Shen, Z. X. Yong, H. Pandey, R. Baw-
den, T. Wang, T. Neeraj, J. Rozen, A. Sharma, A. San-
tilli, T. F´ evry, J. A. Fries, R. Teehan, T. L. Scao, S. Bi-
derman, L. Gao, T. Wolf, and A. M. Rush, “Multitask
prompted training enables zero-shot task generaliza-
tion,” in The Tenth International Conference on Learning
Representations, ICLR 2022, Virtual Event, April 25-29,
2022. OpenReview.net, 2022.
[29] T. Wang, A. Roberts, D. Hesslow, T. L. Scao, H. W.
Chung, I. Beltagy, J. Launay, and C. Raffel, “What
language model architecture and pretraining objec-
tive works best for zero-shot generalization?” in
International Conference on Machine Learning, ICML
2022, 17-23 July 2022, Baltimore, Maryland, USA, ser.
Proceedings of Machine Learning Research, vol. 162,
2022, pp. 22 964–22 984.
[30] J. Kaplan, S. McCandlish, T. Henighan, T. B. Brown,
B. Chess, R. Child, S. Gray, A. Radford, J. Wu, and
D. Amodei, “Scaling laws for neural language mod-
els,” CoRR, vol. abs/2001.08361, 2020.
[31] J. Wei, Y. Tay, R. Bommasani, C. Raffel, B. Zoph,
S. Borgeaud, D. Yogatama, M. Bosma, D. Zhou,
D. Metzler, E. H. Chi, T. Hashimoto, O. Vinyals,
P. Liang, J. Dean, and W. Fedus, “Emergent
abilities of large language models,” CoRR, vol.
abs/2206.07682, 2022.
[32] M. Shanahan, “Talking about large language mod-
els,” CoRR, vol. abs/2212.03551, 2022.
[33] J. Wei, X. Wang, D. Schuurmans, M. Bosma, E. H. Chi,
Q. Le, and D. Zhou, “Chain of thought prompting
elicits reasoning in large language models,” CoRR,
vol. abs/2201.11903, 2022.
[34] J. Hoffmann, S. Borgeaud, A. Mensch,
E. Buchatskaya, T. Cai, E. Rutherford,
D. de Las Casas, L. A. Hendricks, J. Welbl,
A. Clark, T. Hennigan, E. Noland, K. Millican,
G. van den Driessche, B. Damoc, A. Guy, S. Osindero,
K. Simonyan, E. Elsen, J. W. Rae, O. Vinyals, and
L. Sifre, “Training compute-optimal large language
models,” vol. abs/2203.15556, 2022.
[35] R. Taylor, M. Kardas, G. Cucurull, T. Scialom,
A. Hartshorn, E. Saravia, A. Poulton, V. Kerkez, and
R. Stojnic, “Galactica: A large language model for
science,” CoRR, vol. abs/2211.09085, 2022.
[36] P. Liu, W. Yuan, J. Fu, Z. Jiang, H. Hayashi, and
G. Neubig, “Pre-train, prompt, and predict: A sys-
tematic survey of prompting methods in natural
language processing,” ACM Comput. Surv., pp. 195:1–
195:35, 2023.
[37] C. Zhou, Q. Li, C. Li, J. Yu, Y. Liu, G. Wang,
K. Zhang, C. Ji, Q. Yan, L. He, H. Peng, J. Li, J. Wu,
Z. Liu, P. Xie, C. Xiong, J. Pei, P. S. Yu, and L. Sun,
“A comprehensive survey on pretrained foundation
[38] [39] [40] [41] [42] [43] [44] [45] [46] [47] [48] [49] [50] [51] [52] [53] 101
models: A history from BERT to chatgpt,” CoRR, vol.
abs/2302.09419, 2023.
X. Han, Z. Zhang, N. Ding, Y. Gu, X. Liu, Y. Huo,
J. Qiu, Y. Yao, A. Zhang, L. Zhang, W. Han,
M. Huang, Q. Jin, Y. Lan, Y. Liu, Z. Liu, Z. Lu,
X. Qiu, R. Song, J. Tang, J. Wen, J. Yuan, W. X. Zhao,
and J. Zhu, “Pre-trained models: Past, present and
future,” AI Open, vol. 2, pp. 225–250, 2021.
X. Qiu, T. Sun, Y. Xu, Y. Shao, N. Dai, and X. Huang,
“Pre-trained models for natural language processing:
A survey,” CoRR, vol. abs/2003.08271, 2020.
S. Altman, “Planning for agi and beyond,” OpenAI
Blog, February 2023.
S. Bubeck, V. Chandrasekaran, R. Eldan, J. Gehrke,
E. Horvitz, E. Kamar, P. Lee, Y. T. Lee, Y. Li,
S. Lundberg, H. Nori, H. Palangi, M. T. Ribeiro, and
Y. Zhang, “Sparks of artificial general intelligence:
Early experiments with gpt-4,” vol. abs/2303.12712,
2023.
S. Huang, L. Dong, W. Wang, Y. Hao, S. Singhal,
S. Ma, T. Lv, L. Cui, O. K. Mohammed, B. Patra,
Q. Liu, K. Aggarwal, Z. Chi, J. Bjorck, V. Chaudhary,
S. Som, X. Song, and F. Wei, “Language is not all you
need: Aligning perception with language models,”
CoRR, vol. abs/2302.14045, 2023.
Y. Cao, S. Li, Y. Liu, Z. Yan, Y. Dai, P. S. Yu, and
L. Sun, “A comprehensive survey of ai-generated
content (aigc): A history of generative ai from gan
to chatgpt,” arXiv preprint arXiv:2303.04226, 2023.
D. Driess, F. Xia, M. S. Sajjadi, C. Lynch, A. Chowdh-
ery, B. Ichter, A. Wahid, J. Tompson, Q. Vuong, T. Yu
et al., “Palm-e: An embodied multimodal language
model,” arXiv preprint arXiv:2303.03378, 2023.
C. Wu, S. Yin, W. Qi, X. Wang, Z. Tang, and
N. Duan, “Visual chatgpt: Talking, drawing and edit-
ing with visual foundation models,” arXiv preprint
arXiv:2303.04671, 2023.
OpenAI, “Gpt-4 technical report,” OpenAI, 2023.
Y. Fu, H. Peng, and T. Khot, “How does gpt obtain
its ability? tracing emergent abilities of language
models to their sources,” Yao Fu’s Notion, Dec 2022.
J. Li, T. Tang, W. X. Zhao, and J. Wen, “Pretrained
language model for text generation: A survey,” in
Proceedings of the Thirtieth International Joint Confer-
ence on Artificial Intelligence, IJCAI 2021, Virtual Event
/ Montreal, Canada, 19-27 August 2021, Z. Zhou, Ed.
ijcai.org, 2021, pp. 4492–4499.
P. Lu, L. Qiu, W. Yu, S. Welleck, and K. Chang, “A
survey of deep learning for mathematical reason-
ing,” CoRR, vol. abs/2212.10535, 2022.
Q. Dong, L. Li, D. Dai, C. Zheng, Z. Wu, B. Chang,
X. Sun, J. Xu, L. Li, and Z. Sui, “A survey for in-
context learning,” CoRR, vol. abs/2301.00234, 2023.
J. Huang and K. C. Chang, “Towards reasoning
in large language models: A survey,” CoRR, vol.
abs/2212.10403, 2022.
S. Qiao, Y. Ou, N. Zhang, X. Chen, Y. Yao, S. Deng,
C. Tan, F. Huang, and H. Chen, “Reasoning with
language model prompting: A survey,” CoRR, vol.
abs/2212.09597, 2022.
J. Zhou, P. Ke, X. Qiu, M. Huang, and J. Zhang,
“Chatgpt: potential, prospects, and limitations,” in
Frontiers of Information Technology & Electronic Engi-
neering, 2023, pp. 1–6.
[54] W. X. Zhao, J. Liu, R. Ren, and J.-R. Wen, “Dense
text retrieval based on pretrained language models:
A survey,” ACM Transactions on Information Systems,
vol. 42, no. 4, pp. 1–60, 2024.
[55] T. B. Brown, B. Mann, N. Ryder, M. Subbiah,
J. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam,
G. Sastry, A. Askell, S. Agarwal, A. Herbert-Voss,
G. Krueger, T. Henighan, R. Child, A. Ramesh,
D. M. Ziegler, J. Wu, C. Winter, C. Hesse, M. Chen,
E. Sigler, M. Litwin, S. Gray, B. Chess, J. Clark,
C. Berner, S. McCandlish, A. Radford, I. Sutskever,
and D. Amodei, “Language models are few-shot
learners,” in Advances in Neural Information Processing
Systems 33: Annual Conference on Neural Information
Processing Systems 2020, NeurIPS 2020, December 6-12,
2020, virtual, H. Larochelle, M. Ranzato, R. Hadsell,
M. Balcan, and H. Lin, Eds., 2020.
[56] A. Chowdhery, S. Narang, J. Devlin, M. Bosma,
G. Mishra, A. Roberts, P. Barham, H. W. Chung,
C. Sutton, S. Gehrmann, P. Schuh, K. Shi,
S. Tsvyashchenko, J. Maynez, A. Rao, P. Barnes,
Y. Tay, N. Shazeer, V. Prabhakaran, E. Reif, N. Du,
B. Hutchinson, R. Pope, J. Bradbury, J. Austin, M. Is-
ard, G. Gur-Ari, P. Yin, T. Duke, A. Levskaya, S. Ghe-
mawat, S. Dev, H. Michalewski, X. Garcia, V. Misra,
K. Robinson, L. Fedus, D. Zhou, D. Ippolito, D. Luan,
H. Lim, B. Zoph, A. Spiridonov, R. Sepassi, D. Do-
han, S. Agrawal, M. Omernick, A. M. Dai, T. S.
Pillai, M. Pellat, A. Lewkowycz, E. Moreira, R. Child,
O. Polozov, K. Lee, Z. Zhou, X. Wang, B. Saeta,
M. Diaz, O. Firat, M. Catasta, J. Wei, K. Meier-
Hellstern, D. Eck, J. Dean, S. Petrov, and N. Fiedel,
“Palm: Scaling language modeling with pathways,”
CoRR, vol. abs/2204.02311, 2022.
[57] H. Touvron, T. Lavril, G. Izacard, X. Martinet,
M. Lachaux, T. Lacroix, B. Rozi ere, N. Goyal, E. Ham- bro, F. Azhar, A. Rodriguez, A. Joulin, E. Grave, and G. Lample, “Llama: Open and efficient foundation language models,” CoRR, 2023. [58] T. Henighan, J. Kaplan, M. Katz, M. Chen, C. Hesse, J. Jackson, H. Jun, T. B. Brown, P. Dhariwal, S. Gray et al., “Scaling laws for autoregressive generative modeling,” arXiv preprint arXiv:2010.14701, 2020. [59] S. M. Xie, H. Pham, X. Dong, N. Du, H. Liu, Y. Lu, P. Liang, Q. V. Le, T. Ma, and A. W. Yu, “Doremi: Optimizing data mixtures speeds up language model pretraining,” arXiv preprint arXiv:2305.10429, 2023. [60] P. Villalobos, J. Sevilla, L. Heim, T. Besiroglu, M. Hobbhahn, and A. Ho, “Will we run out of data? an analysis of the limits of scaling datasets in machine learning,” CoRR, vol. abs/2211.04325, 2022. [Online]. Available: https://doi.org/10.48550/arXiv. 2211.04325 [61] N. Muennighoff, A. M. Rush, B. Barak, T. L. Scao, A. Piktus, N. Tazi, S. Pyysalo, T. Wolf, and C. Raffel, “Scaling data-constrained language models,” arXiv preprint arXiv:2305.16264, 2023. [62] I. McKenzie, A. Lyzhov, A. Parrish, A. Prabhu, [63] [64] [65] [66] [67] [68] [69] 102 A. Mueller, N. Kim, S. Bowman, and E. Perez, “The inverse scaling prize,” 2022. [Online]. Available: https://github.com/inverse-scaling/prize B. A. Huberman and T. Hogg, “Phase transitions in artificial intelligence systems,” Artificial Intelligence, vol. 33, no. 2, pp. 155–171, 1987. J. W. Rae, S. Borgeaud, T. Cai, K. Millican, J. Hoff- mann, H. F. Song, J. Aslanides, S. Henderson, R. Ring, S. Young, E. Rutherford, T. Hennigan, J. Menick, A. Cassirer, R. Powell, G. van den Driessche, L. A. Hendricks, M. Rauh, P. Huang, A. Glaese, J. Welbl, S. Dathathri, S. Huang, J. Ue- sato, J. Mellor, I. Higgins, A. Creswell, N. McAleese, A. Wu, E. Elsen, S. M. Jayakumar, E. Buchatskaya, D. Budden, E. Sutherland, K. Simonyan, M. Paganini, L. Sifre, L. Martens, X. L. Li, A. Kuncoro, A. Ne- matzadeh, E. Gribovskaya, D. Donato, A. Lazaridou, A. Mensch, J. Lespiau, M. Tsimpoukelli, N. Grig- orev, D. Fritz, T. Sottiaux, M. Pajarskas, T. Pohlen, Z. Gong, D. Toyama, C. de Masson d’Autume, Y. Li, T. Terzi, V. Mikulik, I. Babuschkin, A. Clark, D. de Las Casas, A. Guy, C. Jones, J. Bradbury, M. J. Johnson, B. A. Hechtman, L. Weidinger, I. Gabriel, W. S. Isaac, E. Lockhart, S. Osindero, L. Rimell, C. Dyer, O. Vinyals, K. Ayoub, J. Stanway, L. Bennett, D. Hassabis, K. Kavukcuoglu, and G. Irving, “Scaling language models: Methods, analysis & insights from training gopher,” CoRR, vol. abs/2112.11446, 2021. D. Dai, Y. Sun, L. Dong, Y. Hao, Z. Sui, and F. Wei, “Why can GPT learn in-context? language models se- cretly perform gradient descent as meta-optimizers,” CoRR, vol. abs/2212.10559, 2022. L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. L. Wain- wright, P. Mishkin, C. Zhang, S. Agarwal, K. Slama, A. Ray, J. Schulman, J. Hilton, F. Kelton, L. Miller, M. Simens, A. Askell, P. Welinder, P. F. Christiano, J. Leike, and R. Lowe, “Training language models to follow instructions with human feedback,” CoRR, vol. abs/2203.02155, 2022. J. Wei, M. Bosma, V. Y. Zhao, K. Guu, A. W. Yu, B. Lester, N. Du, A. M. Dai, and Q. V. Le, “Finetuned language models are zero-shot learners,” in The Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022. Open- Review.net, 2022. R. Thoppilan, D. D. Freitas, J. Hall, N. Shazeer, A. Kulshreshtha, H. Cheng, A. Jin, T. Bos, L. Baker, Y. Du, Y. Li, H. Lee, H. S. Zheng, A. Ghafouri, M. Menegali, Y. Huang, M. Krikun, D. Lepikhin, J. Qin, D. Chen, Y. Xu, Z. Chen, A. Roberts, M. Bosma, Y. Zhou, C. Chang, I. Krivokon, W. Rusch, M. Pick- ett, K. S. Meier-Hellstern, M. R. Morris, T. Doshi, R. D. Santos, T. Duke, J. Soraker, B. Zevenbergen, V. Prabhakaran, M. Diaz, B. Hutchinson, K. Olson, A. Molina, E. Hoffman-John, J. Lee, L. Aroyo, R. Ra- jakumar, A. Butryna, M. Lamm, V. Kuzmina, J. Fen- ton, A. Cohen, R. Bernstein, R. Kurzweil, B. Aguera- Arcas, C. Cui, M. Croak, E. H. Chi, and Q. Le, “Lamda: Language models for dialog applications,” CoRR, vol. abs/2201.08239, 2022. H. W. Chung, L. Hou, S. Longpre, B. Zoph, Y. Tay, [70] [71] [72] [73] [74] [75] [76] [77] [78] W. Fedus, E. Li, X. Wang, M. Dehghani, S. Brahma, A. Webson, S. S. Gu, Z. Dai, M. Suzgun, X. Chen, A. Chowdhery, S. Narang, G. Mishra, A. Yu, V. Y. Zhao, Y. Huang, A. M. Dai, H. Yu, S. Petrov, E. H. Chi, J. Dean, J. Devlin, A. Roberts, D. Zhou, Q. V. Le, and J. Wei, “Scaling instruction-finetuned language models,” CoRR, vol. abs/2210.11416, 2022. A. Srivastava, A. Rastogi, A. Rao, A. A. M. Shoeb, A. Abid, A. Fisch, A. R. Brown, A. Santoro, A. Gupta, A. Garriga-Alonso, A. Kluska, A. Lewkowycz, A. Agarwal, A. Power, A. Ray, A. Warstadt, A. W. Kocurek, A. Safaya, A. Tazarv, A. Xiang, A. Parrish, A. Nie, A. Hussain, A. Askell, A. Dsouza, A. Rahane, A. S. Iyer, A. Andreassen, A. Santilli, A. Stuhlm ¨ uller, A. M. Dai, A. La, A. K. Lampinen, A. Zou, A. Jiang, A. Chen, A. Vuong, A. Gupta, A. Gottardi, A. Norelli, A. Venkatesh, A. Gholamidavoodi, A. Tabassum, A. Menezes, A. Kirubarajan, A. Mullokandov, A. Sab- harwal, A. Herrick, A. Efrat, A. Erdem, A. Karakas, and et al., “Beyond the imitation game: Quantifying and extrapolating the capabilities of language mod- els,” CoRR, vol. abs/2206.04615, 2022. R. Schaeffer, B. Miranda, and S. Koyejo, “Are emer- gent abilities of large language models a mirage?” arXiv preprint arXiv:2304.15004, 2023. S. Hu, X. Liu, X. Han, X. Zhang, C. He, W. Zhao, Y. Lin, N. Ding, Z. Ou, G. Zeng, Z. Liu, and M. Sun, “Unlock predictable scaling from emergent abilities,” 2023. A. Power, Y. Burda, H. Edwards, I. Babuschkin, and V. Misra, “Grokking: Generalization beyond overfit- ting on small algorithmic datasets,” arXiv preprint arXiv:2201.02177, 2022. J. Rasley, S. Rajbhandari, O. Ruwase, and Y. He, “Deepspeed: System optimizations enable training deep learning models with over 100 billion param- eters,” in KDD, 2020, pp. 3505–3506. M. Shoeybi, M. Patwary, R. Puri, P. LeGresley, J. Casper, and B. Catanzaro, “Megatron-lm: Train- ing multi-billion parameter language models using model parallelism,” CoRR, vol. abs/1909.08053, 2019. D. Narayanan, M. Shoeybi, J. Casper, P. LeGres- ley, M. Patwary, V. Korthikanti, D. Vainbrand, P. Kashinkunti, J. Bernauer, B. Catanzaro, A. Phan- ishayee, and M. Zaharia, “Efficient large-scale lan- guage model training on GPU clusters using megatron-lm,” in International Conference for High Per- formance Computing, Networking, Storage and Analysis, SC 2021, St. Louis, Missouri, USA, November 14-19, 2021. ACM, 2021, p. 58. V. Korthikanti, J. Casper, S. Lym, L. McAfee, M. An- dersch, M. Shoeybi, and B. Catanzaro, “Reducing ac- tivation recomputation in large transformer models,” CoRR, vol. abs/2205.05198, 2022. T. L. Scao, A. Fan, C. Akiki, E. Pavlick, S. Ilic, D. Hesslow, R. Castagn´ e, A. S. Luccioni, F. Yvon, M. Gall´ e, J. Tow, A. M. Rush, S. Biderman, A. Web- son, P. S. Ammanamanchi, T. Wang, B. Sagot, N. Muennighoff, A. V. del Moral, O. Ruwase, R. Baw- den, S. Bekman, A. McMillan-Major, I. Beltagy, H. Nguyen, L. Saulnier, S. Tan, P. O. Suarez, V. Sanh, 103 H. Laurenc¸on, Y. Jernite, J. Launay, M. Mitchell, C. Raffel, A. Gokaslan, A. Simhi, A. Soroa, A. F. Aji, A. Alfassy, A. Rogers, A. K. Nitzav, C. Xu, C. Mou, C. Emezue, C. Klamm, C. Leong, D. van Strien, D. I. Adelani, and et al., “BLOOM: A 176b-parameter open-access multilingual language model,” CoRR, vol. abs/2211.05100, 2022. [79] P. F. Christiano, J. Leike, T. B. Brown, M. Martic, S. Legg, and D. Amodei, “Deep reinforcement learn- ing from human preferences,” in Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, Decem- ber 4-9, 2017, Long Beach, CA, USA, I. Guyon, U. von Luxburg, S. Bengio, H. M. Wallach, R. Fergus, S. V. N. Vishwanathan, and R. Garnett, Eds., 2017, pp. 4299– 4307. [80] T. Schick, J. Dwivedi-Yu, R. Dess ı, R. Raileanu,
M. Lomeli, L. Zettlemoyer, N. Cancedda, and
T. Scialom, “Toolformer: Language models can teach
themselves to use tools,” CoRR, vol. abs/2302.04761,
2023.
[81] R. Nakano, J. Hilton, S. Balaji, J. Wu, L. Ouyang,
C. Kim, C. Hesse, S. Jain, V. Kosaraju, W. Saun-
ders, X. Jiang, K. Cobbe, T. Eloundou, G. Krueger,
K. Button, M. Knight, B. Chess, and J. Schulman,
“Webgpt: Browser-assisted question-answering with
human feedback,” CoRR, vol. abs/2112.09332, 2021.
[82] C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang,
M. Matena, Y. Zhou, W. Li, and P. J. Liu, “Exploring
the limits of transfer learning with a unified text-
to-text transformer,” J. Mach. Learn. Res., pp. 140:1–
140:67, 2020.
[83] L. Xue, N. Constant, A. Roberts, M. Kale, R. Al-
Rfou, A. Siddhant, A. Barua, and C. Raffel, “mt5: A
massively multilingual pre-trained text-to-text trans-
former,” in Proceedings of the 2021 Conference of the
North American Chapter of the Association for Com-
putational Linguistics: Human Language Technologies,
NAACL-HLT 2021, Online, June 6-11, 2021, 2021, pp.
483–498.
[84] W. Zeng, X. Ren, T. Su, H. Wang, Y. Liao, Z. Wang,
X. Jiang, Z. Yang, K. Wang, X. Zhang, C. Li,
Z. Gong, Y. Yao, X. Huang, J. Wang, J. Yu, Q. Guo,
Y. Yu, Y. Zhang, J. Wang, H. Tao, D. Yan, Z. Yi,
F. Peng, F. Jiang, H. Zhang, L. Deng, Y. Zhang,
Z. Lin, C. Zhang, S. Zhang, M. Guo, S. Gu, G. Fan,
Y. Wang, X. Jin, Q. Liu, and Y. Tian, “Pangu-α: Large-
scale autoregressive pretrained chinese language
models with auto-parallel computation,” CoRR, vol.
abs/2104.12369, 2021.
[85] Z. Zhang, Y. Gu, X. Han, S. Chen, C. Xiao,
Z. Sun, Y. Yao, F. Qi, J. Guan, P. Ke, Y. Cai,
G. Zeng, Z. Tan, Z. Liu, M. Huang, W. Han, Y. Liu,
X. Zhu, and M. Sun, “CPM-2: large-scale cost-
effective pre-trained language models,” CoRR, vol.
abs/2106.10715, 2021.
[86] E. Nijkamp, B. Pang, H. Hayashi, L. Tu, H. Wang,
Y. Zhou, S. Savarese, and C. Xiong, “Codegen: An
open large language model for code with mtulti-turn
program synthesis,” arXiv preprint arXiv:2203.13474,
2022.
[87] [88] [89] [90] [91] [92] [93] [94] [95] S. Black, S. Biderman, E. Hallahan, Q. Anthony,
L. Gao, L. Golding, H. He, C. Leahy, K. McDonell,
J. Phang, M. Pieler, U. S. Prashanth, S. Purohit,
L. Reynolds, J. Tow, B. Wang, and S. Weinbach, “Gpt-
neox-20b: An open-source autoregressive language
model,” CoRR, vol. abs/2204.06745, 2022.
Y. Wang, S. Mishra, P. Alipoormolabashi, Y. Kordi,
A. Mirzaei, A. Naik, A. Ashok, A. S. Dhanasekaran,
A. Arunkumar, D. Stap, E. Pathak, G. Karamanolakis,
H. G. Lai, I. Purohit, I. Mondal, J. Anderson, K. Kuz-
nia, K. Doshi, K. K. Pal, M. Patel, M. Moradshahi,
M. Parmar, M. Purohit, N. Varshney, P. R. Kaza,
P. Verma, R. S. Puri, R. Karia, S. Doshi, S. K.
Sampat, S. Mishra, S. R. A, S. Patro, T. Dixit, and
X. Shen, “Super-naturalinstructions: Generalization
via declarative instructions on 1600+ NLP tasks,” in
Proceedings of the 2022 Conference on Empirical Methods
in Natural Language Processing, EMNLP 2022, Abu
Dhabi, United Arab Emirates, December 7-11, 2022,
2022, pp. 5085–5109.
Y. Tay, M. Dehghani, V. Q. Tran, X. Garc´ ıa, J. Wei,
X. Wang, H. W. Chung, D. Bahri, T. Schuster,
H. Zheng, D. Zhou, N. Houlsby, and D. Metzler,
“Ul2: Unifying language learning paradigms,” 2022.
S. Zhang, S. Roller, N. Goyal, M. Artetxe, M. Chen,
S. Chen, C. Dewan, M. T. Diab, X. Li, X. V. Lin,
T. Mihaylov, M. Ott, S. Shleifer, K. Shuster, D. Simig,
P. S. Koura, A. Sridhar, T. Wang, and L. Zettlemoyer,
“OPT: open pre-trained transformer language mod-
els,” CoRR, vol. abs/2205.01068, 2022.
M. R. Costa-juss a, J. Cross, O. C¸ elebi, M. Elbayad, K. Heafield, K. Heffernan, E. Kalbassi, J. Lam, D. Licht, J. Maillard, A. Sun, S. Wang, G. Wen- zek, A. Youngblood, B. Akula, L. Barrault, G. M. Gonzalez, P. Hansanti, J. Hoffman, S. Jarrett, K. R. Sadagopan, D. Rowe, S. Spruit, C. Tran, P. Andrews, N. F. Ayan, S. Bhosale, S. Edunov, A. Fan, C. Gao, V. Goswami, F. Guzm´ an, P. Koehn, A. Mourachko, C. Ropers, S. Saleem, H. Schwenk, and J. Wang, “No language left behind: Scaling human-centered ma- chine translation,” CoRR, vol. abs/2207.04672, 2022. Q. Zheng, X. Xia, X. Zou, Y. Dong, S. Wang, Y. Xue, Z. Wang, L. Shen, A. Wang, Y. Li et al., “Codegeex: A pre-trained model for code generation with multi- lingual evaluations on humaneval-x,” arXiv preprint arXiv:2303.17568, 2023. A. Zeng, X. Liu, Z. Du, Z. Wang, H. Lai, M. Ding, Z. Yang, Y. Xu, W. Zheng, X. Xia, W. L. Tam, Z. Ma, Y. Xue, J. Zhai, W. Chen, P. Zhang, Y. Dong, and J. Tang, “GLM-130B: an open bilingual pre-trained model,” vol. abs/2210.02414, 2022. N. Muennighoff, T. Wang, L. Sutawika, A. Roberts, S. Biderman, T. L. Scao, M. S. Bari, S. Shen, Z. X. Yong, H. Schoelkopf, X. Tang, D. Radev, A. F. Aji, K. Almubarak, S. Albanie, Z. Alyafeai, A. Web- son, E. Raff, and C. Raffel, “Crosslingual general- ization through multitask finetuning,” CoRR, vol. abs/2211.01786, 2022. S. Iyer, X. V. Lin, R. Pasunuru, T. Mihaylov, D. Simig, P. Yu, K. Shuster, T. Wang, Q. Liu, P. S. Koura, X. Li, B. O’Horo, G. Pereyra, J. Wang, C. Dewan, 104 A. Celikyilmaz, L. Zettlemoyer, and V. Stoyanov, “OPT-IML: scaling language model instruction meta learning through the lens of generalization,” CoRR, vol. abs/2212.12017, 2022. [96] S. Biderman, H. Schoelkopf, Q. Anthony, H. Bradley, K. O’Brien, E. Hallahan, M. A. Khan, S. Purohit, U. S. Prashanth, E. Raff et al., “Pythia: A suite for analyzing large language models across training and scaling,” arXiv preprint arXiv:2304.01373, 2023. [97] E. Nijkamp, H. Hayashi, C. Xiong, S. Savarese, and Y. Zhou, “Codegen2: Lessons for training llms on programming and natural languages,” CoRR, vol. abs/2305.02309, 2023. [98] R. Li, L. B. Allal, Y. Zi, N. Muennighoff, D. Kocetkov, C. Mou, M. Marone, C. Akiki, J. Li, J. Chim, Q. Liu, E. Zheltonozhskii, T. Y. Zhuo, T. Wang, O. Dehaene, M. Davaadorj, J. Lamy-Poirier, J. Monteiro, O. Shliazhko, N. Gontier, N. Meade, A. Zebaze, M. Yee, L. K. Umapathi, J. Zhu, B. Lipkin, M. Oblokulov, Z. Wang, R. M. V, J. Stillerman, S. S. Patel, D. Abulkhanov, M. Zocca, M. Dey, Z. Zhang, N. Fahmy, U. Bhattacharyya, W. Yu, S. Singh, S. Luccioni, P. Villegas, M. Kunakov, F. Zhdanov, M. Romero, T. Lee, N. Timor, J. Ding, C. Schlesinger, H. Schoelkopf, J. Ebert, T. Dao, M. Mishra, A. Gu, J. Robinson, C. J. Anderson, B. Dolan-Gavitt, D. Contractor, S. Reddy, D. Fried, D. Bahdanau, Y. Jernite, C. M. Ferrandis, S. Hughes, T. Wolf, A. Guha, L. von Werra, and H. de Vries, “Starcoder: may the source be with you!” CoRR, vol. abs/2305.06161, 2023. [Online]. Available: https://doi.org/10.48550/arXiv.2305.06161 [99] H. Touvron, L. Martin, K. Stone, P. Albert, A. Alma- hairi, Y. Babaei, N. Bashlykov, S. Batra, P. Bhargava, S. Bhosale et al., “Llama 2: Open foundation and fine- tuned chat models,” arXiv preprint arXiv:2307.09288, 2023. [100] A. Yang, B. Xiao, B. Wang, B. Zhang, C. Yin, C. Lv, D. Pan, D. Wang, D. Yan, F. Yang et al., “Baichuan 2: Open large-scale language models,” arXiv preprint arXiv:2309.10305, 2023. [101] J. Bai, S. Bai, Y. Chu, Z. Cui, K. Dang, X. Deng, Y. Fan, W. Ge, Y. Han, F. Huang et al., “Qwen technical report,” arXiv preprint arXiv:2309.16609, 2023. [102] X. Li, Y. Yao, X. Jiang, X. Fang, X. Meng, S. Fan, P. Han, J. Li, L. Du, B. Qin et al., “Flm-101b: An open llm and how to train it with $100 k budget,” arXiv preprint arXiv:2309.03852, 2023. [103] T. Wei, L. Zhao, L. Zhang, B. Zhu, L. Wang, H. Yang, B. Li, C. Cheng, W. L ¨ u, R. Hu et al., “Skywork: A more open bilingual foundation model,” arXiv preprint arXiv:2310.19341, 2023. [104] D. Lepikhin, H. Lee, Y. Xu, D. Chen, O. Firat, Y. Huang, M. Krikun, N. Shazeer, and Z. Chen, “Gshard: Scaling giant models with conditional com- putation and automatic sharding,” in 9th International Conference on Learning Representations, ICLR 2021, Vir- tual Event, Austria, May 3-7, 2021, 2021. [105] M. Chen, J. Tworek, H. Jun, Q. Yuan, H. P. de Oliveira Pinto, J. Kaplan, H. Edwards, Y. Burda, N. Joseph, G. Brockman, A. Ray, R. Puri, G. Krueger, M. Petrov, H. Khlaaf, G. Sastry, P. Mishkin, B. Chan, S. Gray, N. Ryder, M. Pavlov, A. Power, L. Kaiser, M. Bavarian, C. Winter, P. Tillet, F. P. Such, D. Cummings, M. Plappert, F. Chantzis, E. Barnes, A. Herbert-Voss, W. H. Guss, A. Nichol, A. Paino, N. Tezak, J. Tang, I. Babuschkin, S. Balaji, S. Jain, W. Saunders, C. Hesse, A. N. Carr, J. Leike, J. Achiam, V. Misra, E. Morikawa, A. Radford, M. Knight, M. Brundage, M. Murati, K. Mayer, P. Welinder, B. McGrew, D. Amodei, S. McCandlish, I. Sutskever, and W. Zaremba, “Evaluating large language models trained on code,” CoRR, vol. abs/2107.03374, 2021. [106] Y. Sun, S. Wang, S. Feng, S. Ding, C. Pang, J. Shang, J. Liu, X. Chen, Y. Zhao, Y. Lu, W. Liu, Z. Wu, W. Gong, J. Liang, Z. Shang, P. Sun, W. Liu, X. Ouyang, D. Yu, H. Tian, H. Wu, and H. Wang, “ERNIE 3.0: Large-scale knowledge enhanced pre- training for language understanding and genera- tion,” CoRR, vol. abs/2107.02137, 2021. [107] O. Lieber, O. Sharir, B. Lenz, and Y. Shoham, “Jurassic-1: Technical details and evaluation,” White Paper. AI21 Labs, vol. 1, 2021. [108] B. Kim, H. Kim, S. Lee, G. Lee, D. Kwak, D. H. Jeon, S. Park, S. Kim, S. Kim, D. Seo, H. Lee, M. Jeong, S. Lee, M. Kim, S. Ko, S. Kim, T. Park, J. Kim, S. Kang, N. Ryu, K. M. Yoo, M. Chang, S. Suh, S. In, J. Park, K. Kim, H. Kim, J. Jeong, Y. G. Yeo, D. Ham, D. Park, M. Y. Lee, J. Kang, I. Kang, J. Ha, W. Park, and N. Sung, “What changes can large- scale language models bring? intensive study on hy- perclova: Billions-scale korean generative pretrained transformers,” in Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, EMNLP 2021, Virtual Event / Punta Cana, Dominican Republic, 7-11 November, 2021. Association for Com- putational Linguistics, 2021. [109] S. Wu, X. Zhao, T. Yu, R. Zhang, C. Shen, H. Liu, F. Li, H. Zhu, J. Luo, L. Xu et al., “Yuan 1.0: Large- scale pre-trained language model in zero-shot and few-shot learning,” arXiv preprint arXiv:2110.04725, 2021. [110] A. Askell, Y. Bai, A. Chen, D. Drain, D. Ganguli, T. Henighan, A. Jones, N. Joseph, B. Mann, N. Das- Sarma, N. Elhage, Z. Hatfield-Dodds, D. Hernandez, J. Kernion, K. Ndousse, C. Olsson, D. Amodei, T. B. Brown, J. Clark, S. McCandlish, C. Olah, and J. Ka- plan, “A general language assistant as a laboratory for alignment,” CoRR, vol. abs/2112.00861, 2021. [111] S. Wang, Y. Sun, Y. Xiang, Z. Wu, S. Ding, W. Gong, S. Feng, J. Shang, Y. Zhao, C. Pang, J. Liu, X. Chen, Y. Lu, W. Liu, X. Wang, Y. Bai, Q. Chen, L. Zhao, S. Li, P. Sun, D. Yu, Y. Ma, H. Tian, H. Wu, T. Wu, W. Zeng, G. Li, W. Gao, and H. Wang, “ERNIE 3.0 titan: Exploring larger-scale knowledge enhanced pre-training for language understanding and gener- ation,” CoRR, vol. abs/2112.12731, 2021. [112] N. Du, Y. Huang, A. M. Dai, S. Tong, D. Lepikhin, Y. Xu, M. Krikun, Y. Zhou, A. W. Yu, O. Firat, B. Zoph, L. Fedus, M. P. Bosma, Z. Zhou, T. Wang, Y. E. Wang, K. Webster, M. Pellat, K. Robinson, K. S. Meier-Hellstern, T. Duke, L. Dixon, K. Zhang, Q. V. 105 Le, Y. Wu, Z. Chen, and C. Cui, “Glam: Efficient scaling of language models with mixture-of-experts,” in International Conference on Machine Learning, ICML 2022, 17-23 July 2022, Baltimore, Maryland, USA, 2022, pp. 5547–5569. [113] S. Smith, M. Patwary, B. Norick, P. LeGresley, S. Rajbhandari, J. Casper, Z. Liu, S. Prabhumoye, G. Zerveas, V. Korthikanti, E. Zheng, R. Child, R. Y. Aminabadi, J. Bernauer, X. Song, M. Shoeybi, Y. He, M. Houston, S. Tiwary, and B. Catanzaro, “Using deepspeed and megatron to train megatron- turing NLG 530b, A large-scale generative language model,” CoRR, vol. abs/2201.11990, 2022. [114] Y. Li, D. H. Choi, J. Chung, N. Kushman, J. Schrit- twieser, R. Leblond, T. Eccles, J. Keeling, F. Gi- meno, A. D. Lago, T. Hubert, P. Choy, C. de Mas- son d’Autume, I. Babuschkin, X. Chen, P. Huang, J. Welbl, S. Gowal, A. Cherepanov, J. Molloy, D. J. Mankowitz, E. S. Robson, P. Kohli, N. de Freitas, K. Kavukcuoglu, and O. Vinyals, “Competition-level code generation with alphacode,” Science, 2022. [115] S. Soltan, S. Ananthakrishnan, J. FitzGerald, R. Gupta, W. Hamza, H. Khan, C. Peris, S. Rawls, A. Rosenbaum, A. Rumshisky, C. S. Prakash, M. Srid- har, F. Triefenbach, A. Verma, G. T ¨ ur, and P. Natara- jan, “Alexatm 20b: Few-shot learning using a large-scale multilingual seq2seq model,” CoRR, vol. abs/2208.01448, 2022. [116] A. Glaese, N. McAleese, M. Trebacz, J. Aslanides, V. Firoiu, T. Ewalds, M. Rauh, L. Weidinger, M. Chad- wick, P. Thacker, L. Campbell-Gillingham, J. Ue- sato, P. Huang, R. Comanescu, F. Yang, A. See, S. Dathathri, R. Greig, C. Chen, D. Fritz, J. S. Elias, R. Green, S. Mokr´ a, N. Fernando, B. Wu, R. Foley, S. Young, I. Gabriel, W. Isaac, J. Mel- lor, D. Hassabis, K. Kavukcuoglu, L. A. Hendricks, and G. Irving, “Improving alignment of dialogue agents via targeted human judgements,” CoRR, vol. abs/2209.14375, 2022. [117] H. Su, X. Zhou, H. Yu, Y. Chen, Z. Zhu, Y. Yu, and J. Zhou, “Welm: A well-read pre-trained language model for chinese,” CoRR, vol. abs/2209.10372, 2022. [118] Y. Tay, J. Wei, H. W. Chung, V. Q. Tran, D. R. So, S. Shakeri, X. Garcia, H. S. Zheng, J. Rao, A. Chowd- hery, D. Zhou, D. Metzler, S. Petrov, N. Houlsby, Q. V. Le, and M. Dehghani, “Transcending scal- ing laws with 0.1% extra compute,” CoRR, vol. abs/2210.11399, 2022. [119] X. Ren, P. Zhou, X. Meng, X. Huang, Y. Wang, W. Wang, P. Li, X. Zhang, A. Podolskiy, G. Arshinov, A. Bout, I. Piontkovskaya, J. Wei, X. Jiang, T. Su, Q. Liu, and J. Yao, “Pangu-Σ: Towards trillion pa- rameter language model with sparse heterogeneous computing,” CoRR, vol. abs/2303.10845, 2023. [120] R. Anil, A. M. Dai, O. Firat, M. Johnson, D. Lep- ikhin, A. Passos, S. Shakeri, E. Taropa, P. Bailey, Z. Chen et al., “Palm 2 technical report,” arXiv preprint arXiv:2305.10403, 2023. [121] A. Radford, R. J´ ozefowicz, and I. Sutskever, “Learn- ing to generate reviews and discovering sentiment,” CoRR, vol. abs/1704.01444, 2017. [122] A. Radford, K. Narasimhan, T. Salimans, I. Sutskever et al., “Improving language understanding by gener- ative pre-training,” 2018. [123] B. McCann, N. S. Keskar, C. Xiong, and R. Socher, “The natural language decathlon: Multitask learning as question answering,” CoRR, vol. abs/1806.08730, 2018. [124] Y. Zhang, S. Sun, M. Galley, Y. Chen, C. Brockett, X. Gao, J. Gao, J. Liu, and B. Dolan, “DIALOGPT : Large-scale generative pre-training for conversa- tional response generation,” in Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics: System Demonstrations, ACL 2020, Online, July 5-10, 2020, A. Celikyilmaz and T. Wen, Eds. Association for Computational Linguistics, 2020, pp. 270–278. [125] D. Ham, J. Lee, Y. Jang, and K. Kim, “End-to-end neural pipeline for goal-oriented dialogue systems using GPT-2,” in Proceedings of the 58th Annual Meet- ing of the Association for Computational Linguistics, ACL 2020, Online, July 5-10, 2020. Association for Computational Linguistics, 2020, pp. 583–592. [126] I. Drori, S. Tran, R. Wang, N. Cheng, K. Liu, L. Tang, E. Ke, N. Singh, T. L. Patti, J. Lynch, A. Shporer, N. Verma, E. Wu, and G. Strang, “A neural network solves and generates mathematics problems by pro- gram synthesis: Calculus, differential equations, lin- ear algebra, and more,” CoRR, vol. abs/2112.15594, 2021. [127] A. Neelakantan, T. Xu, R. Puri, A. Radford, J. M. Han, J. Tworek, Q. Yuan, N. Tezak, J. W. Kim, C. Hallacy, J. Heidecke, P. Shyam, B. Power, T. E. Nekoul, G. Sastry, G. Krueger, D. Schnurr, F. P. Such, K. Hsu, M. Thompson, T. Khan, T. Sherbakov, J. Jang, P. Welinder, and L. Weng, “Text and code embeddings by contrastive pre-training,” CoRR, vol. abs/2201.10005, 2022. [128] J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov, “Proximal policy optimization algo- rithms,” arXiv preprint arXiv:1707.06347, 2017. [129] N. Stiennon, L. Ouyang, J. Wu, D. M. Ziegler, R. Lowe, C. Voss, A. Radford, D. Amodei, and P. F. Christiano, “Learning to summarize from human feedback,” CoRR, vol. abs/2009.01325, 2020. [130] OpenAI, “Our approach to alignment research,” Ope- nAI Blog, August 2022. [131] ——, “Introducing chatgpt,” OpenAI Blog, November 2022. [132] D. Ganguli, L. Lovitt, J. Kernion, A. Askell, Y. Bai, S. Kadavath, B. Mann, E. Perez, N. Schiefer, K. Ndousse, A. Jones, S. Bowman, A. Chen, T. Con- erly, N. DasSarma, D. Drain, N. Elhage, S. E. Showk, S. Fort, Z. Hatfield-Dodds, T. Henighan, D. Hernan- dez, T. Hume, J. Jacobson, S. Johnston, S. Kravec, C. Olsson, S. Ringer, E. Tran-Johnson, D. Amodei, T. Brown, N. Joseph, S. McCandlish, C. Olah, J. Ka- plan, and J. Clark, “Red teaming language models to reduce harms: Methods, scaling behaviors, and lessons learned,” CoRR, vol. abs/2209.07858, 2022. [133] OpenAI, “Gpt-4v(ision) system card,” OpenAI, 2023. [134] ——, “Lessons learned on language model safety 106 and misuse,” OpenAI blog, 2022. [135] Meta, “Introducing meta llama 3: The most capable openly available llm to date,” https://ai.meta.com/ blog/meta-llama-3/, 2024. [136] “Introducing Llama 3.1: Our most capable models to date ,” https://ai.meta.com/blog/meta-llama-3-1/, 2023. [137] A. Q. Jiang, A. Sablayrolles, A. Mensch, C. Bam- ford, D. S. Chaplot, D. de las Casas, F. Bressand, G. Lengyel, G. Lample, L. Saulnier, L. R. Lavaud, M.- A. Lachaux, P. Stock, T. L. Scao, T. Lavril, T. Wang, T. Lacroix, and W. E. Sayed, “Mistral 7b,” 2023. [138] A. Q. Jiang, A. Sablayrolles, A. Roux, A. Mensch, B. Savary, C. Bamford, D. S. Chaplot, D. de Las Casas, E. B. Hanna, F. Bressand, G. Lengyel, G. Bour, G. Lample, L. R. Lavaud, L. Saulnier, M. Lachaux, P. Stock, S. Subramanian, S. Yang, S. Antoniak, T. L. Scao, T. Gervet, T. Lavril, T. Wang, T. Lacroix, and W. E. Sayed, “Mixtral of experts,” CoRR, vol. abs/2401.04088, 2024. [Online]. Available: https://doi.org/10.48550/arXiv.2401.04088 [139] T. Mesnard, C. Hardin, R. Dadashi, S. Bhupatiraju, S. Pathak, L. Sifre, M. Rivi ere, M. S. Kale, J. Love,
P. Tafti, L. Hussenot, A. Chowdhery, A. Roberts,
A. Barua, A. Botev, A. Castro-Ros, A. Slone,
A. H´ eliou, A. Tacchetti, A. Bulanova, A. Paterson,
B. Tsai, B. Shahriari, C. L. Lan, C. A. Choquette-Choo,
C. Crepy, D. Cer, D. Ippolito, D. Reid, E. Buchatskaya,
E. Ni, E. Noland, G. Yan, G. Tucker, G. Muraru,
G. Rozhdestvenskiy, H. Michalewski, I. Tenney, I. Gr-
ishchenko, J. Austin, J. Keeling, J. Labanowski,
J. Lespiau, J. Stanway, J. Brennan, J. Chen, J. Ferret,
J. Chiu, and et al., “Gemma: Open models based
on gemini research and technology,” CoRR, vol.
abs/2403.08295, 2024.
[140] M. Rivi` ere, S. Pathak, P. G. Sessa, C. Hardin, S. Bhu-
patiraju, L. Hussenot, T. Mesnard, B. Shahriari,
A. Ram´ e, J. Ferret, P. Liu, P. Tafti, A. Friesen, M. Cas-
bon, S. Ramos, R. Kumar, C. L. Lan, S. Jerome, A. Tsit-
sulin, N. Vieillard, P. Stanczyk, S. Girgin, N. Mom-
chev, M. Hoffman, S. Thakoor, J. Grill, B. Neyshabur,
O. Bachem, A. Walton, A. Severyn, A. Parrish, A. Ah-
mad, A. Hutchison, A. Abdagic, A. Carl, A. Shen,
A. Brock, A. Coenen, A. Laforge, A. Paterson, B. Bas-
tian, B. Piot, B. Wu, B. Royal, C. Chen, C. Kumar,
C. Perry, C. Welty, C. A. Choquette-Choo, D. Sinopal-
nikov, D. Weinberger, D. Vijaykumar, D. Rogozin-
ska, D. Herbison, E. Bandy, E. Wang, E. Noland,
E. Moreira, E. Senter, E. Eltyshev, F. Visin, G. Rasskin,
G. Wei, G. Cameron, G. Martins, H. Hashemi,
H. Klimczak-Plucinska, H. Batra, H. Dhand, I. Nar-
dini, J. Mein, J. Zhou, J. Svensson, J. Stanway, J. Chan,
J. P. Zhou, J. Carrasqueira, J. Iljazi, J. Becker, J. Fer-
nandez, J. van Amersfoort, J. Gordon, J. Lipschultz,
J. Newlan, J. Ji, K. Mohamed, K. Badola, K. Black,
K. Millican, K. McDonell, K. Nguyen, K. Sodhia,
K. Greene, L. L. Sj¨ osund, L. Usui, L. Sifre, L. Heuer-
mann, L. Lago, and L. McNealus, “Gemma 2: Im-
proving open language models at a practical size,”
CoRR, vol. abs/2408.00118, 2024.
[141] A. Yang, B. Yang, B. Hui, B. Zheng, B. Yu, C. Zhou,
C. Li, C. Li, D. Liu, F. Huang, G. Dong, H. Wei, H. Lin,
J. Tang, J. Wang, J. Yang, J. Tu, J. Zhang, J. Ma, J. Xu,
J. Zhou, J. Bai, J. He, J. Lin, K. Dang, K. Lu, K. Chen,
K. Yang, M. Li, M. Xue, N. Ni, P. Zhang, P. Wang,
R. Peng, R. Men, R. Gao, R. Lin, S. Wang, S. Bai,
S. Tan, T. Zhu, T. Li, T. Liu, W. Ge, X. Deng, X. Zhou,
X. Ren, X. Zhang, X. Wei, X. Ren, Y. Fan, Y. Yao,
Y. Zhang, Y. Wan, Y. Chu, Y. Liu, Z. Cui, Z. Zhang,
and Z. Fan, “Qwen2 technical report,” arXiv preprint
arXiv:2407.10671, 2024.
[142] Q. Team, “Qwen2.5: A party of foundation
models,” September 2024. [Online]. Available:
https://qwenlm.github.io/blog/qwen2.5/
[143] T. GLM, A. Zeng, B. Xu, B. Wang, C. Zhang, D. Yin,
D. Rojas, G. Feng, H. Zhao, H. Lai, H. Yu, H. Wang,
J. Sun, J. Zhang, J. Cheng, J. Gui, J. Tang, J. Zhang,
J. Li, L. Zhao, L. Wu, L. Zhong, M. Liu, M. Huang,
P. Zhang, Q. Zheng, R. Lu, S. Duan, S. Zhang, S. Cao,
S. Yang, W. L. Tam, W. Zhao, X. Liu, X. Xia, X. Zhang,
X. Gu, X. Lv, X. Liu, X. Liu, X. Yang, X. Song,
X. Zhang, Y. An, Y. Xu, Y. Niu, Y. Yang, Y. Li, Y. Bai,
Y. Dong, Z. Qi, Z. Wang, Z. Yang, Z. Du, Z. Hou,
and Z. Wang, “Chatglm: A family of large language
models from glm-130b to glm-4 all tools,” 2024.
[144] H. Zhong, C. Xiao, C. Tu, T. Zhang, Z. Liu, and
M. Sun, “JEC-QA: A legal-domain question answer-
ing dataset,” in The Thirty-Fourth AAAI Conference
on Artificial Intelligence, AAAI 2020, The Thirty-Second
Innovative Applications of Artificial Intelligence Confer-
ence, IAAI 2020, The Tenth AAAI Symposium on Edu-
cational Advances in Artificial Intelligence, EAAI 2020,
New York, NY, USA, February 7-12, 2020. AAAI Press,
2020, pp. 9701–9708.
[145] D. Jin, E. Pan, N. Oufattole, W.-H. Weng, H. Fang,
and P. Szolovits, “What disease does this patient
have? a large-scale open domain question answer-
ing dataset from medical exams,” Applied Sciences,
vol. 11, no. 14, p. 6421, 2021.
[146] R. Taori, I. Gulrajani, T. Zhang, Y. Dubois, X. Li,
C. Guestrin, P. Liang, and T. B. Hashimoto, “Stanford
alpaca: An instruction-following llama model,”
https://github.com/tatsu-lab/stanford alpaca,
2023.
[147] Y. Wang, Y. Kordi, S. Mishra, A. Liu, N. A. Smith,
D. Khashabi, and H. Hajishirzi, “Self-instruct: Align-
ing language model with self generated instruc-
tions,” CoRR, vol. abs/2212.10560, 2022.
[148] Alpaca-LoRA, “Instruct-tune llama on consumer
hardware,” https://github.com/tloen/alpaca-lora,
2023.
[149] E. J. Hu, Y. Shen, P. Wallis, Z. Allen-Zhu, Y. Li,
S. Wang, L. Wang, and W. Chen, “Lora: Low-rank
adaptation of large language models,” in The Tenth
International Conference on Learning Representations,
ICLR 2022, Virtual Event, April 25-29, 2022. Open-
Review.net, 2022.
[150] X. Geng, A. Gudibande, H. Liu, E. Wallace, P. Abbeel,
S. Levine, and D. Song, “Koala: A dialogue model for
academic research,” Blog post, April 2023.
[151] Y. Ji, Y. Deng, Y. Gong, Y. Peng, Q. Niu, B. Ma,
and X. Li, “Belle: Be everyone’s large language
107
model engine,” https://github.com/LianjiaTech/
BELLE, 2023.
[152] W.-L. Chiang, Z. Li, Z. Lin, Y. Sheng, Z. Wu,
H. Zhang, L. Zheng, S. Zhuang, Y. Zhuang, J. E.
Gonzalez, I. Stoica, and E. P. Xing, “Vicuna:
An open-source chatbot impressing gpt-4 with
90%* chatgpt quality,” 2023. [Online]. Available:
https://vicuna.lmsys.org
[153] D. Eccleston, “Sharegpt,” https://sharegpt.com/,
2023.
[154] H. Liu, C. Li, Q. Wu, and Y. J. Lee, “Visual instruction
tuning,” CoRR, vol. abs/2304.08485, 2023.
[155] D. Zhu, J. Chen, X. Shen, X. Li, and M. Elhoseiny,
“Minigpt-4: Enhancing vision-language understand-
ing with advanced large language models,” CoRR,
vol. abs/2304.10592, 2023.
[156] W. Dai, J. Li, D. Li, A. M. H. Tiong, J. Zhao, W. Wang,
B. Li, P. Fung, and S. C. H. Hoi, “Instructblip: To-
wards general-purpose vision-language models with
instruction tuning,” CoRR, vol. abs/2305.06500, 2023.
[157] Y. Su, T. Lan, H. Li, J. Xu, Y. Wang, and D. Cai,
“Pandagpt: One model to instruction-follow them
all,” 2023.
[158] Y. Zhu, R. Kiros, R. S. Zemel, R. Salakhutdinov,
R. Urtasun, A. Torralba, and S. Fidler, “Aligning
books and movies: Towards story-like visual expla-
nations by watching movies and reading books,” in
2015 IEEE International Conference on Computer Vision,
ICCV 2015, Santiago, Chile, December 7-13, 2015. IEEE
Computer Society, 2015, pp. 19–27.
[159] “Project gutenberg.” [Online]. Available: https:
//www.gutenberg.org/
[160] T. H. Trinh and Q. V. Le, “A simple method for com-
monsense reasoning,” CoRR, vol. abs/1806.02847,
2018.
[161] R. Zellers, A. Holtzman, H. Rashkin, Y. Bisk,
A. Farhadi, F. Roesner, and Y. Choi, “Defending
against neural fake news,” in Advances in Neu-
ral Information Processing Systems 32: Annual Confer-
ence on Neural Information Processing Systems 2019,
NeurIPS 2019, December 8-14, 2019, Vancouver, BC,
Canada, H. M. Wallach, H. Larochelle, A. Beygelz-
imer, F. d’Alch´ e-Buc, E. B. Fox, and R. Garnett, Eds.,
2019, pp. 9051–9062.
[162] A. Gokaslan, V. C. E. Pavlick, and S. Tellex,
“Openwebtext corpus,” http://Skylion007.github.
io/OpenWebTextCorpus, 2019.
[163] J. Baumgartner, S. Zannettou, B. Keegan, M. Squire,
and J. Blackburn, “The pushshift reddit dataset,” in
Proceedings of the Fourteenth International AAAI Con-
ference on Web and Social Media, ICWSM 2020, Held
Virtually, Original Venue: Atlanta, Georgia, USA, June
8-11, 2020. AAAI Press, 2020, pp. 830–839.
[164] “Wikipedia.” [Online]. Available: https://en.
wikipedia.org/wiki/Main Page
[165] “Bigquery dataset.” [Online]. Available: https:
//cloud.google.com/bigquery?hl=zh-cn
[166] L. Gao, S. Biderman, S. Black, L. Golding, T. Hoppe,
C. Foster, J. Phang, H. He, A. Thite, N. Nabeshima,
S. Presser, and C. Leahy, “The pile: An 800gb dataset
of diverse text for language modeling,” CoRR, vol.
abs/2101.00027, 2021.
[167] H. Laurenc¸on, L. Saulnier, T. Wang, C. Akiki, A. V.
del Moral, T. Le Scao, L. Von Werra, C. Mou, E. G.
Ponferrada, H. Nguyen et al., “The bigscience roots
corpus: A 1.6 tb composite multilingual dataset,” in
Thirty-sixth Conference on Neural Information Process-
ing Systems Datasets and Benchmarks Track, 2022.
[168] “Common crawl.” [Online]. Available: https://
commoncrawl.org/
[169] G. Wenzek, M.-A. Lachaux, A. Conneau, V. Chaud-
´
hary, F. Guzm´ an, A. Joulin, and
E. Grave, “Ccnet:
Extracting high quality monolingual datasets from
web crawl data,” in Proceedings of The 12th Language
Resources and Evaluation Conference, 2020, pp. 4003–
4012.
[170] T. Computer, “Redpajama: an open dataset for train-
ing large language models,” https://github.com/
togethercomputer/RedPajama-Data, 2023.
[171] G. Penedo, Q. Malartic, D. Hesslow, R. Cojocaru,
A. Cappelli, H. Alobeidli, B. Pannier, E. Almazrouei,
and J. Launay, “The RefinedWeb dataset for Falcon
LLM: outperforming curated corpora with web data,
and web data only,” arXiv preprint arXiv:2306.01116,
2023.
[172] C. B. Clement, M. Bierbaum, K. P. O’Keeffe, and
A. A. Alemi, “On the use of arxiv as a dataset,” arXiv
preprint arXiv:1905.00075, 2019.
[173] K. Lo, L. L. Wang, M. Neumann, R. Kinney, and
D. Weld, “S2ORC: The semantic scholar open re-
search corpus,” in ACL, 2020.
[174] L. Soldaini and K. Lo, “peS2o (Pretraining Efficiently
on S2ORC) Dataset,” ODC-By, https://github.com/
allenai/pes2o, 2023.
[175] D. Kocetkov, R. Li, L. B. Allal, J. Li, C. Mou, C. M.
Ferrandis, Y. Jernite, M. Mitchell, S. Hughes, T. Wolf
et al., “The stack: 3 tb of permissively licensed source
code,” arXiv preprint arXiv:2211.15533, 2022.
[176] B. Wang and A. Komatsuzaki, “GPT-J-6B: A 6 Billion
Parameter Autoregressive Language Model,” https:
//github.com/kingoflolz/mesh-transformer-jax,
2021.
[177] L. Soldaini, R. Kinney, A. Bhagia, D. Schwenk,
D. Atkinson, R. Authur, B. Bogin, K. Chandu, J. Du-
mas, Y. Elazar, V. Hofmann, A. H. Jha, S. Kumar,
L. Lucy, X. Lyu, N. Lambert, I. Magnusson, J. Morri-
son, N. Muennighoff, A. Naik, C. Nam, M. E. Peters,
A. Ravichander, K. Richardson, Z. Shen, E. Strubell,
N. Subramani, O. Tafjord, P. Walsh, L. Zettlemoyer,
N. A. Smith, H. Hajishirzi, I. Beltagy, D. Groeneveld,
J. Dodge, and K. Lo, “Dolma: an open corpus of
three trillion tokens for language model pretraining
research,” arXiv preprint arXiv:2402.00159, 2024.
[178] D. Groeneveld, I. Beltagy, P. Walsh, A. Bhagia, R. Kin-
ney, O. Tafjord, A. H. Jha, H. Ivison, I. Magnusson,
Y. Wang et al., “Olmo: Accelerating the science of
language models,” arXiv preprint arXiv:2402.00838,
2024.
[179] S. Mishra, D. Khashabi, C. Baral, and H. Ha-
jishirzi, “Cross-task generalization via natural lan-
guage crowdsourcing instructions,” in Proceedings of
the 60th Annual Meeting of the Association for Com-
108
putational Linguistics (Volume 1: Long Papers), ACL
2022, Dublin, Ireland, May 22-27, 2022, S. Muresan,
P. Nakov, and A. Villavicencio, Eds., 2022, pp. 3470–
3487.
[180] S. H. Bach, V. Sanh, Z. X. Yong, A. Webson, C. Raffel,
N. V. Nayak, A. Sharma, T. Kim, M. S. Bari, T. F´
evry,
Z. Alyafeai, M. Dey, A. Santilli, Z. Sun, S. Ben-David,
C. Xu, G. Chhablani, H. Wang, J. A. Fries, M. S.
AlShaibani, S. Sharma, U. Thakker, K. Almubarak,
X. Tang, D. R. Radev, M. T. Jiang, and A. M. Rush,
“Promptsource: An integrated development environ-
ment and repository for natural language prompts,”
in ACL (demo). Association for Computational Lin-
guistics, 2022, pp. 93–104.
[181] T. Tang, J. Li, W. X. Zhao, and J. Wen, “MVP: multi-
task supervised pre-training for natural language
generation,” CoRR, vol. abs/2206.12131, 2022.
[182] H. Nguyen, S. Suri, K. Tsui, Shahules786, T. team,
and C. Schuhmann, “The oig dataset,” https://laion.
ai/blog/oig-dataset/, 2023.
[183] Y. Bai, A. Jones, K. Ndousse, A. Askell, A. Chen,
N. DasSarma, D. Drain, S. Fort, D. Ganguli,
T. Henighan, N. Joseph, S. Kadavath, J. Kernion,
T. Conerly, S. E. Showk, N. Elhage, Z. Hatfield-
Dodds, D. Hernandez, T. Hume, S. Johnston,
S. Kravec, L. Lovitt, N. Nanda, C. Olsson,
D. Amodei, T. B. Brown, J. Clark, S. McCandlish,
C. Olah, B. Mann, and J. Kaplan, “Training a helpful
and harmless assistant with reinforcement learning
from human feedback,” CoRR, vol. abs/2204.05862,
2022. [Online]. Available: https://doi.org/10.48550/
arXiv.2204.05862
[184] B. Guo, X. Zhang, Z. Wang, M. Jiang, J. Nie, Y. Ding,
J. Yue, and Y. Wu, “How close is chatgpt to human
experts? comparison corpus, evaluation, and detec-
tion,” arXiv preprint arXiv:2301.07597, 2023.
[185] M. Conover, M. Hayes, A. Mathur, J. Xie, J. Wan,
S. Shah, A. Ghodsi, P. Wendell, M. Zaharia, and
R. Xin. (2023) Free dolly: Introducing the world’s first
truly open instruction-tuned llm.
[186] A. K¨ opf, Y. Kilcher, D. von R ¨ utte, S. Anagnostidis, Z.-
R. Tam, K. Stevens, A. Barhoum, N. M. Duc, O. Stan-
ley, R. Nagyfi et al., “Openassistant conversations–
democratizing large language model alignment,”
arXiv preprint arXiv:2304.07327, 2023.
[187] R. Taori, I. Gulrajani, T. Zhang, Y. Dubois, X. Li,
C. Guestrin, P. Liang, and T. B. Hashimoto, “Stanford
alpaca: An instruction-following llama model,”
https://github.com/tatsu-lab/stanford alpaca,
2023.
[188] J. Cheung, “Guanaco - generative universal assistant
for natural-language adaptive context-aware om-
nilingual outputs,” https://guanaco-model.github.
io/, 2023.
[189] C. Xu, D. Guo, N. Duan, and J. McAuley,
“Baize: An open-source chat model with parameter-
efficient tuning on self-chat data,” arXiv preprint
arXiv:2304.01196, 2023.
[190] Y. Ji, Y. Gong, Y. Deng, Y. Peng, Q. Niu, B. Ma,
and X. Li, “Towards better instruction following
language models for chinese: Investigating the im-
pact of training data and evaluation,” arXiv preprint
arXiv:2304.07854, 2023.
[191] K. Ethayarajh, Y. Choi, and S. Swayamdipta, “Under-
standing dataset difficulty with V-usable informa-
tion,” in Proceedings of the 39th International Conference
on Machine Learning, 2022, pp. 5988–6008.
[192] N. Lambert, L. Tunstall, N. Rajani,
and T. Thrush. (2023) Huggingface h4
stack exchange preference dataset. [On-
line]. Available: https://huggingface.co/datasets/
HuggingFaceH4/stack-exchange-preferences
[193] R. Liu, R. Yang, C. Jia, G. Zhang, D. Zhou, A. M.
Dai, D. Yang, and S. Vosoughi, “Training socially
aligned language models in simulated human soci-
ety,” CoRR, vol. abs/2305.16960, 2023.
[194] G. Xu, J. Liu, M. Yan, H. Xu, J. Si, Z. Zhou, P. Yi,
X. Gao, J. Sang, R. Zhang, J. Zhang, C. Peng,
F. Huang, and J. Zhou, “Cvalues: Measuring the
values of chinese large language models from safety
to responsibility,” 2023.
[195] J. Dai, X. Pan, R. Sun, J. Ji, X. Xu, M. Liu, Y. Wang, and
Y. Yang, “Safe rlhf: Safe reinforcement learning from
human feedback,” arXiv preprint arXiv:2310.12773,
2023.
[196] V. Sanh, A. Webson, C. Raffel, S. H. Bach,
L. Sutawika, Z. Alyafeai, A. Chaffin, A. Stiegler,
A. Raja, M. Dey, M. S. Bari, C. Xu, U. Thakker, S. S.
Sharma, E. Szczechla, T. Kim, G. Chhablani, N. V.
Nayak, D. Datta, J. Chang, M. T. Jiang, H. Wang,
M. Manica, S. Shen, Z. X. Yong, H. Pandey, R. Baw-
den, T. Wang, T. Neeraj, J. Rozen, A. Sharma, A. San-
tilli, T. F´ evry, J. A. Fries, R. Teehan, T. L. Scao, S. Bi-
derman, L. Gao, T. Wolf, and A. M. Rush, “Multitask
prompted training enables zero-shot task generaliza-
tion,” in The Tenth International Conference on Learning
Representations, ICLR 2022, Virtual Event, April 25-29,
2022. OpenReview.net, 2022.
[197] S. Longpre, L. Hou, T. Vu, A. Webson, H. W.
Chung, Y. Tay, D. Zhou, Q. V. Le, B. Zoph, J. Wei
et al., “The flan collection: Designing data and meth-
ods for effective instruction tuning,” arXiv preprint
arXiv:2301.13688, 2023.
[198] K. Cobbe, V. Kosaraju, M. Bavarian, J. Hilton,
R. Nakano, C. Hesse, and J. Schulman, “Training
verifiers to solve math word problems,” CoRR, vol.
abs/2110.14168, 2021.
[199] M. Geva, D. Khashabi, E. Segal, T. Khot, D. Roth,
and J. Berant, “Did aristotle use a laptop? A ques-
tion answering benchmark with implicit reasoning
strategies,” Trans. Assoc. Comput. Linguistics, vol. 9,
pp. 346–361, 2021.
[200] O. Camburu, B. Shillingford, P. Minervini,
T. Lukasiewicz, and P. Blunsom, “Make up your
mind! adversarial generation of inconsistent natural
language explanations,” in Proceedings of the 58th
Annual Meeting of the Association for Computational
Linguistics, ACL 2020, Online, July 5-10, 2020,
D. Jurafsky, J. Chai, N. Schluter, and J. R. Tetreault,
Eds. Association for Computational Linguistics,
2020, pp. 4157–4165.
[201] T. Wolf, L. Debut, V. Sanh, J. Chaumond, C. De-
109
langue, A. Moi, P. Cistac, T. Rault, R. Louf, M. Fun-
towicz, J. Davison, S. Shleifer, P. von Platen, C. Ma,
Y. Jernite, J. Plu, C. Xu, T. L. Scao, S. Gugger,
M. Drame, Q. Lhoest, and A. M. Rush, “Transform-
ers: State-of-the-art natural language processing,” in
Proceedings of the 2020 Conference on Empirical Methods
in Natural Language Processing: System Demonstrations,
EMNLP 2020 - Demos, Online, November 16-20, 2020.
Association for Computational Linguistics, 2020, pp.
38–45.
[202] J. Bradbury, R. Frostig, P. Hawkins, M. J.
Johnson, C. Leary, D. Maclaurin, G. Necula,
A. Paszke, J. VanderPlas, S. Wanderman-Milne,
and Q. Zhang, “JAX: composable transformations
of Python+NumPy programs,” 2018. [Online].
Available: http://github.com/google/jax
[203] Z. Bian, H. Liu, B. Wang, H. Huang, Y. Li, C. Wang,
F. Cui, and Y. You, “Colossal-ai: A unified deep learn-
ing system for large-scale parallel training,” CoRR,
vol. abs/2110.14883, 2021.
[204] J. Fang, Y. Yu, S. Li, Y. You, and J. Zhou, “Patrick-
star: Parallel training of pre-trained models via
a chunk-based memory management,” CoRR, vol.
abs/2108.05818, 2021.
[205] Y. You, “Colossalchat: An open-source solution
for cloning chatgpt with a complete
rlhf pipeline,” 2023. [Online]. Available:
https://medium.com/@yangyou berkeley/
colossalchat-an-open-source-solution-for-cloning-
chatgpt-with-a-complete-rlhf-pipeline-5edf08fb538b
[206] “Bmtrain: Effient training for big models.” [Online].
Available: https://github.com/OpenBMB/BMTrain
[207] J. He, J. Qiu, A. Zeng, Z. Yang, J. Zhai, and J. Tang,
“Fastmoe: A fast mixture-of-expert training system,”
CoRR, vol. abs/2103.13262, 2021.
[208] W. Kwon, Z. Li, S. Zhuang, Y. Sheng, L. Zheng,
C. H. Yu, J. E. Gonzalez, H. Zhang, and I. Stoica,
“Efficient memory management for large language
model serving with pagedattention,” in Proceedings
of the ACM SIGOPS 29th Symposium on Operating
Systems Principles, 2023.
[209] (2023) Deepspeed-mii. [Online]. Available: https:
//github.com/microsoft/DeepSpeed-MII
[210] Z. Yao, R. Y. Aminabadi, O. Ruwase, S. Rajbhan-
dari, X. Wu, A. A. Awan, J. Rasley, M. Zhang,
C. Li, C. Holmes, Z. Zhou, M. Wyatt, M. Smith,
L. Kurilenko, H. Qin, M. Tanaka, S. Che, S. L. Song,
and Y. He, “DeepSpeed-Chat: Easy, Fast and Afford-
able RLHF Training of ChatGPT-like Models at All
Scales,” arXiv preprint arXiv:2308.01320, 2023.
[211] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Brad-
bury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein,
L. Antiga, A. Desmaison, A. K¨ opf, E. Z. Yang,
Z. DeVito, M. Raison, A. Tejani, S. Chilamkurthy,
B. Steiner, L. Fang, J. Bai, and S. Chintala, “Py-
torch: An imperative style, high-performance deep
learning library,” in Advances in Neural Information
Processing Systems 32: Annual Conference on Neural
Information Processing Systems 2019, NeurIPS 2019,
December 8-14, 2019, Vancouver, BC, Canada, H. M.
Wallach, H. Larochelle, A. Beygelzimer, F. d’Alch´
e-
Buc, E. B. Fox, and R. Garnett, Eds., 2019, pp. 8024–
8035.
[212] M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis,
J. Dean, M. Devin, S. Ghemawat, G. Irving, M. Is-
ard, M. Kudlur, J. Levenberg, R. Monga, S. Moore,
D. G. Murray, B. Steiner, P. A. Tucker, V. Vasudevan,
P. Warden, M. Wicke, Y. Yu, and X. Zheng, “Tensor-
flow: A system for large-scale machine learning,” in
12th USENIX Symposium on Operating Systems Design
and Implementation, OSDI 2016, Savannah, GA, USA,
November 2-4, 2016, K. Keeton and T. Roscoe, Eds.
USENIX Association, 2016, pp. 265–283.
[213] T. Chen, M. Li, Y. Li, M. Lin, N. Wang, M. Wang,
T. Xiao, B. Xu, C. Zhang, and Z. Zhang, “Mxnet:
A flexible and efficient machine learning library
for heterogeneous distributed systems,” CoRR, vol.
abs/1512.01274, 2015.
[214] Y. Ma, D. Yu, T. Wu, and H. Wang, “Paddlepaddle:
An open-source deep learning platform from indus-
trial practice,” Frontiers of Data and Domputing, vol. 1,
no. 1, p. 105, 2019.
[215] L. Huawei Technologies Co., “Huawei mindspore
ai development framework,” in Artificial Intelligence
Technology. Springer, 2022, pp. 137–162.
[216] J. Yuan, X. Li, C. Cheng, J. Liu, R. Guo, S. Cai, C. Yao,
F. Yang, X. Yi, C. Wu, H. Zhang, and J. Zhao, “One-
flow: Redesign the distributed deep learning frame-
work from scratch,” CoRR, vol. abs/2110.15032, 2021.
[217] S. Roller, E. Dinan, N. Goyal, D. Ju, M. Williamson,
Y. Liu, J. Xu, M. Ott, E. M. Smith, Y. Boureau, and
J. Weston, “Recipes for building an open-domain
chatbot,” in Proceedings of the 16th Conference of the
European Chapter of the Association for Computational
Linguistics: Main Volume, EACL 2021, Online, April 19

  • 23, 2021, 2021, pp. 300–325.
    [218] A. Lewkowycz, A. Andreassen, D. Dohan, E. Dyer,
    H. Michalewski, V. V. Ramasesh, A. Slone, C. Anil,
    I. Schlag, T. Gutman-Solo, Y. Wu, B. Neyshabur,
    G. Gur-Ari, and V. Misra, “Solving quantitative rea-
    soning problems with language models,” CoRR, vol.
    abs/2206.14858, 2022.
    [219] T. Saier, J. Krause, and M. F¨ arber, “unarxive 2022:
    All arxiv publications pre-processed for nlp, includ-
    ing structured full-text and citation network,” arXiv
    preprint arXiv:2303.14957, 2023.
    [220] H. A. Simon, “Experiments with a heuristic com-
    piler,” J. ACM, vol. 10, no. 4, pp. 493–506, 1963.
    [221] Z. Manna and R. J. Waldinger, “Toward automatic
    program synthesis,” Commun. ACM, vol. 14, no. 3,
    pp. 151–165, 1971.
    [222] Z. Feng, D. Guo, D. Tang, N. Duan, X. Feng, M. Gong,
    L. Shou, B. Qin, T. Liu, D. Jiang, and M. Zhou,
    “Codebert: A pre-trained model for programming
    and natural languages,” in Findings of EMNLP, 2020.
    [223] J. Austin, A. Odena, M. I. Nye, M. Bosma,
    H. Michalewski, D. Dohan, E. Jiang, C. J. Cai,
    M. Terry, Q. V. Le, and C. Sutton, “Program syn-
    thesis with large language models,” CoRR, vol.
    abs/2108.07732, 2021.
    [224] S. Black, L. Gao, P. Wang, C. Leahy, and S. Bider-
    man, “GPT-Neo: Large Scale Autoregressive Lan-
    110
    guage Modeling with Mesh-Tensorflow,” 2021.
    [225] F. F. Xu, U. Alon, G. Neubig, and V. J. Hellendoorn,
    “A systematic evaluation of large language models
    of code,” in MAPS@PLDI, 2022.
    [226] A. Madaan, S. Zhou, U. Alon, Y. Yang, and G. Neu-
    big, “Language models of code are few-shot com-
    monsense learners,” in Proceedings of the 2022 Confer-
    ence on Empirical Methods in Natural Language Process-
    ing, EMNLP 2022, Abu Dhabi, United Arab Emirates,
    December 7-11, 2022, Y. Goldberg, Z. Kozareva, and
    Y. Zhang, Eds. Association for Computational Lin-
    guistics, 2022, pp. 1384–1403.
    [227] S. Longpre, G. Yauney, E. Reif, K. Lee, A. Roberts,
    B. Zoph, D. Zhou, J. Wei, K. Robinson, D. Mimno
    et al., “A pretrainer’s guide to training data: Measur-
    ing the effects of data age, domain coverage, quality,
    & toxicity,” arXiv preprint arXiv:2305.13169, 2023.
    [228] D. Chen, Y. Huang, Z. Ma, H. Chen, X. Pan, C. Ge,
    D. Gao, Y. Xie, Z. Liu, J. Gao, Y. Li, B. Ding, and
    J. Zhou, “Data-juicer: A one-stop data processing
    system for large language models,” 2023.
    [229] M. Abdin, S. A. Jacobs, A. A. Awan, J. Aneja,
    A. Awadallah, H. Awadalla, N. Bach, A. Bahree,
    A. Bakhtiari, H. Behl et al., “Phi-3 technical report:
    A highly capable language model locally on your
    phone,” arXiv preprint arXiv:2404.14219, 2024.
    ı
    [230] G. Penedo, H. Kydl´
    ˇ cek, A. Lozhkov, M. Mitchell,
    C. Raffel, L. Von Werra, T. Wolf et al., “The fineweb
    datasets: Decanting the web for the finest text data at
    scale,” arXiv preprint arXiv:2406.17557, 2024.
    [231] P. Maini, S. Seto, H. Bai, D. Grangier, Y. Zhang, and
    N. Jaitly, “Rephrasing the web: A recipe for compute
    and data-efficient language modeling,” in ICLR 2024
    Workshop on Navigating and Addressing Data Problems
    for Foundation Models, 2024.
    ¨
    [232] M. Marion, A.
    Ust ¨ un, L. Pozzobon, A. Wang,
    M. Fadaee, and S. Hooker, “When less is more: Inves-
    tigating data pruning for pretraining llms at scale,”
    arXiv preprint arXiv:2309.04564, 2023.
    [233] N. Sachdeva, B. Coleman, W.-C. Kang, J. Ni, L. Hong,
    E. H. Chi, J. Caverlee, J. McAuley, and D. Z. Cheng,
    “How to train data-efficient llms,” arXiv preprint
    arXiv:2402.09668, 2024.
    [234] D. Hernandez, T. B. Brown, T. Conerly, N. Das-
    Sarma, D. Drain, S. E. Showk, N. Elhage, Z. Hatfield-
    Dodds, T. Henighan, T. Hume, S. Johnston, B. Mann,
    C. Olah, C. Olsson, D. Amodei, N. Joseph, J. Ka-
    plan, and S. McCandlish, “Scaling laws and inter-
    pretability of learning from repeated data,” CoRR,
    vol. abs/2205.10487, 2022.
    [235] A. Holtzman, J. Buys, L. Du, M. Forbes, and Y. Choi,
    “The curious case of neural text degeneration,” in 8th
    International Conference on Learning Representations,
    ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020.
    OpenReview.net, 2020.
    [236] K. Lee, D. Ippolito, A. Nystrom, C. Zhang, D. Eck,
    C. Callison-Burch, and N. Carlini, “Deduplicating
    training data makes language models better,” in Pro-
    ceedings of the 60th Annual Meeting of the Association
    for Computational Linguistics (Volume 1: Long Papers),
    ACL 2022, Dublin, Ireland, May 22-27, 2022, 2022, pp.
    8424–8445.
    [237] N. Carlini, D. Ippolito, M. Jagielski, K. Lee, F. Tram`
    er,
    and C. Zhang, “Quantifying memorization across
    neural language models,” CoRR, 2022.
    [238] N. Kandpal, E. Wallace, and C. Raffel, “Deduplicat-
    ing training data mitigates privacy risks in language
    models,” in International Conference on Machine Learn-
    ing, ICML 2022, 17-23 July 2022, Baltimore, Maryland,
    USA. PMLR, 2022, pp. 10 697–10 707.
    [239] J. D. Lafferty, A. McCallum, and F. C. N. Pereira,
    “Conditional random fields: Probabilistic models
    for segmenting and labeling sequence data,” in
    Proceedings of the Eighteenth International Conference
    on Machine Learning (ICML 2001), Williams College,
    Williamstown, MA, USA, June 28 - July 1, 2001, C. E.
    Brodley and A. P. Danyluk, Eds. Morgan Kaufmann,
    2001, pp. 282–289.
    [240] P. Gage, “A new algorithm for data compression,” C
    Users Journal, vol. 12, no. 2, pp. 23–38, 1994.
    [241] R. Sennrich, B. Haddow, and A. Birch, “Neural ma-
    chine translation of rare words with subword units,”
    in Proceedings of the 54th Annual Meeting of the Associa-
    tion for Computational Linguistics, ACL 2016, August 7-
    12, 2016, Berlin, Germany, Volume 1: Long Papers. The
    Association for Computer Linguistics, 2016.
    [242] M. Schuster and K. Nakajima, “Japanese and korean
    voice search,” in 2012 IEEE international conference on
    acoustics, speech and signal processing (ICASSP). IEEE,
    2012, pp. 5149–5152.
    [243] Y. Wu, M. Schuster, Z. Chen, Q. V. Le, M. Norouzi,
    W. Macherey, M. Krikun, Y. Cao, Q. Gao,
    K. Macherey, J. Klingner, A. Shah, M. Johnson, X. Liu,
    L. Kaiser, S. Gouws, Y. Kato, T. Kudo, H. Kazawa,
    K. Stevens, G. Kurian, N. Patil, W. Wang, C. Young,
    J. Smith, J. Riesa, A. Rudnick, O. Vinyals, G. Cor-
    rado, M. Hughes, and J. Dean, “Google’s neural
    machine translation system: Bridging the gap be-
    tween human and machine translation,” CoRR, vol.
    abs/1609.08144, 2016.
    [244] T. Kudo, “Subword regularization: Improving neural
    network translation models with multiple subword
    candidates,” in Proceedings of the 56th Annual Meeting
    of the Association for Computational Linguistics, ACL
    2018, Melbourne, Australia, July 15-20, 2018, Volume 1:
    Long Papers, I. Gurevych and Y. Miyao, Eds. Associ-
    ation for Computational Linguistics, 2018, pp. 66–75.
    [245] T. Kudo and J. Richardson, “Sentencepiece: A simple
    and language independent subword tokenizer and
    detokenizer for neural text processing,” in Proceed-
    ings of the 2018 Conference on Empirical Methods in
    Natural Language Processing, EMNLP 2018: System
    Demonstrations, Brussels, Belgium, October 31 - Novem-
    ber 4, 2018, E. Blanco and W. Lu, Eds. Association
    for Computational Linguistics, 2018.
    [246] M. Davis and M. D ¨ urst, “Unicode normalization
    forms,” 2001.
    [247] P. Nakkiran, G. Kaplun, Y. Bansal, T. Yang, B. Barak,
    and I. Sutskever, “Deep double descent: Where big-
    ger models and more data hurt,” in 8th International
    Conference on Learning Representations, ICLR 2020,
    Addis Ababa, Ethiopia, April 26-30, 2020. OpenRe-
    111
    view.net, 2020.
    [248] K. Tirumala, D. Simig, A. Aghajanyan, and A. S. Mor-
    cos, “D4: Improving llm pretraining via document
    de-duplication and diversification,” arXiv preprint
    arXiv:2308.12284, 2023.
    [249] Z. Shen, T. Tao, L. Ma, W. Neiswanger, J. Hes-
    tness, N. Vassilieva, D. Soboleva, and E. Xing,
    “Slimpajama-dc: Understanding data combinations
    for llm training,” arXiv preprint arXiv:2309.10818,

[250] S. M. Xie, S. Santurkar, T. Ma, and P. Liang, “Data
selection for language models via importance resam-
pling,” arXiv preprint arXiv:2302.03169, 2023.
[251] X. Wang, W. Zhou, Q. Zhang, J. Zhou, S. Gao,
J. Wang, M. Zhang, X. Gao, Y. Chen, and T. Gui,
“Farewell to aimless large-scale pretraining: Influ-
ential subset selection for language model,” arXiv
preprint arXiv:2305.12816, 2023.
[252] D. Paperno, G. Kruszewski, A. Lazaridou, Q. N.
Pham, R. Bernardi, S. Pezzelle, M. Baroni, G. Boleda,
and R. Fern´ andez, “The LAMBADA dataset: Word
prediction requiring a broad discourse context,” in
ACL (1). The Association for Computer Linguistics,
2016.
[253] M. F. Chen, N. Roberts, K. Bhatia, J. Wang, C. Zhang,
F. Sala, and C. R´ e, “Skill-it! a data-driven skills
framework for understanding and training language
models,” arXiv preprint arXiv:2307.14430, 2023.
[254] B. Rozi ere, J. Gehring, F. Gloeckle, S. Sootla, I. Gat, X. E. Tan, Y. Adi, J. Liu, T. Remez, J. Rapin, A. Kozhevnikov, I. Evtimov, J. Bitton, M. Bhatt, C. Canton-Ferrer, A. Grattafiori, W. Xiong, A. D´ efossez, J. Copet, F. Azhar, H. Touvron, L. Mar- tin, N. Usunier, T. Scialom, and G. Synnaeve, “Code llama: Open foundation models for code,” CoRR, vol. abs/2308.12950, 2023. [255] Y. Bengio, J. Louradour, R. Collobert, and J. Weston, “Curriculum learning,” in ICML, 2009, pp. 41–48. [256] C. Xu, C. Rosset, L. Del Corro, S. Mahajan, J. McAuley, J. Neville, A. H. Awadallah, and N. Rao, “Contrastive post-training large language models on data curriculum,” arXiv preprint arXiv:2310.02263, 2023. [257] S. Tworkowski, K. Staniszewski, M. Pacek, Y. Wu, H. Michalewski, and P. Milos, “Focused transformer: Contrastive training for context scaling,” CoRR, vol. abs/2307.03170, 2023. [258] Z. Azerbayev, H. Schoelkopf, K. Paster, M. D. Santos, S. McAleer, A. Q. Jiang, J. Deng, S. Biderman, and S. Welleck, “Llemma: An open language model for mathematics,” arXiv preprint arXiv:2310.10631, 2023. [259] S. Chen, S. Wong, L. Chen, and Y. Tian, “Extend- ing context window of large language models via positional interpolation,” CoRR, vol. abs/2306.15595, 2023. [260] G. Wenzek, M.-A. Lachaux, A. Conneau, V. Chaud- ´ hary, F. Guzm´ an, A. Joulin, and E. Grave, “Ccnet: Extracting high quality monolingual datasets from web crawl data,” in Proceedings of the Twelfth Language Resources and Evaluation Conference, 2020, pp. 4003– 4012. [261] A. Joulin, E. Grave, P. Bojanowski, and T. Mikolov, “Bag of tricks for efficient text classification,” in EACL, 2017, pp. 427–431. [262] D. Chen, Y. Huang, Z. Ma, H. Chen, X. Pan, C. Ge, D. Gao, Y. Xie, Z. Liu, J. Gao et al., “Data-juicer: A one-stop data processing system for large language models,” arXiv preprint arXiv:2309.02033, 2023. [263] B. Zhang, B. Ghorbani, A. Bapna, Y. Cheng, X. Garcia, J. Shen, and O. Firat, “Examining scaling and transfer of language model architectures for machine transla- tion,” in International Conference on Machine Learning, ICML 2022, 17-23 July 2022, Baltimore, Maryland, USA, 2022, pp. 26 176–26 192. [264] L. Dong, N. Yang, W. Wang, F. Wei, X. Liu, Y. Wang, J. Gao, M. Zhou, and H. Hon, “Unified language model pre-training for natural language understand- ing and generation,” in Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019, December 8-14, 2019, Vancouver, BC, Canada, 2019, pp. 13 042–13 054. [265] A. Clark, D. de Las Casas, A. Guy, A. Mensch, M. Paganini, J. Hoffmann, B. Damoc, B. A. Hecht- man, T. Cai, S. Borgeaud, G. van den Driessche, E. Rutherford, T. Hennigan, M. J. Johnson, A. Cas- sirer, C. Jones, E. Buchatskaya, D. Budden, L. Sifre, S. Osindero, O. Vinyals, M. Ranzato, J. W. Rae, E. Elsen, K. Kavukcuoglu, and K. Simonyan, “Uni- fied scaling laws for routed language models,” in International Conference on Machine Learning, ICML 2022, 17-23 July 2022, Baltimore, Maryland, USA, 2022, pp. 4057–4086. [266] A. Gu, K. Goel, and C. R´ e, “Efficiently modeling long sequences with structured state spaces,” in The Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022. OpenReview.net, 2022. [Online]. Available: https://openreview.net/forum?id=uYLFoz1vlAC [267] J. T. Smith, A. Warrington, and S. Linderman, “Sim- plified state space layers for sequence modeling,” in ICLR, 2023. [268] A. Orvieto, S. L. Smith, A. Gu, A. Fernando, C. Gul- cehre, R. Pascanu, and S. De, “Resurrecting recurrent neural networks for long sequences,” in ICML, 2023. [269] M. Poli, S. Massaroli, E. Nguyen, D. Y. Fu, T. Dao, S. Baccus, Y. Bengio, S. Ermon, and C. R´ e, “Hyena hierarchy: Towards larger convolutional language models,” in ICML, 2023. [270] B. Peng, E. Alcaide, Q. Anthony, A. Albalak, S. Arcadinho, H. Cao, X. Cheng, M. Chung, M. Grella, K. K. G. V., X. He, H. Hou, P. Kazienko, J. Kocon, J. Kong, B. Koptyra, H. Lau, K. S. I. Mantri, F. Mom, A. Saito, X. Tang, B. Wang, J. S. Wind, S. Wozniak, R. Zhang, Z. Zhang, Q. Zhao, P. Zhou, J. Zhu, and R. Zhu, “RWKV: reinventing rnns for the transformer era,” CoRR, vol. abs/2305.13048, 2023. [Online]. Available: https://doi.org/10.48550/arXiv.2305.13048 [271] Y. Sun, L. Dong, S. Huang, S. Ma, Y. Xia, J. Xue, J. Wang, and F. Wei, “Retentive network: A successor to transformer for large language models,” arXiv 112 preprint arXiv:2307.08621, 2023. [272] A. Gu and T. Dao, “Mamba: Linear-time sequence modeling with selective state spaces,” CoRR, vol. abs/2312.00752, 2023. [273] B. Peng, E. Alcaide, Q. Anthony, A. Albalak, S. Ar- cadinho, H. Cao, X. Cheng, M. Chung, M. Grella, K. K. GV et al., “Rwkv: Reinventing rnns for the transformer era,” arXiv preprint arXiv:2305.13048, 2023. [274] M. Ding, Z. Yang, W. Hong, W. Zheng, C. Zhou, D. Yin, J. Lin, X. Zou, Z. Shao, H. Yang, and J. Tang, “Cogview: Mastering text-to-image generation via transformers,” in Advances in Neural Information Pro- cessing Systems 34: Annual Conference on Neural Infor- mation Processing Systems 2021, NeurIPS 2021, Decem- ber 6-14, 2021, virtual, 2021, pp. 19 822–19 835. [275] L. J. Ba, J. R. Kiros, and G. E. Hinton, “Layer normal- ization,” vol. abs/1607.06450, 2016. [276] B. Zhang and R. Sennrich, “Root mean square layer normalization,” in Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019, December 8-14, 2019, Vancouver, BC, Canada, 2019, pp. 12 360–12 371. [277] H. Wang, S. Ma, L. Dong, S. Huang, D. Zhang, and F. Wei, “Deepnet: Scaling transformers to 1, 000 layers,” vol. abs/2203.00555, 2022. [278] V. Nair and G. E. Hinton, “Rectified linear units im- prove restricted boltzmann machines,” in Proceedings of the 27th international conference on machine learning (ICML-10), 2010, pp. 807–814. [279] A. Wang, A. Singh, J. Michael, F. Hill, O. Levy, and S. R. Bowman, “GLUE: A multi-task benchmark and analysis platform for natural language under- standing,” in Proceedings of the Workshop: Analyz- ing and Interpreting Neural Networks for NLP, Black- boxNLP@EMNLP 2018, Brussels, Belgium, November 1, 2018, T. Linzen, G. Chrupala, and A. Alishahi, Eds. Association for Computational Linguistics, 2018, pp. 353–355. [280] P. Ramachandran, B. Zoph, and Q. V. Le, “Searching for activation functions,” arXiv preprint arXiv:1710.05941, 2017. [281] N. Shazeer, “GLU variants improve transformer,” vol. abs/2002.05202, 2020. [282] J. Su, Y. Lu, S. Pan, B. Wen, and Y. Liu, “Roformer: Enhanced transformer with rotary position embed- ding,” vol. abs/2104.09864, 2021. [283] O. Press, N. A. Smith, and M. Lewis, “Train short, test long: Attention with linear biases enables input length extrapolation,” in The Tenth International Con- ference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022, 2022. [284] S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep network training by reducing internal covariate shift,” in Proceedings of the 32nd International Conference on Machine Learning, ICML 2015, Lille, France, 6-11 July 2015, ser. JMLR Workshop and Conference Proceedings, F. R. Bach and D. M. Blei, Eds., vol. 37. JMLR.org, 2015, pp. 448–456. [Online]. Available: http://proceedings.mlr.press/v37/ioffe15.html [285] S. Narang, H. W. Chung, Y. Tay, L. Fedus, T. F´ evry, M. Matena, K. Malkan, N. Fiedel, N. Shazeer, Z. Lan, Y. Zhou, W. Li, N. Ding, J. Marcus, A. Roberts, and C. Raffel, “Do transformer modifications transfer across implementations and applications?” in Pro- ceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, EMNLP 2021, Virtual Event / Punta Cana, Dominican Republic, 7-11 Novem- ber, 2021, 2021, pp. 5758–5773. [286] R. Xiong, Y. Yang, D. He, K. Zheng, S. Zheng, C. Xing, H. Zhang, Y. Lan, L. Wang, and T. Liu, “On layer normalization in the transformer architecture,” in ICML, 2020. [287] A. Baevski and M. Auli, “Adaptive input repre- sentations for neural language modeling,” in 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019. OpenReview.net, 2019. [288] L. Liu, X. Liu, J. Gao, W. Chen, and J. Han, “Under- standing the difficulty of training transformers,” in Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, EMNLP 2020, Online, November 16-20, 2020. Association for Computa- tional Linguistics, 2020, pp. 5747–5763. [289] D. Hendrycks and K. Gimpel, “Gaussian error linear units (gelus),” arXiv preprint arXiv:1606.08415, 2016. [290] Y. N. Dauphin, A. Fan, M. Auli, and D. Grangier, “Language modeling with gated convolutional net- works,” in Proceedings of the 34th International Confer- ence on Machine Learning, ICML 2017, Sydney, NSW, Australia, 6-11 August 2017, 2017, pp. 933–941. [291] T. L. Scao, T. Wang, D. Hesslow, S. Bekman, M. S. Bari, S. Biderman, H. Elsahar, N. Muennighoff, J. Phang, O. Press, C. Raffel, V. Sanh, S. Shen, L. Sutawika, J. Tae, Z. X. Yong, J. Launay, and I. Belt- agy, “What language model to train if you have one million GPU hours?” in Findings of the Association for Computational Linguistics: EMNLP 2022, Abu Dhabi, United Arab Emirates, December 7-11, 2022, 2022, pp. 765–782. [292] P. Shaw, J. Uszkoreit, and A. Vaswani, “Self- attention with relative position representations,” in Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL- HLT, New Orleans, Louisiana, USA, June 1-6, 2018, Volume 2 (Short Papers), M. A. Walker, H. Ji, and A. Stent, Eds. Association for Computational Linguistics, 2018, pp. 464–468. [Online]. Available: https://doi.org/10.18653/v1/n18-2074 [293] Z. Dai, Z. Yang, Y. Yang, J. G. Carbonell, Q. V. Le, and R. Salakhutdinov, “Transformer-xl: Attentive language models beyond a fixed-length context,” in Proceedings of the 57th Conference of the Association for Computational Linguistics, ACL 2019, Florence, Italy, July 28- August 2, 2019, Volume 1: Long Papers, A. Korhonen, D. R. Traum, and L. M arquez, Eds. Association for Computational
Linguistics, 2019, pp. 2978–2988. [Online]. Available:
https://doi.org/10.18653/v1/p19-1285
113
[294] Z. Yang, Z. Dai, Y. Yang, J. Carbonell, R. R. Salakhut-
dinov, and Q. V. Le, “Xlnet: Generalized autoregres-
sive pretraining for language understanding,” Ad-
vances in neural information processing systems, vol. 32,
2019.
[295] B. Peng, J. Quesnelle, H. Fan, and E. Shippole, “Yarn:
Efficient context window extension of large language
models,” CoRR, vol. abs/2309.00071, 2023.
[296] Y. Sun, L. Dong, B. Patra, S. Ma, S. Huang,
A. Benhaim, V. Chaudhary, X. Song, and F. Wei,
“A length-extrapolatable transformer,” CoRR, vol.
abs/2212.10554, 2022. [Online]. Available: https:
//doi.org/10.48550/arXiv.2212.10554
[297] H. Peng, N. Pappas, D. Yogatama, R. Schwartz, N. A.
Smith, and L. Kong, “Random feature attention,”
in 9th International Conference on Learning Representa-
tions, ICLR 2021, Virtual Event, Austria, May 3-7, 2021.
[298] M. Zaheer, G. Guruganesh, K. A. Dubey, J. Ainslie,
C. Alberti, S. Onta˜
n
´ on, P. Pham, A. Ravula, Q. Wang,
L. Yang, and A. Ahmed, “Big bird: Transformers for
longer sequences,” in Advances in Neural Information
Processing Systems 33: Annual Conference on Neural
Information Processing Systems 2020, NeurIPS 2020,
December 6-12, 2020, virtual, 2020.
[299] R. Child, S. Gray, A. Radford, and I. Sutskever, “Gen-
erating long sequences with sparse transformers,”
CoRR, vol. abs/1904.10509, 2019.
[300] N. Shazeer, “Fast transformer decoding: One write-
head is all you need,” CoRR, vol. abs/1911.02150,
2019. [Online]. Available: http://arxiv.org/abs/1911.
02150
[301] J. Ainslie, J. Lee-Thorp, M. de Jong, Y. Zemlyanskiy,
F. Lebr´ on, and S. Sanghai, “Gqa: Training gener-
alized multi-query transformer models from multi-
head checkpoints,” arXiv preprint arXiv:2305.13245,
2023.
[302] T. Dao, D. Y. Fu, S. Ermon, A. Rudra, and C. Re,
“Flashattention: Fast and memory-efficient exact at-
tention with IO-awareness,” in NeurIPS, 2022.
[303] T. Dao, “Flashattention-2: Faster attention with better
parallelism and work partitioning,” arXiv preprint
arXiv:2307.08691, 2023.
[304] “vllm: Easy, fast, and cheap llm serving with
pagedattention.” [Online]. Available: https://vllm.
ai/
[305] K. Murray and D. Chiang, “Correcting length bias in
neural machine translation,” in WMT. Association
for Computational Linguistics, 2018, pp. 212–223.
[306] A. Holtzman, J. Buys, L. Du, M. Forbes, and Y. Choi,
“The curious case of neural text degeneration,” in
ICLR, 2020.
[307] C.-M. U. P. P. D. O. C. SCIENCE, Speech Under-
standing Systems. Summary of Results of the Five-Year
Research Effort at Carnegie-Mellon University, 1977.
[308] P. Koehn and R. Knowles, “Six challenges for neural
machine translation,” in NMT@ACL. Association
for Computational Linguistics, 2017, pp. 28–39.
[309] Y. Wu, M. Schuster, Z. Chen, Q. V. Le, M. Norouzi,
W. Macherey, M. Krikun, Y. Cao, Q. Gao,
K. Macherey, J. Klingner, A. Shah, M. Johnson, X. Liu,
L. Kaiser, S. Gouws, Y. Kato, T. Kudo, H. Kazawa,
K. Stevens, G. Kurian, N. Patil, W. Wang, C. Young,
J. Smith, J. Riesa, A. Rudnick, O. Vinyals, G. Cor-
rado, M. Hughes, and J. Dean, “Google’s neural
machine translation system: Bridging the gap be-
tween human and machine translation,” CoRR, vol.
abs/1609.08144, 2016.
[310] R. Paulus, C. Xiong, and R. Socher, “A deep re-
inforced model for abstractive summarization,” in
ICLR (Poster). OpenReview.net, 2018.
[311] A. K. Vijayakumar, M. Cogswell, R. R. Selvaraju,
Q. Sun, S. Lee, D. J. Crandall, and D. Batra, “Diverse
beam search: Decoding diverse solutions from neural
sequence models,” CoRR, vol. abs/1610.02424, 2016.
[312] A. Fan, M. Lewis, and Y. N. Dauphin, “Hierarchical
neural story generation,” in ACL (1). Association for
Computational Linguistics, 2018, pp. 889–898.
[313] J. Hewitt, C. D. Manning, and P. Liang, “Trunca-
tion sampling as language model desmoothing,” in
EMNLP (Findings). Association for Computational
Linguistics, 2022, pp. 3414–3427.
[314] Y. Su, T. Lan, Y. Wang, D. Yogatama, L. Kong, and
N. Collier, “A contrastive framework for neural text
generation,” in NeurIPS, 2022.
[315] C. Meister, T. Pimentel, G. Wiher, and R. Cotterell,
“Locally typical sampling,” Trans. Assoc. Comput. Lin-
guistics, 2023.
[316] X. L. Li, A. Holtzman, D. Fried, P. Liang, J. Eis-
ner, T. Hashimoto, L. Zettlemoyer, and M. Lewis,
“Contrastive decoding: Open-ended text generation
as optimization,” in ACL (1). Association for Com-
putational Linguistics, 2023, pp. 12 286–12 312.
[317] Y. Chuang, Y. Xie, H. Luo, Y. Kim, J. R. Glass, and
P. He, “Dola: Decoding by contrasting layers im-
proves factuality in large language models,” CoRR,
vol. abs/2309.03883, 2023.
[318] D. P. Kingma and J. Ba, “Adam: A method for
stochastic optimization,” in 3rd International Confer-
ence on Learning Representations, ICLR 2015, San Diego,
CA, USA, May 7-9, 2015, Conference Track Proceedings,
Y. Bengio and Y. LeCun, Eds., 2015.
[319] I. Loshchilov and F. Hutter, “Fixing weight decay
regularization in adam,” CoRR, vol. abs/1711.05101,
2017.
[320] N. Shazeer and M. Stern, “Adafactor: Adaptive
learning rates with sublinear memory cost,” in Pro-
ceedings of the 35th International Conference on Machine
Learning, ICML 2018, Stockholmsm¨assan, Stockholm,
Sweden, July 10-15, 2018, ser. Proceedings of Machine
Learning Research, J. G. Dy and A. Krause, Eds.,
vol. 80. PMLR, 2018, pp. 4603–4611.
[321] Y. Huang, Y. Cheng, A. Bapna, O. Firat, D. Chen,
M. X. Chen, H. Lee, J. Ngiam, Q. V. Le, Y. Wu, and
Z. Chen, “Gpipe: Efficient training of giant neural
networks using pipeline parallelism,” in Advances
in Neural Information Processing Systems 32: Annual
Conference on Neural Information Processing Systems
2019, NeurIPS 2019, December 8-14, 2019, Vancouver,
BC, Canada, H. M. Wallach, H. Larochelle, A. Beygelz-
imer, F. d’Alch´ e-Buc, E. B. Fox, and R. Garnett, Eds.,
2019, pp. 103–112.
[322] A. Harlap, D. Narayanan, A. Phanishayee, V. Se-
114
shadri, N. R. Devanur, G. R. Ganger, and P. B. Gib-
bons, “Pipedream: Fast and efficient pipeline parallel
DNN training,” CoRR, vol. abs/1806.03377, 2018.
[323] P. Micikevicius, S. Narang, J. Alben, G. F. Di-
amos, E. Elsen, D. Garc´ ıa, B. Ginsburg, M. Houston,
O. Kuchaiev, G. Venkatesh, and H. Wu, “Mixed pre-
cision training,” CoRR, vol. abs/1710.03740, 2017.
[324] Q. Xu, S. Li, C. Gong, and Y. You, “An efficient
2d method for training super-large deep learning
models,” CoRR, vol. abs/2104.05343, 2021.
[325] B. Wang, Q. Xu, Z. Bian, and Y. You, “Tesseract:
Parallelize the tensor parallelism efficiently,” in Pro-
ceedings of the 51st International Conference on Parallel
Processing, ICPP 2022, Bordeaux, France, 29 August
2022 - 1 September 2022. ACM, 2022.
[326] Z. Bian, Q. Xu, B. Wang, and Y. You, “Maximizing
parallelism in distributed training for huge neural
networks,” CoRR, vol. abs/2105.14450, 2021.
[327] S. Li, F. Xue, C. Baranwal, Y. Li, and Y. You, “Se-
quence parallelism: Long sequence training from
system perspective,” arXiv e-prints, pp. arXiv–2105,
2021.
[328] L. Zheng, Z. Li, H. Zhang, Y. Zhuang, Z. Chen,
Y. Huang, Y. Wang, Y. Xu, D. Zhuo, E. P. Xing
et al., “Alpa: Automating inter-and {Intra-Operator}
parallelism for distributed deep learning,” in OSDI,
2022, pp. 559–578.
[329] T. Chen, B. Xu, C. Zhang, and C. Guestrin, “Training
deep nets with sublinear memory cost,” CoRR, vol.
abs/1604.06174, 2016.
[330] FairScale authors, “Fairscale: A general purpose
modular pytorch library for high performance
and large scale training,” https://github.com/
facebookresearch/fairscale, 2021.
[331] R. Lou, K. Zhang, and W. Yin, “Is prompt all you
need? no. A comprehensive and broader view of in-
struction learning,” CoRR, vol. abs/2303.10475, 2023.
[332] X. Liu, P. He, W. Chen, and J. Gao, “Multi-task deep
neural networks for natural language understand-
ing,” in ACL (1). Association for Computational
Linguistics, 2019, pp. 4487–4496.
[333] A. Aghajanyan, A. Gupta, A. Shrivastava, X. Chen,
L. Zettlemoyer, and S. Gupta, “Muppet: Massive
multi-task representations with pre-finetuning,” in
EMNLP (1). Association for Computational Linguis-
tics, 2021, pp. 5799–5811.
[334] S. Longpre, L. Hou, T. Vu, A. Webson, H. W. Chung,
Y. Tay, D. Zhou, Q. V. Le, B. Zoph, J. Wei, and
A. Roberts, “The flan collection: Designing data and
methods for effective instruction tuning,” CoRR, vol.
abs/2301.13688, 2023.
[335] C. Xu, Q. Sun, K. Zheng, X. Geng, P. Zhao, J. Feng,
C. Tao, and D. Jiang, “Wizardlm: Empowering large
language models to follow complex instructions,”
CoRR, vol. abs/2304.12244, 2023. [Online]. Available:
https://doi.org/10.48550/arXiv.2304.12244
[336] Z. Sun, Y. Shen, Q. Zhou, H. Zhang, Z. Chen, D. Cox,
Y. Yang, and C. Gan, “Principle-driven self-alignment
of language models from scratch with minimal hu-
man supervision,” arXiv preprint arXiv:2305.03047,
2023.
[337] X. Li, P. Yu, C. Zhou, T. Schick, L. Zettle-
moyer, O. Levy, J. Weston, and M. Lewis, “Self-
alignment with instruction backtranslation,” CoRR,
vol. abs/2308.06259, 2023.
[338] C. Zhou, P. Liu, P. Xu, S. Iyer, J. Sun, Y. Mao, X. Ma,
A. Efrat, P. Yu, L. Yu et al., “Lima: Less is more for
alignment,” arXiv preprint arXiv:2305.11206, 2023.
[339] L. Chen, S. Li, J. Yan, H. Wang, K. Gunaratna, V. Ya-
dav, Z. Tang, V. Srinivasan, T. Zhou, H. Huang, and
H. Jin, “Alpagasus: Training A better alpaca with
fewer data,” CoRR, vol. abs/2307.08701, 2023.
[340] S. Mukherjee, A. Mitra, G. Jawahar, S. Agarwal,
H. Palangi, and A. H. Awadallah, “Orca: Progressive
learning from complex explanation traces of GPT-4,”
CoRR, vol. abs/2306.02707, 2023.
[341] YuLan-Chat-Team, “Yulan-chat: An open-source
bilingual chatbot,” https://github.com/RUC-GSAI/
YuLan-Chat, 2023.
[342] Y. Huang, X. Liu, Y. Gong, Z. Gou, Y. Shen, N. Duan,
and W. Chen, “Key-point-driven data synthesis with
its enhancement on mathematical reasoning,” CoRR,
vol. abs/2403.02333, 2024.
[343] N. Ding, Y. Chen, B. Xu, Y. Qin, S. Hu, Z. Liu, M. Sun,
and B. Zhou, “Enhancing chat language models by
scaling high-quality instructional conversations,” in
Proceedings of the 2023 Conference on Empirical Methods
in Natural Language Processing, EMNLP 2023, Sin-
gapore, December 6-10, 2023, H. Bouamor, J. Pino,
and K. Bali, Eds. Association for Computational
Linguistics, 2023, pp. 3029–3051.
[344] K. Zhou, B. Zhang, J. Wang, Z. Chen, W. X. Zhao,
J. Sha, Z. Sheng, S. Wang, and J. Wen, “Jiuzhang3.0:
Efficiently improving mathematical reasoning by
training small data synthesis models,” CoRR, vol.
abs/2405.14365, 2024.
[345] Y. Cao, Y. Kang, and L. Sun, “Instruction mining:
High-quality instruction data selection for large lan-
guage models,” CoRR, vol. abs/2307.06290, 2023.
[346] M. Li, Y. Zhang, Z. Li, J. Chen, L. Chen,
N. Cheng, J. Wang, T. Zhou, and J. Xiao, “From
quantity to quality: Boosting LLM performance with
self-guided data selection for instruction tuning,”
CoRR, vol. abs/2308.12032, 2023. [Online]. Available:
https://doi.org/10.48550/arXiv.2308.12032
[347] O. Sener and S. Savarese, “Active learning
for convolutional neural networks: A core-set
approach,” in 6th International Conference on Learning
Representations, ICLR 2018, Vancouver, BC, Canada,
April 30 - May 3, 2018, Conference Track Proceedings.
OpenReview.net, 2018. [Online]. Available: https:
//openreview.net/forum?id=H1aIuk-RW
[348] M. Xia, S. Malladi, S. Gururangan, S. Arora,
and D. Chen, “LESS: selecting influential
data for targeted instruction tuning,” CoRR,
vol. abs/2402.04333, 2024. [Online]. Available:
https://doi.org/10.48550/arXiv.2402.04333
[349] P. W. Koh and P. Liang, “Understanding black-box
predictions via influence functions,” in International
conference on machine learning. PMLR, 2017, pp. 1885–
1894.
[350] Y. Wang, H. Ivison, P. Dasigi, J. Hessel, T. Khot, K. R.
115
Chandu, D. Wadden, K. MacMillan, N. A. Smith,
I. Beltagy, and H. Hajishirzi, “How far can camels
go? exploring the state of instruction tuning on open
resources,” CoRR, vol. abs/2306.04751, 2023.
[351] X. Liu, H. Yan, S. Zhang, C. An, X. Qiu, and D. Lin,
“Scaling laws of rope-based extrapolation,” CoRR,
vol. abs/2310.05209, 2023.
[352] B. Peng, C. Li, P. He, M. Galley, and J. Gao, “Instruc-
tion tuning with GPT-4,” CoRR, vol. abs/2304.03277,
2023.
[353] M. M. Krell, M. Kosec, S. P. Perez, and A. Fitzgib-
bon, “Efficient sequence packing without cross-
contamination: Accelerating large language mod-
els without impacting performance,” arXiv preprint
arXiv:2107.02027, 2021.
[354] K. Singhal, S. Azizi, T. Tu, S. S. Mahdavi, J. Wei,
H. W. Chung, N. Scales, A. Tanwani, H. Cole-Lewis,
S. Pfohl et al., “Large language models encode clinical
knowledge,” arXiv preprint arXiv:2212.13138, 2022.
[355] J. Zhang, R. Xie, Y. Hou, W. X. Zhao, L. Lin, and
J. Wen, “Recommendation as instruction following:
A large language model empowered recommenda-
tion approach,” CoRR, vol. abs/2305.07001, 2023.
[356] H. Wang, C. Liu, N. Xi, Z. Qiang, S. Zhao, B. Qin, and
T. Liu, “Huatuo: Tuning llama model with chinese
medical knowledge,” arXiv preprint arXiv:2304.06975,
2023.
[357] Q. Huang, M. Tao, Z. An, C. Zhang, C. Jiang, Z. Chen,
Z. Wu, and Y. Feng, “Lawyer llama technical report,”
arXiv preprint arXiv:2305.15062, 2023.
[358] S. Wu, O. Irsoy, S. Lu, V. Dabravolski, M. Dredze,
S. Gehrmann, P. Kambadur, D. Rosenberg, and
G. Mann, “Bloomberggpt: A large language model
for finance,” arXiv preprint arXiv:2303.17564, 2023.
[359] T. Liu and B. K. H. Low, “Goat: Fine-tuned llama out-
performs gpt-4 on arithmetic tasks,” arXiv preprint
arXiv:2305.14201, 2023.
[360] T. Sun, X. Zhang, Z. He, P. Li, Q. Cheng, H. Yan,
X. Liu, Y. Shao, Q. Tang, X. Zhao, K. Chen, Y. Zheng,
Z. Zhou, R. Li, J. Zhan, Y. Zhou, L. Li, X. Yang, L. Wu,
Z. Yin, X. Huang, and X. Qiu, “Moss: Training con-
versational language models from synthetic data,”
2023.
[361] Y. Dubois, X. Li, R. Taori, T. Zhang, I. Gulrajani,
J. Ba, C. Guestrin, P. Liang, and T. B. Hashimoto,
“Alpacafarm: A simulation framework for methods
that learn from human feedback,” CoRR, vol.
abs/2305.14387, 2023. [Online]. Available: https:
//doi.org/10.48550/arXiv.2305.14387
[362] D. Hendrycks, C. Burns, S. Basart, A. Zou,
M. Mazeika, D. Song, and J. Steinhardt, “Measur-
ing massive multitask language understanding,” in
ICLR. OpenReview.net, 2021.
[363] M. Suzgun, N. Scales, N. Sch¨ arli, S. Gehrmann,
Y. Tay, H. W. Chung, A. Chowdhery, Q. V. Le, E. H.
Chi, D. Zhou, and J. Wei, “Challenging big-bench
tasks and whether chain-of-thought can solve them,”
CoRR, vol. abs/2210.09261, 2022.
[364] Z. Kenton, T. Everitt, L. Weidinger, I. Gabriel,
V. Mikulik, and G. Irving, “Alignment of language
agents,” CoRR, vol. abs/2103.14659, 2021.
[365] D. M. Ziegler, N. Stiennon, J. Wu, T. B. Brown,
A. Radford, D. Amodei, P. F. Christiano, and G. Irv-
ing, “Fine-tuning language models from human pref-
erences,” CoRR, vol. abs/1909.08593, 2019.
[366] A. Askell, Y. Bai, A. Chen, D. Drain, D. Ganguli,
T. Henighan, A. Jones, N. Joseph, B. Mann, N. Das-
Sarma, N. Elhage, Z. Hatfield-Dodds, D. Hernandez,
J. Kernion, K. Ndousse, C. Olsson, D. Amodei, T. B.
Brown, J. Clark, S. McCandlish, C. Olah, and J. Ka-
plan, “A general language assistant as a laboratory
for alignment,” CoRR, vol. abs/2112.00861, 2021.
[367] E. Perez, S. Huang, H. F. Song, T. Cai, R. Ring,
J. Aslanides, A. Glaese, N. McAleese, and G. Irving,
“Red teaming language models with language mod-
els,” in Proceedings of the 2022 Conference on Empir-
ical Methods in Natural Language Processing, EMNLP
2022, Abu Dhabi, United Arab Emirates, December 7-11,
2022, Y. Goldberg, Z. Kozareva, and Y. Zhang, Eds.
Association for Computational Linguistics, 2022, pp.
3419–3448.
[368] J. Menick, M. Trebacz, V. Mikulik, J. Aslanides,
H. F. Song, M. Chadwick, M. Glaese, S. Young,
L. Campbell-Gillingham, G. Irving, and
N. McAleese, “Teaching language models to
support answers with verified quotes,” CoRR, vol.
abs/2203.11147, 2022.
[369] Y. Bai, S. Kadavath, S. Kundu, A. Askell,
J. Kernion, A. Jones, A. Chen, A. Goldie,
A. Mirhoseini, C. McKinnon, C. Chen, C. Olsson,
C. Olah, D. Hernandez, D. Drain, D. Ganguli,
D. Li, E. Tran-Johnson, E. Perez, J. Kerr,
J. Mueller, J. Ladish, J. Landau, K. Ndousse,
K. Lukosiute, L. Lovitt, M. Sellitto, N. Elhage,
N. Schiefer, N. Mercado, N. DasSarma, R. Lasenby,
R. Larson, S. Ringer, S. Johnston, S. Kravec,
S. E. Showk, S. Fort, T. Lanham, T. Telleen-
Lawton, T. Conerly, T. Henighan, T. Hume, S. R.
Bowman, Z. Hatfield-Dodds, B. Mann, D. Amodei,
N. Joseph, S. McCandlish, T. Brown, and J. Kaplan,
“Constitutional AI: harmlessness from AI feedback,”
CoRR, vol. abs/2212.08073, 2022. [Online]. Available:
https://doi.org/10.48550/arXiv.2212.08073
[370] H. Lee, S. Phatale, H. Mansoor, K. Lu, T. Mesnard,
C. Bishop, V. Carbune, and A. Rastogi, “RLAIF:
scaling reinforcement learning from human feedback
with AI feedback,” CoRR, vol. abs/2309.00267, 2023.
[371] H. Dong, W. Xiong, D. Goyal, R. Pan, S. Diao,
J. Zhang, K. Shum, and T. Zhang, “RAFT:
reward ranked finetuning for generative foundation
model alignment,” CoRR, vol. abs/2304.06767, 2023.
[Online]. Available: https://doi.org/10.48550/arXiv.
2304.06767
[372] A. Askell, Y. Bai, A. Chen, D. Drain, D. Gan-
guli, T. Henighan, A. Jones, N. Joseph, B. Mann,
N. DasSarma et al., “A general language assis-
tant as a laboratory for alignment,” arXiv preprint
arXiv:2112.00861, 2021.
[373] R. Zheng, S. Dou, S. Gao, W. Shen, B. Wang, Y. Liu,
S. Jin, Q. Liu, L. Xiong, L. Chen et al., “Secrets of rlhf
in large language models part i: Ppo,” arXiv preprint
arXiv:2307.04964, 2023.
116
[374] J. Uesato, N. Kushman, R. Kumar, H. F. Song,
N. Y. Siegel, L. Wang, A. Creswell, G. Irving, and
I. Higgins, “Solving math word problems with
process- and outcome-based feedback,” CoRR, vol.
abs/2211.14275, 2022.
[375] H. Lightman, V. Kosaraju, Y. Burda, H. Edwards,
B. Baker, T. Lee, J. Leike, J. Schulman, I. Sutskever,
and K. Cobbe, “Let’s verify step by step,” CoRR, vol.
abs/2305.20050, 2023.
[376] D. Hendrycks, S. Basart, S. Kadavath, M. Mazeika,
A. Arora, E. Guo, C. Burns, S. Puranik, H. He,
D. Song, and J. Steinhardt, “Measuring coding chal-
lenge competence with APPS,” in NeurIPS Datasets
and Benchmarks, 2021.
[377] T. Wang, P. Yu, X. E. Tan, S. O’Brien, R. Pa-
sunuru, J. Dwivedi-Yu, O. Golovneva, L. Zettle-
moyer, M. Fazel-Zarandi, and A. Celikyilmaz, “Shep-
herd: A critic for language model generation,” CoRR,
vol. abs/2308.04592, 2023.
[378] G. Chen, M. Liao, C. Li, and K. Fan, “Alphamath
almost zero: process supervision without process,”
CoRR, vol. abs/2405.03553, 2024.
[379] Q. Ma, H. Zhou, T. Liu, J. Yuan, P. Liu, Y. You, and
H. Yang, “Let’s reward step by step: Step-level re-
ward model as the navigators for reasoning,” CoRR,
vol. abs/2310.10080, 2023.
[380] Z. Chen, K. Zhou, W. X. Zhao, J. Wan, F. Zhang,
D. Zhang, and J. Wen, “Improving large language
models via fine-grained reinforcement learning
with minimum editing constraint,” CoRR, vol.
abs/2401.06081, 2024. [Online]. Available: https:
//doi.org/10.48550/arXiv.2401.06081
[381] Z. Xi, W. Chen, B. Hong, S. Jin, R. Zheng, W. He,
Y. Ding, S. Liu, X. Guo, J. Wang, H. Guo, W. Shen,
X. Fan, Y. Zhou, S. Dou, X. Wang, X. Zhang,
P. Sun, T. Gui, Q. Zhang, and X. Huang, “Train-
ing large language models for reasoning through
reverse curriculum reinforcement learning,” CoRR,
vol. abs/2402.05808, 2024.
[382] D. Silver, J. Schrittwieser, K. Simonyan,
I. Antonoglou, A. Huang, A. Guez, T. Hubert,
L. Baker, M. Lai, A. Bolton, Y. Chen, T. P. Lillicrap,
F. Hui, L. Sifre, G. van den Driessche, T. Graepel,
and D. Hassabis, “Mastering the game of go without
human knowledge,” Nat., pp. 354–359, 2017.
[383] T. Anthony, Z. Tian, and D. Barber, “Thinking fast
and slow with deep learning and tree search,” in
Advances in Neural Information Processing Systems 30:
Annual Conference on Neural Information Processing
Systems 2017, December 4-9, 2017, Long Beach, CA,
USA, 2017, pp. 5360–5370.
[384] H. Luo, Q. Sun, C. Xu, P. Zhao, J. Lou, C. Tao,
X. Geng, Q. Lin, S. Chen, and D. Zhang, “Wizard-
math: Empowering mathematical reasoning for large
language models via reinforced evol-instruct,” CoRR,
vol. abs/2308.09583, 2023.
[385] R. Liu, C. Jia, G. Zhang, Z. Zhuang, T. X. Liu, and
S. Vosoughi, “Second thoughts are best: Learning
to re-align with human values from text edits,” in
NeurIPS, 2022.
[386] X. Lu, S. Welleck, J. Hessel, L. Jiang, L. Qin, P. West,
P. Ammanabrolu, and Y. Choi, “QUARK: control-
lable text generation with reinforced unlearning,” in
NeurIPS, 2022.
[387] J. Scheurer, J. A. Campos, T. Korbak, J. S. Chan,
A. Chen, K. Cho, and E. Perez, “Training language
models with language feedback at scale,” CoRR, vol.
abs/2303.16755, 2023.
[388] G. Guo, R. Zhao, T. Tang, W. X. Zhao, and
J.-R. Wen, “Beyond imitation: Leveraging fine-
grained quality signals for alignment,” arXiv preprint
arXiv:2311.04072, 2023.
[389] R. Krishna, D. Lee, L. Fei-Fei, and M. S. Bernstein,
“Socially situated artificial intelligence enables
learning from human interaction,” Proceedings of the
National Academy of Sciences of the United States of
America, vol. 119, 2022. [Online]. Available: https:
//api.semanticscholar.org/CorpusID:252381954
[390] H. Liu, C. Sferrazza, and P. Abbeel, “Chain of hind-
sight aligns language models with feedback,” CoRR,
vol. abs/2302.02676, 2023.
[391] R. Rafailov, A. Sharma, E. Mitchell, S. Ermon,
C. D. Manning, and C. Finn, “Direct preference
optimization: Your language model is secretly a
reward model,” CoRR, vol. abs/2305.18290, 2023.
[Online]. Available: https://doi.org/10.48550/arXiv.
2305.18290
[392] K. Ethayarajh, W. Xu, N. Muennighoff, D. Jurafsky,
and D. Kiela, “KTO: model alignment as prospect
theoretic optimization,” CoRR, vol. abs/2402.01306,
2024.
[393] Y. Meng, M. Xia, and D. Chen, “Simpo: Simple pref-
erence optimization with a reference-free reward,”
CoRR, vol. abs/2405.14734, 2024.
[394] D. Feng, B. Qin, C. Huang, Z. Zhang, and W. Lei,
“Towards analyzing and understanding the limita-
tions of DPO: A theoretical perspective,” CoRR, vol.
abs/2404.04626, 2024.
[395] A. Gorbatovski, B. Shaposhnikov, A. Malakhov,
N. Surnachev, Y. Aksenov, I. Maksimov, N. Balagan-
sky, and D. Gavrilov, “Learn your reference model
for real good alignment,” CoRR, vol. abs/2404.09656,
2024.
[396] D. Kim, Y. Kim, W. Song, H. Kim, Y. Kim, S. Kim,
and C. Park, “sdpo: Don’t use your data all at once,”
CoRR, vol. abs/2403.19270, 2024.
[397] Z. Yuan, H. Yuan, C. Tan, W. Wang, S. Huang, and
F. Huang, “RRHF: rank responses to align language
models with human feedback without tears,”
CoRR, vol. abs/2304.05302, 2023. [Online]. Available:
https://doi.org/10.48550/arXiv.2304.05302
[398] Y. Zhao, R. Joshi, T. Liu, M. Khalman, M. Saleh,
and P. J. Liu, “Slic-hf: Sequence likelihood calibration
with human feedback,” CoRR, vol. abs/2305.10425,
2023.
[399] A. Fisch, J. Eisenstein, V. Zayats, A. Agarwal,
A. Beirami, C. Nagpal, P. Shaw, and J. Berant, “Ro-
bust preference optimization through reward model
distillation,” CoRR, vol. abs/2405.19316, 2024.
[400] T. Zhang, F. Liu, J. Wong, P. Abbeel, and
J. E. Gonzalez, “The wisdom of hindsight makes
language models better instruction followers,”
117
CoRR, vol. abs/2302.05206, 2023. [Online]. Available:
https://doi.org/10.48550/arXiv.2302.05206
[401] A. Hussein, M. M. Gaber, E. Elyan, and C. Jayne,
“Imitation learning: A survey of learning methods,”
ACM Comput. Surv., vol. 50, no. 2, apr 2017. [Online].
Available: https://doi.org/10.1145/3054912
[402] S. Levine, “Should i imitate or reinforce,”
2022. [Online]. Available: https://www.youtube.
com/watch?v=sVPm7zOrBxM
[403] J. Schulman, “Reinforcement learning from
human feedback: Progress and challenges,” 2023.
[Online]. Available: https://www.youtube.com/
watch?v=hhiLw5Q UFg
[404] X. L. Li and P. Liang, “Prefix-tuning: Optimizing
continuous prompts for generation,” in Proceedings of
the 59th Annual Meeting of the Association for Compu-
tational Linguistics and the 11th International Joint Con-
ference on Natural Language Processing, ACL/IJCNLP
2021, (Volume 1: Long Papers), Virtual Event, August
1-6, 2021, C. Zong, F. Xia, W. Li, and R. Navigli, Eds.
Association for Computational Linguistics, 2021, pp.
4582–4597.
[405] B. Lester, R. Al-Rfou, and N. Constant, “The power
of scale for parameter-efficient prompt tuning,” in
Proceedings of the 2021 Conference on Empirical Methods
in Natural Language Processing, EMNLP 2021, Virtual
Event / Punta Cana, Dominican Republic, 7-11 Novem-
ber, 2021, M. Moens, X. Huang, L. Specia, and S. W.
Yih, Eds. Association for Computational Linguistics,
2021, pp. 3045–3059.
[406] N. Houlsby, A. Giurgiu, S. Jastrzebski, B. Morrone,
Q. de Laroussilhe, A. Gesmundo, M. Attariyan, and
S. Gelly, “Parameter-efficient transfer learning for
NLP,” in Proceedings of the 36th International Confer-
ence on Machine Learning, ICML 2019, 9-15 June 2019,
Long Beach, California, USA, 2019, pp. 2790–2799.
[407] Z. Hu, Y. Lan, L. Wang, W. Xu, E. Lim, R. K. Lee,
L. Bing, and S. Poria, “Llm-adapters: An adapter
family for parameter-efficient fine-tuning of large
language models,” CoRR, vol. abs/2304.01933, 2023.
[408] J. He, C. Zhou, X. Ma, T. Berg-Kirkpatrick, and
G. Neubig, “Towards a unified view of parameter-
efficient transfer learning,” in The Tenth International
Conference on Learning Representations, ICLR 2022, Vir-
tual Event, April 25-29, 2022. OpenReview.net, 2022.
[409] X. Liu, K. Ji, Y. Fu, Z. Du, Z. Yang, and J. Tang, “P-
tuning v2: Prompt tuning can be comparable to fine-
tuning universally across scales and tasks,” CoRR,
vol. abs/2110.07602, 2021.
[410] X. Liu, Y. Zheng, Z. Du, M. Ding, Y. Qian, Z. Yang,
and J. Tang, “GPT understands, too,” CoRR, vol.
abs/2103.10385, 2021.
[411] Y. Gu, X. Han, Z. Liu, and M. Huang, “Ppt: Pre-
trained prompt tuning for few-shot learning,” in Pro-
ceedings of the 60th Annual Meeting of the Association
for Computational Linguistics (Volume 1: Long Papers),
2022, pp. 8410–8423.
[412] Z. Jiang, F. F. Xu, J. Araki, and G. Neubig, “How can
we know what language models know?” Transactions
of the Association for Computational Linguistics, vol. 8,
pp. 423–438, 2020.
[413] T. Shin, Y. Razeghi, R. L. Logan IV, E. Wallace,
and S. Singh, “Autoprompt: Eliciting knowledge
from language models with automatically gener-
ated prompts,” in Proceedings of the 2020 Conference
on Empirical Methods in Natural Language Processing
(EMNLP), 2020, pp. 4222–4235.
[414] Q. Zhang, M. Chen, A. Bukharin, P. He, Y. Cheng,
W. Chen, and T. Zhao, “Adaptive budget allocation
for parameter-efficient fine-tuning,” CoRR, vol.
abs/2303.10512, 2023. [Online]. Available: https:
//doi.org/10.48550/arXiv.2303.10512
[415] M. Valipour, M. Rezagholizadeh, I. Kobyzev, and
A. Ghodsi, “Dylora: Parameter efficient tuning
of pre-trained models using dynamic search-free
low-rank adaptation,” CoRR, vol. abs/2210.07558,
2022. [Online]. Available: https://doi.org/10.48550/
arXiv.2210.07558
[416] N. Ding, Y. Qin, G. Yang, F. Wei, Y. Zonghan, Y. Su,
S. Hu, Y. Chen, C.-M. Chan, W. Chen, J. Yi, W. Zhao,
X. Wang, Z. Liu, H.-T. Zheng, J. Chen, Y. Liu, J. Tang,
J. Li, and M. Sun, “Parameter-efficient fine-tuning
of large-scale pre-trained language models,” Nature
Machine Intelligence, vol. 5, pp. 1–16, 03 2023.
[417] R. Zhang, J. Han, A. Zhou, X. Hu, S. Yan, P. Lu, H. Li,
P. Gao, and Y. Qiao, “Llama-adapter: Efficient fine-
tuning of language models with zero-init attention,”
CoRR, vol. abs/2303.16199, 2023.
[418] J. Pfeiffer, I. Vulic, I. Gurevych, and S. Ruder, “MAD-
X: an adapter-based framework for multi-task cross-
lingual transfer,” in Proceedings of the 2020 Conference
on Empirical Methods in Natural Language Processing,
EMNLP 2020, Online, November 16-20, 2020, B. Web-
ber, T. Cohn, Y. He, and Y. Liu, Eds. Association for
Computational Linguistics, 2020, pp. 7654–7673.
[419] S. Mangrulkar, S. Gugger, L. Debut, Y. Belkada,
and S. Paul, “Peft: State-of-the-art parameter-
efficient fine-tuning methods,” https://github.com/
huggingface/peft, 2022.
[420] J. Liu, D. Shen, Y. Zhang, B. Dolan, L. Carin, and
W. Chen, “What makes good in-context examples for
gpt-3?” in Proceedings of Deep Learning Inside Out: The
3rd Workshop on Knowledge Extraction and Integration
for Deep Learning Architectures, DeeLIO@ACL 2022,
Dublin, Ireland and Online, May 27, 2022, 2022, pp.
100–114.
[421] O. Rubin, J. Herzig, and J. Berant, “Learning to
retrieve prompts for in-context learning,” in Pro-
ceedings of the 2022 Conference of the North American
Chapter of the Association for Computational Linguistics:
Human Language Technologies, NAACL 2022, Seattle,
WA, United States, July 10-15, 2022, 2022, pp. 2655–
2671.
[422] H. J. Kim, H. Cho, J. Kim, T. Kim, K. M. Yoo, and
S. Lee, “Self-generated in-context learning: Leverag-
ing auto-regressive language models as a demonstra-
tion generator,” CoRR, vol. abs/2206.08082, 2022.
[423] Y. Zhou, A. I. Muresanu, Z. Han, K. Paster, S. Pitis,
H. Chan, and J. Ba, “Large language models are
human-level prompt engineers,” in Proc. of ICLR,
2023.
[424] Y. Hao, Y. Sun, L. Dong, Z. Han, Y. Gu, and F. Wei,
118
“Structured prompting: Scaling in-context learning
to 1, 000 examples,” CoRR, 2022.
[425] Y. Lu, M. Bartolo, A. Moore, S. Riedel, and P. Stene-
torp, “Fantastically ordered prompts and where to
find them: Overcoming few-shot prompt order sen-
sitivity,” in Proceedings of the 60th Annual Meeting of
the Association for Computational Linguistics (Volume
1: Long Papers), ACL 2022, Dublin, Ireland, May 22-
27, 2022, S. Muresan, P. Nakov, and A. Villavicencio,
Eds., 2022, pp. 8086–8098.
[426] Y. Fu, H. Peng, A. Sabharwal, P. Clark, and T. Khot,
“Complexity-based prompting for multi-step reason-
ing,” CoRR, vol. abs/2210.00720, 2022.
[427] Z. Zhang, A. Zhang, M. Li, and A. Smola, “Auto-
matic chain of thought prompting in large language
models,” CoRR, vol. abs/2210.03493, 2022.
[428] A. Creswell, M. Shanahan, and I. Higgins,
“Selection-inference: Exploiting large language mod-
els for interpretable logical reasoning,” CoRR, vol.
abs/2205.09712, 2022.
[429] X. Wang, J. Wei, D. Schuurmans, Q. V. Le, E. H. Chi,
and D. Zhou, “Self-consistency improves chain of
thought reasoning in language models,” CoRR, vol.
abs/2203.11171, 2022.
[430] Y. Li, Z. Lin, S. Zhang, Q. Fu, B. Chen, J. Lou,
and W. Chen, “On the advance of making language
models better reasoners,” CoRR, vol. abs/2206.02336,
2022.
[431] X. Wang, J. Wei, D. Schuurmans, Q. V. Le, E. H. Chi,
and D. Zhou, “Rationale-augmented ensembles in
language models,” CoRR, 2022.
[432] D. Zhou, N. Sch¨ arli, L. Hou, J. Wei, N. Scales,
X. Wang, D. Schuurmans, O. Bousquet, Q. Le, and
E. H. Chi, “Least-to-most prompting enables com-
plex reasoning in large language models,” CoRR, vol.
abs/2205.10625, 2022.
[433] T. Khot, H. Trivedi, M. Finlayson, Y. Fu,
K. Richardson, P. Clark, and A. Sabhar-
wal, “Decomposed prompting: A modular
approach for solving complex tasks,” CoRR,
vol. abs/2210.02406, 2022. [Online]. Available:
https://doi.org/10.48550/arXiv.2210.02406
[434] L. Wang, W. Xu, Y. Lan, Z. Hu, Y. Lan, R. K.
Lee, and E. Lim, “Plan-and-solve prompting:
Improving zero-shot chain-of-thought reasoning by
large language models,” CoRR, vol. abs/2305.04091,
2023. [Online]. Available: https://doi.org/10.48550/
arXiv.2305.04091
[435] Q. Lyu, S. Havaldar, A. Stein, L. Zhang, D. Rao,
E. Wong, M. Apidianaki, and C. Callison-Burch,
“Faithful chain-of-thought reasoning,” CoRR, vol.
abs/2301.13379, 2023.
[436] L. Gao, A. Madaan, S. Zhou, U. Alon, P. Liu, Y. Yang,
J. Callan, and G. Neubig, “PAL: program-aided lan-
guage models,” CoRR, vol. abs/2211.10435, 2022.
[437] Y. Shen, K. Song, X. Tan, D. Li, W. Lu, and
Y. Zhuang, “Hugginggpt: Solving ai tasks with chat-
gpt and its friends in huggingface,” arXiv preprint
arXiv:2303.17580, 2023.
[438] H. Sun, Y. Zhuang, L. Kong, B. Dai, and
C. Zhang, “Adaplanner: Adaptive planning from
feedback with language models,” arXiv preprint
arXiv:2305.16653, 2023.
[439] Y. Lu, P. Lu, Z. Chen, W. Zhu, X. E. Wang, and W. Y.
Wang, “Multimodal procedural planning via dual
text-image prompting,” CoRR, vol. abs/2305.01795,
2023.
[440] S. Hao, Y. Gu, H. Ma, J. J. Hong, Z. Wang, D. Z. Wang,
and Z. Hu, “Reasoning with language model is plan-
ning with world model,” CoRR, vol. abs/2305.14992,
2023.
[441] Z. Chen, K. Zhou, B. Zhang, Z. Gong, W. X.
Zhao, and J. Wen, “Chatcot: Tool-augmented chain-
of-thought reasoning on chat-based large language
models,” CoRR, vol. abs/2305.14323, 2023.
[442] S. Yao, J. Zhao, D. Yu, N. Du, I. Shafran,
K. Narasimhan, and Y. Cao, “React: Synergizing rea-
soning and acting in language models,” CoRR, vol.
abs/2210.03629, 2022.
[443] N. Shinn, F. Cassano, B. Labash, A. Gopinath,
K. Narasimhan, and S. Yao, “Reflexion: Language
agents with verbal reinforcement learning,” 2023.
[444] S. Yao, D. Yu, J. Zhao, I. Shafran, T. L. Griffiths,
Y. Cao, and K. Narasimhan, “Tree of thoughts: Delib-
erate problem solving with large language models,”
CoRR, vol. abs/2305.10601, 2023.
[445] V. Liu and L. B. Chilton, “Design guidelines for
prompt engineering text-to-image generative mod-
els,” in Proceedings of the 2022 CHI Conference on
Human Factors in Computing Systems, 2022, pp. 1–23.
[446] J. White, Q. Fu, S. Hays, M. Sandborn, C. Olea,
H. Gilbert, A. Elnashar, J. Spencer-Smith, and D. C.
Schmidt, “A prompt pattern catalog to enhance
prompt engineering with chatgpt,” arXiv preprint
arXiv:2302.11382, 2023.
[447] S. K. K. Santu and D. Feng, “Teler: A general
taxonomy of LLM prompts for benchmarking
complex tasks,” CoRR, vol. abs/2305.11430, 2023.
[Online]. Available: https://doi.org/10.48550/arXiv.
2305.11430
[448] OpenAI, “Gpt best practices,” OpenAI, 2023.
[Online]. Available: https://platform.openai.com/
docs/guides/gpt-best-practices
[449] Contributors, “Ai short,” 2023. [Online]. Available:
https://www.aishort.top/
[450] ——, “Awesome chatgpt prompts,” Github,
2023. [Online]. Available: https://github.com/f/
awesome-chatgpt-prompts/
[451] J. Jiang, K. Zhou, Z. Dong, K. Ye, W. X. Zhao, and
J. Wen, “Structgpt: A general framework for large
language model to reason over structured data,”
CoRR, vol. abs/2305.09645, 2023.
[452] L. Beurer-Kellner, M. Fischer, and M. Vechev,
“Prompting is programming: A query language for
large language models,” Proceedings of the ACM on
Programming Languages, vol. 7, no. PLDI, pp. 1946–
1969, 2023.
[453] P. Lu, B. Peng, H. Cheng, M. Galley, K.-W. Chang,
Y. N. Wu, S.-C. Zhu, and J. Gao, “Chameleon: Plug-
and-play compositional reasoning with large lan-
guage models,” arXiv preprint arXiv:2304.09842, 2023.
[454] R. Ren, Y. Wang, Y. Qu, W. X. Zhao, J. Liu, H. Tian,
119
H. Wu, J.-R. Wen, and H. Wang, “Investigating
the factual knowledge boundary of large language
models with retrieval augmentation,” arXiv preprint
arXiv:2307.11019, 2023.
[455] X. Amatriain, “Prompt design and engineering:
Introduction and advanced methods,” CoRR, vol.
abs/2401.14423, 2024.
[456] Y. Hou, J. Zhang, Z. Lin, H. Lu, R. Xie, J. J. McAuley,
and W. X. Zhao, “Large language models are zero-
shot rankers for recommender systems,” CoRR, vol.
abs/2305.08845, 2023.
[457] S. Chang and E. Fosler-Lussier, “How to prompt
llms for text-to-sql: A study in zero-shot, single-
domain, and cross-domain settings,” CoRR, vol.
abs/2305.11853, 2023. [Online]. Available: https:
//doi.org/10.48550/arXiv.2305.11853
[458] Y. Wen, N. Jain, J. Kirchenbauer, M. Goldblum,
J. Geiping, and T. Goldstein, “Hard prompts
made easy: Gradient-based discrete optimization
for prompt tuning and discovery,” CoRR, vol.
abs/2302.03668, 2023. [Online]. Available: https:
//doi.org/10.48550/arXiv.2302.03668
[459] T. Gao, A. Fisch, and D. Chen, “Making pre-trained
language models better few-shot learners,” in Pro-
ceedings of the 59th Annual Meeting of the Association
for Computational Linguistics and the 11th International
Joint Conference on Natural Language Processing, ACL/I-
JCNLP 2021, (Volume 1: Long Papers), Virtual Event,
August 1-6, 2021, C. Zong, F. Xia, W. Li, and R. Nav-
igli, Eds. Association for Computational Linguistics,
2021, pp. 3816–3830.
[460] L. Chen, J. Chen, T. Goldstein, H. Huang,
and T. Zhou, “Instructzero: Efficient instruction
optimization for black-box large language models,”
CoRR, vol. abs/2306.03082, 2023. [Online]. Available:
https://doi.org/10.48550/arXiv.2306.03082
[461] X. Lin, Z. Wu, Z. Dai, W. Hu, Y. Shu, S. Ng, P. Jaillet,
and B. K. H. Low, “Use your INSTINCT: instruc-
tion optimization using neural bandits coupled with
transformers,” CoRR, vol. abs/2310.02905, 2023.
[462] M. Deng, J. Wang, C. Hsieh, Y. Wang, H. Guo, T. Shu,
M. Song, E. P. Xing, and Z. Hu, “Rlprompt: Optimiz-
ing discrete text prompts with reinforcement learn-
ing,” in Proceedings of the 2022 Conference on Empir-
ical Methods in Natural Language Processing, EMNLP
2022, Abu Dhabi, United Arab Emirates, December 7-11,
2022, Y. Goldberg, Z. Kozareva, and Y. Zhang, Eds.
Association for Computational Linguistics, 2022, pp.
3369–3391.
[463] T. Zhang, X. Wang, D. Zhou, D. Schuurmans, and
J. E. Gonzalez, “TEMPERA: test-time prompt editing
via reinforcement learning,” in The Eleventh Inter-
national Conference on Learning Representations, ICLR
2023, Kigali, Rwanda, May 1-5, 2023. OpenRe-
view.net, 2023.
[464] Y. Jafari, D. Mekala, R. Yu, and T. Berg-Kirkpatrick,
“Morl-prompt: An empirical analysis of multi-
objective reinforcement learning for discrete prompt
optimization,” CoRR, vol. abs/2402.11711, 2024.
[465] W. Kong, S. A. Hombaiah, M. Zhang, Q. Mei, and
M. Bendersky, “Prewrite: Prompt rewriting with re-
inforcement learning,” CoRR, vol. abs/2401.08189,
2024.
[466] H. Xu, Y. Chen, Y. Du, N. Shao, Y. Wang, H. Li,
and Z. Yang, “GPS: genetic prompt search for effi-
cient few-shot learning,” in Proceedings of the 2022
Conference on Empirical Methods in Natural Language
Processing, EMNLP 2022, Abu Dhabi, United Arab Emi-
rates, December 7-11, 2022, Y. Goldberg, Z. Kozareva,
and Y. Zhang, Eds. Association for Computational
Linguistics, 2022, pp. 8162–8171.
[467] A. Prasad, P. Hase, X. Zhou, and M. Bansal,
“Grips: Gradient-free, edit-based instruction search
for prompting large language models,” in Proceedings
of the 17th Conference of the European Chapter of the
Association for Computational Linguistics, EACL 2023,
Dubrovnik, Croatia, May 2-6, 2023, A. Vlachos and
I. Augenstein, Eds. Association for Computational
Linguistics, 2023, pp. 3827–3846.
[468] Y. Zhou, A. I. Muresanu, Z. Han, K. Paster, S. Pitis,
H. Chan, and J. Ba, “Large language models are
human-level prompt engineers,” in The Eleventh
International Conference on Learning Representations,
ICLR 2023, Kigali, Rwanda, May 1-5, 2023. Open-
Review.net, 2023.
[469] R. Pryzant, D. Iter, J. Li, Y. T. Lee, C. Zhu,
and M. Zeng, “Automatic prompt optimization
with ”gradient descent” and beam search,” CoRR,
vol. abs/2305.03495, 2023. [Online]. Available:
https://doi.org/10.48550/arXiv.2305.03495
[470] C. Yang, X. Wang, Y. Lu, H. Liu, Q. V. Le,
D. Zhou, and X. Chen, “Large language models
as optimizers,” CoRR, vol. abs/2309.03409, 2023.
[Online]. Available: https://doi.org/10.48550/arXiv.
2309.03409
[471] Q. Ye, M. Axmed, R. Pryzant, and F. Khani,
“Prompt engineering a prompt engineer,” CoRR, vol.
abs/2311.05661, 2023.
[472] X. Tang, X. Wang, W. X. Zhao, S. Lu, Y. Li, and
J. Wen, “Unleashing the potential of large language
models as prompt optimizers: An analogical analysis
with gradient-based model optimizers,” CoRR, vol.
abs/2402.17564, 2024.
[473] H. Yang and K. Li, “Instoptima: Evolutionary
multi-objective instruction optimization via large
language model-based instruction operators,” in
EMNLP (Findings). Association for Computational
Linguistics, 2023, pp. 13 593–13 602.
[474] Q. Guo, R. Wang, J. Guo, B. Li, K. Song, X. Tan,
G. Liu, J. Bian, and Y. Yang, “Connecting large
language models with evolutionary algorithms
yields powerful prompt optimizers,” CoRR, vol.
abs/2309.08532, 2023.
[475] X. L. Do, Y. Zhao, H. Brown, Y. Xie, J. X. Zhao, N. F.
Chen, K. Kawaguchi, M. Q. Xie, and J. He, “Prompt
optimization via adversarial in-context learning,”
CoRR, vol. abs/2312.02614, 2023.
[476] X. Wang, C. Li, Z. Wang, F. Bai, H. Luo,
J. Zhang, N. Jojic, E. P. Xing, and Z. Hu,
“Promptagent: Strategic planning with language
models enables expert-level prompt optimization,”
CoRR, vol. abs/2310.16427, 2023. [Online]. Available:
120
https://doi.org/10.48550/arXiv.2310.16427
[477] T. Tang, J. Li, W. X. Zhao, and J. Wen, “Context-
tuning: Learning contextualized prompts for natu-
ral language generation,” in Proceedings of the 29th
International Conference on Computational Linguistics,
COLING 2022, Gyeongju, Republic of Korea, October 12-
17, 2022, N. Calzolari, C. Huang, H. Kim, J. Puste-
jovsky, L. Wanner, K. Choi, P. Ryu, H. Chen, L. Do-
natelli, H. Ji, S. Kurohashi, P. Paggio, N. Xue, S. Kim,
Y. Hahm, Z. He, T. K. Lee, E. Santus, F. Bond, and
S. Na, Eds. International Committee on Computa-
tional Linguistics, 2022, pp. 6340–6354.
[478] T. Vu, B. Lester, N. Constant, R. Al-Rfou’, and D. Cer,
“Spot: Better frozen model adaptation through soft
prompt transfer,” in Proceedings of the 60th Annual
Meeting of the Association for Computational Linguistics
(Volume 1: Long Papers), ACL 2022, Dublin, Ireland,
May 22-27, 2022, S. Muresan, P. Nakov, and A. Villavi-
cencio, Eds. Association for Computational Linguis-
tics, 2022, pp. 5039–5059.
[479] J. Li, T. Tang, J. Nie, J. Wen, and X. Zhao, “Learn-
ing to transfer prompts for text generation,” in Pro-
ceedings of the 2022 Conference of the North American
Chapter of the Association for Computational Linguistics:
Human Language Technologies, NAACL 2022, Seattle,
WA, United States, July 10-15, 2022, M. Carpuat,
M. de Marneffe, and I. V. M. Ru´ ız, Eds. Association
for Computational Linguistics, 2022, pp. 3506–3518.
[480] S. Min, X. Lyu, A. Holtzman, M. Artetxe, M. Lewis,
H. Hajishirzi, and L. Zettlemoyer, “Rethinking the
role of demonstrations: What makes in-context learn-
ing work?” in Proceedings of the 2022 Conference
on Empirical Methods in Natural Language Processing,
EMNLP 2022, Abu Dhabi, United Arab Emirates, De-
cember 7-11, 2022. Association for Computational
Linguistics, 2022, pp. 11 048–11 064.
[481] Z. Zhao, E. Wallace, S. Feng, D. Klein, and S. Singh,
“Calibrate before use: Improving few-shot perfor-
mance of language models,” in Proceedings of the 38th
International Conference on Machine Learning, ICML
2021, 18-24 July 2021, Virtual Event, M. Meila and
T. Zhang, Eds., 2021, pp. 12 697–12 706.
[482] Y. Lee, C. Lim, and H. Choi, “Does GPT-3 generate
empathetic dialogues? A novel in-context example
selection method and automatic evaluation metric
for empathetic dialogue generation,” in Proceedings
of the 29th International Conference on Computational
Linguistics, COLING 2022, Gyeongju, Republic of Korea,
October 12-17, 2022, N. Calzolari, C. Huang, H. Kim,
J. Pustejovsky, L. Wanner, K. Choi, P. Ryu, H. Chen,
L. Donatelli, H. Ji, S. Kurohashi, P. Paggio, N. Xue,
S. Kim, Y. Hahm, Z. He, T. K. Lee, E. Santus, F. Bond,
and S. Na, Eds. International Committee on Com-
putational Linguistics, 2022, pp. 669–683.
[483] I. Levy, B. Bogin, and J. Berant, “Diverse demonstra-
tions improve in-context compositional generaliza-
tion,” CoRR, vol. abs/2212.06800, 2022.
[484] H. Su, J. Kasai, C. H. Wu, W. Shi, T. Wang, J. Xin,
R. Zhang, M. Ostendorf, L. Zettlemoyer, N. A. Smith,
and T. Yu, “Selective annotation makes language
models better few-shot learners,” CoRR, 2022.
[485] X. Ye, S. Iyer, A. Celikyilmaz, V. Stoyanov, G. Durrett,
and R. Pasunuru, “Complementary explanations for
effective in-context learning,” CoRR, 2022.
[486] X. Li and X. Qiu, “Finding supporting examples for
in-context learning,” CoRR, 2023.
[487] Y. Zhang, S. Feng, and C. Tan, “Active example
selection for in-context learning,” in Proceedings of
the 2022 Conference on Empirical Methods in Natural
Language Processing, EMNLP 2022, Abu Dhabi, United
Arab Emirates, December 7-11, 2022, 2022, pp. 9134–
9148.
[488] F. Gilardi, M. Alizadeh, and M. Kubli, “Chatgpt out-
performs crowd-workers for text-annotation tasks,”
2023.
[489] H. J. Kim, H. Cho, J. Kim, T. Kim, K. M. Yoo, and
S. Lee, “Self-generated in-context learning: Leverag-
ing auto-regressive language models as a demonstra-
tion generator,” CoRR, vol. abs/2206.08082, 2022.
[490] S. M. Xie, A. Raghunathan, P. Liang, and T. Ma,
“An explanation of in-context learning as implicit
bayesian inference,” in The Tenth International Con-
ference on Learning Representations, ICLR 2022, Virtual
Event, April 25-29, 2022, 2022.
[491] Z. Wu, Y. Wang, J. Ye, and L. Kong, “Self-adaptive in-
context learning,” CoRR, vol. abs/2212.10375, 2022.
[492] Y. Gu, L. Dong, F. Wei, and M. Huang, “Pre-training
to learn in context,” CoRR, vol. abs/2305.09137, 2023.
[493] S. Min, M. Lewis, L. Zettlemoyer, and H. Hajishirzi,
“Metaicl: Learning to learn in context,” in Proceed-
ings of the 2022 Conference of the North American
Chapter of the Association for Computational Linguistics:
Human Language Technologies, NAACL 2022, Seattle,
WA, United States, July 10-15, 2022, M. Carpuat,
M. de Marneffe, and I. V. M. Ru´ ız, Eds., 2022, pp.
2791–2809.
[494] M. Hahn and N. Goyal, “A theory of emergent
in-context learning as implicit structure induction,”
CoRR, vol. abs/2303.07971, 2023.
[495] J. Pan, T. Gao, H. Chen, and D. Chen, “What in-
context learning ”learns” in-context: Disentangling
task recognition and task learning,” CoRR, vol.
abs/2305.09731, 2023.
[496] N. Wies, Y. Levine, and A. Shashua, “The learnability
of in-context learning,” CoRR, vol. abs/2303.07895,
2023.
[497] A. Webson and E. Pavlick, “Do prompt-based models
really understand the meaning of their prompts?” in
Proceedings of the 2022 Conference of the North American
Chapter of the Association for Computational Linguistics:
Human Language Technologies, NAACL 2022, Seattle,
WA, United States, July 10-15, 2022, 2022, pp. 2300–
2344.
[498] J. von Oswald, E. Niklasson, E. Randazzo, J. Sacra-
mento, A. Mordvintsev, A. Zhmoginov, and M. Vla-
dymyrov, “Transformers learn in-context by gradient
descent,” CoRR, vol. abs/2212.07677, 2022.
[499] C. Olsson, N. Elhage, N. Nanda, N. Joseph,
N. DasSarma, T. Henighan, B. Mann, A. Askell,
Y. Bai, A. Chen, T. Conerly, D. Drain, D. Gan-
guli, Z. Hatfield-Dodds, D. Hernandez, S. John-
ston, A. Jones, J. Kernion, L. Lovitt, K. Ndousse,
121
D. Amodei, T. Brown, J. Clark, J. Kaplan, S. McCan-
dlish, and C. Olah, “In-context learning and induc-
tion heads,” CoRR, vol. abs/2209.11895, 2022.
[500] E. Aky ¨ urek, D. Schuurmans, J. Andreas, T. Ma, and
D. Zhou, “What learning algorithm is in-context
learning? investigations with linear models,” CoRR,
vol. abs/2211.15661, 2022.
[501] J. Wei, J. Wei, Y. Tay, D. Tran, A. Webson, Y. Lu,
X. Chen, H. Liu, D. Huang, D. Zhou et al., “Larger
language models do in-context learning differently,”
arXiv preprint arXiv:2303.03846, 2023.
[502] J. Coda-Forno, M. Binz, Z. Akata, M. M. Botvinick,
J. X. Wang, and E. Schulz, “Meta-in-context
learning in large language models,” CoRR, vol.
abs/2305.12907, 2023.
[503] J. W. Wei, L. Hou, A. K. Lampinen, X. Chen,
D. Huang, Y. Tay, X. Chen, Y. Lu, D. Zhou, T. Ma, and
Q. V. Le, “Symbol tuning improves in-context learn-
ing in language models,” CoRR, vol. abs/2305.08298,
2023.
[504] Z. Chu, J. Chen, Q. Chen, W. Yu, T. He, H. Wang,
W. Peng, M. Liu, B. Qin, and T. Liu, “A survey of
chain of thought reasoning: Advances, frontiers and
future,” CoRR, vol. abs/2309.15402, 2023.
[505] S. Miao, C. Liang, and K. Su, “A diverse corpus
for evaluating and developing english math word
problem solvers,” in Proceedings of the 58th Annual
Meeting of the Association for Computational Linguistics,
ACL 2020, Online, July 5-10, 2020, D. Jurafsky, J. Chai,
N. Schluter, and J. R. Tetreault, Eds. Association for
Computational Linguistics, 2020, pp. 975–984.
[506] A. Talmor, J. Herzig, N. Lourie, and J. Berant, “Com-
monsenseqa: A question answering challenge tar-
geting commonsense knowledge,” in Proceedings of
the 2019 Conference of the North American Chapter of
the Association for Computational Linguistics: Human
Language Technologies, NAACL-HLT 2019, Minneapolis,
MN, USA, June 2-7, 2019, Volume 1 (Long and Short
Papers), J. Burstein, C. Doran, and T. Solorio, Eds.
Association for Computational Linguistics, 2019, pp.
4149–4158.
[507] T. Kojima, S. S. Gu, M. Reid, Y. Matsuo, and Y. Iwa-
sawa, “Large language models are zero-shot reason-
ers,” CoRR, vol. abs/2205.11916, 2022.
[508] W. Chen, X. Ma, X. Wang, and W. W. Cohen, “Pro-
gram of thoughts prompting: Disentangling com-
putation from reasoning for numerical reasoning
tasks,” CoRR, vol. abs/2211.12588, 2022.
[509] L. Gao, A. Madaan, S. Zhou, U. Alon, P. Liu, Y. Yang,
J. Callan, and G. Neubig, “PAL: program-aided lan-
guage models,” in International Conference on Ma-
chine Learning, ICML 2023, 23-29 July 2023, Honolulu,
Hawaii, USA, A. Krause, E. Brunskill, K. Cho, B. En-
gelhardt, S. Sabato, and J. Scarlett, Eds., 2023.
[510] X. Zhao, Y. Xie, K. Kawaguchi, J. He, and Q. Xie, “Au-
tomatic model selection with large language models
for reasoning,” CoRR, vol. abs/2305.14333, 2023.
[511] Y. Li, Z. Lin, S. Zhang, Q. Fu, B. Chen, J.-G. Lou,
and W. Chen, “Making large language models better
reasoners with step-aware verifier,” 2023.
[512] O. Yoran, T. Wolfson, B. Bogin, U. Katz, D. Deutch,
and J. Berant, “Answering questions by meta-
reasoning over multiple chains of thought,” CoRR,
vol. abs/2304.13007, 2023.
[513] Z. Ling, Y. Fang, X. Li, Z. Huang, M. Lee, R. Memi-
sevic, and H. Su, “Deductive verification of chain-of-
thought reasoning,” CoRR, vol. abs/2306.03872, 2023.
[514] T. Xue, Z. Wang, Z. Wang, C. Han, P. Yu, and H. Ji,
“RCOT: detecting and rectifying factual inconsis-
tency in reasoning by reversing chain-of-thought,”
CoRR, vol. abs/2305.11499, 2023.
[515] Y. Weng, M. Zhu, F. Xia, B. Li, S. He, K. Liu, and
J. Zhao, “Large language models are better reasoners
with self-verification,” CoRR, abs/2212.09561, 2023.
[516] W. Jiang, H. Shi, L. Yu, Z. Liu, Y. Zhang, Z. Li,
and J. T. Kwok, “Forward-backward reasoning in
large language models for mathematical verifica-
tion,” 2023.
[517] J. Long, “Large language model guided tree-of-
thought,” CoRR, vol. abs/2305.08291, 2023.
[518] S. Mo and M. Xin, “Tree of uncertain thoughts
reasoning for large language models,” CoRR, vol.
abs/2309.07694, 2023.
[519] M. Besta, N. Blach, A. Kubicek, R. Gerstenberger,
L. Gianinazzi, J. Gajda, T. Lehmann, M. Podstawski,
H. Niewiadomski, P. Nyczyk, and T. Hoefler, “Graph
of thoughts: Solving elaborate problems with large
language models,” CoRR, vol. abs/2308.09687, 2023.
[520] B. Lei, P. Lin, C. Liao, and C. Ding, “Boosting log-
ical reasoning in large language models through a
new framework: The graph of thought,” CoRR, vol.
abs/2308.08614, 2023.
[521] R. Ding, C. Zhang, L. Wang, Y. Xu, M. Ma, W. Zhang,
S. Qin, S. Rajmohan, Q. Lin, and D. Zhang, “Ev-
erything of thoughts: Defying the law of pen-
rose triangle for thought generation,” arXiv preprint
arXiv:2311.04254, 2023.
[522] P. Liang, R. Bommasani, T. Lee, D. Tsipras, D. Soylu,
M. Yasunaga, Y. Zhang, D. Narayanan, Y. Wu, A. Ku-
mar, B. Newman, B. Yuan, B. Yan, C. Zhang, C. Cos-
grove, C. D. Manning, C. R´ e, D. Acosta-Navas, D. A.
Hudson, E. Zelikman, E. Durmus, F. Ladhak, F. Rong,
H. Ren, H. Yao, J. Wang, K. Santhanam, L. J. Orr,
L. Zheng, M. Y ¨ uksekg¨
on
¨ ul, M. Suzgun, N. Kim,
N. Guha, N. S. Chatterji, O. Khattab, P. Henderson,
Q. Huang, R. Chi, S. M. Xie, S. Santurkar, S. Ganguli,
T. Hashimoto, T. Icard, T. Zhang, V. Chaudhary,
W. Wang, X. Li, Y. Mai, Y. Zhang, and Y. Koreeda,
“Holistic evaluation of language models,” CoRR, vol.
abs/2211.09110, 2022.
[523] Z. Bi, N. Zhang, Y. Jiang, S. Deng, G. Zheng, and
H. Chen, “When do program-of-thoughts work for
reasoning?” CoRR, vol. abs/2308.15452, 2023.
[524] A. Madaan and A. Yazdanbakhsh, “Text and pat-
terns: For effective chain of thought, it takes two to
tango,” CoRR, vol. abs/2209.07686, 2022.
[525] Z. Zhang, A. Zhang, M. Li, H. Zhao, G. Karypis, and
A. Smola, “Multimodal chain-of-thought reasoning
in language models,” CoRR, vol. abs/2302.00923,
2023.
[526] F. Shi, M. Suzgun, M. Freitag, X. Wang, S. Sri-
vats, S. Vosoughi, H. W. Chung, Y. Tay, S. Ruder,
122
D. Zhou, D. Das, and J. Wei, “Language models are
multilingual chain-of-thought reasoners,” CoRR, vol.
abs/2210.03057, 2022.
[527] J. Qian, H. Wang, Z. Li, S. Li, and X. Yan, “Limita-
tions of language models in arithmetic and symbolic
induction,” CoRR, vol. abs/2208.05051, 2022.
[528] N. Bian, X. Han, L. Sun, H. Lin, Y. Lu, and B. He,
“ChatGPT is a Knowledgeable but Inexperienced
Solver: An Investigation of Commonsense Problem
in Large Language Models,” CoRR, 2023.
[529] S. Yao, D. Yu, J. Zhao, I. Shafran, T. L. Griffiths,
Y. Cao, and K. Narasimhan, “Tree of thoughts: Delib-
erate problem solving with large language models,”
CoRR, vol. abs/2305.10601, 2023.
[530] G. Wang, Y. Xie, Y. Jiang, A. Mandlekar, C. Xiao,
Y. Zhu, L. Fan, and A. Anandkumar, “Voyager: An
open-ended embodied agent with large language
models,” arXiv preprint arXiv:2305.16291, 2023.
[531] X. Jiang, Y. Dong, L. Wang, Q. Shang, and
G. Li, “Self-planning code generation with large
language model,” CoRR, vol. abs/2303.06689, 2023.
[Online]. Available: https://doi.org/10.48550/arXiv.
2303.06689
[532] I. Singh, V. Blukis, A. Mousavian, A. Goyal, D. Xu,
J. Tremblay, D. Fox, J. Thomason, and A. Garg, “Prog-
prompt: Generating situated robot task plans using
large language models,” CoRR, vol. abs/2209.11302,
2022.
[533] B. Liu, Y. Jiang, X. Zhang, Q. Liu, S. Zhang, J. Biswas,
and P. Stone, “LLM+P: empowering large language
models with optimal planning proficiency,” CoRR,
vol. abs/2304.11477, 2023. [Online]. Available:
https://doi.org/10.48550/arXiv.2304.11477
[534] R. Rombach, A. Blattmann, D. Lorenz, P. Esser, and
B. Ommer, “High-resolution image synthesis with
latent diffusion models,” in IEEE/CVF Conference on
Computer Vision and Pattern Recognition, CVPR 2022,
New Orleans, LA, USA, June 18-24, 2022, 2022, pp.
10 674–10 685.
[535] J. S. Park, J. C. O’Brien, C. J. Cai, M. R. Morris,
P. Liang, and M. S. Bernstein, “Generative agents:
Interactive simulacra of human behavior,” CoRR, vol.
abs/2304.03442, 2023.
[536] 2023. [Online]. Available: https://github.com/
Significant-Gravitas/Auto-GPT
[537] Z. Wang, S. Cai, A. Liu, X. Ma, and Y. Liang,
“Describe, explain, plan and select: Interactive plan-
ning with large language models enables open-world
multi-task agents,” CoRR, vol. abs/2302.01560, 2023.
[538] J. Wang, X. Yi, R. Guo, H. Jin, P. Xu, S. Li, X. Wang,
X. Guo, C. Li, X. Xu et al., “Milvus: A purpose-
built vector data management system,” in Proceedings
of the 2021 International Conference on Management of
Data, 2021, pp. 2614–2627.
[539] W. Zhong, L. Guo, Q. Gao, H. Ye, and Y. Wang,
“Memorybank: Enhancing large language models
with long-term memory,” CoRR, vol. abs/2305.10250,
2023.
[540] M. P. Marcus, B. Santorini, and M. A. Marcinkiewicz,
“Building a large annotated corpus of english: The
penn treebank,” Comput. Linguistics, vol. 19, no. 2,
pp. 313–330, 1993.
[541] S. Merity, C. Xiong, J. Bradbury, and R. Socher,
“Pointer sentinel mixture models,” in ICLR (Poster).
OpenReview.net, 2017.
[542] O. Bojar, C. Buck, C. Federmann, B. Haddow,
P. Koehn, J. Leveling, C. Monz, P. Pecina, M. Post,
H. Saint-Amand, R. Soricut, L. Specia, and A. Tam-
chyna, “Findings of the 2014 workshop on statistical
machine translation,” in WMT@ACL. The Associa-
tion for Computer Linguistics, 2014, pp. 12–58.
[543] O. Bojar, R. Chatterjee, C. Federmann, Y. Graham,
B. Haddow, M. Huck, A. Jimeno-Yepes, P. Koehn,
V. Logacheva, C. Monz, M. Negri, A. N´
ev
´ eol, M. L.
Neves, M. Popel, M. Post, R. Rubino, C. Scarton,
L. Specia, M. Turchi, K. Verspoor, and M. Zampieri,
“Findings of the 2016 conference on machine trans-
lation,” in WMT. The Association for Computer
Linguistics, 2016, pp. 131–198.
[544] L. Barrault, O. Bojar, M. R. Costa-juss a, C. Feder- mann, M. Fishel, Y. Graham, B. Haddow, M. Huck, P. Koehn, S. Malmasi, C. Monz, M. M ¨ uller, S. Pal, M. Post, and M. Zampieri, “Findings of the 2019 conference on machine translation (WMT19),” in Pro- ceedings of the Fourth Conference on Machine Transla- tion, WMT 2019, Florence, Italy, August 1-2, 2019 - Vol- ume 2: Shared Task Papers, Day 1, O. Bojar, R. Chatter- jee, C. Federmann, M. Fishel, Y. Graham, B. Haddow, M. Huck, A. Jimeno-Yepes, P. Koehn, A. Martins, C. Monz, M. Negri, A. N´ ev ´ eol, M. L. Neves, M. Post, M. Turchi, and K. Verspoor, Eds. Association for Computational Linguistics, 2019, pp. 1–61. [545] L. Barrault, M. Biesialska, O. Bojar, M. R. Costa- juss a, C. Federmann, Y. Graham, R. Grundkiewicz,
B. Haddow, M. Huck, E. Joanis, T. Kocmi, P. Koehn,
C. Lo, N. Ljubesic, C. Monz, M. Morishita, M. Na-
gata, T. Nakazawa, S. Pal, M. Post, and M. Zampieri,
“Findings of the 2020 conference on machine trans-
lation (WMT20),” in Proceedings of the Fifth Con-
ference on Machine Translation, WMT@EMNLP 2020,
Online, November 19-20, 2020, L. Barrault, O. Bojar,
F. Bougares, R. Chatterjee, M. R. Costa-juss a, C. Fed- ermann, M. Fishel, A. Fraser, Y. Graham, P. Guzman, B. Haddow, M. Huck, A. Jimeno-Yepes, P. Koehn, A. Martins, M. Morishita, C. Monz, M. Nagata, T. Nakazawa, and M. Negri, Eds. Association for Computational Linguistics, 2020, pp. 1–55. [546] F. Akhbardeh, A. Arkhangorodsky, M. Biesialska, O. Bojar, R. Chatterjee, V. Chaudhary, M. R. Costa- juss a, C. Espa˜ na-Bonet, A. Fan, C. Federmann,
M. Freitag, Y. Graham, R. Grundkiewicz, B. Had-
dow, L. Harter, K. Heafield, C. Homan, M. Huck,
K. Amponsah-Kaakyire, J. Kasai, D. Khashabi,
K. Knight, T. Kocmi, P. Koehn, N. Lourie, C. Monz,
M. Morishita, M. Nagata, A. Nagesh, T. Nakazawa,
M. Negri, S. Pal, A. A. Tapo, M. Turchi, V. Vydrin,
and M. Zampieri, “Findings of the 2021 confer-
ence on machine translation (WMT21),” in Proceed-
ings of the Sixth Conference on Machine Translation,
WMT@EMNLP 2021, Online Event, November 10-11,
2021, L. Barrault, O. Bojar, F. Bougares, R. Chat-
terjee, M. R. Costa-juss a, C. Federmann, M. Fishel, 123 A. Fraser, M. Freitag, Y. Graham, R. Grundkiewicz, P. Guzman, B. Haddow, M. Huck, A. Jimeno-Yepes, P. Koehn, T. Kocmi, A. Martins, M. Morishita, and C. Monz, Eds. Association for Computational Lin- guistics, 2021, pp. 1–88. [547] T. Kocmi, R. Bawden, O. Bojar, A. Dvorkovich, C. Fe- dermann, M. Fishel, T. Gowda, Y. Graham, R. Grund- kiewicz, B. Haddow, R. Knowles, P. Koehn, C. Monz, M. Morishita, M. Nagata, T. Nakazawa, M. Nov´ ak, M. Popel, and M. Popovic, “Findings of the 2022 conference on machine translation (WMT22),” in Pro- ceedings of the Seventh Conference on Machine Trans- lation, WMT 2022, Abu Dhabi, United Arab Emi- rates (Hybrid), December 7-8, 2022, P. Koehn, L. Bar- rault, O. Bojar, F. Bougares, R. Chatterjee, M. R. Costa-juss a, C. Federmann, M. Fishel, A. Fraser,
M. Freitag, Y. Graham, R. Grundkiewicz, P. Guzman,
B. Haddow, M. Huck, A. Jimeno-Yepes, T. Kocmi,
A. Martins, M. Morishita, C. Monz, M. Nagata,
T. Nakazawa, M. Negri, A. N´
ev
´ eol, M. Neves,
M. Popel, M. Turchi, and M. Zampieri, Eds. Associ-
ation for Computational Linguistics, 2022, pp. 1–45.
[548] N. Goyal, C. Gao, V. Chaudhary, P. Chen, G. Wenzek,
D. Ju, S. Krishnan, M. Ranzato, F. Guzm´ an, and
A. Fan, “The flores-101 evaluation benchmark for
low-resource and multilingual machine translation,”
Trans. Assoc. Comput. Linguistics, vol. 10, pp. 522–538,
2022.
[549] R. Bawden, E. Bilinski, T. Lavergne, and S. Rosset,
“Diabla: a corpus of bilingual spontaneous writ-
ten dialogues for machine translation,” Lang. Resour.
Evaluation, vol. 55, no. 3, pp. 635–660, 2021.
[550] R. Nallapati, B. Zhou, C. N. dos Santos, C¸ . G ¨ ulc¸ehre,
and B. Xiang, “Abstractive text summarization using
sequence-to-sequence rnns and beyond,” in Proceed-
ings of the 20th SIGNLL Conference on Computational
Natural Language Learning, CoNLL 2016, Berlin, Ger-
many, August 11-12, 2016, Y. Goldberg and S. Riezler,
Eds. ACL, 2016, pp. 280–290.
[551] S. Narayan, S. B. Cohen, and M. Lapata, “Don’t give
me the details, just the summary! topic-aware con-
volutional neural networks for extreme summariza-
tion,” in EMNLP. Association for Computational
Linguistics, 2018, pp. 1797–1807.
[552] F. Ladhak, E. Durmus, C. Cardie, and K. Mckeown,
“Wikilingua: A new benchmark dataset for cross-
lingual abstractive summarization,” in Findings of
the Association for Computational Linguistics: EMNLP
2020, 2020, pp. 4034–4048.
[553] S. Moon, P. Shah, A. Kumar, and R. Subba, “Open-
dialkg: Explainable conversational reasoning with
attention-based walks over knowledge graphs,” in
ACL (1). Association for Computational Linguistics,
2019, pp. 845–854.
[554] Y. Lai, C. Li, Y. Wang, T. Zhang, R. Zhong, L. Zettle-
moyer, S. W. Yih, D. Fried, S. I. Wang, and T. Yu,
“DS-1000: A natural and reliable benchmark for data
science code generation,” CoRR, vol. abs/2211.11501,
2022.
[555] Z. Wang, S. Zhou, D. Fried, and G. Neubig,
“Execution-based evaluation for open-domain code
generation,” CoRR, vol. abs/2212.10481, 2022.
[556] T. Kwiatkowski, J. Palomaki, O. Redfield, M. Collins,
A. P. Parikh, C. Alberti, D. Epstein, I. Polosukhin,
J. Devlin, K. Lee, K. Toutanova, L. Jones, M. Kel-
cey, M. Chang, A. M. Dai, J. Uszkoreit, Q. Le, and
S. Petrov, “Natural questions: a benchmark for ques-
tion answering research,” Trans. Assoc. Comput. Lin-
guistics, pp. 452–466, 2019.
[557] P. Clark, I. Cowhey, O. Etzioni, T. Khot, A. Sabharwal,
C. Schoenick, and O. Tafjord, “Think you have solved
question answering? try arc, the AI2 reasoning chal-
lenge,” CoRR, vol. abs/1803.05457, 2018.
[558] S. Lin, J. Hilton, and O. Evans, “Truthfulqa: Measur-
ing how models mimic human falsehoods,” in Pro-
ceedings of the 60th Annual Meeting of the Association
for Computational Linguistics (Volume 1: Long Papers),
ACL 2022, Dublin, Ireland, May 22-27, 2022, 2022, pp.
3214–3252.
[559] J. Berant, A. Chou, R. Frostig, and P. Liang, “Semantic
parsing on freebase from question-answer pairs,” in
Proceedings of the 2013 Conference on Empirical Methods
in Natural Language Processing, EMNLP 2013, 18-21
October 2013, Grand Hyatt Seattle, Seattle, Washington,
USA, A meeting of SIGDAT, a Special Interest Group of
the ACL, 2013, pp. 1533–1544.
[560] M. Joshi, E. Choi, D. S. Weld, and L. Zettlemoyer,
“Triviaqa: A large scale distantly supervised chal-
lenge dataset for reading comprehension,” in Pro-
ceedings of the 55th Annual Meeting of the Association
for Computational Linguistics, ACL 2017, Vancouver,
Canada, July 30 - August 4, Volume 1: Long Papers, 2017,
pp. 1601–1611.
[561] Y. Bisk, R. Zellers, R. L. Bras, J. Gao, and Y. Choi,
“PIQA: reasoning about physical commonsense in
natural language,” in The Thirty-Fourth AAAI Con-
ference on Artificial Intelligence, AAAI 2020, The Thirty-
Second Innovative Applications of Artificial Intelligence
Conference, IAAI 2020, The Tenth AAAI Symposium
on Educational Advances in Artificial Intelligence, EAAI
2020, New York, NY, USA, February 7-12, 2020, 2020,
pp. 7432–7439.
[562] M. Dubey, D. Banerjee, A. Abdelkawi, and
J. Lehmann, “Lc-quad 2.0: A large dataset for com-
plex question answering over wikidata and dbpe-
dia,” in The Semantic Web - ISWC 2019 - 18th In-
ternational Semantic Web Conference, Auckland, New
Zealand, October 26-30, 2019, Proceedings, Part II, 2019,
pp. 69–78.
[563] Y. Gu, S. Kase, M. Vanni, B. M. Sadler, P. Liang,
X. Yan, and Y. Su, “Beyond I.I.D.: three levels of
generalization for question answering on knowledge
bases,” in WWW ’21: The Web Conference 2021, Virtual
Event / Ljubljana, Slovenia, April 19-23, 2021, 2021, pp.
3477–3488.
[564] S. Cao, J. Shi, L. Pan, L. Nie, Y. Xiang, L. Hou,
J. Li, B. He, and H. Zhang, “KQA pro: A dataset
with explicit compositional programs for complex
question answering over knowledge base,” in Pro-
ceedings of the 60th Annual Meeting of the Association
for Computational Linguistics (Volume 1: Long Papers),
ACL 2022, Dublin, Ireland, May 22-27, 2022, 2022, pp.
124
6101–6119.
[565] X. Hu, X. Wu, Y. Shu, and Y. Qu, “Logical form gen-
eration via multi-task learning for complex question
answering over knowledge bases,” in Proceedings
of the 29th International Conference on Computational
Linguistics, COLING 2022, Gyeongju, Republic of Korea,
October 12-17, 2022, 2022, pp. 1687–1696.
[566] S. Longpre, Y. Lu, and J. Daiber, “MKQA: A lin-
guistically diverse benchmark for multilingual open
domain question answering,” Trans. Assoc. Comput.
Linguistics, vol. 9, pp. 1389–1406, 2021.
[567] T. Saikh, T. Ghosal, A. Mittal, A. Ekbal, and P. Bhat-
tacharyya, “Scienceqa: a novel resource for question
answering on scholarly articles,” Int. J. Digit. Libr.,
vol. 23, no. 3, pp. 289–301, 2022.
[568] T. Mihaylov, P. Clark, T. Khot, and A. Sabharwal,
“Can a suit of armor conduct electricity? A new
dataset for open book question answering,” in Pro-
ceedings of the 2018 Conference on Empirical Methods in
Natural Language Processing, Brussels, Belgium, October
31 - November 4, 2018, 2018, pp. 2381–2391.
[569] T. Nguyen, M. Rosenberg, X. Song, J. Gao, S. Tiwary,
R. Majumder, and L. Deng, “MS MARCO: A human
generated machine reading comprehension dataset,”
in Proceedings of the Workshop on Cognitive Computa-
tion: Integrating neural and symbolic approaches 2016
co-located with the 30th Annual Conference on Neural
Information Processing Systems (NIPS 2016), Barcelona,
Spain, December 9, 2016, 2016.
[570] T. Khot, P. Clark, M. Guerquin, P. Jansen, and A. Sab-
harwal, “QASC: A dataset for question answering
via sentence composition,” in The Thirty-Fourth AAAI
Conference on Artificial Intelligence, AAAI 2020, The
Thirty-Second Innovative Applications of Artificial Intel-
ligence Conference, IAAI 2020, The Tenth AAAI Sympo-
sium on Educational Advances in Artificial Intelligence,
EAAI 2020, New York, NY, USA, February 7-12, 2020,
2020, pp. 8082–8090.
[571] P. Rajpurkar, J. Zhang, K. Lopyrev, and P. Liang,
“Squad: 100, 000+ questions for machine compre-
hension of text,” in Proceedings of the 2016 Conference
on Empirical Methods in Natural Language Processing,
EMNLP 2016, Austin, Texas, USA, November 1-4, 2016,
2016, pp. 2383–2392.
[572] A. H. Miller, A. Fisch, J. Dodge, A. Karimi, A. Bordes,
and J. Weston, “Key-value memory networks for
directly reading documents,” in Proceedings of the
2016 Conference on Empirical Methods in Natural Lan-
guage Processing, EMNLP 2016, Austin, Texas, USA,
November 1-4, 2016, 2016, pp. 1400–1409.
[573] B. Goodrich, V. Rao, P. J. Liu, and M. Saleh, “As-
sessing the factual accuracy of generated text,” in
Proceedings of the 25th ACM SIGKDD International
Conference on Knowledge Discovery & Data Mining,
KDD 2019, Anchorage, AK, USA, August 4-8, 2019,
2019, pp. 166–175.
[574] K. Toutanova and D. Chen, “Observed versus latent
features for knowledge base and text inference,” in
Proceedings of the 3rd Workshop on Continuous Vector
Space Models and their Compositionality, CVSC 2015,
Beijing, China, July 26-31, 2015, 2015, pp. 57–66.
[575] K. D. Bollacker, C. Evans, P. K. Paritosh, T. Sturge,
and J. Taylor, “Freebase: a collaboratively created
graph database for structuring human knowledge,”
in Proceedings of the ACM SIGMOD International Con-
ference on Management of Data, SIGMOD 2008, Vancou-
ver, BC, Canada, June 10-12, 2008, 2008, pp. 1247–1250.
[576] T. Dettmers, P. Minervini, P. Stenetorp, and S. Riedel,
“Convolutional 2d knowledge graph embeddings,”
in Proceedings of the Thirty-Second AAAI Conference
on Artificial Intelligence, (AAAI-18), the 30th innovative
Applications of Artificial Intelligence (IAAI-18), and the
8th AAAI Symposium on Educational Advances in Ar-
tificial Intelligence (EAAI-18), New Orleans, Louisiana,
USA, February 2-7, 2018, 2018, pp. 1811–1818.
[577] G. A. Miller, “Wordnet: A lexical database for en-
glish,” Commun. ACM, pp. 39–41, 1995.
[578] F. Petroni, T. Rockt¨ aschel, S. Riedel, P. S. H. Lewis,
A. Bakhtin, Y. Wu, and A. H. Miller, “Language mod-
els as knowledge bases?” in Proceedings of the 2019
Conference on Empirical Methods in Natural Language
Processing and the 9th International Joint Conference
on Natural Language Processing, EMNLP-IJCNLP 2019,
Hong Kong, China, November 3-7, 2019, 2019, pp. 2463–
2473.
[579] F. Mahdisoltani, J. Biega, and F. M. Suchanek,
“YAGO3: A knowledge base from multilingual
wikipedias,” in Seventh Biennial Conference on Innova-
tive Data Systems Research, CIDR 2015, Asilomar, CA,
USA, January 4-7, 2015, Online Proceedings, 2015.
[580] F. M. Suchanek, G. Kasneci, and G. Weikum, “Yago:
a core of semantic knowledge,” in Proceedings of
the 16th International Conference on World Wide Web,
WWW 2007, Banff, Alberta, Canada, May 8-12, 2007,
2007, pp. 697–706.
[581] Z. Yang, P. Qi, S. Zhang, Y. Bengio, W. W. Cohen,
R. Salakhutdinov, and C. D. Manning, “Hotpotqa:
A dataset for diverse, explainable multi-hop ques-
tion answering,” in Proceedings of the 2018 Conference
on Empirical Methods in Natural Language Processing,
Brussels, Belgium, October 31 - November 4, 2018. As-
sociation for Computational Linguistics, 2018, pp.
2369–2380.
[582] C. Clark, K. Lee, M. Chang, T. Kwiatkowski,
M. Collins, and K. Toutanova, “Boolq: Exploring the
surprising difficulty of natural yes/no questions,” in
Proceedings of the 2019 Conference of the North American
Chapter of the Association for Computational Linguis-
tics: Human Language Technologies, NAACL-HLT 2019,
Minneapolis, MN, USA, June 2-7, 2019, Volume 1 (Long
and Short Papers), J. Burstein, C. Doran, and T. Solorio,
Eds. Association for Computational Linguistics,
2019, pp. 2924–2936.
[583] M. Sap, H. Rashkin, D. Chen, R. L. Bras, and Y. Choi,
“Socialiqa: Commonsense reasoning about social in-
teractions,” CoRR, vol. abs/1904.09728, 2019.
[584] R. Zellers, A. Holtzman, Y. Bisk, A. Farhadi, and
Y. Choi, “Hellaswag: Can a machine really finish
your sentence?” in Proceedings of the 57th Conference of
the Association for Computational Linguistics, ACL 2019,
Florence, Italy, July 28- August 2, 2019, Volume 1: Long
Papers, A. Korhonen, D. R. Traum, and L. M arquez, 125 Eds. Association for Computational Linguistics, 2019, pp. 4791–4800. [585] K. Sakaguchi, R. L. Bras, C. Bhagavatula, and Y. Choi, “Winogrande: An adversarial winograd schema challenge at scale,” in AAAI. AAAI Press, 2020, pp. 8732–8740. [586] M. Roemmele, C. A. Bejan, and A. S. Gordon, “Choice of plausible alternatives: An evaluation of commonsense causal reasoning,” in Logical Formaliza- tions of Commonsense Reasoning, Papers from the 2011 AAAI Spring Symposium, Technical Report SS-11-06, Stanford, California, USA, March 21-23, 2011. AAAI, 2011. [587] K. Sakaguchi, C. Bhagavatula, R. L. Bras, N. Tandon, P. Clark, and Y. Choi, “proscript: Partially ordered scripts generation,” in Findings of the Association for Computational Linguistics: EMNLP 2021, Virtual Event / Punta Cana, Dominican Republic, 16-20 November, 2021, M. Moens, X. Huang, L. Specia, and S. W. Yih, Eds. Association for Computational Linguistics, 2021, pp. 2138–2149. [588] B. Dalvi, L. Huang, N. Tandon, W. Yih, and P. Clark, “Tracking state changes in procedural text: a chal- lenge dataset and models for process paragraph com- prehension,” in Proceedings of the 2018 Conference of the North American Chapter of the Association for Com- putational Linguistics: Human Language Technologies, NAACL-HLT 2018, New Orleans, Louisiana, USA, June 1-6, 2018, Volume 1 (Long Papers), M. A. Walker, H. Ji, and A. Stent, Eds. Association for Computational Linguistics, 2018, pp. 1595–1604. [589] S. Saha, P. Yadav, L. Bauer, and M. Bansal, “Expla- graphs: An explanation graph generation task for structured commonsense reasoning,” in Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, EMNLP 2021, Virtual Event / Punta Cana, Dominican Republic, 7-11 November, 2021, M. Moens, X. Huang, L. Specia, and S. W. Yih, Eds. Association for Computational Linguistics, 2021, pp. 7716–7740. [590] O. Tafjord, B. Dalvi, and P. Clark, “Proofwriter: Generating implications, proofs, and abductive state- ments over natural language,” in Findings of the Association for Computational Linguistics: ACL/IJCNLP 2021, Online Event, August 1-6, 2021, ser. Findings of ACL, C. Zong, F. Xia, W. Li, and R. Navigli, Eds., vol. ACL/IJCNLP 2021. Association for Computational Linguistics, 2021, pp. 3621–3634. [591] B. Dalvi, P. Jansen, O. Tafjord, Z. Xie, H. Smith, L. Pi- patanangkura, and P. Clark, “Explaining answers with entailment trees,” in Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, EMNLP 2021, Virtual Event / Punta Cana, Dominican Republic, 7-11 November, 2021, M. Moens, X. Huang, L. Specia, and S. W. Yih, Eds. Association for Computational Linguistics, 2021, pp. 7358–7370. [592] A. Saparov and H. He, “Language models are greedy reasoners: A systematic formal analysis of chain-of- thought,” CoRR, vol. abs/2210.01240, 2022. [593] C. Anil, Y. Wu, A. Andreassen, A. Lewkowycz, V. Misra, V. V. Ramasesh, A. Slone, G. Gur-Ari, E. Dyer, and B. Neyshabur, “Exploring length gen- eralization in large language models,” CoRR, vol. abs/2207.04901, 2022. [594] A. Patel, S. Bhattamishra, and N. Goyal, “Are NLP models really able to solve simple math word prob- lems?” in NAACL-HLT. Association for Computa- tional Linguistics, 2021, pp. 2080–2094. [595] S. Roy and D. Roth, “Solving general arithmetic word problems,” in Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, EMNLP 2015, Lisbon, Portugal, September 17-21, 2015, L. M arquez, C. Callison-Burch, J. Su, D. Pighin, and
Y. Marton, Eds. The Association for Computational
Linguistics, 2015, pp. 1743–1752.
[596] A. Amini, S. Gabriel, S. Lin, R. Koncel-Kedziorski,
Y. Choi, and H. Hajishirzi, “Mathqa: Towards inter-
pretable math word problem solving with operation-
based formalisms,” in Proceedings of the 2019 Confer-
ence of the North American Chapter of the Association for
Computational Linguistics: Human Language Technolo-
gies, NAACL-HLT 2019, Minneapolis, MN, USA, June
2-7, 2019, Volume 1 (Long and Short Papers), J. Burstein,
C. Doran, and T. Solorio, Eds. Association for
Computational Linguistics, 2019, pp. 2357–2367.
[597] W. Ling, D. Yogatama, C. Dyer, and P. Blunsom, “Pro-
gram induction by rationale generation: Learning to
solve and explain algebraic word problems,” in Pro-
ceedings of the 55th Annual Meeting of the Association
for Computational Linguistics, ACL 2017, Vancouver,
Canada, July 30 - August 4, Volume 1: Long Papers,
R. Barzilay and M. Kan, Eds. Association for Com-
putational Linguistics, 2017, pp. 158–167.
[598] R. Koncel-Kedziorski, S. Roy, A. Amini, N. Kushman,
and H. Hajishirzi, “Mawps: A math word problem
repository,” in Proceedings of the 2016 conference of
the north american chapter of the association for compu-
tational linguistics: human language technologies, 2016,
pp. 1152–1157.
[599] D. Dua, Y. Wang, P. Dasigi, G. Stanovsky, S. Singh,
and M. Gardner, “DROP: A reading comprehension
benchmark requiring discrete reasoning over para-
graphs,” in Proceedings of the 2019 Conference of the
North American Chapter of the Association for Com-
putational Linguistics: Human Language Technologies,
NAACL-HLT 2019, Minneapolis, MN, USA, June 2-7,
2019, Volume 1 (Long and Short Papers), 2019, pp. 2368–
2378.
[600] S. Welleck, J. Liu, R. L. Bras, H. Hajishirzi, Y. Choi,
and K. Cho, “Naturalproofs: Mathematical theorem
proving in natural language,” in Proceedings of the
Neural Information Processing Systems Track on Datasets
and Benchmarks 1, NeurIPS Datasets and Benchmarks
2021, December 2021, virtual, J. Vanschoren and S. Ye-
ung, Eds., 2021.
[601] A. Q. Jiang, W. Li, J. M. Han, and Y. Wu, “Lisa: Lan-
guage models of isabelle proofs,” in 6th Conference
on Artificial Intelligence and Theorem Proving, 2021, pp.
378–392.
[602] K. Zheng, J. M. Han, and S. Polu, “minif2f: a cross-
system benchmark for formal olympiad-level mathe-
matics,” in The Tenth International Conference on Learn-
126
ing Representations, ICLR 2022, Virtual Event, April 25-
29, 2022. OpenReview.net, 2022.
[603] Z. Azerbayev, B. Piotrowski, H. Schoelkopf, E. W.
Ayers, D. Radev, and J. Avigad, “Proofnet: Autofor-
malizing and formally proving undergraduate-level
mathematics,” CoRR, vol. abs/2302.12433, 2023.
[604] J. Li, X. Cheng, W. X. Zhao, J. Nie, and J. Wen,
“Halueval: A large-scale hallucination evaluation
benchmark for large language models,” CoRR, vol.
abs/2305.11747, 2023.
[605] N. Nangia, C. Vania, R. Bhalerao, and S. R. Bowman,
“Crows-pairs: A challenge dataset for measuring
social biases in masked language models,” in Pro-
ceedings of the 2020 Conference on Empirical Methods
in Natural Language Processing, EMNLP 2020, Online,
November 16-20, 2020, 2020, pp. 1953–1967.
[606] R. Rudinger, J. Naradowsky, B. Leonard, and B. V.
Durme, “Gender bias in coreference resolution,” in
Proceedings of the 2018 Conference of the North American
Chapter of the Association for Computational Linguistics:
Human Language Technologies, NAACL-HLT, New Or-
leans, Louisiana, USA, June 1-6, 2018, Volume 2 (Short
Papers), 2018, pp. 8–14.
[607] S. Gehman, S. Gururangan, M. Sap, Y. Choi, and
N. A. Smith, “Realtoxicityprompts: Evaluating neu-
ral toxic degeneration in language models,” in Find-
ings of the Association for Computational Linguistics:
EMNLP 2020, Online Event, 16-20 November 2020, ser.
Findings of ACL, T. Cohn, Y. He, and Y. Liu, Eds.,
vol. EMNLP 2020. Association for Computational
Linguistics, 2020, pp. 3356–3369.
[608] X. Puig, K. Ra, M. Boben, J. Li, T. Wang, S. Fidler,
and A. Torralba, “Virtualhome: Simulating house-
hold activities via programs,” in CVPR. Computer
Vision Foundation / IEEE Computer Society, 2018,
pp. 8494–8502.
[609] S. Srivastava, C. Li, M. Lingelbach, R. Mart´ ın-Mart´
ın,
F. Xia, K. E. Vainio, Z. Lian, C. Gokmen, S. Buch,
C. K. Liu, S. Savarese, H. Gweon, J. Wu, and L. Fei-
Fei, “BEHAVIOR: benchmark for everyday house-
hold activities in virtual, interactive, and ecological
environments,” in CoRL, ser. Proceedings of Machine
Learning Research, vol. 164. PMLR, 2021, pp. 477–
490.
[610] M. Shridhar, J. Thomason, D. Gordon, Y. Bisk,
W. Han, R. Mottaghi, L. Zettlemoyer, and D. Fox,
“ALFRED: A benchmark for interpreting grounded
instructions for everyday tasks,” in CVPR. Com-
puter Vision Foundation / IEEE, 2020, pp. 10 737–
10 746.
[611] M. Shridhar, X. Yuan, M. Cˆ ot´ e, Y. Bisk, A. Trischler,
and M. J. Hausknecht, “Alfworld: Aligning text and
embodied environments for interactive learning,” in
9th International Conference on Learning Representa-
tions, ICLR 2021, Virtual Event, Austria, May 3-7, 2021.
OpenReview.net, 2021.
[612] S. Yao, H. Chen, J. Yang, and K. Narasimhan, “Web-
shop: Towards scalable real-world web interaction
with grounded language agents,” in NeurIPS, 2022.
[613] X. Deng, Y. Gu, B. Zheng, S. Chen, S. Stevens,
B. Wang, H. Sun, and Y. Su, “Mind2web: To-
wards a generalist agent for the web,” CoRR, vol.
abs/2306.06070, 2023.
[614] W. H. Guss, B. Houghton, N. Topin, P. Wang,
C. Codel, M. Veloso, and R. Salakhutdinov, “Minerl:
A large-scale dataset of minecraft demonstrations,”
in Proceedings of the Twenty-Eighth International Joint
Conference on Artificial Intelligence, IJCAI 2019, Macao,
China, August 10-16, 2019, S. Kraus, Ed. ijcai.org,
2019, pp. 2442–2448.
[615] L. Fan, G. Wang, Y. Jiang, A. Mandlekar, Y. Yang,
H. Zhu, A. Tang, D. Huang, Y. Zhu, and A. Anand-
kumar, “Minedojo: Building open-ended embodied
agents with internet-scale knowledge,” in NeurIPS,
2022.
[616] P. Lu, L. Qiu, K. Chang, Y. N. Wu, S. Zhu, T. Rajpuro-
hit, P. Clark, and A. Kalyan, “Dynamic prompt learn-
ing via policy gradient for semi-structured mathe-
matical reasoning,” CoRR, vol. abs/2209.14610, 2022.
[617] B. Zhang, K. Zhou, X. Wei, W. X. Zhao, J. Sha,
S. Wang, and J. rong Wen, “Evaluating and improv-
ing tool-augmented computation-intensive math rea-
soning,” CoRR, vol. abs/2306.02408, 2023.
[618] R. Yang, L. Song, Y. Li, S. Zhao, Y. Ge, X. Li,
and Y. Shan, “Gpt4tools: Teaching large language
model to use tools via self-instruction,” CoRR, vol.
abs/2305.18752, 2023.
[619] S. G. Patil, T. Zhang, X. Wang, and J. E. Gonzalez,
“Gorilla: Large language model connected with mas-
sive apis,” CoRR, vol. abs/2305.15334, 2023.
[620] W. Yih, M. Richardson, C. Meek, M. Chang, and
J. Suh, “The value of semantic parse labeling for
knowledge base question answering,” in Proceedings
of the 54th Annual Meeting of the Association for Com-
putational Linguistics, ACL 2016, August 7-12, 2016,
Berlin, Germany, Volume 2: Short Papers. The Associ-
ation for Computer Linguistics, 2016.
[621] H. Puerto, G. G. Sahin, and I. Gurevych, “Metaqa:
Combining expert agents for multi-skill question an-
swering,” in Proceedings of the 17th Conference of the
European Chapter of the Association for Computational
Linguistics, EACL 2023, Dubrovnik, Croatia, May 2-6,
2023, A. Vlachos and I. Augenstein, Eds. Association
for Computational Linguistics, 2023, pp. 3548–3562.
[622] P. Pasupat and P. Liang, “Compositional semantic
parsing on semi-structured tables,” in Proceedings of
the 53rd Annual Meeting of the Association for Com-
putational Linguistics and the 7th International Joint
Conference on Natural Language Processing of the Asian
Federation of Natural Language Processing, ACL 2015,
July 26-31, 2015, Beijing, China, Volume 1: Long Papers.
The Association for Computer Linguistics, 2015, pp.
1470–1480.
[623] V. Zhong, C. Xiong, and R. Socher, “Seq2sql: Gener-
ating structured queries from natural language using
reinforcement learning,” CoRR, vol. abs/1709.00103,
2017.
[624] W. Chen, H. Wang, J. Chen, Y. Zhang, H. Wang,
S. Li, X. Zhou, and W. Y. Wang, “Tabfact: A large-
scale dataset for table-based fact verification,” in 8th
International Conference on Learning Representations,
ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020.
127
OpenReview.net, 2020.
[625] T. Yu, R. Zhang, K. Yang, M. Yasunaga, D. Wang,
Z. Li, J. Ma, I. Li, Q. Yao, S. Roman, Z. Zhang, and
D. R. Radev, “Spider: A large-scale human-labeled
dataset for complex and cross-domain semantic pars-
ing and text-to-sql task,” in Proceedings of the 2018
Conference on Empirical Methods in Natural Language
Processing, Brussels, Belgium, October 31 - November 4,
2018, E. Riloff, D. Chiang, J. Hockenmaier, and J. Tsu-
jii, Eds. Association for Computational Linguistics,
2018, pp. 3911–3921.
[626] D. Bahdanau, K. Cho, and Y. Bengio, “Neural ma-
chine translation by jointly learning to align and
translate,” in ICLR, 2015.
[627] K. Papineni, S. Roukos, T. Ward, and W. Zhu, “Bleu:
a method for automatic evaluation of machine trans-
lation,” in Proceedings of the 40th Annual Meeting of
the Association for Computational Linguistics, July 6-12,
2002, Philadelphia, PA, USA. ACL, 2002, pp. 311–318.
[628] C.-Y. Lin, “ROUGE: A package for automatic evalu-
ation of summaries,” in Text Summarization Branches
Out. Association for Computational Linguistics, Jul.
2004, pp. 74–81.
[629] W. Jiao, W. Wang, J.-t. Huang, X. Wang, and Z. Tu,
“Is chatgpt a good translator? a preliminary study,”
arXiv preprint arXiv:2301.08745, 2023.
[630] T. Zhang, F. Ladhak, E. Durmus, P. Liang, K. R. McK-
eown, and T. B. Hashimoto, “Benchmarking large
language models for news summarization,” CoRR,
vol. abs/2301.13848, 2023.
[631] T. Goyal, J. J. Li, and G. Durrett, “News summariza-
tion and evaluation in the era of GPT-3,” CoRR, vol.
abs/2209.12356, 2022.
[632] S. Gehrmann, E. Clark, and T. Sellam, “Repairing
the cracked foundation: A survey of obstacles in
evaluation practices for generated text,” CoRR, vol.
abs/2202.06935, 2022.
[633] J. Wang, Y. Liang, F. Meng, H. Shi, Z. Li, J. Xu, J. Qu,
and J. Zhou, “Is chatgpt a good NLG evaluator? A
preliminary study,” CoRR, vol. abs/2303.04048, 2023.
[634] Y. Liu, D. Iter, Y. Xu, S. Wang, R. Xu, and C. Zhu,
“G-eval: NLG evaluation using GPT-4 with better
human alignment,” CoRR, vol. abs/2303.16634, 2023.
[635] K. Yang, Y. Tian, N. Peng, and D. Klein, “Re3: Gen-
erating longer stories with recursive reprompting
and revision,” in Proceedings of the 2022 Conference
on Empirical Methods in Natural Language Processing,
EMNLP 2022, Abu Dhabi, United Arab Emirates, De-
cember 7-11, 2022, Y. Goldberg, Z. Kozareva, and
Y. Zhang, Eds. Association for Computational Lin-
guistics, 2022, pp. 4393–4479.
[636] W. Zhou, Y. E. Jiang, P. Cui, T. Wang, Z. Xiao, Y. Hou,
R. Cotterell, and M. Sachan, “Recurrentgpt: Interac-
tive generation of (arbitrarily) long text,” CoRR, vol.
abs/2305.13304, 2023.
[637] S. Gulwani, O. Polozov, and R. Singh, “Program
synthesis,” Found. Trends Program. Lang., vol. 4, no.
1-2, pp. 1–119, 2017.
[638] S. Zhang, Z. Chen, Y. Shen, M. Ding, J. B. Tenenbaum,
and C. Gan, “Planning with large language models
for code generation,” 2023.
[639] M. Welsh, “The end of programming,” Commun.
ACM, vol. 66, no. 1, pp. 34–35, 2023.
[640] Y. Bang, S. Cahyawijaya, N. Lee, W. Dai, D. Su,
B. Wilie, H. Lovenia, Z. Ji, T. Yu, W. Chung, Q. V. Do,
Y. Xu, and P. Fung, “A multitask, multilingual, mul-
timodal evaluation of chatgpt on reasoning, halluci-
nation, and interactivity,” CoRR, vol. abs/2302.04023,
2023.
[641] Y. Liu, A. R. Fabbri, P. Liu, Y. Zhao, L. Nan, R. Han,
S. Han, S. R. Joty, C. Wu, C. Xiong, and D. Radev,
“Revisiting the gold standard: Grounding summa-
rization evaluation with robust human evaluation,”
CoRR, vol. abs/2212.07981, 2022.
[642] A. R. Fabbri, W. Kryscinski, B. McCann, C. Xiong,
R. Socher, and D. R. Radev, “Summeval: Re-
evaluating summarization evaluation,” Trans. Assoc.
Comput. Linguistics, vol. 9, pp. 391–409, 2021.
[643] T. Tang, H. Lu, Y. E. Jiang, H. Huang, D. Zhang, W. X.
Zhao, and F. Wei, “Not all metrics are guilty: Improv-
ing NLG evaluation with LLM paraphrasing,” CoRR,
vol. abs/2305.15067, 2023.
[644] X. Wang, X. Tang, W. X. Zhao, J. Wang, and J. Wen,
“Rethinking the evaluation for conversational rec-
ommendation in the era of large language models,”
CoRR, vol. abs/2305.13112, 2023.
[645] M. Gao, J. Ruan, R. Sun, X. Yin, S. Yang, and X. Wan,
“Human-like summarization evaluation with chat-
gpt,” CoRR, vol. abs/2304.02554, 2023.
[646] Y. Ji, Y. Gong, Y. Peng, C. Ni, P. Sun, D. Pan, B. Ma,
and X. Li, “Exploring chatgpt’s ability to rank con-
tent: A preliminary study on consistency with hu-
man preferences,” CoRR, vol. abs/2303.07610, 2023.
[647] Y. Bai, J. Ying, Y. Cao, X. Lv, Y. He, X. Wang, J. Yu,
K. Zeng, Y. Xiao, H. Lyu, J. Zhang, J. Li, and L. Hou,
“Benchmarking foundation models with language-
model-as-an-examiner,” CoRR, vol. abs/2306.04181,
2023.
[648] Y. Liu, S. Feng, D. Wang, Y. Zhang, and H. Sch ¨ utze,
“Evaluate what you can’t evaluate: Unassess-
able generated responses quality,” CoRR, vol.
abs/2305.14658, 2023.
[649] P. Wang, L. Li, L. Chen, D. Zhu, B. Lin, Y. Cao, Q. Liu,
T. Liu, and Z. Sui, “Large language models are not
fair evaluators,” CoRR, vol. abs/2305.17926, 2023.
[650] J. Ye, X. Chen, N. Xu, C. Zu, Z. Shao, S. Liu, Y. Cui,
Z. Zhou, C. Gong, Y. Shen, J. Zhou, S. Chen, T. Gui,
Q. Zhang, and X. Huang, “A comprehensive capabil-
ity analysis of gpt-3 and gpt-3.5 series models,” arXiv
preprint arXiv:2303.10420, 2023.
[651] M. McCloskey and N. J. Cohen, “Catastrophic in-
terference in connectionist networks: The sequential
learning problem,” in Psychology of learning and moti-
vation, 1989, pp. 109–165.
[652] R. Kemker, M. McClure, A. Abitino, T. L. Hayes,
and C. Kanan, “Measuring catastrophic forgetting in
neural networks,” in Proceedings of the Thirty-Second
AAAI Conference on Artificial Intelligence, (AAAI-18),
the 30th innovative Applications of Artificial Intelligence
(IAAI-18), and the 8th AAAI Symposium on Educational
Advances in Artificial Intelligence (EAAI-18), New Or-
leans, Louisiana, USA, February 2-7, 2018, 2018, pp.
128
3390–3398.
[653] T. Xie, C. H. Wu, P. Shi, R. Zhong, T. Scholak,
M. Yasunaga, C. Wu, M. Zhong, P. Yin, S. I. Wang,
V. Zhong, B. Wang, C. Li, C. Boyle, A. Ni, Z. Yao,
D. Radev, C. Xiong, L. Kong, R. Zhang, N. A. Smith,
L. Zettlemoyer, and T. Yu, “Unifiedskg: Unifying and
multi-tasking structured knowledge grounding with
text-to-text language models,” in EMNLP. Associ-
ation for Computational Linguistics, 2022, pp. 602–
631.
[654] A. Roberts, C. Raffel, and N. Shazeer, “How much
knowledge can you pack into the parameters of a lan-
guage model?” in Proceedings of the 2020 Conference
on Empirical Methods in Natural Language Processing,
EMNLP 2020, Online, November 16-20, 2020, 2020, pp.
5418–5426.
[655] G. Izacard, P. S. H. Lewis, M. Lomeli, L. Hos-
seini, F. Petroni, T. Schick, J. Dwivedi-Yu, A. Joulin,
S. Riedel, and E. Grave, “Few-shot learning with
retrieval augmented language models,” CoRR, vol.
abs/2208.03299, 2022.
[656] K. Guu, K. Lee, Z. Tung, P. Pasupat, and M. Chang,
“Retrieval augmented language model pre-training,”
in Proceedings of the 37th International Conference on
Machine Learning, ICML 2020, 13-18 July 2020, Virtual
Event, 2020, pp. 3929–3938.
[657] P. S. H. Lewis, E. Perez, A. Piktus, F. Petroni,
V. Karpukhin, N. Goyal, H. K ¨ uttler, M. Lewis, W. Yih,
T. Rockt¨ aschel, S. Riedel, and D. Kiela, “Retrieval-
augmented generation for knowledge-intensive NLP
tasks,” in Advances in Neural Information Processing
Systems 33: Annual Conference on Neural Information
Processing Systems 2020, NeurIPS 2020, December 6-12,
2020, virtual, 2020.
[658] Y. Lan, G. He, J. Jiang, J. Jiang, W. X. Zhao, and J. Wen,
“Complex knowledge base question answering: A
survey,” CoRR, vol. abs/2108.06688, 2021.
[659] S. Borgeaud, A. Mensch, J. Hoffmann, T. Cai,
E. Rutherford, K. Millican, G. van den Driessche,
J. Lespiau, B. Damoc, A. Clark, D. de Las Casas,
A. Guy, J. Menick, R. Ring, T. Hennigan, S. Huang,
L. Maggiore, C. Jones, A. Cassirer, A. Brock, M. Pa-
ganini, G. Irving, O. Vinyals, S. Osindero, K. Si-
monyan, J. W. Rae, E. Elsen, and L. Sifre, “Improv-
ing language models by retrieving from trillions of
tokens,” in International Conference on Machine Learn-
ing, ICML 2022, 17-23 July 2022, Baltimore, Maryland,
USA, ser. Proceedings of Machine Learning Research,
K. Chaudhuri, S. Jegelka, L. Song, C. Szepesv´ ari,
G. Niu, and S. Sabato, Eds., vol. 162. PMLR, 2022,
pp. 2206–2240.
[660] S. Xu, L. Pang, H. Shen, X. Cheng, and T.-S. Chua,
“Search-in-the-chain: Towards accurate, credible and
traceable large language models for knowledge-
intensive tasks,” CoRR, vol. abs/2304.14732, 2023.
[661] B. Peng, M. Galley, P. He, H. Cheng, Y. Xie, Y. Hu,
Q. Huang, L. Liden, Z. Yu, W. Chen, and J. Gao,
“Check your facts and try again: Improving large
language models with external knowledge and au-
tomated feedback,” CoRR, vol. abs/2302.12813, 2023.
[662] Z. Jiang, F. F. Xu, L. Gao, Z. Sun, Q. Liu, J. Dwivedi-
Yu, Y. Yang, J. Callan, and G. Neubig, “Ac-
tive retrieval augmented generation,” CoRR, vol.
abs/2305.06983, 2023.
[663] L. Huang, W. Yu, W. Ma, W. Zhong, Z. Feng,
H. Wang, Q. Chen, W. Peng, X. Feng, B. Qin, and
T. Liu, “A survey on hallucination in large language
models: Principles, taxonomy, challenges, and open
questions,” CoRR, vol. abs/2311.05232, 2023.
[664] Y. Li, Y. Du, K. Zhou, J. Wang, W. X. Zhao, and
J. Wen, “Evaluating object hallucination in large
vision-language models,” CoRR, vol. abs/2305.10355,
2023.
[665] S. Kadavath, T. Conerly, A. Askell, T. J. Henighan,
D. Drain, E. Perez, N. Schiefer, Z. Dodds, N. Das-
Sarma, E. Tran-Johnson, S. Johnston, S. El-Showk,
A. Jones, N. Elhage, T. Hume, A. Chen, Y. Bai,
S. Bowman, S. Fort, D. Ganguli, D. Hernandez, J. Ja-
cobson, J. Kernion, S. Kravec, L. Lovitt, K. Ndousse,
C. Olsson, S. Ringer, D. Amodei, T. B. Brown, J. Clark,
N. Joseph, B. Mann, S. McCandlish, C. Olah, and
J. Kaplan, “Language models (mostly) know what
they know,” CoRR, vol. abs/2207.05221, 2022.
[666] P. Manakul, A. Liusie, and M. J. F. Gales, “Selfcheck-
gpt: Zero-resource black-box hallucination detection
for generative large language models,” ArXiv, vol.
abs/2305.06983, 2023.
[667] S. Agarwal, I. Akkaya, V. Balcom, M. Bavarian,
G. Bernadett-Shapiro, G. Brockman, M. Brundage,
J. Chan, F. Chantzis, N. Deutsch, B. Eastman, A. Eleti,
N. Felix, S. P. Fishman, I. Fulford, C. Gibson, J. Gross,
M. Heaton, J. Hilton, X. Hu, S. Jain, H. Jin, L. Kil-
patrick, C. Kim, M. Kolhede, A. Mayne, P. McMil-
lan, D. Medina, J. Menick, A. Mishchenko, A. Nair,
R. Nayak, A. Neelakantan, R. Nuttall, J. Parish,
A. T. Passos, A. Perelman, F. de Avila Belbute Peres,
V. Pong, J. Schulman, E. Sigler, N. Staudacher, N. Tur-
ley, J. Tworek, R. Greene, A. Vijayvergiya, C. Voss,
J. Weng, M. Wiethoff, S. Yoo, K. Yu, W. Zaremba,
S. Zhao, W. Zhuk, and B. Zoph, “Chatgpt plugins,”
OpenAI Blog, March 2023.
[668] A. Lazaridou, E. Gribovskaya, W. Stokowiec, and
N. Grigorev, “Internet-augmented language models
through few-shot prompting for open-domain ques-
tion answering,” CoRR, vol. abs/2203.05115, 2022.
[669] H. Qian, Y. Zhu, Z. Dou, H. Gu, X. Zhang, Z. Liu,
R. Lai, Z. Cao, J. Nie, and J. Wen, “Webbrain: Learn-
ing to generate factually correct articles for queries
by grounding on large web corpus,” CoRR, vol.
abs/2304.04358, 2023.
[670] J. Liu, J. Jin, Z. Wang, J. Cheng, Z. Dou, and J. Wen,
“RETA-LLM: A retrieval-augmented large language
model toolkit,” CoRR, vol. abs/2306.05212, 2023.
[671] D. Dai, L. Dong, Y. Hao, Z. Sui, B. Chang, and F. Wei,
“Knowledge neurons in pretrained transformers,” in
Proceedings of the 60th Annual Meeting of the Asso-
ciation for Computational Linguistics (Volume 1: Long
Papers), ACL 2022, Dublin, Ireland, May 22-27, 2022,
S. Muresan, P. Nakov, and A. Villavicencio, Eds.
Association for Computational Linguistics, 2022, pp.
8493–8502.
[672] K. Meng, D. Bau, A. J. Andonian, and Y. Belinkov,
129
“Locating and editing factual associations in gpt,”
in Advances in Neural Information Processing Systems,
2022.
[673] M. Geva, R. Schuster, J. Berant, and O. Levy, “Trans-
former feed-forward layers are key-value memo-
ries,” in Proceedings of the 2021 Conference on Empirical
Methods in Natural Language Processing, EMNLP 2021,
Virtual Event / Punta Cana, Dominican Republic, 7-
11 November, 2021, M. Moens, X. Huang, L. Specia,
and S. W. Yih, Eds. Association for Computational
Linguistics, 2021, pp. 5484–5495.
[674] Y. Yao, P. Wang, B. Tian, S. Cheng, Z. Li, S. Deng,
H. Chen, and N. Zhang, “Editing large language
models: Problems, methods, and opportunities,”
CoRR, vol. abs/2305.13172, 2023.
[675] P. Wang, N. Zhang, X. Xie, Y. Yao, B. Tian,
M. Wang, Z. Xi, S. Cheng, K. Liu, G. Zheng, and
H. Chen, “Easyedit: An easy-to-use knowledge edit-
ing framework for large language models,” CoRR,
vol. abs/2308.07269, 2023.
[676] Z. Shao, Y. Gong, Y. Shen, M. Huang, N. Duan, and
W. Chen, “Synthetic prompting: Generating chain-of-
thought demonstrations for large language models,”
CoRR, vol. abs/2302.00618, 2023.
[677] Sifatkaur, M. Singh, V. S. B, and N. Malviya, “Mind
meets machine: Unravelling gpt-4’s cognitive psy-
chology,” CoRR, vol. abs/2303.11436, 2023.
[678] M. I. Nye, A. J. Andreassen, G. Gur-Ari,
H. Michalewski, J. Austin, D. Bieber, D. Dohan,
A. Lewkowycz, M. Bosma, D. Luan, C. Sutton, and
A. Odena, “Show your work: Scratchpads for inter-
mediate computation with language models,” CoRR,
vol. abs/2112.00114, 2021.
[679] J. Qian, H. Wang, Z. Li, S. Li, and X. Yan, “Limita-
tions of language models in arithmetic and symbolic
induction,” CoRR, vol. abs/2208.05051, 2022.
[680] W. X. Zhao, K. Zhou, Z. Gong, B. Zhang, Y. Zhou,
J. Sha, Z. Chen, S. Wang, C. Liu, and J. Wen, “Ji-
uzhang: A chinese pre-trained language model for
mathematical problem understanding,” in KDD ’22:
The 28th ACM SIGKDD Conference on Knowledge Dis-
covery and Data Mining, Washington, DC, USA, August
14 - 18, 2022, A. Zhang and H. Rangwala, Eds. ACM,
2022, pp. 4571–4581.
[681] Q. Wang, C. Kaliszyk, and J. Urban, “First experi-
ments with neural translation of informal to formal
mathematics,” in Intelligent Computer Mathematics -
11th International Conference, CICM 2018, Hagenberg,
Austria, August 13-17, 2018, Proceedings, ser. Lecture
Notes in Computer Science, F. Rabe, W. M. Farmer,
G. O. Passmore, and A. Youssef, Eds., vol. 11006.
Springer, 2018, pp. 255–270.
[682] S. Polu and I. Sutskever, “Generative language mod-
eling for automated theorem proving,” CoRR, vol.
abs/2009.03393, 2020.
[683] A. Q. Jiang, W. Li, S. Tworkowski, K. Czechowski,
T. Odrzyg´ ozdz, P. Milos, Y. Wu, and M. Jamnik,
“Thor: Wielding hammers to integrate language
models and automated theorem provers,” CoRR, vol.
abs/2205.10893, 2022.
[684] S. Polu, J. M. Han, K. Zheng, M. Baksys,
I. Babuschkin, and I. Sutskever, “Formal mathe-
matics statement curriculum learning,” CoRR, vol.
abs/2202.01344, 2022.
[685] Y. Wu, A. Q. Jiang, W. Li, M. N. Rabe, C. Staats,
M. Jamnik, and C. Szegedy, “Autoformalization with
large language models,” CoRR, vol. abs/2205.12615,
2022.
[686] A. Q. Jiang, S. Welleck, J. P. Zhou, W. Li, J. Liu,
M. Jamnik, T. Lacroix, Y. Wu, and G. Lample, “Draft,
sketch, and prove: Guiding formal theorem provers
with informal proofs,” CoRR, vol. abs/2210.12283,
2022.
[687] A. Madaan, N. Tandon, P. Gupta, S. Hallinan, L. Gao,
S. Wiegreffe, U. Alon, N. Dziri, S. Prabhumoye,
Y. Yang, S. Welleck, B. P. Majumder, S. Gupta, A. Yaz-
danbakhsh, and P. Clark, “Self-refine: Iterative refine-
ment with self-feedback,” CoRR, vol. abs/2303.17651,
2023.
[688] N. Shinn, B. Labash, and A. Gopinath, “Reflexion: an
autonomous agent with dynamic memory and self-
reflection,” CoRR, vol. abs/2303.11366, 2023.
[689] Z. Gou, Z. Shao, Y. Gong, Y. Shen, Y. Yang, N. Duan,
and W. Chen, “CRITIC: large language models can
self-correct with tool-interactive critiquing,” CoRR,
vol. abs/2305.11738, 2023.
[690] J. Uesato, N. Kushman, R. Kumar, H. F. Song,
N. Y. Siegel, L. Wang, A. Creswell, G. Irving, and
I. Higgins, “Solving math word problems with
process- and outcome-based feedback,” CoRR, vol.
abs/2211.14275, 2022.
[691] H. Lightman, V. Kosaraju, Y. Burda, H. Edwards,
B. Baker, T. Lee, J. Leike, J. Schulman, I. Sutskever,
and K. Cobbe, “Let’s verify step by step,” CoRR, vol.
abs/2305.20050, 2023.
[692] Z. Yuan, H. Yuan, C. Tan, W. Wang, and S. Huang,
“How well do large language models perform in
arithmetic tasks?” CoRR, vol. abs/2304.02015, 2023.
[693] X. Pi, Q. Liu, B. Chen, M. Ziyadi, Z. Lin, Q. Fu,
Y. Gao, J. Lou, and W. Chen, “Reasoning like pro-
gram executors,” in Proceedings of the 2022 Conference
on Empirical Methods in Natural Language Processing,
EMNLP 2022, Abu Dhabi, United Arab Emirates, De-
cember 7-11, 2022, 2022, pp. 761–779.
[694] H. Zhou, A. Nova, H. Larochelle, A. C. Courville,
B. Neyshabur, and H. Sedghi, “Teaching algorith-
mic reasoning via in-context learning,” CoRR, vol.
abs/2211.09066, 2022.
[695] A. Parisi, Y. Zhao, and N. Fiedel, “TALM:
tool augmented language models,” CoRR, vol.
abs/2205.12255, 2022.
[696] W. Huang, P. Abbeel, D. Pathak, and I. Mordatch,
“Language models as zero-shot planners: Extract-
ing actionable knowledge for embodied agents,” in
ICML, ser. Proceedings of Machine Learning Re-
search, vol. 162. PMLR, 2022, pp. 9118–9147.
[697] T. Carta, C. Romac, T. Wolf, S. Lamprier, O. Sigaud,
and P. Oudeyer, “Grounding large language models
in interactive environments with online reinforce-
ment learning,” CoRR, vol. abs/2302.02662, 2023.
[698] X. Zhu, Y. Chen, H. Tian, C. Tao, W. Su, C. Yang,
G. Huang, B. Li, L. Lu, X. Wang, Y. Qiao, Z. Zhang,
130
and J. Dai, “Ghost in the minecraft: Generally capa-
ble agents for open-world environments via large
language models with text-based knowledge and
memory,” CoRR, vol. abs/2305.17144, 2023.
[699] G. Wang, Y. Xie, Y. Jiang, A. Mandlekar, C. Xiao,
Y. Zhu, L. Fan, and A. Anandkumar, “Voyager: An
open-ended embodied agent with large language
models,” CoRR, vol. abs/2305.16291, 2023.
[700] M. Ahn, A. Brohan, N. Brown, Y. Chebotar, O. Cortes,
B. David, C. Finn, K. Gopalakrishnan, K. Hausman,
A. Herzog, D. Ho, J. Hsu, J. Ibarz, B. Ichter, A. Ir-
pan, E. Jang, R. J. Ruano, K. Jeffrey, S. Jesmonth,
N. J. Joshi, R. Julian, D. Kalashnikov, Y. Kuang,
K. Lee, S. Levine, Y. Lu, L. Luu, C. Parada, P. Pastor,
J. Quiambao, K. Rao, J. Rettinghouse, D. Reyes, P. Ser-
manet, N. Sievers, C. Tan, A. Toshev, V. Vanhoucke,
F. Xia, T. Xiao, P. Xu, S. Xu, and M. Yan, “Do as
I can, not as I say: Grounding language in robotic
affordances,” CoRR, vol. abs/2204.01691, 2022.
[701] J. Liang, W. Huang, F. Xia, P. Xu, K. Hausman,
B. Ichter, P. Florence, and A. Zeng, “Code as policies:
Language model programs for embodied control,”
CoRR, vol. abs/2209.07753, 2022.
[702] Y. Fu, H. Peng, T. Khot, and M. Lapata, “Improv-
ing language model negotiation with self-play and
in-context learning from AI feedback,” CoRR, vol.
abs/2305.10142, 2023.
[703] N. Mehta, M. Teruel, P. F. Sanz, X. Deng, A. H.
Awadallah, and J. Kiseleva, “Improving grounded
language understanding in a collaborative environ-
ment by interacting with agents through help feed-
back,” CoRR, vol. abs/2304.10750, 2023.
[704] S. G. Patil, T. Zhang, X. Wang, and J. E. Gonzalez,
“Gorilla: Large language model connected with mas-
sive apis,” CoRR, vol. abs/2305.15334, 2023.
[705] S. Hao, T. Liu, Z. Wang, and Z. Hu, “Toolkengpt:
Augmenting frozen language models with mas-
sive tools via tool embeddings,” CoRR, vol.
abs/2305.11554, 2023.
[706] Y. Liang, C. Wu, T. Song, W. Wu, Y. Xia, Y. Liu, Y. Ou,
S. Lu, L. Ji, S. Mao, Y. Wang, L. Shou, M. Gong,
and N. Duan, “Taskmatrix.ai: Completing tasks by
connecting foundation models with millions of apis,”
CoRR, vol. abs/2303.16434, 2023.
[707] T. Cai, X. Wang, T. Ma, X. Chen, and D. Zhou,
“Large language models as tool makers,” CoRR, vol.
abs/2305.17126, 2023.
[708] J. Huang, S. S. Gu, L. Hou, Y. Wu, X. Wang,
H. Yu, and J. Han, “Large language models can self-
improve,” CoRR, vol. abs/2210.11610, 2022.
[709] E. Beeching, C. Fourrier, N. Habib, S. Han,
N. Lambert, N. Rajani, O. Sanseviero,
L. Tunstall, and T. Wolf, “Open llm leaderboard,”
https://huggingface.co/spaces/HuggingFaceH4/
open llm leaderboard, 2023.
[710] W. Zhong, R. Cui, Y. Guo, Y. Liang, S. Lu, Y. Wang,
A. Saied, W. Chen, and N. Duan, “Agieval: A human-
centric benchmark for evaluating foundation mod-
els,” CoRR, vol. abs/2304.06364, 2023.
[711] H. Zeng, “Measuring massive multitask chinese un-
derstanding,” CoRR, vol. abs/2304.12986, 2023.
[712] C. Liu, R. Jin, Y. Ren, L. Yu, T. Dong, X. Peng,
S. Zhang, J. Peng, P. Zhang, Q. Lyu, X. Su, Q. Liu,
and D. Xiong, “M3KE: A massive multi-level multi-
subject knowledge evaluation benchmark for chinese
large language models,” CoRR, vol. abs/2305.10263,
2023.
[713] Y. Huang, Y. Bai, Z. Zhu, J. Zhang, J. Zhang, T. Su,
J. Liu, C. Lv, Y. Zhang, J. Lei, Y. Fu, M. Sun, and
J. He, “C-eval: A multi-level multi-discipline chinese
evaluation suite for foundation models,” CoRR, vol.
abs/2305.08322, 2023.
[714] Z. Gu, X. Zhu, H. Ye, L. Zhang, J. Wang, S. Jiang,
Z. Xiong, Z. Li, Q. He, R. Xu, W. Huang, W. Zheng,
H. Feng, and Y. Xiao, “Xiezhi: An ever-updating
benchmark for holistic domain knowledge evalua-
tion,” CoRR, vol. abs/2306.05783, 2023.
[715] O. Contributors, “Opencompass: A universal eval-
uation platform for foundation models,” https://
github.com/InternLM/OpenCompass, 2023.
[716] Y. Fu, L. Ou, M. Chen, Y. Wan, H. Peng, and
T. Khot, “Chain-of-thought hub: A continuous effort
to measure large language models’ reasoning perfor-
mance,” CoRR, vol. abs/2305.17306, 2023.
[717] J. Yu, X. Wang, S. Tu, S. Cao, D. Zhang-li, X. Lv,
H. Peng, Z. Yao, X. Zhang, H. Li, C. Li, Z. Zhang,
Y. Bai, Y. Liu, A. Xin, N. Lin, K. Yun, L. Gong, J. Chen,
Z. Wu, Y. Qi, W. Li, Y. Guan, K. Zeng, J. Qi, H. Jin,
J. Liu, Y. Gu, Y. Yao, N. Ding, L. Hou, Z. Liu, B. Xu,
J. Tang, and J. Li, “Kola: Carefully benchmarking
world knowledge of large language models,” CoRR,
vol. abs/2306.09296, 2023.
[718] T. Sawada, D. Paleka, A. Havrilla, P. Tadepalli, P. Vi-
das, A. Kranias, J. J. Nay, K. Gupta, and A. Komat-
suzaki, “ARB: advanced reasoning benchmark for
large language models,” CoRR, vol. abs/2307.13692,
2023.
[719] Y. Peng, S. Li, W. Gu, Y. Li, W. Wang, C. Gao, and
M. R. Lyu, “Revisiting, benchmarking and exploring
API recommendation: How far are we?” IEEE Trans.
Software Eng., vol. 49, no. 4, pp. 1876–1897, 2023.
[720] M. Li, F. Song, B. Yu, H. Yu, Z. Li, F. Huang, and Y. Li,
“Api-bank: A benchmark for tool-augmented llms,”
CoRR, vol. abs/2304.08244, 2023.
[721] Q. Tang, Z. Deng, H. Lin, X. Han, Q. Liang, and
L. Sun, “Toolalpaca: Generalized tool learning for
language models with 3000 simulated cases,” CoRR,
vol. abs/2306.05301, 2023.
[722] Q. Xu, F. Hong, B. Li, C. Hu, Z. Chen, and J. Zhang,
“On the tool manipulation capability of open-source
large language models,” CoRR, vol. abs/2305.16504,
2023.
[723] Y. Qin, S. Liang, Y. Ye, K. Zhu, L. Yan, Y. Lu, Y. Lin,
X. Cong, X. Tang, B. Qian, S. Zhao, R. Tian, R. Xie,
J. Zhou, M. Gerstein, D. Li, Z. Liu, and M. Sun, “Tool-
llm: Facilitating large language models to master
16000+ real-world apis,” CoRR, vol. abs/2307.16789,
2023.
[724] Z. Liu, W. Yao, J. Zhang, L. Xue, S. Heinecke,
R. Murthy, Y. Feng, Z. Chen, J. C. Niebles,
D. Arpit, R. Xu, P. Mui, H. Wang, C. Xiong, and
S. Savarese, “BOLAA: benchmarking and orchestrat-
131
ing llm-augmented autonomous agents,” CoRR, vol.
abs/2308.05960, 2023.
[725] X. Liu, H. Yu, H. Zhang, Y. Xu, X. Lei, H. Lai,
Y. Gu, H. Ding, K. Men, K. Yang, S. Zhang, X. Deng,
A. Zeng, Z. Du, C. Zhang, S. Shen, T. Zhang,
Y. Su, H. Sun, M. Huang, Y. Dong, and J. Tang,
“Agentbench: Evaluating llms as agents,” CoRR, vol.
abs/2308.03688, 2023.
[726] K. Zhu, J. Wang, J. Zhou, Z. Wang, H. Chen, Y. Wang,
L. Yang, W. Ye, N. Z. Gong, Y. Zhang, and X. Xie,
“Promptbench: Towards evaluating the robustness
of large language models on adversarial prompts,”
CoRR, vol. abs/2306.04528, 2023.
[727] R. S. Shah, K. Chawla, D. Eidnani, A. Shah, W. Du,
S. Chava, N. Raman, C. Smiley, J. Chen, and D. Yang,
“WHEN FLUE MEETS FLANG: benchmarks and
large pre-trained language model for financial do-
main,” CoRR, vol. abs/2211.00083, 2022.
[728] N. Guha, D. E. Ho, J. Nyarko, and C. R´ e, “Legal-
bench: Prototyping a collaborative benchmark for
legal reasoning,” CoRR, vol. abs/2209.06120, 2022.
[729] L. Zheng, W. Chiang, Y. Sheng, S. Zhuang, Z. Wu,
Y. Zhuang, Z. Lin, Z. Li, D. Li, E. P. Xing, H. Zhang,
J. E. Gonzalez, and I. Stoica, “Judging llm-as-a-
judge with mt-bench and chatbot arena,” CoRR, vol.
abs/2306.05685, 2023.
[730] X. Wang, Z. Hu, P. Lu, Y. Zhu, J. Zhang, S. Sub-
ramaniam, A. R. Loomba, S. Zhang, Y. Sun, and
W. Wang, “Scibench: Evaluating college-level sci-
entific problem-solving abilities of large language
models,” CoRR, vol. abs/2307.10635, 2023.
[731] X. Li, T. Zhang, Y. Dubois, R. Taori, I. Gulrajani,
C. Guestrin, P. Liang, and T. B. Hashimoto, “Al-
pacaeval: An automatic evaluator of instruction-
following models,” https://github.com/tatsu-lab/
alpaca eval, 2023.
[732] Y. Huang, Q. Zhang, P. S. Yu, and L. Sun, “Trustgpt:
A benchmark for trustworthy and responsible large
language models,” CoRR, vol. abs/2306.11507, 2023.
[733] Y. Bai, J. Ying, Y. Cao, X. Lv, Y. He, X. Wang, J. Yu,
K. Zeng, Y. Xiao, H. Lyu, J. Zhang, J. Li, and L. Hou,
“Benchmarking foundation models with language-
model-as-an-examiner,” CoRR, vol. abs/2306.04181,
2023.
[734] C. Chan, W. Chen, Y. Su, J. Yu, W. Xue, S. Zhang, J. Fu,
and Z. Liu, “Chateval: Towards better llm-based
evaluators through multi-agent debate,” CoRR, vol.
abs/2308.07201, 2023.
[735] Y. Chang, X. Wang, J. Wang, Y. Wu, K. Zhu, H. Chen,
L. Yang, X. Yi, C. Wang, Y. Wang, W. Ye, Y. Zhang,
Y. Chang, P. S. Yu, Q. Yang, and X. Xie, “A survey
on evaluation of large language models,” CoRR, vol.
abs/2307.03109, 2023.
[736] Z. Zhuang, Q. Chen, L. Ma, M. Li, Y. Han, Y. Qian,
H. Bai, Z. Feng, W. Zhang, and T. Liu, “Through
the lens of core competency: Survey on evaluation of
large language models,” CoRR, vol. abs/2308.07902,
2023.
[737] J. H. Clark, J. Palomaki, V. Nikolaev, E. Choi, D. Gar-
rette, M. Collins, and T. Kwiatkowski, “Tydi QA:
A benchmark for information-seeking question an-
swering in typologically diverse languages,” Trans.
Assoc. Comput. Linguistics, vol. 8, pp. 454–470, 2020.
[738] L. Gao, J. Tow, S. Biderman, S. Black, A. DiPofi,
C. Foster, L. Golding, J. Hsu, K. McDonell, N. Muen-
nighoff, J. Phang, L. Reynolds, E. Tang, A. Thite,
B. Wang, K. Wang, and A. Zou, “A framework for
few-shot language model evaluation,” Sep. 2021.
[739] R. Shah, K. Chawla, D. Eidnani, A. Shah, W. Du,
S. Chava, N. Raman, C. Smiley, J. Chen, and D. Yang,
“When flue meets flang: Benchmarks and large pre-
trained language model for financial domain,” in
Proceedings of the 2022 Conference on Empirical Methods
in Natural Language Processing, 2022, pp. 2322–2335.
[740] K. Zhou, Y. Zhu, Z. Chen, W. Chen, W. X. Zhao,
X. Chen, Y. Lin, J.-R. Wen, and J. Han, “Don’t make
your llm an evaluation benchmark cheater,” arXiv
preprint arXiv:2311.01964, 2023.
[741] C. Zan, K. Peng, L. Ding, B. Qiu, B. Liu, S. He, Q. Lu,
Z. Zhang, C. Liu, W. Liu, Y. Zhan, and D. Tao, “Vega-
mt: The JD explore academy machine translation
system for WMT22,” in Proceedings of the Seventh Con-
ference on Machine Translation, WMT 2022, Abu Dhabi,
United Arab Emirates (Hybrid), December 7-8, 2022,
P. Koehn, L. Barrault, O. Bojar, F. Bougares, R. Chat-
terjee, M. R. Costa-juss` a, C. Federmann, M. Fishel,
A. Fraser, M. Freitag, Y. Graham, R. Grundkiewicz,
P. Guzman, B. Haddow, M. Huck, A. Jimeno-Yepes,
T. Kocmi, A. Martins, M. Morishita, C. Monz, M. Na-
gata, T. Nakazawa, M. Negri, A. N´
ev
´ eol, M. Neves,
M. Popel, M. Turchi, and M. Zampieri, Eds. Asso-
ciation for Computational Linguistics, 2022, pp. 411–
422.
[742] Y. Zhao, M. Khalman, R. Joshi, S. Narayan, M. Saleh,
and P. J. Liu, “Calibrating sequence likelihood
improves conditional language generation,” CoRR,
vol. abs/2210.00045, 2022. [Online]. Available:
https://doi.org/10.48550/arXiv.2210.00045
[743] D. Khashabi, S. Min, T. Khot, A. Sabharwal,
O. Tafjord, P. Clark, and H. Hajishirzi, “Unifiedqa:
Crossing format boundaries with a single QA sys-
tem,” in EMNLP (Findings), ser. Findings of ACL,
vol. EMNLP 2020. Association for Computational
Linguistics, 2020, pp. 1896–1907.
[744] X. Zhu, J. Wang, L. Zhang, Y. Zhang, R. Gan,
J. Zhang, and Y. Yang, “Solving math word problem
via cooperative reasoning induced language mod-
els,” arXiv preprint arXiv:2210.16257, 2022.
[745] A. Nguyen, N. Karampatziakis, and W. Chen, “Meet
in the middle: A new pre-training paradigm,”
CoRR, vol. abs/2303.07295, 2023. [Online]. Available:
https://doi.org/10.48550/arXiv.2303.07295
[746] H. Li, J. Zhang, C. Li, and H. Chen, “RESDSQL:
decoupling schema linking and skeleton parsing
for text-to-sql,” CoRR, vol. abs/2302.05965, 2023.
[Online]. Available: https://doi.org/10.48550/arXiv.
2302.05965
[747] W. Kang and J. J. McAuley, “Self-attentive sequential
recommendation,” in IEEE International Conference on
Data Mining, ICDM 2018, Singapore, November 17-20,
2018. IEEE Computer Society, 2018, pp. 197–206.
[748] B. Yang, C. Han, Y. Li, L. Zuo, and Z. Yu, “Improv-
132
ing conversational recommendation systems’ quality
with context-aware item meta-information,” in Find-
ings of the Association for Computational Linguistics:
NAACL 2022, Seattle, WA, United States, July 10-15,
2022, M. Carpuat, M. de Marneffe, and I. V. M. Ru´
ız,
Eds. Association for Computational Linguistics,
2022, pp. 38–48.
[749] E. Almazrouei, H. Alobeidli, A. Alshamsi, A. Cap-
pelli, R. Cojocaru, M. Debbah, E. Goffinet, D. Hes-
low, J. Launay, Q. Malartic, B. Noune, B. Pannier,
and G. Penedo, “Falcon-40B: an open large language
model with state-of-the-art performance,” 2023.
[750] S. Martin, J. Liermann, and H. Ney, “Algorithms for
bigram and trigram word clustering,” Speech commu-
nication, vol. 24, no. 1, pp. 19–37, 1998.
[751] R. Navigli, “Word sense disambiguation: A survey,”
ACM computing surveys (CSUR), vol. 41, no. 2, pp.
1–69, 2009.
[752] W. H. Gomaa, A. A. Fahmy et al., “A survey of
text similarity approaches,” international journal of
Computer Applications, vol. 68, no. 13, pp. 13–18, 2013.
[753] S. Minaee, N. Kalchbrenner, E. Cambria, N. Nikzad,
M. Chenaghlu, and J. Gao, “Deep learning–based
text classification: a comprehensive review,” ACM
computing surveys (CSUR), vol. 54, no. 3, pp. 1–40,
2021.
[754] N. Alex, E. Lifland, L. Tunstall, A. Thakur, P. Maham,
C. J. Riedel, E. Hine, C. Ashurst, P. Sedille, A. Carlier,
M. Noetel, and A. Stuhlm ¨ uller, “RAFT: A real-world
few-shot text classification benchmark,” in NeurIPS
Datasets and Benchmarks, 2021.
[755] C. Qin, A. Zhang, Z. Zhang, J. Chen, M. Yasunaga,
and D. Yang, “Is chatgpt a general-purpose nat-
ural language processing task solver?” CoRR, vol.
abs/2302.06476, 2023.
[756] X. Chen, J. Ye, C. Zu, N. Xu, R. Zheng, M. Peng,
J. Zhou, T. Gui, Q. Zhang, and X. Huang, “How
robust is gpt-3.5 to predecessors? a comprehensive
study on language understanding tasks,” 2023.
[757] D. Nadeau and S. Sekine, “A survey of named entity
recognition and classification,” Lingvisticae Investiga-
tiones, vol. 30, no. 1, pp. 3–26, 2007.
[758] A. Ratnaparkhi, “A maximum entropy model for
part-of-speech tagging,” in Conference on empirical
methods in natural language processing, 1996.
[759] V. Yadav and S. Bethard, “A survey on recent
advances in named entity recognition from deep
learning models,” in Proceedings of the 27th Interna-
tional Conference on Computational Linguistics, 2018,
pp. 2145–2158.
[760] F. Souza, R. Nogueira, and R. Lotufo, “Portuguese
named entity recognition using bert-crf,” arXiv
preprint arXiv:1909.10649, 2019.
[761] S. Pawar, G. K. Palshikar, and P. Bhattacharyya,
“Relation extraction: A survey,” arXiv preprint
arXiv:1712.05191, 2017.
[762] C. Walker and et al., “Ace 2005 multilingual training
corpus ldc2006t06,” Philadelphia, 2006.
[763] J. Gao, H. Zhao, C. Yu, and R. Xu, “Exploring the
feasibility of chatgpt for event extraction,” CoRR, vol.
abs/2303.03836, 2023.
[764] Y. Ma, Y. Cao, Y. Hong, and A. Sun, “Large language
model is not a good few-shot information extractor,
but a good reranker for hard samples!” CoRR, vol.
abs/2303.08559, 2023.
[765] R. Tang, X. Han, X. Jiang, and X. Hu, “Does synthetic
data generation of llms help clinical text mining?”
arXiv preprint arXiv:2303.04360, 2023.
[766] X. Wei, X. Cui, N. Cheng, X. Wang, X. Zhang,
S. Huang, P. Xie, J. Xu, Y. Chen, M. Zhang et al.,
“Zero-shot information extraction via chatting with
chatgpt,” arXiv preprint arXiv:2302.10205, 2023.
[767] A. Vaswani, S. Bengio, E. Brevdo, F. Chollet,
A. Gomez, S. Gouws, L. Jones, Ł. Kaiser, N. Kalch-
brenner, N. Parmar et al., “Tensor2tensor for neural
machine translation,” in Proceedings of the 13th Con-
ference of the Association for Machine Translation in the
Americas (Volume 1: Research Track), 2018, pp. 193–199.
[768] B. Zhang, B. Haddow, and A. Birch, “Prompting
large language model for machine translation: A case
study,” arXiv preprint arXiv:2301.07069, 2023.
[769] M. Ghazvininejad, H. Gonen, and L. Zettlemoyer,
“Dictionary-based phrase-level prompting of large
language models for machine translation,” arXiv
preprint arXiv:2302.07856, 2023.
[770] L. Wang, C. Lyu, T. Ji, Z. Zhang, D. Yu, S. Shi,
and Z. Tu, “Document-level machine transla-
tion with large language models,” arXiv preprint
arXiv:2304.02210, 2023.
[771] W. Jiao, J.-t. Huang, W. Wang, X. Wang, S. Shi, and
Z. Tu, “Parrot: Translating during chat using large
language models,” arXiv preprint arXiv:2304.02426,
2023.
[772] W. Yang, C. Li, J. Zhang, and C. Zong, “Bigtrans:
Augmenting large language models with multi-
lingual translation capability over 100 languages,”
arXiv preprint arXiv:2305.18098, 2023.
[773] J. Kocon, I. Cichecki, O. Kaszyca, M. Kochanek,
D. Szydlo, J. Baran, J. Bielaniewicz, M. Gruza,
A. Janz, K. Kanclerz, A. Kocon, B. Koptyra,
W. Mieleszczenko-Kowszewicz, P. Milkowski,
M. Oleksy, M. Piasecki, L. Radlinski, K. Wojtasik,
S. Wozniak, and P. Kazienko, “Chatgpt: Jack of all
trades, master of none,” CoRR, vol. abs/2302.10724,
2023.
[774] Q. Zhong, L. Ding, J. Liu, B. Du, and D. Tao,
“Can chatgpt understand too? A comparative study
on chatgpt and fine-tuned BERT,” CoRR, vol.
abs/2302.10198, 2023.
[775] D. Cheng, S. Huang, J. Bi, Y. Zhan, J. Liu, Y. Wang,
H. Sun, F. Wei, D. Deng, and Q. Zhang, “Uprise:
Universal prompt retrieval for improving zero-shot
evaluation,” arXiv preprint arXiv:2303.08518, 2023.
[776] R. Ren, Y. Qu, J. Liu, W. X. Zhao, Q. She, H. Wu,
H. Wang, and J.-R. Wen, “Rocketqav2: A joint train-
ing method for dense passage retrieval and pas-
sage re-ranking,” in Proceedings of the 2021 Conference
on Empirical Methods in Natural Language Processing,
2021, pp. 2825–2835.
[777] W. Sun, L. Yan, X. Ma, P. Ren, D. Yin, and Z. Ren,
“Is chatgpt good at search? investigating large lan-
guage models as re-ranking agent,” arXiv preprint
133
arXiv:2304.09542, 2023.
[778] Z. Qin, R. Jagerman, K. Hui, H. Zhuang, J. Wu,
J. Shen, T. Liu, J. Liu, D. Metzler, X. Wang et al.,
“Large language models are effective text rankers
with pairwise ranking prompting,” arXiv preprint
arXiv:2306.17563, 2023.
[779] S. Cho, S. Jeong, J. Seo, and J. C. Park, “Discrete
prompt optimization via constrained generation for
zero-shot re-ranker,” arXiv preprint arXiv:2305.13729,
2023.
[780] R. Tang, X. Zhang, X. Ma, J. Lin, and F. Ture,
“Found in the middle: Permutation self-consistency
improves listwise ranking in large language mod-
els,” arXiv preprint arXiv:2310.07712, 2023.
[781] X. Ma, X. Zhang, R. Pradeep, and J. Lin, “Zero-shot
listwise document reranking with a large language
model,” arXiv preprint arXiv:2305.02156, 2023.
[782] S. Zhuang, H. Zhuang, B. Koopman, and G. Zuccon,
“A setwise approach for effective and highly effi-
cient zero-shot ranking with large language models,”
arXiv preprint arXiv:2310.09497, 2023.
[783] H. Zhuang, Z. Qin, K. Hui, J. Wu, L. Yan, X. Wang,
and M. Berdersky, “Beyond yes and no: Improving
zero-shot llm rankers via scoring fine-grained rele-
vance labels,” arXiv preprint arXiv:2310.14122, 2023.
[784] N. Ziems, W. Yu, Z. Zhang, and M. Jiang, “Large
language models are built-in autoregressive search
engines,” arXiv preprint arXiv:2305.09612, 2023.
[785] X. Ma, L. Wang, N. Yang, F. Wei, and J. Lin, “Fine-
tuning llama for multi-stage text retrieval,” arXiv
preprint arXiv:2310.08319, 2023.
[786] R. Pradeep, S. Sharifymoghaddam, and J. Lin,
“Rankvicuna: Zero-shot listwise document rerank-
ing with open-source large language models,” arXiv
preprint arXiv:2309.15088, 2023.
[787] Y. Tay, V. Q. Tran, M. Dehghani, J. Ni, D. Bahri,
H. Mehta, Z. Qin, K. Hui, Z. Zhao, J. Gupta et al.,
“Transformer memory as a differentiable search in-
dex,” in Advances in Neural Information Processing
Systems, 2022.
[788] R. Ren, W. X. Zhao, J. Liu, H. Wu, J.-R. Wen,
and H. Wang, “TOME: A two-stage approach for
model-based retrieval,” in Proceedings of the 61st
Annual Meeting of the Association for Computational
Linguistics (Volume 1: Long Papers). Association
for Computational Linguistics, 2023, pp. 6102–6114.
[Online]. Available: https://aclanthology.org/2023.
acl-long.336
[789] Y. Qu, Y. Ding, J. Liu, K. Liu, R. Ren, W. X. Zhao,
D. Dong, H. Wu, and H. Wang, “Rocketqa: An op-
timized training approach to dense passage retrieval
for open-domain question answering,” in Proceedings
of the 2021 Conference of the North American Chapter
of the Association for Computational Linguistics: Human
Language Technologies, 2021, pp. 5835–5847.
[790] R. Ren, S. Lv, Y. Qu, J. Liu, W. X. Zhao, Q. She,
H. Wu, H. Wang, and J.-R. Wen, “Pair: Leverag-
ing passage-centric similarity relation for improving
dense passage retrieval,” in Findings of the Association
for Computational Linguistics: ACL-IJCNLP 2021, 2021,
pp. 2173–2183.
[791] Z. Peng, X. Wu, and Y. Fang, “Soft prompt tuning
for augmenting dense retrieval with large language
models,” arXiv preprint arXiv:2307.08303, 2023.
[792] Z. Dai, V. Y. Zhao, J. Ma, Y. Luan, J. Ni, J. Lu,
A. Bakalov, K. Guu, K. Hall, and M.-W. Chang,
“Promptagator: Few-shot dense retrieval from 8 ex-
amples,” in The Eleventh International Conference on
Learning Representations, 2023.
[793] A. Askari, M. Aliannejadi, E. Kanoulas, and S. Ver-
berne, “Generating synthetic documents for cross-
encoder re-rankers: A comparative study of chatgpt
and human experts,” arXiv preprint arXiv:2305.02320,
2023.
[794] K. Mao, Z. Dou, H. Chen, F. Mo, and H. Qian, “Large
language models know your contextual search in-
tent: A prompting framework for conversational
search,” arXiv preprint arXiv:2303.06573, 2023.
[795] L. Gao, X. Ma, J. Lin, and J. Callan, “Precise zero-
shot dense retrieval without relevance labels,” in
Proceedings of the 61st Annual Meeting of the Association
for Computational Linguistics (Volume 1: Long Papers).
Association for Computational Linguistics, 2023, pp.
1762–1777.
[796] L. Wang, N. Yang, and F. Wei, “Query2doc: Query ex-
pansion with large language models,” arXiv preprint
arXiv:2303.07678, 2023.
[797] G. Ma, X. Wu, P. Wang, Z. Lin, and S. Hu, “Pre-
training with large language model-based document
expansion for dense passage retrieval,” arXiv preprint
arXiv:2308.08285, 2023.
[798] W. Sun, Z. Chen, X. Ma, L. Yan, S. Wang, P. Ren,
Z. Chen, D. Yin, and Z. Ren, “Instruction distilla-
tion makes large language models efficient zero-shot
rankers,” arXiv preprint arXiv:2311.01555, 2023.
[799] L. Wang, N. Yang, X. Huang, L. Yang, R. Ma-
jumder, and F. Wei, “Large search model: Redefin-
ing search stack in the era of llms,” arXiv preprint
arXiv:2310.14587, 2023.
[800] C. Li, Z. Gan, Z. Yang, J. Yang, L. Li, L. Wang,
and J. Gao, “Multimodal foundation models: From
specialists to general-purpose assistants,” CoRR, vol.
abs/2309.10020, 2023.
[801] W. X. Zhao, S. Mu, Y. Hou, Z. Lin, Y. Chen, X. Pan,
K. Li, Y. Lu, H. Wang, C. Tian, Y. Min, Z. Feng, X. Fan,
X. Chen, P. Wang, W. Ji, Y. Li, X. Wang, and J. Wen,
“Recbole: Towards a unified, comprehensive and ef-
ficient framework for recommendation algorithms,”
in CIKM, G. Demartini, G. Zuccon, J. S. Culpepper,
Z. Huang, and H. Tong, Eds. ACM, 2021, pp. 4653–
4664.
[802] K. Zhou, H. Wang, W. X. Zhao, Y. Zhu, S. Wang,
F. Zhang, Z. Wang, and J. Wen, “S3-rec: Self-
supervised learning for sequential recommendation
with mutual information maximization,” in CIKM,
M. d’Aquin, S. Dietze, C. Hauff, E. Curry, and
P. Cudr´ e-Mauroux, Eds. ACM, 2020, pp. 1893–1902.
[803] W. X. Zhao, Y. Hou, X. Pan, C. Yang, Z. Zhang, Z. Lin,
J. Zhang, S. Bian, J. Tang, W. Sun, Y. Chen, L. Xu,
G. Zhang, Z. Tian, C. Tian, S. Mu, X. Fan, X. Chen,
and J. Wen, “Recbole 2.0: Towards a more up-to-date
recommendation library,” in CIKM, M. A. Hasan and
134
L. Xiong, Eds. ACM, 2022, pp. 4722–4726.
[804] L. Xu, Z. Tian, G. Zhang, J. Zhang, L. Wang, B. Zheng,
Y. Li, J. Tang, Z. Zhang, Y. Hou, X. Pan, W. X. Zhao,
X. Chen, and J. Wen, “Towards a more user-friendly
and easy-to-use benchmark library for recommender
systems,” in SIGIR, H. Chen, W. E. Duh, H. Huang,
M. P. Kato, J. Mothe, and B. Poblete, Eds. ACM,
2023, pp. 2837–2847.
[805] S. Rendle, C. Freudenthaler, Z. Gantner, and
L. Schmidt-Thieme, “BPR: bayesian personalized
ranking from implicit feedback,” CoRR, vol.
abs/1205.2618, 2012.
[806] W. Fan, Z. Zhao, J. Li, Y. Liu, X. Mei, Y. Wang, J. Tang,
and Q. Li, “Recommender systems in the era of large
language models (llms),” CoRR, 2023.
[807] L. Wu, Z. Zheng, Z. Qiu, H. Wang, H. Gu, T. Shen,
C. Qin, C. Zhu, H. Zhu, Q. Liu, H. Xiong, and
E. Chen, “A survey on large language models for
recommendation,” CoRR, 2023.
[808] Y. Gao, T. Sheng, Y. Xiang, Y. Xiong, H. Wang, and
J. Zhang, “Chat-rec: Towards interactive and explain-
able llms-augmented recommender system,” CoRR,
vol. abs/2303.14524, 2023.
[809] S. Dai, N. Shao, H. Zhao, W. Yu, Z. Si, C. Xu, Z. Sun,
X. Zhang, and J. Xu, “Uncovering chatgpt’s capabil-
ities in recommender systems,” in RecSys, J. Zhang,
L. Chen, S. Berkovsky, M. Zhang, T. D. Noia, J. Basil-
ico, L. Pizzato, and Y. Song, Eds. ACM, 2023, pp.
1126–1132.
[810] Y. Hou, J. Zhang, Z. Lin, H. Lu, R. Xie, J. J. McAuley,
and W. X. Zhao, “Large language models are zero-
shot rankers for recommender systems,” CoRR, 2023.
[811] J. Liu, C. Liu, R. Lv, K. Zhou, and Y. Zhang, “Is
chatgpt a good recommender? A preliminary study,”
CoRR, vol. abs/2304.10149, 2023.
[812] K. Bao, J. Zhang, Y. Zhang, W. Wang, F. Feng,
and X. He, “Tallrec: An effective and efficient tun-
ing framework to align large language model with
recommendation,” in RecSys, J. Zhang, L. Chen,
S. Berkovsky, M. Zhang, T. D. Noia, J. Basilico, L. Piz-
zato, and Y. Song, Eds. ACM, 2023, pp. 1007–1014.
[813] Y. Zhu, L. Wu, Q. Guo, L. Hong, and J. Li, “Col-
laborative large language model for recommender
systems,” arXiv preprint arXiv:2311.01343, 2023.
[814] B. Zheng, Y. Hou, H. Lu, Y. Chen, W. X.
Zhao, and J.-R. Wen, “Adapting large language
models by integrating collaborative semantics for
recommendation,” 2023. [Online]. Available: https:
//api.semanticscholar.org/CorpusID:265213194
[815] Y. Xi, W. Liu, J. Lin, J. Zhu, B. Chen, R. Tang,
W. Zhang, R. Zhang, and Y. Yu, “Towards open-
world recommendation with knowledge augmen-
tation from large language models,” CoRR, vol.
abs/2306.10933, 2023.
[816] Q. Liu, N. Chen, T. Sakai, and X. Wu, “A first look
at llm-powered generative news recommendation,”
CoRR, vol. abs/2305.06566, 2023.
[817] R. Li, W. Deng, Y. Cheng, Z. Yuan, J. Zhang,
and F. Yuan, “Exploring the upper limits of
text-based collaborative filtering using large lan-
guage models: Discoveries and insights,” CoRR, vol.
abs/2305.11700, 2023.
[818] W. Wei, X. Ren, J. Tang, Q. Wang, L. Su, S. Cheng,
J. Wang, D. Yin, and C. Huang, “Llmrec: Large lan-
guage models with graph augmentation for recom-
mendation,” CoRR, vol. abs/2311.00423, 2023.
[819] X. Li, B. Chen, L. Hou, and R. Tang, “Ctrl: Connect
tabular and language model for ctr prediction,” arXiv
preprint arXiv:2306.02841, 2023.
[820] A. Muhamed, I. Keivanloo, S. Perera, J. Mracek,
Y. Xu, Q. Cui, S. Rajagopalan, B. Zeng, and
T. Chilimbi, “Ctr-bert: Cost-effective knowledge dis-
tillation for billion-parameter teacher models,” in
NeurIPS Efficient Natural Language and Speech Process-
ing Workshop, 2021.
[821] L. Wang, C. Ma, X. Feng, Z. Zhang, H. Yang,
J. Zhang, Z. Chen, J. Tang, X. Chen, Y. Lin, W. X.
Zhao, Z. Wei, and J. Wen, “A survey on large lan-
guage model based autonomous agents,” CoRR, vol.
abs/2308.11432, 2023.
[822] L. Wang, J. Zhang, X. Chen, Y. Lin, R. Song, W. X.
Zhao, and J. Wen, “Recagent: A novel simulation
paradigm for recommender systems,” CoRR, vol.
abs/2306.02552, 2023.
[823] E. Ie, C. Hsu, M. Mladenov, V. Jain, S. Narvekar,
J. Wang, R. Wu, and C. Boutilier, “Recsim: A con-
figurable simulation platform for recommender sys-
tems,” CoRR, vol. abs/1909.04847, 2019.
[824] J. Zhang, Y. Hou, R. Xie, W. Sun, J. J. McAuley,
W. X. Zhao, L. Lin, and J. Wen, “Agentcf: Collabora-
tive learning with autonomous language agents for
recommender systems,” CoRR, vol. abs/2310.09233,
2023.
[825] A. Zhang, L. Sheng, Y. Chen, H. Li, Y. Deng, X. Wang,
and T. Chua, “On generative agents in recommenda-
tion,” CoRR, vol. abs/2310.10108, 2023.
[826] Y. Du, Z. Liu, J. Li, and W. X. Zhao, “A survey of
vision-language pre-trained models,” in Proceedings
of the Thirty-First International Joint Conference on Ar-
tificial Intelligence, IJCAI 2022, Vienna, Austria, 23-29
July 2022, L. D. Raedt, Ed. ijcai.org, 2022, pp. 5436–
5443.
[827] Z. Gan, L. Li, C. Li, L. Wang, Z. Liu, and J. Gao,
“Vision-language pre-training: Basics, recent ad-
vances, and future trends,” Found. Trends Comput.
Graph. Vis., vol. 14, no. 3-4, pp. 163–352, 2022.
[828] P. K. Rubenstein, C. Asawaroengchai, D. D. Nguyen,
A. Bapna, Z. Borsos, F. de Chaumont Quitry, P. Chen,
D. E. Badawy, W. Han, E. Kharitonov et al., “Au-
diopalm: A large language model that can speak and
listen,” CoRR, 2023.
[829] J. Alayrac, J. Donahue, P. Luc, A. Miech, I. Barr,
Y. Hasson, K. Lenc, A. Mensch, K. Millican,
M. Reynolds, R. Ring, E. Rutherford, S. Cabi, T. Han,
Z. Gong, S. Samangooei, M. Monteiro, J. L. Menick,
S. Borgeaud, A. Brock, A. Nematzadeh, S. Shar-
ifzadeh, M. Binkowski, R. Barreira, O. Vinyals,
A. Zisserman, and K. Simonyan, “Flamingo: a visual
language model for few-shot learning,” in NeurIPS,
2022.
[830] C. Schuhmann, R. Beaumont, R. Vencu, C. Gordon,
R. Wightman, M. Cherti, T. Coombes, A. Katta,
135
C. Mullis, M. Wortsman, P. Schramowski, S. Kun-
durthy, K. Crowson, L. Schmidt, R. Kaczmarczyk,
and J. Jitsev, “LAION-5B: an open large-scale dataset
for training next generation image-text models,” in
NeurIPS, 2022.
[831] S. Changpinyo, P. Sharma, N. Ding, and R. Soricut,
“Conceptual 12m: Pushing web-scale image-text pre-
training to recognize long-tail visual concepts,” in
IEEE Conference on Computer Vision and Pattern Recog-
nition, CVPR 2021, virtual, June 19-25, 2021. Com-
puter Vision Foundation / IEEE, 2021, pp. 3558–3568.
[832] Q. Ye, H. Xu, G. Xu, J. Ye, M. Yan, Y. Zhou, J. Wang,
A. Hu, P. Shi, Y. Shi, C. Li, Y. Xu, H. Chen, J. Tian,
Q. Qi, J. Zhang, and F. Huang, “mplug-owl: Mod-
ularization empowers large language models with
multimodality,” CoRR, vol. abs/2304.14178, 2023.
[833] J. Bai, S. Bai, S. Yang, S. Wang, S. Tan, P. Wang,
J. Lin, C. Zhou, and J. Zhou, “Qwen-vl: A frontier
large vision-language model with versatile abilities,”
CoRR, vol. abs/2308.12966, 2023.
[834] H. Liu, C. Li, Y. Li, and Y. J. Lee, “Improved base-
lines with visual instruction tuning,” CoRR, vol.
abs/2310.03744, 2023.
[835] P. Zhang, X. Dong, B. Wang, Y. Cao, C. Xu,
L. Ouyang, Z. Zhao, S. Ding, S. Zhang, H. Duan,
W. Zhang, H. Yan, X. Zhang, W. Li, J. Li,
K. Chen, C. He, X. Zhang, Y. Qiao, D. Lin, and
J. Wang, “Internlm-xcomposer: A vision-language
large model for advanced text-image comprehension
and composition,” CoRR, vol. abs/2309.15112, 2023.
[836] K. Chen, Z. Zhang, W. Zeng, R. Zhang, F. Zhu, and
R. Zhao, “Shikra: Unleashing multimodal llm’s ref-
erential dialogue magic,” CoRR, vol. abs/2306.15195,
2023.
[837] F. Liu, K. Lin, L. Li, J. Wang, Y. Yacoob, and L. Wang,
“Aligning large multi-modal model with robust in-
struction tuning,” CoRR, vol. abs/2306.14565, 2023.
[838] Y. Du, H. Guo, K. Zhou, W. X. Zhao, J. Wang,
C. Wang, M. Cai, R. Song, and J.-R. Wen, “What
makes for good visual instructions? synthesizing
complex visual reasoning instructions for visual in-
struction tuning,” 2023.
[839] D. Gurari, Q. Li, A. J. Stangl, A. Guo, C. Lin,
K. Grauman, J. Luo, and J. P. Bigham, “Vizwiz grand
challenge: Answering visual questions from blind
people,” in CVPR. Computer Vision Foundation
/ IEEE Computer Society, 2018, pp. 3608–3617.
[840] A. Mishra, K. Alahari, and C. V. Jawahar, “Top-down
and bottom-up cues for scene text recognition,” in
CVPR. IEEE Computer Society, 2012, pp. 2687–2694.
[841] Y. Liu, H. Duan, Y. Zhang, B. Li, S. Zhang, W. Zhao,
Y. Yuan, J. Wang, C. He, Z. Liu, K. Chen, and
D. Lin, “Mmbench: Is your multi-modal model an
all-around player?” CoRR, vol. abs/2307.06281, 2023.
[842] C. Fu, P. Chen, Y. Shen, Y. Qin, M. Zhang, X. Lin,
Z. Qiu, W. Lin, J. Yang, X. Zheng, K. Li, X. Sun, and
R. Ji, “MME: A comprehensive evaluation bench-
mark for multimodal large language models,” CoRR,
vol. abs/2306.13394, 2023.
[843] Y. Zhang, Y. Li, L. Cui, D. Cai, L. Liu, T. Fu, X. Huang,
E. Zhao, Y. Zhang, Y. Chen, L. Wang, A. T. Luu, W. Bi,
F. Shi, and S. Shi, “Siren’s song in the AI ocean: A
survey on hallucination in large language models,”
CoRR, vol. abs/2309.01219, 2023.
[844] A. Gunjal, J. Yin, and E. Bas, “Detecting and prevent-
ing hallucinations in large vision language models,”
CoRR, vol. abs/2308.06394, 2023.
[845] J. Lu, J. Rao, K. Chen, X. Guo, Y. Zhang, B. Sun,
C. Yang, and J. Yang, “Evaluation and mitigation
of agnosia in multimodal large language models,”
CoRR, vol. abs/2309.04041, 2023.
[846] A. Rohrbach, L. A. Hendricks, K. Burns, T. Darrell,
and K. Saenko, “Object hallucination in image cap-
tioning,” in EMNLP. Association for Computational
Linguistics, 2018, pp. 4035–4045.
[847] Y. Li, Y. Du, K. Zhou, J. Wang, W. X. Zhao, and
J.-R. Wen, “Evaluating object hallucination in large
vision-language models,” in The 2023 Conference on
Empirical Methods in Natural Language Processing,
2023. [Online]. Available: https://openreview.net/
forum?id=xozJw0kZXF
[848] D. A. Hudson and C. D. Manning, “GQA: A new
dataset for real-world visual reasoning and compo-
sitional question answering,” in CVPR. Computer
Vision Foundation / IEEE, 2019, pp. 6700–6709.
[849] P. Lu, S. Mishra, T. Xia, L. Qiu, K. Chang, S. Zhu,
O. Tafjord, P. Clark, and A. Kalyan, “Learn to explain:
Multimodal reasoning via thought chains for science
question answering,” in NeurIPS, 2022.
[850] A. Singh, V. Natarjan, M. Shah, Y. Jiang, X. Chen,
D. Parikh, and M. Rohrbach, “Towards vqa models
that can read,” in Proceedings of the IEEE Conference
on Computer Vision and Pattern Recognition, 2019, pp.
8317–8326.
[851] F. Liu, T. Guan, Z. Li, L. Chen, Y. Yacoob,
D. Manocha, and T. Zhou, “Hallusionbench: You
see what you think? or you think what you see?
an image-context reasoning benchmark challenging
for gpt-4v(ision), llava-1.5, and other multi-modality
models,” CoRR, vol. abs/2310.14566, 2023.
[852] S. Antol, A. Agrawal, J. Lu, M. Mitchell, D. Batra,
C. L. Zitnick, and D. Parikh, “VQA: visual question
answering,” in ICCV. IEEE Computer Society, 2015,
pp. 2425–2433.
[853] R. Vedantam, C. L. Zitnick, and D. Parikh, “Cider:
Consensus-based image description evaluation,” in
CVPR. IEEE Computer Society, 2015, pp. 4566–4575.
[854] H. Liu, C. Li, Q. Wu, and Y. J. Lee, “Visual instruction
tuning,” CoRR, vol. abs/2304.08485, 2023.
[855] P. Xu, W. Shao, K. Zhang, P. Gao, S. Liu, M. Lei,
F. Meng, S. Huang, Y. Qiao, and P. Luo, “Lvlm-ehub:
A comprehensive evaluation benchmark for large
vision-language models,” CoRR, vol. abs/2306.09265,
2023.
[856] Z. Li, Y. Wang, M. Du, Q. Liu, B. Wu, J. Zhang,
C. Zhou, Z. Fan, J. Fu, J. Chen, X. Huang, and
Z. Wei, “Reform-eval: Evaluating large vision lan-
guage models via unified re-formulation of task-
oriented benchmarks,” CoRR, vol. abs/2310.02569,
2023.
[857] B. Li, R. Wang, G. Wang, Y. Ge, Y. Ge, and
Y. Shan, “Seed-bench: Benchmarking multimodal
136
llms with generative comprehension,” CoRR, vol.
abs/2307.16125, 2023.
[858] W. Yu, Z. Yang, L. Li, J. Wang, K. Lin, Z. Liu, X. Wang,
and L. Wang, “Mm-vet: Evaluating large multi-
modal models for integrated capabilities,” CoRR, vol.
abs/2308.02490, 2023.
[859] J. Wang, L. Meng, Z. Weng, B. He, Z. Wu, and
Y. Jiang, “To see is to believe: Prompting GPT-
4V for better visual instruction tuning,” CoRR, vol.
abs/2311.07574, 2023.
[860] Y. Zhang, R. Zhang, J. Gu, Y. Zhou, N. Lipka, D. Yang,
and T. Sun, “Llavar: Enhanced visual instruction tun-
ing for text-rich image understanding,” arXiv preprint
arXiv:2306.17107, 2023.
[861] X. Qi, K. Huang, A. Panda, M. Wang, and P. Mittal,
“Visual adversarial examples jailbreak aligned large
language models,” in The Second Workshop on New
Frontiers in Adversarial Machine Learning, 2023.
[862] Y. Zhou, C. Cui, J. Yoon, L. Zhang, Z. Deng, C. Finn,
M. Bansal, and H. Yao, “Analyzing and mitigating
object hallucination in large vision-language mod-
els,” arXiv preprint arXiv:2310.00754, 2023.
[863] Z. Sun, S. Shen, S. Cao, H. Liu, C. Li, Y. Shen, C. Gan,
L.-Y. Gui, Y.-X. Wang, Y. Yang et al., “Aligning large
multimodal models with factually augmented rlhf,”
arXiv preprint arXiv:2309.14525, 2023.
[864] E. Jim´ enez-Ruiz, O. Hassanzadeh, V. Efthymiou,
J. Chen, and K. Srinivas, “Semtab 2019: Resources to
benchmark tabular data to knowledge graph match-
ing systems,” in The Semantic Web - 17th International
Conference, ESWC 2020, Heraklion, Crete, Greece, May
31-June 4, 2020, Proceedings, ser. Lecture Notes in
Computer Science, vol. 12123. Springer, 2020, pp.
514–530.
[865] S. Pan, L. Luo, Y. Wang, C. Chen, J. Wang, and X. Wu,
“Unifying large language models and knowledge
graphs: A roadmap,” CoRR, vol. abs/2306.08302,
2023.
[866] Y. Sun, S. Wang, S. Feng, S. Ding, C. Pang,
J. Shang, J. Liu, X. Chen, Y. Zhao, Y. Lu,
W. Liu, Z. Wu, W. Gong, J. Liang, Z. Shang,
P. Sun, W. Liu, X. Ouyang, D. Yu, H. Tian,
H. Wu, and H. Wang, “ERNIE 3.0: Large-
scale knowledge enhanced pre-training for
language understanding and generation,” CoRR,
vol. abs/2107.02137, 2021. [Online]. Available:
https://arxiv.org/abs/2107.02137
[867] Z. Zhang, X. Han, Z. Liu, X. Jiang, M. Sun, and
Q. Liu, “ERNIE: enhanced language representation
with informative entities,” in Proceedings of the 57th
Conference of the Association for Computational Linguis-
tics, ACL 2019, Florence, Italy, July 28- August 2, 2019,
Volume 1: Long Papers. Association for Computa-
tional Linguistics, 2019, pp. 1441–1451.
[868] X. Wang, T. Gao, Z. Zhu, Z. Zhang, Z. Liu, J. Li,
and J. Tang, “KEPLER: A unified model for knowl-
edge embedding and pre-trained language represen-
tation,” Trans. Assoc. Comput. Linguistics, vol. 9, pp.
176–194, 2021.
[869] J. Zhang, X. Zhang, J. Yu, J. Tang, J. Tang, C. Li,
and H. Chen, “Subgraph retrieval enhanced model
for multi-hop knowledge base question answering,”
in Proceedings of the 60th Annual Meeting of the As-
sociation for Computational Linguistics (Volume 1: Long
Papers), ACL 2022, Dublin, Ireland, May 22-27, 2022.
Association for Computational Linguistics, 2022, pp.
5773–5784.
[870] P. Ke, H. Ji, Y. Ran, X. Cui, L. Wang, L. Song, X. Zhu,
and M. Huang, “Jointgt: Graph-text joint represen-
tation learning for text generation from knowledge
graphs,” in Findings of the Association for Compu-
tational Linguistics: ACL/IJCNLP 2021, Online Event,
August 1-6, 2021, ser. Findings of ACL, vol. ACL/I-
JCNLP 2021. Association for Computational Lin-
guistics, 2021, pp. 2526–2538.
[871] O. Agarwal, H. Ge, S. Shakeri, and R. Al-Rfou,
“Large scale knowledge graph based synthetic cor-
pus generation for knowledge-enhanced language
model pre-training,” CoRR, vol. abs/2010.12688,
2020.
[872] W. Chen, Y. Su, X. Yan, and W. Y. Wang, “KGPT:
knowledge-grounded pre-training for data-to-text
generation,” in Proceedings of the 2020 Conference
on Empirical Methods in Natural Language Processing,
EMNLP 2020, Online, November 16-20, 2020. Associ-
ation for Computational Linguistics, 2020, pp. 8635–
8648.
[873] Y. Gu, X. Deng, and Y. Su, “Don’t generate, discrim-
inate: A proposal for grounding language models to
real-world environments,” in Proceedings of the 61st
Annual Meeting of the Association for Computational
Linguistics (Volume 1: Long Papers), ACL 2023, Toronto,
Canada, July 9-14, 2023. Association for Computa-
tional Linguistics, 2023, pp. 4928–4949.
[874] L. Luo, Y. Li, G. Haffari, and S. Pan, “Reasoning
on graphs: Faithful and interpretable large language
model reasoning,” CoRR, vol. abs/2310.01061, 2023.
[875] Y. Lan and J. Jiang, “Query graph generation for an-
swering multi-hop complex questions from knowl-
edge bases,” in Proceedings of the 58th Annual Meeting
of the Association for Computational Linguistics, ACL
2020, Online, July 5-10, 2020, D. J. and, Ed. Asso-
ciation for Computational Linguistics, 2020, pp. 969–
974.
[876] P. Wang, N. Zhang, X. Xie, Y. Yao, B. Tian,
M. Wang, Z. Xi, S. Cheng, K. Liu, G. Zheng, and
H. Chen, “Easyedit: An easy-to-use knowledge edit-
ing framework for large language models,” CoRR,
vol. abs/2308.07269, 2023.
[877] Y. Yao, P. Wang, B. Tian, S. Cheng, Z. Li, S. Deng,
H. Chen, and N. Zhang, “Editing large language
models: Problems, methods, and opportunities,”
CoRR, vol. abs/2305.13172, 2023.
[878] S. Choi, T. Fang, Z. Wang, and Y. Song, “KCTS:
knowledge-constrained tree search decoding with
token-level hallucination detection,” CoRR, vol.
abs/2310.09044, 2023.
[879] S. Zhang, L. Pan, J. Zhao, and W. Y. Wang, “Mit-
igating language model hallucination with inter-
active question-knowledge alignment,” CoRR, vol.
abs/2305.13669, 2023.
[880] Y. Zhu, X. Wang, J. Chen, S. Qiao, Y. Ou,
137
Y. Yao, S. Deng, H. Chen, and N. Zhang, “Llms
for knowledge graph construction and reasoning:
Recent capabilities and future opportunities,” CoRR,
vol. abs/2305.13168, 2023. [Online]. Available:
https://doi.org/10.48550/arXiv.2305.13168
[881] M. Karpinska, N. Akoury, and M. Iyyer, “The perils
of using mechanical turk to evaluate open-ended
text generation,” in Proceedings of the 2021 Conference
on Empirical Methods in Natural Language Processing,
EMNLP 2021, Virtual Event / Punta Cana, Dominican
Republic, 7-11 November, 2021, M. Moens, X. Huang,
L. Specia, and S. W. Yih, Eds. Association for
Computational Linguistics, 2021, pp. 1265–1285.
[882] H. Lee, S. Phatale, H. Mansoor, K. Lu, T. Mesnard,
C. Bishop, V. Carbune, and A. Rastogi, “RLAIF:
scaling reinforcement learning from human feedback
with AI feedback,” CoRR, vol. abs/2309.00267, 2023.
[883] G. Cui, L. Yuan, N. Ding, G. Yao, W. Zhu, Y. Ni,
G. Xie, Z. Liu, and M. Sun, “Ultrafeedback: Boosting
language models with high-quality feedback,” CoRR,
vol. abs/2310.01377, 2023.
[884] X. Wang, Z. Wang, J. Liu, Y. Chen, L. Yuan, H. Peng,
and H. Ji, “MINT: evaluating llms in multi-turn in-
teraction with tools and language feedback,” CoRR,
vol. abs/2309.10691, 2023.
[885] S. Saha, O. Levy, A. Celikyilmaz, M. Bansal, J. We-
ston, and X. Li, “Branch-solve-merge improves large
language model evaluation and generation,” CoRR,
vol. abs/2310.15123, 2023.
[886] X. Zhang, B. Yu, H. Yu, Y. Lv, T. Liu, F. Huang, H. Xu,
and Y. Li, “Wider and deeper LLM networks are
fairer LLM evaluators,” CoRR, vol. abs/2308.01862,
2023.
[887] C. Chan, W. Chen, Y. Su, J. Yu, W. Xue, S. Zhang, J. Fu,
and Z. Liu, “Chateval: Towards better llm-based
evaluators through multi-agent debate,” CoRR, vol.
abs/2308.07201, 2023.
[888] R. Li, T. Patel, and X. Du, “PRD: peer rank and dis-
cussion improve large language model based evalu-
ations,” CoRR, vol. abs/2307.02762, 2023.
[889] L. Zhu, X. Wang, and X. Wang, “Judgelm: Fine-tuned
large language models are scalable judges,” CoRR,
vol. abs/2310.17631, 2023.
[890] Z. Zeng, J. Yu, T. Gao, Y. Meng, T. Goyal,
and D. Chen, “Evaluating large language mod-
els at evaluating instruction following,” CoRR, vol.
abs/2310.07641, 2023.
[891] R. Koo, M. Lee, V. Raheja, J. I. Park, Z. M. Kim,
and D. Kang, “Benchmarking cognitive biases in
large language models as evaluators,” CoRR, vol.
abs/2309.17012, 2023.
[892] P. West, X. Lu, N. Dziri, F. Brahman, L. Li,
J. D. Hwang, L. Jiang, J. Fisher, A. Ravichander,
K. Chandu, B. Newman, P. W. Koh, A. Ettinger,
and Y. Choi, “The generative AI paradox: ”what
it can create, it may not understand”,” CoRR, vol.
abs/2311.00059, 2023.
[893] J. Huang, X. Chen, S. Mishra, H. S. Zheng, A. W.
Yu, X. Song, and D. Zhou, “Large language mod-
els cannot self-correct reasoning yet,” CoRR, vol.
abs/2310.01798, 2023.
[894] K. Stechly, M. Marquez, and S. Kambhampati, “GPT-
4 doesn’t know it’s wrong: An analysis of itera-
tive prompting for reasoning problems,” CoRR, vol.
abs/2310.12397, 2023.
[895] O. Nov, N. Singh, and D. M. Mann, “Putting chat-
gpt’s medical advice to the (turing) test,” CoRR, vol.
abs/2301.10035, 2023.
[896] K. Yang, S. Ji, T. Zhang, Q. Xie, and S. Anani-
adou, “On the evaluations of chatgpt and emotion-
enhanced prompting for mental health analysis,”
CoRR, vol. abs/2304.03347, 2023.
[897] K. Jeblick, B. Schachtner, J. Dexl, A. Mittermeier,
A. T. St ¨ uber, J. Topalis, T. Weber, P. Wesp, B. O.
Sabel, J. Ricke, and M. Ingrisch, “Chatgpt makes
medicine easy to swallow: An exploratory case
study on simplified radiology reports,” CoRR, vol.
abs/2212.14882, 2022.
[898] K. Singhal, T. Tu, J. Gottweis, R. Sayres, E. Wul-
czyn, L. Hou, K. Clark, S. Pfohl, H. Cole-Lewis,
D. Neal, M. Schaekermann, A. Wang, M. Amin,
S. Lachgar, P. A. Mansfield, S. Prakash, B. Green,
E. Dominowska, B. A. y Arcas, N. Tomasev, Y. Liu,
R. Wong, C. Semturs, S. S. Mahdavi, J. K. Barral,
D. R. Webster, G. S. Corrado, Y. Matias, S. Azizi,
A. Karthikesalingam, and V. Natarajan, “Towards
expert-level medical question answering with large
language models,” CoRR, vol. abs/2305.09617, 2023.
[899] S. Yang, H. Zhao, S. Zhu, G. Zhou, H. Xu, Y. Jia, and
H. Zan, “Zhongjing: Enhancing the chinese medical
capabilities of large language model through expert
feedback and real-world multi-turn dialogue,” CoRR,
vol. abs/2308.03549, 2023.
[900] S. Chen, B. H. Kann, M. B. Foote, H. J. Aerts,
G. K. Savova, R. H. Mak, and D. S. Bitterman, “The
utility of chatgpt for cancer treatment information,”
medRxiv, 2023.
[901] K. Malinka, M. Peres´ ıni, A. Firc, O. Hujnak, and
F. Janus, “On the educational impact of chatgpt:
Is artificial intelligence ready to obtain a university
degree?” CoRR, vol. abs/2303.11146, 2023.
[902] T. Susnjak, “Chatgpt: The end of online exam in-
tegrity?” CoRR, vol. abs/2212.09292, 2022.
[903] K. Tan, T. Pang, and C. Fan, “Towards applying
powerful large ai models in classroom teaching: Op-
portunities, challenges and prospects,” 2023.
[904] F. Kamalov and I. Gurrib, “A new era of artificial
intelligence in education: A multifaceted revolution,”
CoRR, vol. abs/2305.18303, 2023.
[905] E. Kasneci, K. Seßler, S. K ¨ uchemann, M. Bannert,
D. Dementieva, F. Fischer, U. Gasser, G. Groh,
S. G ¨ unnemann, E. H ¨ ullermeier et al., “Chatgpt for
good? on opportunities and challenges of large lan-
guage models for education,” Learning and Individual
Differences, vol. 103, p. 102274, 2023.
[906] A. Blair-Stanek, N. Holzenberger, and B. V. Durme,
“Can GPT-3 perform statutory reasoning?” CoRR,
vol. abs/2302.06100, 2023.
[907] D. Trautmann, A. Petrova, and F. Schilder, “Legal
prompt engineering for multilingual legal judgement
prediction,” CoRR, vol. abs/2212.02199, 2022.
[908] J. H. Choi, K. E. Hickman, A. Monahan, and
138
D. Schwarcz, “Chatgpt goes to law school,” Available
at SSRN, 2023.
[909] J. J. Nay, “Law informs code: A legal informatics
approach to aligning artificial intelligence with hu-
mans,” CoRR, vol. abs/2209.13020, 2022.
[910] F. Yu, L. Quartey, and F. Schilder, “Legal prompting:
Teaching a language model to think like a lawyer,”
CoRR, vol. abs/2212.01326, 2022.
[911] D. Trautmann, A. Petrova, and F. Schilder, “Legal
prompt engineering for multilingual legal judgement
prediction,” CoRR, vol. abs/2212.02199, 2022.
[912] A. Tamkin, M. Brundage, J. Clark, and D. Ganguli,
“Understanding the capabilities, limitations, and so-
cietal impact of large language models,” CoRR, vol.
abs/2102.02503, 2021.
[913] Z. Sun, “A short survey of viewing large language
models in legal aspect,” CoRR, vol. abs/2303.09136,
2023.
[914] A. Abid, M. Farooqi, and J. Zou, “Persistent anti-
muslim bias in large language models,” in AIES
’21: AAAI/ACM Conference on AI, Ethics, and Society,
Virtual Event, USA, May 19-21, 2021, M. Fourcade,
B. Kuipers, S. Lazar, and D. K. Mulligan, Eds. ACM,
2021, pp. 298–306.
[915] A. Shah and S. Chava, “Zero is not hero yet: Bench-
marking zero-shot performance of llms for financial
tasks,” CoRR, vol. abs/2305.16633, 2023.
[916] D. Araci, “Finbert: Financial sentiment analysis
with pre-trained language models,” CoRR, vol.
abs/1908.10063, 2019.
[917] J. C. S. Alvarado, K. Verspoor, and T. Baldwin,
“Domain adaption of named entity recognition to
support credit risk assessment,” in Proceedings of
the Australasian Language Technology Association Work-
shop, ALTA 2015, Parramatta, Australia, December 8 - 9,
2015, B. Hachey and K. Webster, Eds. ACL, 2015,
pp. 84–90.
[918] G. Son, H. Jung, M. Hahm, K. Na, and S. Jin, “Beyond
classification: Financial reasoning in state-of-the-art
language models,” CoRR, vol. abs/2305.01505, 2023.
[919] X. Zhang, Q. Yang, and D. Xu, “Xuanyuan 2.0: A
large chinese financial chat model with hundreds of
billions parameters,” arXiv preprint arXiv:2305.12002,
2023.
[920] H. Yang, X.-Y. Liu, and C. D. Wang, “Fingpt: Open-
source financial large language models,” CoRR, vol.
abs/2306.06031, 2023.
[921] Q. Jin, B. Dhingra, Z. Liu, W. W. Cohen, and X. Lu,
“Pubmedqa: A dataset for biomedical research ques-
tion answering,” in Proceedings of the 2019 Conference
on Empirical Methods in Natural Language Processing
and the 9th International Joint Conference on Natu-
ral Language Processing, EMNLP-IJCNLP 2019, Hong
Kong, China, November 3-7, 2019, 2019, pp. 2567–2577.
[922] A. Krithara, A. Nentidis, K. Bougiatiotis, and
G. Paliouras, “Bioasq-qa: A manually curated corpus
for biomedical question answering,” 2022.
[923] Z. Bi, N. Zhang, Y. Xue, Y. Ou, D. Ji, G. Zheng,
and H. Chen, “Oceangpt: A large language model
for ocean science tasks,” CoRR, vol. abs/2310.02031,
2023.
[924] C. Zhang, C. Zhang, C. Li, Y. Qiao, S. Zheng, S. K.
Dam, M. Zhang, J. U. Kim, S. T. Kim, J. Choi, G. Park,
S. Bae, L. Lee, P. Hui, I. S. Kweon, and C. S. Hong,
“One small step for generative ai, one giant leap for
AGI: A complete survey on chatgpt in AIGC era,”
CoRR, vol. abs/2304.06488, 2023.
[925] M. Haman and M. Skolnik, “Using chatgpt to con-
duct a literature review.” Accountability in research,
2023.
¨
[926]
O. Aydın and E. Karaarslan, “Openai chatgpt gen-
erated literature review: Digital twin in healthcare,”
SSRN Electronic Journal, 2022.
[927] Y. J. Park, D. Kaplan, Z. Ren, C. Hsu, C. Li, H. Xu,
S. Li, and J. Li, “Can chatgpt be used to generate
scientific hypotheses?” CoRR, vol. abs/2304.12208,
2023.
[928] M. M. Hassan, R. A. Knipper, and S. K. K. Santu,
“Chatgpt as your personal data scientist,” CoRR, vol.
abs/2305.13657, 2023.
[929] L. Cheng, X. Li, and L. Bing, “Is GPT-4 a good data
analyst?” CoRR, vol. abs/2305.15038, 2023.
[930] S. I. M. Hussam Alkaissi, “Artificial hallucinations in
chatgpt: Implications in scientific writing,” PubMed,
2023.
[931] A. Azaria, R. Azoulay, and S. Reches, “Chatgpt
is a remarkable tool – for experts,” CoRR, vol.
abs/2306.03102, 2023.
[932] O. O. Buruk, “Academic writing with GPT-3.5: reflec-
tions on practices, efficacy and transparency,” CoRR,
vol. abs/2304.11079, 2023.
[933] R. Liu and N. B. Shah, “Reviewergpt? an exploratory
study on using large language models for paper
reviewing,” CoRR, vol. abs/2306.00622, 2023.
[934] M. Kosinski, “Theory of mind may have sponta-
neously emerged in large language models,” CoRR,
vol. abs/2302.02083, 2023.
[935] M. M. Amin, E. Cambria, and B. W. Schuller, “Will
affective computing emerge from foundation models
and general ai? A first evaluation on chatgpt,” CoRR,
vol. abs/2303.03186, 2023.
[936] G. Sridhara, R. H. G., and S. Mazumdar, “Chatgpt: A
study on its utility for ubiquitous software engineer-
ing tasks,” CoRR, vol. abs/2305.16837, 2023.
[937] W. Sun, C. Fang, Y. You, Y. Miao, Y. Liu, Y. Li,
G. Deng, S. Huang, Y. Chen, Q. Zhang, H. Qian,
Y. Liu, and Z. Chen, “Automatic code summariza-
tion via chatgpt: How far are we?” CoRR, vol.
abs/2305.12865, 2023.
[938] C. S. Xia and L. Zhang, “Conversational automated
program repair,” CoRR, vol. abs/2301.13246, 2023.
[939] A. Kazemnejad, I. Padhi, K. N. Ramamurthy, P. Das,
and S. Reddy, “The impact of positional encoding
on length generalization in transformers,” CoRR, vol.
abs/2305.19466, 2023.
[940] W. Xiong, J. Liu, I. Molybog, H. Zhang, P. Bhargava,
R. Hou, L. Martin, R. Rungta, K. A. Sankararaman,
B. Oguz, M. Khabsa, H. Fang, Y. Mehdad, S. Narang,
K. Malik, A. Fan, S. Bhosale, S. Edunov, M. Lewis,
S. Wang, and H. Ma, “Effective long-context scaling
of foundation models,” CoRR, vol. abs/2309.16039,
2023.
139
[941] kaiokendev, “Things I’m learning while training su-
perhot.” 2023.
[942] Z. Dong, T. Tang, J. Li, W. X. Zhao, and J. Wen,
“BAMBOO: A comprehensive benchmark for evalu-
ating long text modeling capacities of large language
models,” CoRR, vol. abs/2309.13345, 2023.
[943] J. Su. (2023) Transformer upgrade path: 12, infinite
extrapolation of rerope?
[944] A. Pal, D. Karkhanis, M. Roberts, S. Dooley, A. Sun-
dararajan, and S. Naidu, “Giraffe: Adventures in
expanding context lengths in llms,” CoRR, vol.
abs/2308.10882, 2023.
[945] G. Izacard and E. Grave, “Leveraging passage re-
trieval with generative models for open domain
question answering,” in Proceedings of the 16th Con-
ference of the European Chapter of the Association for
Computational Linguistics: Main Volume, EACL 2021,
Online, April 19 - 23, 2021. Association for Compu-
tational Linguistics, 2021, pp. 874–880.
[946] N. Ratner, Y. Levine, Y. Belinkov, O. Ram, I. Magar,
O. Abend, E. Karpas, A. Shashua, K. Leyton-Brown,
and Y. Shoham, “Parallel context windows for large
language models,” in Proceedings of the 61st Annual
Meeting of the Association for Computational Linguistics
(Volume 1: Long Papers), ACL 2023, Toronto, Canada,
July 9-14, 2023. Association for Computational
Linguistics, 2023, pp. 6383–6402.
[947] I. Beltagy, M. E. Peters, and A. Cohan, “Long-
former: The long-document transformer,” CoRR, vol.
abs/2004.05150, 2020.
[948] G. Xiao, Y. Tian, B. Chen, S. Han, and M. Lewis,
“Efficient streaming language models with attention
sinks,” CoRR, vol. abs/2309.17453, 2023.
[949] N. F. Liu, K. Lin, J. Hewitt, A. Paranjape, M. Bevilac-
qua, F. Petroni, and P. Liang, “Lost in the middle:
How language models use long contexts,” Transac-
tions of the Association for Computational Linguistics,
vol. 12, pp. 157–173, 2024.
[950] C. Han, Q. Wang, W. Xiong, Y. Chen, H. Ji, and
S. Wang, “Lm-infinite: Simple on-the-fly length gen-
eralization for large language models,” CoRR, vol.
abs/2308.16137, 2023.
[951] A. Bertsch, U. Alon, G. Neubig, and M. R. Gorm-
ley, “Unlimiformer: Long-range transformers with
unlimited length input,” CoRR, vol. abs/2305.01625,
2023.
[952] Y. Wu, M. N. Rabe, D. Hutchins, and C. Szegedy,
“Memorizing transformers,” in The Tenth Interna-
tional Conference on Learning Representations, ICLR
2022, Virtual Event, April 25-29, 2022. OpenRe-
view.net, 2022.
[953] Y. Lu, X. Zhou, W. He, J. Zhao, T. Ji, T. Gui, Q. Zhang,
and X. Huang, “Longheads: Multi-head attention
is secretly a long context processor,” CoRR, vol.
abs/2402.10685, 2024.
[954] C. Xiao, P. Zhang, X. Han, G. Xiao, Y. Lin, Z. Zhang,
Z. Liu, S. Han, and M. Sun, “Infllm: Unveiling the in-
trinsic capacity of llms for understanding extremely
long sequences with training-free memory,” CoRR,
vol. abs/2402.04617, 2024.
[955] Y. Fu, R. Panda, X. Niu, X. Yue, H. Hajishirzi, Y. Kim,
and H. Peng, “Data engineering for scaling language
models to 128k context,” CoRR, vol. abs/2402.10171,
2024.
[956] K. Lv, X. Liu, Q. Guo, H. Yan, C. He, X. Qiu,
and D. Lin, “Longwanjuan: Towards systematic
measurement for long text quality,” CoRR, vol.
abs/2402.13583, 2024.
[957] H. Chen, R. Pasunuru, J. Weston, and A. Celiky-
ilmaz, “Walking down the memory maze: Beyond
context limit through interactive reading,” CoRR, vol.
abs/2310.05029, 2023.
[958] W. Zhou, Y. E. Jiang, P. Cui, T. Wang, Z. Xiao, Y. Hou,
R. Cotterell, and M. Sachan, “Recurrentgpt: Interac-
tive generation of (arbitrarily) long text,” CoRR, vol.
abs/2305.13304, 2023.
[959] C. Packer, V. Fang, S. G. Patil, K. Lin, S. Wooders, and
J. E. Gonzalez, “Memgpt: Towards llms as operating
systems,” CoRR, vol. abs/2310.08560, 2023.
[960] P. Xu, W. Ping, X. Wu, L. McAfee, C. Zhu, Z. Liu,
S. Subramanian, E. Bakhturina, M. Shoeybi, and
B. Catanzaro, “Retrieval meets long context large
language models,” CoRR, vol. abs/2310.03025, 2023.
[961] S. Russell and P. Norvig, Artificial Intelligence:
A Modern Approach (4th Edition). Pearson, 2020.
[Online]. Available: http://aima.cs.berkeley.edu/
[962] B. M. Lake, T. D. Ullman, J. B. Tenenbaum, and S. J.
Gershman, “Building machines that learn and think
like people,” CoRR, vol. abs/1604.00289, 2016.
[963] S. Yao, J. Zhao, D. Yu, N. Du, I. Shafran,
K. Narasimhan, and Y. Cao, “React: Synergizing rea-
soning and acting in language models,” CoRR, vol.
abs/2210.03629, 2022.
[964] 2023. [Online]. Available: https://github.com/
AntonOsika/gpt-engineer
[965] X. Team, “Xagent: An autonomous agent for complex
task solving,” 2023.
[966] G. Li, H. A. A. K. Hammoud, H. Itani, D. Khizbullin,
and B. Ghanem, “CAMEL: communicative agents for
”mind” exploration of large scale language model
society,” CoRR, vol. abs/2303.17760, 2023.
[967] S. Hong, X. Zheng, J. Chen, Y. Cheng, J. Wang,
C. Zhang, Z. Wang, S. K. S. Yau, Z. Lin, L. Zhou,
C. Ran, L. Xiao, and C. Wu, “Metagpt: Meta pro-
gramming for multi-agent collaborative framework,”
CoRR, vol. abs/2308.00352, 2023.
[968] C. Pham, B. Liu, Y. Yang, Z. Chen, T. Liu, J. Yuan,
B. A. Plummer, Z. Wang, and H. Yang, “Let models
speak ciphers: Multiagent debate through embed-
dings,” CoRR, vol. abs/2310.06272, 2023.
[969] W. Chen, Y. Su, J. Zuo, C. Yang, C. Yuan, C. Qian,
C.-M. Chan, Y. Qin, Y. Lu, R. Xie et al., “Agent-
verse: Facilitating multi-agent collaboration and ex-
ploring emergent behaviors in agents,” arXiv preprint
arXiv:2308.10848, 2023.
[970] Q. Wu, G. Bansal, J. Zhang, Y. Wu, B. Li, E. Zhu,
L. Jiang, X. Zhang, S. Zhang, J. Liu, A. H. Awadallah,
R. W. White, D. Burger, and C. Wang, “Autogen:
Enabling next-gen llm applications via multi-agent
conversation framework,” 2023.
[971] Y. Du, S. Li, A. Torralba, J. B. Tenenbaum, and
I. Mordatch, “Improving factuality and reasoning in
140
language models through multiagent debate,” CoRR,
vol. abs/2305.14325, 2023.
[972] Y. Shao, L. Li, J. Dai, and X. Qiu, “Character-llm:
A trainable agent for role-playing,” in Proceedings of
the 2023 Conference on Empirical Methods in Natural
Language Processing, EMNLP 2023, Singapore, Decem-
ber 6-10, 2023, H. Bouamor, J. Pino, and K. Bali, Eds.
Association for Computational Linguistics, 2023, pp.
13 153–13 187.
[973] W. Hua, X. Yang, Z. Li, W. Cheng, and Y. Zhang,
“Trustagent: Towards safe and trustworthy llm-
based agents through agent constitution,” CoRR, vol.
abs/2402.01586, 2024.
[974] L. Huang, W. Yu, W. Ma, W. Zhong, Z. Feng,
H. Wang, Q. Chen, W. Peng, X. Feng, B. Qin, and
T. Liu, “A survey on hallucination in large language
models: Principles, taxonomy, challenges, and open
questions,” CoRR, vol. abs/2311.05232, 2023.
[975] I. Loshchilov and F. Hutter, “Decoupled weight de-
cay regularization,” in ICLR (Poster). OpenRe-
view.net, 2019.
[976] V. A. Korthikanti, J. Casper, S. Lym, L. McAfee,
M. Andersch, M. Shoeybi, and B. Catanzaro, “Re-
ducing activation recomputation in large transformer
models,” in MLSys. mlsys.org, 2023.
[977] S. Rajbhandari, J. Rasley, O. Ruwase, and Y. He,
“Zero: memory optimizations toward training tril-
lion parameter models,” in Proceedings of the Interna-
tional Conference for High Performance Computing, Net-
working, Storage and Analysis, SC 2020, Virtual Event /
Atlanta, Georgia, USA, November 9-19, 2020, C. Cuic-
chi, I. Qualters, and W. T. Kramer, Eds. IEEE/ACM,
2020, p. 20.
[978] J. Ren, S. Rajbhandari, R. Y. Aminabadi, O. Ruwase,
S. Yang, M. Zhang, D. Li, and Y. He, “Zero-offload:
Democratizing billion-scale model training,” in 2021
USENIX Annual Technical Conference, USENIX ATC
2021, July 14-16, 2021, I. Calciu and G. Kuenning, Eds.
USENIX Association, 2021, pp. 551–564.
[979] S. Rajbhandari, O. Ruwase, J. Rasley, S. Smith, and
Y. He, “Zero-infinity: breaking the GPU memory wall
for extreme scale deep learning,” in SC. ACM, 2021,
p. 59.
[980] T. Dao, D. Y. Fu, S. Ermon, A. Rudra, and C. R´
e,
“Flashattention: Fast and memory-efficient exact at-
tention with io-awareness,” in NeurIPS, 2022.
[981] S. A. Jacobs, M. Tanaka, C. Zhang, M. Zhang, S. L.
Song, S. Rajbhandari, and Y. He, “Deepspeed ulysses:
System optimizations for enabling training of ex-
treme long sequence transformer models,” CoRR,
vol. abs/2309.14509, 2023.
[982] H. Liu, M. Zaharia, and P. Abbeel, “Ring attention
with blockwise transformers for near-infinite con-
text,” CoRR, vol. abs/2310.01889, 2023.
[983] Y. Chen, T. Tang, E. Xiang, L. Li, W. X. Zhao,
J. Wang, Y. Chai, and J. Wen, “Towards coarse-to-fine
evaluation of inference efficiency for large language
models,” CoRR, vol. abs/2404.11502, 2024.
[984] Y. Sheng, L. Zheng, B. Yuan, Z. Li, M. Ryabinin,
B. Chen, P. Liang, C. R´ e, I. Stoica, and C. Zhang,
“Flexgen: High-throughput generative inference of
large language models with a single GPU,” in ICML,
ser. Proceedings of Machine Learning Research, vol.
202. PMLR, 2023, pp. 31 094–31 116.
[985] T. Dao, D. Haziza, F. Massa, and G. Sizov, “Flash-
decoding for long-context inference,” 2023. [Online].
Available: https://crfm.stanford.edu/2023/10/12/
flashdecoding.html
[986] C. Holmes, M. Tanaka, M. Wyatt, A. A. Awan,
J. Rasley, S. Rajbhandari, R. Y. Aminabadi, H. Qin,
A. Bakhtiari, L. Kurilenko, and Y. He, “Deepspeed-
fastgen: High-throughput text generation for llms
via MII and deepspeed-inference,” CoRR, vol.
abs/2401.08671, 2024.
[987] Y. Leviathan, M. Kalman, and Y. Matias, “Fast infer-
ence from transformers via speculative decoding,” in
International Conference on Machine Learning, 2023.
[988] C. Chen, S. Borgeaud, G. Irving, J. Lespiau, L. Sifre,
and J. Jumper, “Accelerating large language model
decoding with speculative sampling,” CoRR, vol.
abs/2302.01318, 2023.
[989] X. Miao, G. Oliaro, Z. Zhang, X. Cheng, Z. Wang,
R. Y. Y. Wong, Z. Chen, D. Arfeen, R. Abhyankar,
and Z. Jia, “Specinfer: Accelerating generative LLM
serving with speculative inference and token tree
verification,” CoRR, vol. abs/2305.09781, 2023.
[990] B. Spector and C. R´ e, “Accelerating LLM infer-
ence with staged speculative decoding,” CoRR, vol.
abs/2308.04623, 2023.
[991] L. Chen, M. Zaharia, and J. Zou, “Frugalgpt: How to
use large language models while reducing cost and
improving performance,” CoRR, vol. abs/2305.05176,
2023.
[992] M. Yue, J. Zhao, M. Zhang, L. Du, and Z. Yao, “Large
language model cascades with mixture of thoughts
representations for cost-efficient reasoning,” CoRR,
vol. abs/2310.03094, 2023.
[993] J. Gu, J. Bradbury, C. Xiong, V. O. K. Li, and R. Socher,
“Non-autoregressive neural machine translation,” in
ICLR (Poster). OpenReview.net, 2018.
[994] C. Wang, J. Zhang, and H. Chen, “Semi-
autoregressive neural machine translation,” in
EMNLP. Association for Computational Linguistics,
2018, pp. 479–488.
[995] T. Cai, Y. Li, Z. Geng, H. Peng, J. D. Lee, D. Chen, and
T. Dao, “Medusa: Simple LLM inference acceleration
framework with multiple decoding heads,” CoRR,
vol. abs/2401.10774, 2024.
[996] S. Teerapittayanon, B. McDanel, and H. T. Kung,
“Branchynet: Fast inference via early exiting from
deep neural networks,” in ICPR. IEEE, 2016, pp.
2464–2469.
[997] G. Huang, D. Chen, T. Li, F. Wu, L. van der Maaten,
and K. Q. Weinberger, “Multi-scale dense networks
for resource efficient image classification,” in ICLR.
OpenReview.net, 2018.
[998] D. Raposo, S. Ritter, B. A. Richards, T. P. Lilli-
crap, P. C. Humphreys, and A. Santoro, “Mixture-
of-depths: Dynamically allocating compute in
transformer-based language models,” CoRR, vol.
abs/2404.02258, 2024.
[999] Z. Wan, X. Wang, C. Liu, S. Alam, Y. Zheng,
141
J. Liu, Z. Qu, S. Yan, Y. Zhu, Q. Zhang,
M. Chowdhury, and M. Zhang, “Efficient large
language models: A survey,” 2024. [Online].
Available: https://arxiv.org/abs/2312.03863
[1000] A. Gholami, S. Kim, Z. Dong, Z. Yao, M. W.
Mahoney, and K. Keutzer, “A survey of quantization
methods for efficient neural network inference,”
CoRR, vol. abs/2103.13630, 2021. [Online]. Available:
https://arxiv.org/abs/2103.13630
[1001] T. Dettmers, M. Lewis, Y. Belkada, and L. Zettle-
moyer, “Llm.int8(): 8-bit matrix multiplication for
transformers at scale,” CoRR, vol. abs/2208.07339,
2022.
[1002] J. Lin, J. Tang, H. Tang, S. Yang, X. Dang, and S. Han,
“Awq: Activation-aware weight quantization for llm
compression and acceleration,” 2023.
[1003] Y. Shang, Z. Yuan, Q. Wu, and Z. Dong, “PB-
LLM: partially binarized large language models,”
CoRR, vol. abs/2310.00034, 2023. [Online]. Available:
https://doi.org/10.48550/arXiv.2310.00034
[1004] T. Dettmers, R. Svirschevski, V. Egiazarian,
D. Kuznedelev, E. Frantar, S. Ashkboos, A. Borzunov,
T. Hoefler, and D. Alistarh, “Spqr: A sparse-
quantized representation for near-lossless LLM
weight compression,” CoRR, vol. abs/2306.03078,
2023.
[1005] Z. Guan, H. Huang, Y. Su, H. Huang, N. Wong, and
H. Yu, “APTQ: attention-aware post-training mixed-
precision quantization for large language models,”
CoRR, vol. abs/2402.14866, 2024. [Online]. Available:
https://doi.org/10.48550/arXiv.2402.14866
[1006] C. Lee, J. Jin, T. Kim, H. Kim, and E. Park, “OWQ:
outlier-aware weight quantization for efficient fine-
tuning and inference of large language models,” in
Thirty-Eighth AAAI Conference on Artificial Intelligence,
AAAI 2024, Thirty-Sixth Conference on Innovative
Applications of Artificial Intelligence, IAAI 2024,
Fourteenth Symposium on Educational Advances in
Artificial Intelligence, EAAI 2014, February 20-
27, 2024, Vancouver, Canada, M. J. Wooldridge,
J. G. Dy, and S. Natarajan, Eds. AAAI Press,
2024, pp. 13 355–13 364. [Online]. Available: https:
//doi.org/10.1609/aaai.v38i12.29237
[1007] G. Xiao, J. Lin, M. Seznec, J. Demouth, and
S. Han, “Smoothquant: Accurate and efficient post-
training quantization for large language models,”
CoRR, vol. abs/2211.10438, 2022. [Online]. Available:
https://doi.org/10.48550/arXiv.2211.10438
[1008] Z. Yao, R. Y. Aminabadi, M. Zhang, X. Wu, C. Li,
and Y. He, “Zeroquant: Efficient and affordable post-
training quantization for large-scale transformers,”
in NeurIPS, 2022.
[1009] E. Frantar, S. Ashkboos, T. Hoefler, and D. Alis-
tarh, “Gptq: Accurate post-training quantization for
generative pre-trained transformers,” arXiv preprint
arXiv:2210.17323, 2022.
[1010] E. Frantar and D. Alistarh, “Optimal brain compres-
sion: A framework for accurate post-training quanti-
zation and pruning,” in NeurIPS, 2022.
[1011] T. Dettmers, A. Pagnoni, A. Holtzman, and L. Zettle-
moyer, “Qlora: Efficient finetuning of quantized
llms,” arXiv preprint arXiv:2305.14314, 2023.
[1012] Z. Liu, B. Oguz, C. Zhao, E. Chang, P. Stock,
Y. Mehdad, Y. Shi, R. Krishnamoorthi, and V. Chan-
dra, “Llm-qat: Data-free quantization aware training
for large language models,” 2023.
[1013] Z. Yao, X. Wu, C. Li, S. Youn, and Y. He, “Zeroquant-
v2: Exploring post-training quantization in llms from
comprehensive study to low rank compensation,”
2023.
[1014] T. Dettmers and L. Zettlemoyer, “The case for 4-bit
precision: k-bit inference scaling laws,” CoRR, vol.
abs/2212.09720, 2022.
[1015] L. Peiyu, L. Zikang, G. Ze-Feng, G. Dawei, Z. W. Xin,
L. Yaliang, D. Bolin, and W. Ji-Rong, “Do emergent
abilities exist in quantized large language models:
An empirical study,” arXiv preprint arXiv:2307.08072,
2023.
[1016] Y. Xu, L. Xie, X. Gu, X. Chen, H. Chang,
H. Zhang, Z. Chen, X. Zhang, and Q. Tian, “Qa-lora:
Quantization-aware low-rank adaptation of large
language models,” CoRR, vol. abs/2309.14717, 2023.
[Online]. Available: https://doi.org/10.48550/arXiv.
2309.14717
[1017] Y. Li, Y. Yu, C. Liang, P. He, N. Karampatziakis,
W. Chen, and T. Zhao, “Loftq: Lora-fine-tuning-
aware quantization for large language models,”
CoRR, vol. abs/2310.08659, 2023. [Online]. Available:
https://doi.org/10.48550/arXiv.2310.08659
[1018] Y. Gu, L. Dong, F. Wei, and M. Huang, “Knowledge
distillation of large language models,” CoRR,
vol. abs/2306.08543, 2023. [Online]. Available:
https://doi.org/10.48550/arXiv.2306.08543
[1019] C. Hsieh, C. Li, C. Yeh, H. Nakhost, Y. Fujii,
A. Ratner, R. Krishna, C. Lee, and T. Pfister,
“Distilling step-by-step! outperforming larger
language models with less training data and
smaller model sizes,” in Findings of the Association for
Computational Linguistics: ACL 2023, Toronto, Canada,
July 9-14, 2023, A. Rogers, J. L. Boyd-Graber, and
N. Okazaki, Eds. Association for Computational
Linguistics, 2023, pp. 8003–8017. [Online]. Available:
https://doi.org/10.18653/v1/2023.findings-acl.507
[1020] E. Frantar and D. Alistarh, “Sparsegpt: Massive lan-
guage models can be accurately pruned in one-
shot,” in International Conference on Machine Learning.
PMLR, 2023, pp. 10 323–10 337.
[1021] X. Ma, G. Fang, and X. Wang, “Llm-pruner: On the
structural pruning of large language models,” Ad-
vances in neural information processing systems, vol. 36,
pp. 21 702–21 720, 2023.
[1022] M. Xia, T. Gao, Z. Zeng, and D. Chen, “Sheared
llama: Accelerating language model pre-training via
structured pruning,” arXiv preprint arXiv:2310.06694,
2023.
[1023] T. Dettmers, M. Lewis, S. Shleifer, and L. Zettle-
moyer, “8-bit optimizers via block-wise quantiza-
tion,” 9th International Conference on Learning Repre-
sentations, ICLR, 2022.
[1024] Y. Ding, W. Fan, L. Ning, S. Wang, H. Li, D. Yin, T.-S.
Chua, and Q. Li, “A survey on rag meets llms: To-
wards retrieval-augmented large language models,”
142
arXiv preprint arXiv:2405.06211, 2024.
[1025] Y. Gao, Y. Xiong, X. Gao, K. Jia, J. Pan, Y. Bi, Y. Dai,
J. Sun, and H. Wang, “Retrieval-augmented gener-
ation for large language models: A survey,” arXiv
preprint arXiv:2312.10997, 2023.
[1026] S. Robertson and H. Zaragoza, The probabilistic rele-
vance framework: BM25 and beyond, 2009.
[1027] Y. Wang, R. Ren, J. Li, W. X. Zhao, J. Liu, and J.-R.
Wen, “Rear: A relevance-aware retrieval-augmented
framework for open-domain question answering,”
arXiv preprint arXiv:2402.17497, 2024.
[1028] D. Rau, S. Wang, H. D´ ejean, and S. Clinchant, “Con-
text embeddings for efficient answer generation in
rag,” arXiv preprint arXiv:2407.09252, 2024.
[1029] F. Xu, W. Shi, and E. Choi, “Recomp: Improving
retrieval-augmented lms with context compression
and selective augmentation,” in The Twelfth Interna-
tional Conference on Learning Representations, 2024.
[1030] Z. Shao, Y. Gong, Y. Shen, M. Huang, N. Duan,
and W. Chen, “Enhancing retrieval-augmented large
language models with iterative retrieval-generation
synergy,” in Findings of the Association for Computa-
tional Linguistics: EMNLP 2023, 2023, pp. 9248–9274.
[1031] T. Chen, H. Wang, S. Chen, W. Yu, K. Ma, X. Zhao,
D. Yu, and H. Zhang, “Dense x retrieval: What re-
trieval granularity should we use?” arXiv preprint
arXiv:2312.06648, 2023.
[1032] X. Huang, S. Cheng, Y. Shu, Y. Bao, and Y. Qu,
“Question decomposition tree for answering com-
plex questions over knowledge bases,” in Proceedings
of the AAAI Conference on Artificial Intelligence, vol. 37,
no. 11, 2023, pp. 12 924–12 932.
[1033] Y. He, J. Tang, H. Ouyang, C. Kang, D. Yin, and
Y. Chang, “Learning to rewrite queries,” in Pro-
ceedings of the 25th ACM International on Conference
on Information and Knowledge Management, 2016, pp.
1443–1452.
[1034] J. Liu and B. Mozafari, “Query rewriting via large
language models,” arXiv preprint arXiv:2403.09060,
2024.
[1035] F. Ye, M. Fang, S. Li, and E. Yilmaz, “Enhancing
conversational search: Large language model-aided
informative query rewriting,” in Findings of the As-
sociation for Computational Linguistics: EMNLP 2023,
2023, pp. 5985–6006.
[1036] S. Jeong, J. Baek, S. Cho, S. J. Hwang, and J. C.
Park, “Adaptive-rag: Learning to adapt retrieval-
augmented large language models through question
complexity,” arXiv preprint arXiv:2403.14403, 2024.
[1037] H. Jiang, Q. Wu, C.-Y. Lin, Y. Yang, and L. Qiu,
“Llmlingua: Compressing prompts for accelerated
inference of large language models,” in Proceedings
of the 2023 Conference on Empirical Methods in Natural
Language Processing, 2023, pp. 13 358–13 376.
[1038] T. Xu, S. Wu, S. Diao, X. Liu, X. Wang, Y. Chen,
and J. Gao, “Sayself: Teaching llms to express con-
fidence with self-reflective rationales,” arXiv preprint
arXiv:2405.20974, 2024.
[1039] A. Asai, Z. Wu, Y. Wang, A. Sil, and H. Ha-
jishirzi, “Self-rag: Learning to retrieve, generate,
and critique through self-reflection,” arXiv preprint
arXiv:2310.11511, 2023.
[1040] H. Luo, Y.-S. Chuang, Y. Gong, T. Zhang, Y. Kim,
X. Wu, D. Fox, H. Meng, and J. Glass, “Sail: Search-
augmented instruction learning,” arXiv preprint
arXiv:2305.15225, 2023.
[1041] X. V. Lin, X. Chen, M. Chen, W. Shi, M. Lomeli,
R. James, P. Rodriguez, J. Kahn, G. Szilvasy, M. Lewis
et al., “Ra-dit: Retrieval-augmented dual instruction
tuning,” arXiv preprint arXiv:2310.01352, 2023.
[1042] K. Guu, K. Lee, Z. Tung, P. Pasupat, and M. Chang,
“Retrieval augmented language model pre-training,”
in International conference on machine learning. PMLR,
2020, pp. 3929–3938.
[1043] K. Lee, M.-W. Chang, and K. Toutanova, “Latent re-
trieval for weakly supervised open domain question
answering,” in Proceedings of the 57th Annual Meeting
of the Association for Computational Linguistics, 2019,
pp. 6086–6096.
[1044] J. Li, J. Chen, R. Ren, X. Cheng, W. X. Zhao, J.-Y.
Nie, and J.-R. Wen, “The dawn after the dark: An
empirical study on factuality hallucination in large
language models,” arXiv preprint arXiv:2401.03205,
2024.
[1045] Z. Ji, N. Lee, R. Frieske, T. Yu, D. Su, Y. Xu, E. Ishii,
Y. J. Bang, A. Madotto, and P. Fung, “Survey of
hallucination in natural language generation,” ACM
Comput. Surv., 2023.
[1046] Y. Zhang, Y. Li, L. Cui, D. Cai, L. Liu, T. Fu, X. Huang,
E. Zhao, Y. Zhang, Y. Chen, L. Wang, A. T. Luu, W. Bi,
F. Shi, and S. Shi, “Siren’s song in the AI ocean: A
survey on hallucination in large language models,”
arXiv preprint arXiv:2309.01219, 2023.
[1047] S. Bengio, O. Vinyals, N. Jaitly, and N. Shazeer,
“Scheduled sampling for sequence prediction with
recurrent neural networks,” in NIPS, 2015, pp. 1171–
1179.
[1048] M. Sharma, M. Tong, T. Korbak, D. Duvenaud,
A. Askell, S. R. Bowman, N. Cheng, E. Dur-
mus, Z. Hatfield-Dodds, S. R. Johnston, S. Kravec,
T. Maxwell, S. McCandlish, K. Ndousse, O. Rausch,
N. Schiefer, D. Yan, M. Zhang, and E. Perez, “To-
wards understanding sycophancy in language mod-
els,” CoRR, vol. abs/2310.13548, 2023.
[1049] V. Rawte, P. Priya, S. M. T. I. Tonmoy, S. M. M.
Zaman, A. P. Sheth, and A. Das, “Exploring the re-
lationship between LLM hallucinations and prompt
linguistic nuances: Readability, formality, and con-
creteness,” CoRR, vol. abs/2309.11064, 2023.
[1050] S. Dhuliawala, M. Komeili, J. Xu, R. Raileanu, X. Li,
A. Celikyilmaz, and J. Weston, “Chain-of-verification
reduces hallucination in large language models,”
CoRR, vol. abs/2309.11495, 2023.
[1051] P. Manakul, A. Liusie, and M. J. F. Gales, “Selfcheck-
gpt: Zero-resource black-box hallucination detection
for generative large language models,” in EMNLP.
Association for Computational Linguistics, 2023, pp.
9004–9017.
[1052] N. Varshney, W. Yao, H. Zhang, J. Chen, and D. Yu,
“A stitch in time saves nine: Detecting and mitigating
hallucinations of llms by validating low-confidence
generation,” CoRR, vol. abs/2307.03987, 2023.
143
[1053] Y. Yehuda, I. Malkiel, O. Barkan, J. Weill, R. Ronen,
and N. Koenigstein, “In search of truth: An interro-
gation approach to hallucination detection,” CoRR,
vol. abs/2403.02889, 2024.
[1054] S. Min, K. Krishna, X. Lyu, M. Lewis, W. tau Yih, P. W.
Koh, M. Iyyer, L. Zettlemoyer, and H. Hajishirzi,
“Factscore: Fine-grained atomic evaluation of factual
precision in long form text generation,” 2023.
[1055] I. Chern, S. Chern, S. Chen, W. Yuan, K. Feng,
C. Zhou, J. He, G. Neubig, and P. Liu, “Factool:
Factuality detection in generative AI - A tool aug-
mented framework for multi-task and multi-domain
scenarios,” CoRR, vol. abs/2307.13528, 2023.
[1056] X. Cheng, J. Li, W. X. Zhao, H. Zhang, F. Zhang,
D. Zhang, K. Gai, and J.-R. Wen, “Small agent can
also rock! empowering small language models as
hallucination detector,” CoRR, vol. abs/2406.11277,
2024.
[1057] M. Sharma, M. Tong, T. Korbak, D. Duvenaud,
A. Askell, S. R. Bowman, E. Durmus, Z. Hatfield-
Dodds, S. R. Johnston, S. Kravec, T. Maxwell, S. Mc-
Candlish, K. Ndousse, O. Rausch, N. Schiefer, D. Yan,
M. Zhang, and E. Perez, “Towards understanding
sycophancy in language models,” in ICLR. Open-
Review.net, 2024.
[1058] J. W. Wei, D. Huang, Y. Lu, D. Zhou, and Q. V. Le,
“Simple synthetic data reduces sycophancy in large
language models,” CoRR, vol. abs/2308.03958, 2023.
[1059] L. Gao, Z. Dai, P. Pasupat, A. Chen, A. T. Chaganty,
Y. Fan, V. Y. Zhao, N. Lao, H. Lee, D. Juan, and
K. Guu, “RARR: researching and revising what lan-
guage models say, using language models,” in ACL
(1). Association for Computational Linguistics, 2023,
pp. 16 477–16 508.
[1060] R. Zhao, X. Li, S. Joty, C. Qin, and L. Bing, “Verify-
and-edit: A knowledge-enhanced chain-of-thought
framework,” in ACL (1). Association for Compu-
tational Linguistics, 2023, pp. 5823–5840.
[1061] H. Trivedi, N. Balasubramanian, T. Khot, and A. Sab-
harwal, “Interleaving retrieval with chain-of-thought
reasoning for knowledge-intensive multi-step ques-
tions,” CoRR, vol. abs/2212.10509, 2022.
[1062] K. Li, O. Patel, F. B. Vi´ egas, H. Pfister, and M. Watten-
berg, “Inference-time intervention: Eliciting truthful
answers from a language model,” in NeurIPS, 2023.
[1063] W. Shi, X. Han, M. Lewis, Y. Tsvetkov, L. Zettlemoyer,
and S. W. Yih, “Trusting your evidence: Halluci-
nate less with context-aware decoding,” CoRR, vol.
abs/2305.14739, 2023.
[1064] D. Kahneman, “Thinking, fast and slow,” Farrar,
Straus and Giroux, 2011.
[1065] S. Wu, Z. Peng, X. Du, T. Zheng, M. Liu, J. Wu, J. Ma,
Y. Li, J. Yang, W. Zhou et al., “A comparative study
on reasoning patterns of openai’s o1 model,” arXiv
preprint arXiv:2410.13639, 2024.
[1066] T. Zhong, Z. Liu, Y. Pan, Y. Zhang, Y. Zhou, S. Liang,
Z. Wu, Y. Lyu, P. Shu, X. Yu et al., “Evaluation
of openai o1: Opportunities and challenges of agi,”
arXiv preprint arXiv:2409.18486, 2024.
[1067] Y. Min, Z. Chen, J. Jiang, J. Chen, J. Deng, Y. Hu,
Y. Tang, J. Wang, X. Cheng, H. Song et al., “Imitate,
144
explore, and self-improve: A reproduction report
on slow-thinking reasoning systems,” arXiv preprint
arXiv:2412.09413, 2024.
[1068] D. Team, “Deepseek-r1-lite-preview is now live: un-
leashing supercharged reasoning power,” 2024.
[1069] Q. Team, “Qwq: Reflect deeply on the boundaries of
the unknown, november 2024,” URL https://qwenlm.
github. io/blog/qwq-32b-preview.
[1070] DeepSeek-AI, “Deepseek-r1: Incentivizing reasoning
capability in llms via reinforcement learning,” 2025.
[1071] J. Jiang, Z. Chen, Y. Min, J. Chen, X. Cheng, J. Wang,
Y. Tang, H. Sun, J. Deng, W. X. Zhao, Z. Liu, D. Yan,
J. Xie, Z. Wang, and J.-R. Wen, “Enhancing llm rea-
soning with reward-guided tree search,” 2024.
[1072] T. Liang, Z. He, W. Jiao, X. Wang, Y. Wang, R. Wang,
Y. Yang, S. Shi, and Z. Tu, “Encouraging divergent
thinking in large language models through multi-
agent debate,” arXiv preprint arXiv:2305.19118, 2023.
[1073] Y. Du, Z. Liu, Y. Li, W. X. Zhao, Y. Huo, B. Wang,
W. Chen, Z. Liu, Z. Wang, and J.-R. Wen, “Virgo:
A preliminary exploration on reproducing o1-like
mllm,” arXiv preprint arXiv:2501.01904, 2025.
[1074] K. Team, “Kimi k1.5: Scaling reinforcement learning
with llms,” 2025. [Online]. Available: https://arxiv.
org/abs/2501.12599
[1075] OpenAI, “Openai’s reinforcement fine-tuning re-
search program,” OpenAI Blog, 2024.
[1076] Z. Zeng, Q. Cheng, Z. Yin, B. Wang, S. Li, Y. Zhou,
Q. Guo, X. Huang, and X. Qiu, “Scaling of search
and learning: A roadmap to reproduce o1 from
reinforcement learning perspective,” arXiv preprint
arXiv:2412.14135, 2024.
[1077] Z. Chen, Y. Min, B. Zhang, J. Chen, J. Jiang, D. Cheng,
W. X. Zhao, Z. Liu, X. Miao, Y. Lu, L. Fang, Z. Wang,
and J.-R. Wen, “An empirical study on eliciting and
improving r1-like reasoning models,” 2025.
[1078] Z. Shao, P. Wang, Q. Zhu, R. Xu, J. Song, X. Bi,
H. Zhang, M. Zhang, Y. Li, Y. Wu et al., “Deepseek-
math: Pushing the limits of mathematical rea-
soning in open language models,” arXiv preprint
arXiv:2402.03300, 2024.
[1079] W. Kool, H. van Hoof, and M. Welling, “Buy 4 REIN-
FORCE samples, get a baseline for free!” in Deep Re-
inforcement Learning Meets Structured Prediction, ICLR
2019 Workshop, New Orleans, Louisiana, United States,
May 6, 2019. OpenReview.net, 2019.
[1080] C. Snell, J. Lee, K. Xu, and A. Kumar, “Scaling llm
test-time compute optimally can be more effective
than scaling model parameters,” 2024. [Online].
Available: https://arxiv.org/abs/2408.03314
[1081] W. Kuang, B. Qian, Z. Li, D. Chen, D. Gao, X. Pan,
Y. Xie, Y. Li, B. Ding, and J. Zhou, “Federatedscope-
llm: A comprehensive package for fine-tuning large
language models in federated learning,” 2023.

0
1
0

Register as a new user and use Qiita more conveniently

  1. You get articles that match your needs
  2. You can efficiently read back useful information
  3. You can use dark theme
What you can do with signing up
0
1

Delete article

Deleted articles cannot be recovered.

Draft of this article would be also deleted.

Are you sure you want to delete this article?