Large Language Diffusion Models
Shen Nie, Fengqi Zhu, Zebin You, Xiaolu Zhang, Jingyang Ou, Jun Hu, Jun Zhou, Yankai Lin, Ji-Rong Wen, Chongxuan Li
https://arxiv.org/abs/2502.09992
References
Achiam, J., Adler, S., Agarwal, S., Ahmad, L., Akkaya, I., Aleman, F. L., Almeida, D., Altenschmidt, J., Altman, S., Anadkat, S., et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023.
Ainslie, J., Lee-Thorp, J., de Jong, M., Zemlyanskiy, Y., Lebron, F., and Sanghai, S. Gqa: Training generalized multi-query transformer models from multi-head check-points. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pp.4895–4901, 2023.
Allen-Zhu, Z. and Li, Y. Physics of Language Models: Part 3.2, Knowledge Manipulation. ArXiv e-prints, abs/2309.14402, September 2023. Full version available at http://arxiv.org/abs/2309.14402.
Anonymous. Interpolating autoregressive and discrete denoising diffusion language models. In The Thirteenth International Conference on Learning Representations, 2025. URL https://openreview.net/forum?id=tyEyYT267x.
Anthropic. Claude 3.5 sonnet, 2024. URL https://www.anthropic.com/news/claude-3-5-sonnet.
Austin, J., Johnson, D. D., Ho, J., Tarlow, D., and Van Den Berg, R. Structured denoising diffusion models in discrete state-spaces. Advances in Neural Information Processing Systems, 34:17981–17993, 2021a.
Austin, J., Odena, A., Nye, M., Bosma, M., Michalewski, H., Dohan, D., Jiang, E., Cai, C., Terry, M., Le, Q., et al. Program synthesis with large language models. arXiv preprint arXiv:2108.07732, 2021b.
Bao, F., Nie, S., Xue, K., Cao, Y., Li, C., Su, H., and Zhu, J. All are worth words: A vit backbone for diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 22669–22679, 2023.
Bavarian, M., Jun, H., Tezak, N., Schulman, J., McLeavey, C., Tworek, J., and Chen, M. Efficient training of language models to fill in the middle. arXiv preprintarXiv:2207.14255, 2022.
Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A. C., Korbak, T., and Evans, O. The reversal curse: Llms trained on” a is b” fail to learn” b is a”. arXiv preprint arXiv:2309.12288, 2023.
Bi, X., Chen, D., Chen, G., Chen, S., Dai, D., Deng, C.,Ding, H., Dong, K., Du, Q., Fu, Z., et al. Deepseek llm: Scaling open-source language models with longtermism. arXiv preprint arXiv:2401.02954, 2024.
Bisk, Y., Zellers, R., Gao, J., Choi, Y., et al. Piqa: Reasoning about physical commonsense in natural language. In Proceedings of the AAAI conference on artificial intelligence, 2020.
Brooks, T., Peebles, B., Holmes, C., DePue, W., Guo, Y., Jing, L., Schnurr, D., Taylor, J., Luhman, T., Luhman, E., Ng, C., Wang, R., and Ramesh, A. Video generation models as world simulators. 2024. URL https://openai.com/research/video-generation-models-as-world-simulators.
Brown, T. B. Language models are few-shot learners. arXiv preprint arXiv:2005.14165, 2020.
Campbell, A., Benton, J., De Bortoli, V., Rainforth, T., Deligiannidis, G., and Doucet, A. A continuous time framework for discrete denoising models. Advances in Neural Information Processing Systems, 35:28266–28279, 2022.
Chang, H., Zhang, H., Jiang, L., Liu, C., and Freeman, W. T. Maskgit: Masked generative image transformer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11315–11325, 2022.
Chang, H., Zhang, H., Barber, J., Maschinot, A., Lezama, J., Jiang, L., Yang, M.-H., Murphy, K., Freeman, W. T., Rubinstein, M., et al. Muse: Text-to-image generation via masked generative transformers. arXiv preprint arXiv:2301.00704, 2023.
Chen, M., Tworek, J., Jun, H., Yuan, Q., Pinto, H. P. D. O., Kaplan, J., Edwards, H., Burda, Y., Joseph, N., Brockman, G., et al. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374, 2021.
Chen, T., Zhang, R., and Hinton, G. Analog bits: Generating discrete data using diffusion models with self-conditioning. arXiv preprint arXiv:2208.04202, 2022.
Chen, Z., Yuan, H., Li, Y., Kou, Y., Zhang, J., and Gu, Q. Fast sampling via de-randomization for discrete diffusion models. arXiv preprint arXiv:2312.09193, 2023.
Chu, Y., Xu, J., Yang, Q., Wei, H., Wei, X., Guo, Z., Leng, Y., Lv, Y., He, J., Lin, J., et al. Qwen2-audio technical report. arXiv preprint arXiv:2407.10759, 2024.
Clark, P., Cowhey, I., Etzioni, O., Khot, T., Sabharwal, A., Schoenick, C., and Tafjord, O. Think you have solved question answering? try arc, the ai2 reasoning challenge. arXiv preprint arXiv:1803.05457, 2018.
Cobbe, K., Kosaraju, V., Bavarian, M., Chen, M., Jun, H., Kaiser, L., Plappert, M., Tworek, J., Hilton, J., Nakano, R., et al. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168, 2021.
Deletang, G., Ruoss, A., Duquenne, P.-A., Catt, E., Genewein, T., Mattern, C., Grau-Moya, J., Wenliang, L. K., Aitchison, M., Orseau, L., et al. Language modeling is compression. In The Twelfth International Conference on Learning Representations.
Devlin, J. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
Dhariwal, P. and Nichol, A. Diffusion models beat gans on image synthesis. Advances in neural information processing systems, 34:8780–8794, 2021.
Dieleman, S., Sartran, L., Roshannai, A., Savinov, N., Ganin, Y., Richemond, P. H., Doucet, A., Strudel, R., Dyer, C., Durkan, C., et al. Continuous diffusion for categorical data. arXiv preprint arXiv:2211.15089, 2022.
Dubey, A., Jauhri, A., Pandey, A., Kadian, A., Al-Dahle, A., Letman, A., Mathur, A., Schelten, A., Yang, A., Fan, A., et al. The llama 3 herd of models. arXiv preprint arXiv:2407.21783, 2024.
Fisher, R. A. On the mathematical foundations of theoretical statistics. Philosophical transactions of the Royal Society of London. Series A, containing papers of a mathematical or physical character, 222(594-604):309–368, 1922.
Gao, L., Tow, J., Abbasi, B., Biderman, S., Black, S., DiPofi, A., Foster, C., Golding, L., Hsu, J., Le Noac’h, A., Li, H., McDonell, K., Muennighoff, N., Ociepa, C., Phang, J., Reynolds, L., Schoelkopf, H., Skowron, A., Sutawika, L., Tang, E., Thite, A., Wang, B., Wang, K., and Zou, A. A framework for few-shot language model evaluation, 07 2024. URL https://zenodo.org/records/12608602.
Gat, I., Remez, T., Shaul, N., Kreuk, F., Chen, R. T., Synnaeve, G., Adi, Y., and Lipman, Y. Discrete flow matching. arXiv preprint arXiv:2407.15595, 2024.
Gong, S., Li, M., Feng, J., Wu, Z., and Kong, L. Diffuseq: Sequence to sequence text generation with diffusion models. arXiv preprint arXiv:2210.08933, 2022.
Gong, S., Agarwal, S., Zhang, Y., Ye, J., Zheng, L., Li, M., An, C., Zhao, P., Bi, W., Han, J., et al. Scaling diffusion language models via adaptation from autoregressive models. arXiv preprint arXiv:2410.17891, 2024.
Google. Our next-generation model: Gemini 1.5, 2024. URL https://blog.google/technology/ai/google-gemini-next-generation-model-february-2024
Graves, A., Srivastava, R. K., Atkinson, T., and Gomez, F. Bayesian flow networks. arXiv preprint arXiv:2308.07037, 2023.
Gulrajani, I. and Hashimoto, T. B. Likelihood-based diffusion language models. Advances in Neural Information Processing Systems, 36, 2024.
Guo, D., Yang, D., Zhang, H., Song, J., Zhang, R., Xu, R., Zhu, Q., Ma, S., Wang, P., Bi, X., et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. arXiv preprint arXiv:2501.12948, 2025.
Han, X., Kumar, S., and Tsvetkov, Y. Ssd-lm: Semi-autoregressive simplex-based diffusion language model for text generation and modular control. arXiv preprint arXiv:2210.17432, 2022.
Large Language Diffusion Models He, Z., Sun, T., Wang, K., Huang, X., and Qiu, X. Diffusionbert: Improving generative masked language models with diffusion models. arXiv preprint arXiv:2211.15029, 2022.
Hendrycks, D., Burns, C., Basart, S., Zou, A., Mazeika, M., Song, D., and Steinhardt, J. Measuring massive multitask language understanding. arXiv preprint arXiv:2009.03300, 2020.
Hendrycks, D., Burns, C., Kadavath, S., Arora, A., Basart, S., Tang, E., Song, D., and Steinhardt, J. Measuring mathematical problem solving with the math dataset. arXiv preprint arXiv:2103.03874, 2021.
Ho, J. and Salimans, T. Classifier-free diffusion guidance. arXiv preprint arXiv:2207.12598, 2022.
Ho, J., Jain, A., and Abbeel, P. Denoising diffusion probabilistic models. Advances in neural information processing systems, 33:6840–6851, 2020.
Hoffmann, J., Borgeaud, S., Mensch, A., Buchatskaya, E., Cai, T., Rutherford, E., Casas, D. d. L., Hendricks, L. A., Welbl, J., Clark, A., et al. Training compute-optimal large language models. arXiv preprint arXiv:2203.15556, 2022.
Holtzman, A., Buys, J., Du, L., Forbes, M., and Choi, Y. The curious case of neural text degeneration. arXiv preprint arXiv:1904.09751, 2019.
Hoogeboom, E., Gritsenko, A. A., Bastings, J., Poole, B., Berg, R. v. d., and Salimans, T. Autoregressive diffusion models. arXiv preprint arXiv:2110.02037, 2021a.
Hoogeboom, E., Nielsen, D., Jaini, P., Forr´ e, P., and Welling,
M. Argmax flows and multinomial diffusion: Learning
categorical distributions. Advances in Neural Information
Processing Systems, 34:12454–12465, 2021b.
Hu, S., Tu, Y., Han, X., He, C., Cui, G., Long, X., Zheng, Z.,
Fang, Y., Huang, Y., Zhao, W., et al. Minicpm: Unveiling
the potential of small language models with scalable train-
ing strategies. arXiv preprint arXiv:2404.06395, 2024.
Huang, Y., Bai, Y., Zhu, Z., Zhang, J., Zhang, J., Su, T., Liu,
J., Lv, C., Zhang, Y., Fu, Y., et al. C-eval: A multi-level
multi-discipline chinese evaluation suite for foundation
models. Advances in Neural Information Processing
Systems, 36, 2024a.
Huang, Y., Zhang, J., Shan, Z., and He, J. Compres-
sion represents intelligence linearly. arXiv preprint
arXiv:2404.09937, 2024b.
Jiang, A. Q., Sablayrolles, A., Mensch, A., Bamford, C.,
Chaplot, D. S., Casas, D. d. l., Bressand, F., Lengyel, G.,
Lample, G., Saulnier, L., et al. Mistral 7b. arXiv preprint
arXiv:2310.06825, 2023.
Kaplan, J., McCandlish, S., Henighan, T., Brown, T. B.,
Chess, B., Child, R., Gray, S., Radford, A., Wu, J., and
Amodei, D. Scaling laws for neural language models.
arXiv preprint arXiv:2001.08361, 2020.
Kitouni, O., Nolte, N., Hensman, J., and Mitra, B. Disk: A
diffusion model for structured knowledge. arXiv preprint
arXiv:2312.05253, 2023.
Kitouni, O., Nolte, N., Bouchacourt, D., Williams, A., Rab-
bat, M., and Ibrahim, M. The factorization curse: Which
tokens you predict underlie the reversal curse and more.
arXiv preprint arXiv:2406.05183, 2024.
Kou, S., Hu, L., He, Z., Deng, Z., and Zhang, H. Cllms:
Consistency large language models. arXiv preprint
arXiv:2403.00835, 2024.
Li, H., Zhang, Y., Koto, F., Yang, Y., Zhao, H., Gong, Y.,
Duan, N., and Baldwin, T. Cmmlu: Measuring mas-
sive multitask language understanding in chinese. arXiv
preprint arXiv:2306.09212, 2023.
Li, X., Thickstun, J., Gulrajani, I., Liang, P. S., and
Hashimoto, T. B. Diffusion-lm improves controllable
text generation. Advances in Neural Information Process-
ing Systems, 35:4328–4343, 2022.
Lin, S., Hilton, J., and Evans, O. Truthfulqa: Measuring
how models mimic human falsehoods. arXiv preprint
arXiv:2109.07958, 2021.
Lin, Z., Gong, Y., Shen, Y., Wu, T., Fan, Z., Lin, C., Duan,
N., and Chen, W. Text generation with diffusion language
models: A pre-training approach with continuous para-
graph denoise. In International Conference on Machine
Learning, pp. 21051–21064. PMLR, 2023.
Liu, A., Feng, B., Xue, B., Wang, B., Wu, B., Lu, C., Zhao,
C., Deng, C., Zhang, C., Ruan, C., et al. Deepseek-v3
technical report. arXiv preprint arXiv:2412.19437, 2024.
Loshchilov, I. Decoupled weight decay regularization. arXiv
preprint arXiv:1711.05101, 2017.
Lou, A. and Ermon, S. Reflected diffusion models, 2023.
Lou, A., Meng, C., and Ermon, S. Discrete diffusion lan-
guage modeling by estimating the ratios of the data distri-
bution. arXiv preprint arXiv:2310.16834, 2023.
Mahabadi, R. K., Ivison, H., Tae, J., Henderson, J., Beltagy,
I., Peters, M. E., and Cohan, A. Tess: Text-to-text self-
conditioned simplex diffusion, 2024.
11
Large Language Diffusion Models
Meng, C., Choi, K., Song, J., and Ermon, S. Concrete score
matching: Generalized score matching for discrete data.
Advances in Neural Information Processing Systems, 35:
34532–34545, 2022.
Nie, S., Zhu, F., Du, C., Pang, T., Liu, Q., Zeng, G., Lin, M.,
and Li, C. Scaling up masked diffusion models on text.
arXiv preprint arXiv:2410.18514, 2024.
OpenAI. ChatGPT: Optimizing Language Models for Di-
alogue. OpenAI blog, November 2022. URL https:
//openai.com/blog/chatgpt/.
OpenAI. Learning to reason with llms, 2024.
URL https://openai.com/index/
learning-to-reason-with-llms/.
Ou, J., Nie, S., Xue, K., Zhu, F., Sun, J., Li, Z., and Li,
C. Your absorbing discrete diffusion secretly models the
conditional distributions of clean data. arXiv preprint
arXiv:2406.03736, 2024.
Ouyang, L., Wu, J., Jiang, X., Almeida, D., Wainwright, C.,
Mishkin, P., Zhang, C., Agarwal, S., Slama, K., Ray, A.,
et al. Training language models to follow instructions
with human feedback. Advances in neural information
processing systems, 35:27730–27744, 2022.
Park, J. S., O’Brien, J., Cai, C. J., Morris, M. R., Liang,
P., and Bernstein, M. S. Generative agents: Interactive
simulacra of human behavior. In Proceedings of the 36th
annual acm symposium on user interface software and
technology, pp. 1–22, 2023.
Peebles, W. and Xie, S. Scalable diffusion models with
transformers. In Proceedings of the IEEE/CVF Interna-
tional Conference on Computer Vision, pp. 4195–4205,
2023.
Radford, A. Improving language understanding by genera-
tive pre-training, 2018.
Radford, A., Wu, J., Child, R., Luan, D., Amodei, D.,
Sutskever, I., et al. Language models are unsupervised
multitask learners. OpenAI blog, 1(8):9, 2019.
Rafailov, R., Sharma, A., Mitchell, E., Manning, C. D., Er-
mon, S., and Finn, C. Direct preference optimization:
Your language model is secretly a reward model. Ad-
vances in Neural Information Processing Systems, 36,
2024.
Reid, M., Hellendoorn, V. J., and Neubig, G. Diffuser:
Discrete diffusion via edit-based reconstruction, 2022.
Rein, D., Hou, B. L., Stickland, A. C., Petty, J., Pang,
R. Y., Dirani, J., Michael, J., and Bowman, S. R. Gpqa:
A graduate-level google-proof q&a benchmark. arXiv
preprint arXiv:2311.12022, 2023.
Richemond, P. H., Dieleman, S., and Doucet, A. Categorical
sdes with simplex diffusion, 2022.
Sahoo, S. S., Arriola, M., Schiff, Y., Gokaslan, A., Marro-
quin, E., Chiu, J. T., Rush, A., and Kuleshov, V. Simple
and effective masked diffusion language models. arXiv
preprint arXiv:2406.07524, 2024.
Sakaguchi, K., Bras, R. L., Bhagavatula, C., and Choi, Y.
Winogrande: An adversarial winograd schema challenge
at scale. Communications of the ACM, 64(9):99–106,
2021.
Shannon, C. E. A mathematical theory of communication.
The Bell system technical journal, 27(3):379–423, 1948.
Shazeer, N. Glu variants improve transformer. arXiv
preprint arXiv:2002.05202, 2020.
Shi, J., Han, K., Wang, Z., Doucet, A., and Titsias, M. K.
Simplified and generalized masked diffusion for discrete
data. arXiv preprint arXiv:2406.04329, 2024.
Shih, A., Sadigh, D., and Ermon, S. Training and infer-
ence on any-order autoregressive models the right way.
In Proceedings of the 31th International Conference on
Machine Learning, 2022.
Sohl-Dickstein, J., Weiss, E., Maheswaranathan, N., and
Ganguli, S. Deep unsupervised learning using nonequi-
librium thermodynamics. In International conference on
machine learning, pp. 2256–2265. PMLR, 2015.
Song, Y., Sohl-Dickstein, J., Kingma, D. P., Kumar, A., Er-
mon, S., and Poole, B. Score-based generative modeling
through stochastic differential equations. arXiv preprint
arXiv:2011.13456, 2020.
Strudel, R., Tallec, C., Altch´ e, F., Du, Y., Ganin, Y., Men-
sch, A., Grathwohl, W., Savinov, N., Dieleman, S., Sifre,
L., et al. Self-conditioned embedding diffusion for text
generation. arXiv preprint arXiv:2211.04236, 2022.
Su, J., Ahmed, M., Lu, Y., Pan, S., Bo, W., and Liu, Y.
Roformer: Enhanced transformer with rotary position
embedding. Neurocomputing, 568:127063, 2024.
Sun, H., Yu, L., Dai, B., Schuurmans, D., and Dai, H. Score-
based continuous-time discrete diffusion models. arXiv
preprint arXiv:2211.16750, 2022.
Suzgun, M., Scales, N., Sch¨ arli, N., Gehrmann, S., Tay,
Y., Chung, H. W., Chowdhery, A., Le, Q. V., Chi,
E. H., Zhou, D., et al. Challenging big-bench tasks and
whether chain-of-thought can solve them. arXiv preprint
arXiv:2210.09261, 2022.
12
Large Language Diffusion Models
Touvron, H., Martin, L., Stone, K., Albert, P., Almahairi,
A., Babaei, Y., Bashlykov, N., Batra, S., Bhargava, P.,
Bhosale, S., et al. Llama 2: Open foundation and fine-
tuned chat models. arXiv preprint arXiv:2307.09288,
2023.
Uria, B., Murray, I., and Larochelle, H. A deep and tractable
density estimator. In Proceedings of the 31th Interna-
tional Conference on Machine Learning, 2014.
Vaswani, A. Attention is all you need. arXiv preprint
arXiv:1706.03762, 2017.
Wang, L., Ma, C., Feng, X., Zhang, Z., Yang, H., Zhang, J.,
Chen, Z., Tang, J., Chen, X., Lin, Y., et al. A survey on
large language model based autonomous agents. Frontiers
of Computer Science, 18(6):186345, 2024a.
Wang, X., Zheng, Z., Ye, F., Xue, D., Huang, S., and Gu, Q.
Diffusion language models are versatile protein learners.
arXiv preprint arXiv:2402.18567, 2024b.
Wang, X., Zheng, Z., Ye, F., Xue, D., Huang, S., and Gu, Q.
Dplm-2: A multimodal diffusion protein language model.
arXiv preprint arXiv:2410.13782, 2024c.
Wei, J., Wang, X., Schuurmans, D., Bosma, M., Xia, F., Chi,
E., Le, Q. V., Zhou, D., et al. Chain-of-thought prompting
elicits reasoning in large language models. Advances in
neural information processing systems, 35:24824–24837,
2022.
Wu, T., Fan, Z., Liu, X., Gong, Y., Shen, Y., Jiao, J., Zheng,
H.-T., Li, J., Wei, Z., Guo, J., Duan, N., and Chen, W.
Ar-diffusion: Auto-regressive diffusion model for text
generation, 2023.
Xu, C., Wang, X., Liao, Z., Li, Y., Hou, T., and Deng,
Z. Show-o turbo: Towards accelerated unified multi-
modal understanding and generation. arXiv preprint
arXiv:2502.05415, 2025.
Xue, K., Zhou, Y., Nie, S., Min, X., Zhang, X., Zhou, J.,
and Li, C. Unifying bayesian flow networks and diffusion
models through stochastic differential equations. arXiv
preprint arXiv:2404.15766, 2024.
Yang, A., Yang, B., Zhang, B., Hui, B., Zheng, B., Yu, B.,
Li, C., Liu, D., Huang, F., Wei, H., Lin, H., Yang, J., Tu,
J., Zhang, J., Yang, J., Yang, J., Zhou, J., Lin, J., Dang,
K., Lu, K., Bao, K., Yang, K., Yu, L., Li, M., Xue, M.,
Zhang, P., Zhu, Q., Men, R., Lin, R., Li, T., Xia, T., Ren,
X., Ren, X., Fan, Y., Su, Y., Zhang, Y., Wan, Y., Liu, Y.,
Cui, Z., Zhang, Z., and Qiu, Z. Qwen2.5 technical report.
arXiv preprint arXiv:2412.15115, 2024.
Ye, J., Zheng, Z., Bao, Y., Qian, L., and Gu, Q. Diffusion
language models can perform many tasks with scaling and
instruction-finetuning. arXiv preprint arXiv:2308.12219,
2023a.
Ye, J., Zheng, Z., Bao, Y., Qian, L., and Wang, M. Dinoiser:
Diffused conditional sequence learning by manipulating
noises. arXiv preprint arXiv:2302.10025, 2023b.
Ye, T., Xu, Z., Li, Y., and Allen-Zhu, Z. Physics of Lan-
guage Models: Part 2.1, Grade-School Math and the Hid-
den Reasoning Process. ArXiv e-prints, abs/2407.20311,
July 2024. Full version available at http://arxiv.
org/abs/2407.20311.
Zellers, R., Holtzman, A., Bisk, Y., Farhadi, A., and Choi,
Y. Hellaswag: Can a machine really finish your sentence?
arXiv preprint arXiv:1905.07830, 2019.
Zhang, B. and Sennrich, R. Root mean square layer nor-
malization. Advances in Neural Information Processing
Systems, 32, 2019.
Zhao, W. X., Zhou, K., Li, J., Tang, T., Wang, X., Hou, Y.,
Min, Y., Zhang, B., Zhang, J., Dong, Z., et al. A survey of
large language models. arXiv preprint arXiv:2303.18223,
2023.
Zheng, K., Chen, Y., Mao, H., Liu, M.-Y., Zhu, J., and
Zhang, Q. Masked diffusion models are secretly time-
agnostic masked models and exploit inaccurate categor-
ical sampling, 2024. URL https://arxiv.org/
abs/2409.02908.
Zheng, L., Yuan, J., Yu, L., and Kong, L. A reparameter-
ized discrete diffusion model for text generation. ArXiv,
abs/2302.05737, 2023.