LLM(Large Language Model)Advent Calendar 2024

Autoregressive Model Beats Diffusion: Llama for Scalable Image Generation

Last updated at 2024-12-03Posted at 2024-12-03

Autoregressive Model Beats Diffusion: Llama for Scalable Image Generation

Peize Sun1 Yi Jiang2† Shoufa Chen1 Shilong Zhang1 Bingyue Peng2 Ping Luo1∗ Zehuan Yuan2∗, 1 The University of Hong Kong, 2 ByteDance
https://arxiv.org/pdf/2406.06525

References

Ibrahim M Alabdulmohsin, Behnam Neyshabur, and Xiaohua Zhai. Revisiting neural scaling laws in language and vision. Advances in Neural Information Processing Systems, 35:22300–22312, 2022.
Alpha-VLLM. Large dit. https://github.com/Alpha-VLLM/LLaMA2-Accessory/tree/ main/Large-DiT-ImageNet, 2024.
Rohan Anil, Andrew M Dai, Orhan Firat, Melvin Johnson, Dmitry Lepikhin, Alexandre Passos, Siamak Shakeri, Emanuel Taropa, Paige Bailey, Zhifeng Chen, et al. Palm 2 technical report. arXiv preprint https://arxiv.org/pdf/2305.10403, 2023.
Jason Ansel, Edward Yang, Horace He, Natalia Gimelshein, Animesh Jain, Michael Voznesensky, Bin Bao, Peter Bell, David Berard, Evgeni Burovski, et al. Pytorch 2: Faster machine learning through dynamic python bytecode transformation and graph compilation. In Proceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2, pp. 929–947, 2024.
Anthropic. Claude. https://www.anthropic.com/index/introducing-claude, 2023.
Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, et al. Qwen technical report. arXiv preprint https://arxiv.org/pdf/2309.16609, 2023a.
Yutong Bai, Xinyang Geng, Karttikeya Mangalam, Amir Bar, Alan Yuille, Trevor Darrell, Jitendra Malik, and Alexei A Efros. Sequential modeling enables scalable learning for large vision models. arXiv preprint https://arxiv.org/pdf/2312.00785, 2023b.
Yoshua Bengio, Nicholas Léonard, and Aaron Courville. Estimating or propagating gradients through stochastic neurons for conditional computation. arXiv preprint https://arxiv.org/pdf/1308.3432, 2013.
James Betker, Gabriel Goh, Li Jing, Tim Brooks, Jianfeng Wang, Linjie Li, Long Ouyang, Juntang Zhuang, Joyce Lee, Yufei Guo, et al. Improving image generation with better captions. Computer Science. https://cdn.openai.com/papers/dall-e-3.pdf, 2(3):8, 2023.
Xiao Bi, Deli Chen, Guanting Chen, Shanhuang Chen, Damai Dai, Chengqi Deng, Honghui Ding, Kai Dong, Qiushi Du, Zhe Fu, et al. Deepseek llm: Scaling open-source language models with longtermism. arXiv preprint https://arxiv.org/pdf/2401.02954, 2024.
Andrew Brock, Jeff Donahue, and Karen Simonyan. Large scale gan training for high fidelity natural image synthesis. arXiv preprint https://arxiv.org/pdf/1809.11096, 2018.
Tim Brooks, Bill Peebles, Connor Holmes, Will DePue, Yufei Guo, Li Jing, David Schnurr, Joe Taylor, Troy Luhman, Eric Luhman, Clarence Ng, Ricky Wang, and Aditya Ramesh. Video generation models as world simulators. OpenAI, 2024. URL https://openai.com/research/video-generation-models-as-world-simulators.
Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
Huiwen Chang, Han Zhang, Lu Jiang, Ce Liu, and William T Freeman. Maskgit: Masked generative image transformer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11315–11325, 2022.
Huiwen Chang, Han Zhang, Jarred Barber, AJ Maschinot, Jose Lezama, Lu Jiang, Ming-Hsuan Yang, Kevin Murphy, William T Freeman, Michael Rubinstein, et al. Muse: Text-to-image generation via masked generative transformers. arXiv preprint https://arxiv.org/pdf/2301.00704, 2023.
Charlie Chen, Sebastian Borgeaud, Geoffrey Irving, Jean-Baptiste Lespiau, Laurent Sifre, and John Jumper. Accelerating large language model decoding with speculative sampling. arXiv preprint https://arxiv.org/pdf/2302.01318, 2023a.
Junsong Chen, Jincheng Yu, Chongjian Ge, Lewei Yao, Enze Xie, Yue Wu, Zhongdao Wang, James Kwok, Ping Luo, Huchuan Lu, et al. Pixart: Fast training of diffusion transformer for photorealistic text-to-image synthesis. arXiv preprint https://arxiv.org/pdf/2310.00426, 2023b.
Shoufa Chen, Mengmeng Xu, Jiawei Ren, Yuren Cong, Sen He, Yanping Xie, Animesh Sinha, Ping Luo, Tao Xiang, and Juan-Manuel Perez-Rua. Gentron: Delving deep into diffusion transformers for image and video generation. arXiv preprint https://arxiv.org/pdf/2312.04557, 2023c.
Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, et al. Palm: Scaling language modeling with pathways. Journal of Machine Learning Research, 24(240):1–113, 2023.
Hyung Won Chung, Le Hou, Shayne Longpre, Barret Zoph, Yi Tay, William Fedus, Yunxuan Li, Xuezhi Wang, Mostafa Dehghani, Siddhartha Brahma, et al. Scaling instruction-finetuned language models. Journal of Machine Learning Research, 25(70):1–53, 2024.
Wenliang Dai, Junnan Li, Dongxu Li, Anthony Meng Huat Tiong, Junqi Zhao, Weisheng Wang, Boyang Li, Pascale N Fung, and Steven Hoi. Instructblip: Towards general-purpose vision- language models with instruction tuning. Advances in Neural Information Processing Systems, 36, 2024.
Tri Dao, Dan Fu, Stefano Ermon, Atri Rudra, and Christopher Ré. Flashattention: Fast and memory- efficient exact attention with io-awareness. Advances in Neural Information Processing Systems, 35:16344–16359, 2022.
Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pp. 248–255. Ieee, 2009.
Tim Dettmers. bitsandbytes. https://github.com/TimDettmers/bitsandbytes, 2022.
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep
bidirectional transformers for language understanding. arXiv preprint https://arxiv.org/pdf/1810.04805, 2018. Prafulla Dhariwal and Alexander Nichol. Diffusion models beat gans on image synthesis. Advances in neural information processing systems, 34:8780–8794, 2021.
Runpei Dong, Chunrui Han, Yuang Peng, Zekun Qi, Zheng Ge, Jinrong Yang, Liang Zhao, Jianjian Sun, Hongyu Zhou, Haoran Wei, Xiangwen Kong, Xiangyu Zhang, Kaisheng Ma, and Li Yi. DreamLLM: Synergistic multimodal comprehension and creation. In The Twelfth International Conference on Learning Representations, 2024.
Patrick Esser, Robin Rombach, and Bjorn Ommer. Taming transformers for high-resolution image synthesis. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 12873–12883, 2021.
Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas Müller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, Dustin Podell, Tim Dockhorn, Zion En- glish, Kyle Lacey, Alex Goodwin, Yannik Marek, and Robin Rombach. Scaling rectified flow transformers for high-resolution image synthesis, 2024.
Yuxin Fang, Quan Sun, Xinggang Wang, Tiejun Huang, Xinlong Wang, and Yue Cao. Eva-02: A visual representation for neon genesis. arXiv preprint https://arxiv.org/pdf/2303.11331, 2023.
Yuying Ge, Sijie Zhao, Ziyun Zeng, Yixiao Ge, Chen Li, Xintao Wang, and Ying Shan. Making llama see and draw with seed tokenizer. arXiv preprint https://arxiv.org/pdf/2310.01218, 2023.
Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. Advances in neural information processing systems, 27, 2014.
Google. Bard. https://bard.google.com/, 2023.
Tom Henighan, Jared Kaplan, Mor Katz, Mark Chen, Christopher Hesse, Jacob Jackson, Heewoo Jun, Tom B Brown, Prafulla Dhariwal, Scott Gray, et al. Scaling laws for autoregressive generative modeling. arXiv preprint https://arxiv.org/pdf/2010.14701, 2020.
Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilibrium. Advances in neural information processing systems, 30, 2017.
Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance. arXiv preprint https://arxiv.org/pdf/2207.12598, 2022.
Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. Advances in neural information processing systems, 33:6840–6851, 2020.
Jonathan Ho, Chitwan Saharia, William Chan, David J Fleet, Mohammad Norouzi, and Tim Salimans. Cascaded diffusion models for high fidelity image generation. The Journal of Machine Learning Research, 23(1):2249–2281, 2022a.
Jonathan Ho, Chitwan Saharia, William Chan, David J Fleet, Mohammad Norouzi, and Tim Salimans. Cascaded diffusion models for high fidelity image generation. The Journal of Machine Learning Research, 23(1):2249–2281, 2022b.
Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, et al. Training compute-optimal large language models. arXiv preprint https://arxiv.org/pdf/2203.15556, 2022.
Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei A Efros. Image-to-image translation with conditional adversarial networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1125–1134, 2017.
Minguk Kang, Jun-Yan Zhu, Richard Zhang, Jaesik Park, Eli Shechtman, Sylvain Paris, and Taesung Park. Scaling up gans for text-to-image synthesis. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10124–10134, 2023.
Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models. arXiv preprint https://arxiv.org/pdf/2001.08361, 2020.
Tero Karras, Samuli Laine, and Timo Aila. A style-based generator architecture for generative adversarial networks. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 4401–4410, 2019.
Diederik P Kingma and Max Welling. Auto-encoding variational bayes. arXiv preprint https://arxiv.org/pdf/1312.6114, 2013.
Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model serving with pagedattention. In Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles, 2023.
Tuomas Kynkäänniemi, Tero Karras, Samuli Laine, Jaakko Lehtinen, and Timo Aila. Improved precision and recall metric for assessing generative models. Advances in neural information processing systems, 32, 2019.
LAION. Laion-coco 600m. https://laion.ai/blog/laion-coco, 2022.
Doyup Lee, Chiheon Kim, Saehoon Kim, Minsu Cho, and Wook-Shin Han. Autoregressive image generation using residual quantization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11523–11532, 2022.
Daiqing Li, Aleks Kamko, Ehsan Akhgari, Ali Sabet, Linmiao Xu, and Suhail Doshi. Playground v2. 5: Three insights towards enhancing aesthetic quality in text-to-image generation. arXiv preprint https://arxiv.org/pdf/2402.17245, 2024.
Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi. Blip: Bootstrapping language-image pre- training for unified vision-language understanding and generation. In International conference on machine learning, pp. 12888–12900. PMLR, 2022.
Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13, pp. 740–755. Springer, 2014.
Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. Advances in neural information processing systems, 36, 2024.
Cheng Lu, Yuhao Zhou, Fan Bao, Jianfei Chen, Chongxuan Li, and Jun Zhu. Dpm-solver: A fast ode solver for diffusion probabilistic model sampling in around 10 steps. Advances in Neural Information Processing Systems, 35:5775–5787, 2022a.
Jiasen Lu, Christopher Clark, Rowan Zellers, Roozbeh Mottaghi, and Aniruddha Kembhavi. Unified- io: A unified model for vision, language, and multi-modal tasks. arXiv preprint https://arxiv.org/pdf/2206.08916, 2022b.
Jiasen Lu, Christopher Clark, Sangho Lee, Zichen Zhang, Savya Khosla, Ryan Marten, Derek Hoiem, and Aniruddha Kembhavi. Unified-io 2: Scaling autoregressive multimodal models with vision, language, audio, and action. arXiv preprint https://arxiv.org/pdf/2312.17172, 2023.
Chuofan Ma, Yi Jiang, Jiannan Wu, Zehuan Yuan, and Xiaojuan Qi. Groma: Localized visual tokenization for grounding multimodal large language models. arXiv preprint https://arxiv.org/pdf/2404.13013, 2024.
Charlie Nash, Jacob Menick, Sander Dieleman, and Peter W Battaglia. Generating images with sparse representations. arXiv preprint https://arxiv.org/pdf/2103.03841, 2021.
Alex Nichol, Prafulla Dhariwal, Aditya Ramesh, Pranav Shyam, Pamela Mishkin, Bob McGrew, Ilya Sutskever, and Mark Chen. Glide: Towards photorealistic image generation and editing with text-guided diffusion models. arXiv preprint https://arxiv.org/pdf/2112.10741, 2021.
OpenAI. Chatgpt. https://openai.com/blog/chatgpt, 2022.
OpenAI. Consistency decoder. https://github.com/openai/consistencydecoder, 2023a.
OpenAI. Gpt-4 technical report. arXiv preprint https://arxiv.org/pdf/2303.08774, 2023b.
OpenLM-Research. Openllama 3b. https://huggingface.co/openlm-research/open_ llama_3b, 2023.
Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35: 27730–27744, 2022.
William Peebles and Saining Xie. Scalable diffusion models with transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4195–4205, 2023.
Zhiliang Peng, Wenhui Wang, Li Dong, Yaru Hao, Shaohan Huang, Shuming Ma, and Furu Wei. Kosmos-2: Grounding multimodal large language models to the world. arXiv preprint https://arxiv.org/pdf/2306.14824, 2023.
Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas Müller, Joe Penna, and Robin Rombach. Sdxl: Improving latent diffusion models for high-resolution image synthesis. arXiv preprint https://arxiv.org/pdf/2307.01952, 2023.
Alec Radford, Karthik Narasimhan, Tim Salimans, Ilya Sutskever, et al. Improving language understanding by generative pre-training. article, 2018.
Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al. Language models are unsupervised multitask learners. OpenAI blog, 1(8):9, 2019.
Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research, 21(1):5485–5551, 2020.
Aditya Ramesh, Mikhail Pavlov, Gabriel Goh, Scott Gray, Chelsea Voss, Alec Radford, Mark Chen, and Ilya Sutskever. Zero-shot text-to-image generation. In International Conference on Machine Learning, pp. 8821–8831. PMLR, 2021.
Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. Hierarchical text- conditional image generation with clip latents. arXiv preprint https://arxiv.org/pdf/2204.06125, 1(2):3, 2022.
Jeff Rasley, Samyam Rajbhandari, Olatunji Ruwase, and Yuxiong He. Deepspeed: System optimiza- tions enable training deep learning models with over 100 billion parameters. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 3505–3506, 2020.
Ali Razavi, Aaron Van den Oord, and Oriol Vinyals. Generating diverse high-fidelity images with vq-vae-2. Advances in neural information processing systems, 32, 2019.
Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High- resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF confer- ence on computer vision and pattern recognition, pp. 10684–10695, 2022.
Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily L Denton, Kamyar Ghasemipour, Raphael Gontijo Lopes, Burcu Karagol Ayan, Tim Salimans, et al. Photorealistic text-to-image diffusion models with deep language understanding. Advances in Neural Information Processing Systems, 35:36479–36494, 2022.
Tim Salimans, Ian Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec Radford, and Xi Chen. Improved techniques for training gans. Advances in neural information processing systems, 29, 2016.
Axel Sauer, Katja Schwarz, and Andreas Geiger. Stylegan-xl: Scaling stylegan to large diverse datasets. In ACM SIGGRAPH 2022 conference proceedings, pp. 1–10, 2022.
Noam Shazeer. Glu variants improve transformer. arXiv preprint https://arxiv.org/pdf/2002.05202, 2020.
Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper, and Bryan Catan- zaro. Megatron-lm: Training multi-billion parameter language models using model parallelism. arXiv preprint https://arxiv.org/pdf/1909.08053, 2019.
Jiaming Song, Chenlin Meng, and Stefano Ermon. Denoising diffusion implicit models. arXiv preprint https://arxiv.org/pdf/2010.02502, 2020.
Yang Song and Stefano Ermon. Generative modeling by estimating gradients of the data distribution. Advances in neural information processing systems, 32, 2019.
Jianlin Su, Murtadha Ahmed, Yu Lu, Shengfeng Pan, Wen Bo, and Yunfeng Liu. Roformer: Enhanced transformer with rotary position embedding. Neurocomputing, 568:127063, 2024.
Quan Sun, Qiying Yu, Yufeng Cui, Fan Zhang, Xiaosong Zhang, Yueze Wang, Hongcheng Gao, Jingjing Liu, Tiejun Huang, and Xinlong Wang. Generative pretraining in multimodality. arXiv preprint https://arxiv.org/pdf/2307.05222, 2023a.
Quan Sun, Qiying Yu, Yufeng Cui, Fan Zhang, Xiaosong Zhang, Yueze Wang, Hongcheng Gao, Jingjing Liu, Tiejun Huang, and Xinlong Wang. Generative pretraining in multimodality. arXiv preprint https://arxiv.org/pdf/2307.05222, 2023b.
Chameleon Team. Chameleon: Mixed-modal early-fusion foundation models. arXiv preprint https://arxiv.org/pdf/2405.09818, 2024.
Gemini Team, Rohan Anil, Sebastian Borgeaud, Yonghui Wu, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, et al. Gemini: a family of highly capable multimodal models. arXiv preprint https://arxiv.org/pdf/2312.11805, 2023.
InternLM Team. Internlm: A multilingual language model with progressively enhanced capabilities, 2023.
Keyu Tian, Yi Jiang, Zehuan Yuan, Bingyue Peng, and Liwei Wang. Visual autoregressive modeling: Scalable image generation via next-scale prediction. arXiv preprint https://arxiv.org/pdf/2404.02905, 2024.
Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models. arXiv preprint https://arxiv.org/pdf/2302.13971, 2023a.
Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint https://arxiv.org/pdf/2307.09288, 2023b.
Aaron Van Den Oord, Oriol Vinyals, et al. Neural discrete representation learning. Advances in neural information processing systems, 30, 2017.
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. Advances in neural information processing systems, 30, 2017.
Jason Wei, Yi Tay, Rishi Bommasani, Colin Raffel, Barret Zoph, Sebastian Borgeaud, Dani Yogatama, Maarten Bosma, Denny Zhou, Donald Metzler, et al. Emergent abilities of large language models. arXiv preprint https://arxiv.org/pdf/2206.07682, 2022.
BigScience Workshop, Teven Le Scao, Angela Fan, Christopher Akiki, Ellie Pavlick, Suzana Ilic ́, Daniel Hesslow, Roman Castagné, Alexandra Sasha Luccioni, François Yvon, et al. Bloom: A 176b-parameter open-access multilingual language model. arXiv preprint https://arxiv.org/pdf/2211.05100, 2022.
Zeyue Xue, Guanglu Song, Qiushan Guo, Boxiao Liu, Zhuofan Zong, Yu Liu, and Ping Luo. Raphael: Text-to-image generation via large mixture of diffusion paths. arXiv preprint https://arxiv.org/pdf/2305.18295, 2023.
Aiyuan Yang, Bin Xiao, Bingning Wang, Borong Zhang, Ce Bian, Chao Yin, Chenxu Lv, Da Pan, Dian Wang, Dong Yan, et al. Baichuan 2: Open large-scale language models. arXiv preprint https://arxiv.org/pdf/309.10305, 2023.
Jiahui Yu, Xin Li, Jing Yu Koh, Han Zhang, Ruoming Pang, James Qin, Alexander Ku, Yuanzhong Xu, Jason Baldridge, and Yonghui Wu. Vector-quantized image modeling with improved vqgan. arXiv preprint https://arxiv.org/pdf/2110.04627, 2021.
Jiahui Yu, Yuanzhong Xu, Jing Yu Koh, Thang Luong, Gunjan Baid, Zirui Wang, Vijay Vasudevan, Alexander Ku, Yinfei Yang, Burcu Karagol Ayan, et al. Scaling autoregressive models for content- rich text-to-image generation. arXiv preprint https://arxiv.org/pdf/2206.10789, 2(3):5, 2022.
Lijun Yu, Yong Cheng, Kihyuk Sohn, José Lezama, Han Zhang, Huiwen Chang, Alexander G Hauptmann, Ming-Hsuan Yang, Yuan Hao, Irfan Essa, et al. Magvit: Masked generative video transformer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10459–10469, 2023a.
Lijun Yu, José Lezama, Nitesh B Gundavarapu, Luca Versari, Kihyuk Sohn, David Minnen, Yong Cheng, Agrim Gupta, Xiuye Gu, Alexander G Hauptmann, et al. Language model beats diffusion– tokenizer is key to visual generation. arXiv preprint https://arxiv.org/pdf/2310.05737, 2023b.
Biao Zhang and Rico Sennrich. Root mean square layer normalization. Advances in Neural Information Processing Systems, 32, 2019.
Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shechtman, and Oliver Wang. The unreasonable effectiveness of deep features as a perceptual metric. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 586–595, 2018.
Shilong Zhang, Peize Sun, Shoufa Chen, Min Xiao, Wenqi Shao, Wenwei Zhang, Kai Chen, and Ping Luo. Gpt4roi: Instruction tuning large language model on region-of-interest. arXiv preprint https://arxiv.org/pdf/2307.03601, 2023.
Susan Zhang, Stephen Roller, Naman Goyal, Mikel Artetxe, Moya Chen, Shuohui Chen, Christopher Dewan, Mona Diab, Xian Li, Xi Victoria Lin, et al. Opt: Open pre-trained transformer language models. arXiv preprint https://arxiv.org/pdf/2205.01068, 2022.
Yanli Zhao, Andrew Gu, Rohan Varma, Liang Luo, Chien-Chin Huang, Min Xu, Less Wright, Hamid Shojanazeri, Myle Ott, Sam Shleifer, et al. Pytorch fsdp: experiences on scaling fully sharded data parallel. arXiv preprint https://arxiv.org/pdf/2304.11277, 2023.
Deyao Zhu, Jun Chen, Xiaoqian Shen, Xiang Li, and Mohamed Elhoseiny. Minigpt-4: Enhancing vision-language understanding with advanced large language models, 2023.

単語帳

数日ぶりの作業で小文字にするのを間違えて大文字にしちゃった。ごめんなさい。
処理がうまくなくて、arxivが53と44の２種類に分類している。
短い文字が、４種類個別に集計しているものがあった。

term	count
THE	256
AL	202
A	194
ET	194
AND	190
OF	169
IMAGE	165
B	131
M	128
MODELS	125
IN	115
TO	109
IS	102
ARXIV	97
WITH	73
GENERATION	71
ON	62
FOR	55
MODEL	49
TRAINING	49
PREPRINT	48
WE	48
ARE	44
AS	43
TEXT	42
LANGUAGE	41
B	40
OUR	32
DIFFUSION	30
FROM	29
L	29
CODEBOOK	28
LARGE	28
YU	28
CONDITIONAL	27
LI	27
AUTOREGRESSIVE	26
BY	26
CHEN	26
SIZE	26
THE	26
IN	25
QUALITY	25
VISUAL	25
ALL	24
IMAGENET	24
ZHANG	24
AND	23
NO	22
OF	22
DATA	21
SYSTEMS	21
THIS	21
TOKENIZER	21
AN	20
FID	20
FREE	20
GUIDANCE	20
HIGH	20
TOKENS	20
INFORMATION	19
NEURAL	19
ON	19
PROCESSING	19
STAGE	19
THAT	19
XL	19
ADVANCES	18
CONFERENCE	18
IMAGES	18
K	18
PERFORMANCE	18
TOP	18
WANG	18
IMAGE	17
SCALING	17
ESSER	16
IS	16
SCALE	16
TOKEN	16
CFG	15
CLASSIFIER	15
I	15
PP	15
RECONSTRUCTION	15
XXL	15
CLASS	14
PARAMETERS	14
BENCHMARK	13
INFERENCE	13
ITS	13
OPEN	13
PROCEEDINGS	13
TABLE	13
CHANG	12
DHARIWAL	12
GENERATIVE	12
LAION	12
LLAMAGEN	12
RESOLUTION	12
ROMBACH	12
USE	12
USING	12
VISION	12
XIE	12
COCO	11
COMPUTER	11
EMBEDDING	11
HO	11
IEEE	11
LEARNING	11
LIU	11
LLAMAGEN	11
LU	11
NICHOL	11
PEEBLES	11
SHOWN	11
TOKENIZERS	11
TRAINED	11
VISION	11
ARCHITECTURE	10
BETTER	10
FIGURE	10
HTTPS	10
IT	10
LEE	10
LLM	10
NOT	10
PATTERN	10
RADFORD	10
RECOGNITION	10
SAME	10
SAMPLING	10
VECTOR	10
VLLM	10
WHEN	10
AT	9
BE	9
BLUE	9
DOWNSAMPLE	9
J	9
LEARNING	9
MORE	9
P	9
SALIMANS	9
SET	9
TABLE	9
THESE	9
TWO	9
UP	9
WHICH	9
BASED	8
CAN	8
COM	8
COMMUNITY	8
CVF	8
DESIGNS	8
FIGURE	8
GENERATED	8
II	8
LLAMA	8
LOSS	8
METHODS	8
MODELS	8
NEXT	8
OPENAI	8
OR	8
POPULAR	8
RESEARCH	8
T	8
TIM	8
UNIFIED	8
USED	8
WU	8
XL	8
YANG	8
ADVERSARIAL	7
AESTHETIC	7
ALWAYS	7
C	7
CODES	7
COMPETITIVE	7
DAVID	7
DECODER	7
EPOCHS	7
ET	7
FOUNDATION	7
HAN	7
LABEL	7
MACHINE	7
MASKED	7
METRICS	7
MODELING	7
QUALITY	7
RATIO	7
SAMPLES	7
SETTING	7
SONG	7
SPEED	7
TEAM	7
TOUVRON	7
TRAINING	7
VQGAN	7
WE	7
WORK	7
ACHIEVE	6
ADOPT	6
BROWN	6
CAPTION	6
CHRISTOPHER	6
DEEP	6
DEN	6
DONG	6
EFFECT	6
ENCODER	6
EXAMPLES	6
EXPERIMENTS	6
FEATURE	6
FIDELITY	6
FURTHER	6
GE	6
HAS	6
HUANG	6
IMPROVE	6
JIANG	6
KWON	6
MULTIMODAL	6
NO	6
OORD	6
PODELL	6
PRE	6
PREVIOUS	6
Q	6
RAMESH	6
RESEARCH	6
SCALABLE	6
SCORE	6
SERVING	6
SHOWS	6
SOURCE	6
SUN	6
SYNTHESIS	6
TEMPERATURE	6
THAN	6
THEIR	6
UNDERSTANDING	6
VAN	6
VANILLA	6
WEI	6
WEIGHT	6
WHERE	6
WILLIAM	6
XU	6
YI	6
ZHAO	6
ZHU	6
ADVANCED	5
AFTER	5
ALEXANDER	5
ALIGNMENT	5
ALSO	5
AR	5
ARE	5
BAI	5
BROCK	5
DAI	5
DENG	5
DESIGN	5
DETAILED	5
DEVELOPED	5
DIMENSION	5
DIT	5
E	5
EFFECT	5
ERMON	5
EVALUATING	5
FAN	5
FID竊	5
FLOWERS	5
FOR	5
FUTURE	5
GENERATION	5
GRADIENT	5
HAVE	5
INCEPTION	5
INCREASING	5
INSTRUCTION	5
INTERNAL	5
INTO	5
IS竊	5
JUN	5
L	5
LUO	5
ORIGINAL	5
OTHER	5
OVER	5
PENG	5
PHOTO	5
PIXELS	5
PROCESS	5
PSNR竊	5
RECALL	5
REPORT	5
RESIZED	5
RESULTS	5
ROOM	5
SCALABILITY	5
SEBASTIAN	5
SHAZEER	5
SITTING	5
SPACE	5
SSIM竊	5
THEY	5
TRANSFORMERS	5
USAGE	5
VALIDATION	5
VQGAN	5
WORKS	5
Z	5

docker

bash

$ docker commit c6430640806f kaizenjapan/wc
sha256:10a0f931326a53dc72485350888d90403714c059c878c7280660f5e5a9ee3f36
$ docker push kaizenjapan/wc
Using default tag: latest
The push refers to repository [docker.io/kaizenjapan/wc]
93bb6265160c: Pushed 
afa85b6e2808: Mounted from kaizenjapan/llm 
8da92be08e15: Mounted from kaizenjapan/llm 
latest: digest: sha256:f02a9142be5b988c7aafe65bdb08eeebf5b487c34f342bab15cc60be3cef990c size: 954

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up