0
1

Delete article

Deleted articles cannot be recovered.

Draft of this article would be also deleted.

Are you sure you want to delete this article?

LLM(Large Language Model)Advent Calendar 2024

Day 5

Autoregressive Model Beats Diffusion: Llama for Scalable Image Generation

Last updated at Posted at 2024-12-03

Autoregressive Model Beats Diffusion: Llama for Scalable Image Generation

Peize Sun1 Yi Jiang2† Shoufa Chen1 Shilong Zhang1 Bingyue Peng2 Ping Luo1∗ Zehuan Yuan2∗, 1 The University of Hong Kong, 2 ByteDance
https://arxiv.org/pdf/2406.06525

References

Ibrahim M Alabdulmohsin, Behnam Neyshabur, and Xiaohua Zhai. Revisiting neural scaling laws in language and vision. Advances in Neural Information Processing Systems, 35:22300–22312, 2022.
Alpha-VLLM. Large dit. https://github.com/Alpha-VLLM/LLaMA2-Accessory/tree/ main/Large-DiT-ImageNet, 2024.
Rohan Anil, Andrew M Dai, Orhan Firat, Melvin Johnson, Dmitry Lepikhin, Alexandre Passos, Siamak Shakeri, Emanuel Taropa, Paige Bailey, Zhifeng Chen, et al. Palm 2 technical report. arXiv preprint https://arxiv.org/pdf/2305.10403, 2023.
Jason Ansel, Edward Yang, Horace He, Natalia Gimelshein, Animesh Jain, Michael Voznesensky, Bin Bao, Peter Bell, David Berard, Evgeni Burovski, et al. Pytorch 2: Faster machine learning through dynamic python bytecode transformation and graph compilation. In Proceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2, pp. 929–947, 2024.
Anthropic. Claude. https://www.anthropic.com/index/introducing-claude, 2023.
Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, et al. Qwen technical report. arXiv preprint https://arxiv.org/pdf/2309.16609, 2023a.
Yutong Bai, Xinyang Geng, Karttikeya Mangalam, Amir Bar, Alan Yuille, Trevor Darrell, Jitendra Malik, and Alexei A Efros. Sequential modeling enables scalable learning for large vision models. arXiv preprint https://arxiv.org/pdf/2312.00785, 2023b.
Yoshua Bengio, Nicholas Léonard, and Aaron Courville. Estimating or propagating gradients through stochastic neurons for conditional computation. arXiv preprint https://arxiv.org/pdf/1308.3432, 2013.
James Betker, Gabriel Goh, Li Jing, Tim Brooks, Jianfeng Wang, Linjie Li, Long Ouyang, Juntang Zhuang, Joyce Lee, Yufei Guo, et al. Improving image generation with better captions. Computer Science. https://cdn.openai.com/papers/dall-e-3.pdf, 2(3):8, 2023.
Xiao Bi, Deli Chen, Guanting Chen, Shanhuang Chen, Damai Dai, Chengqi Deng, Honghui Ding, Kai Dong, Qiushi Du, Zhe Fu, et al. Deepseek llm: Scaling open-source language models with longtermism. arXiv preprint https://arxiv.org/pdf/2401.02954, 2024.
Andrew Brock, Jeff Donahue, and Karen Simonyan. Large scale gan training for high fidelity natural image synthesis. arXiv preprint https://arxiv.org/pdf/1809.11096, 2018.
Tim Brooks, Bill Peebles, Connor Holmes, Will DePue, Yufei Guo, Li Jing, David Schnurr, Joe Taylor, Troy Luhman, Eric Luhman, Clarence Ng, Ricky Wang, and Aditya Ramesh. Video generation models as world simulators. OpenAI, 2024. URL https://openai.com/research/video-generation-models-as-world-simulators.
Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
Huiwen Chang, Han Zhang, Lu Jiang, Ce Liu, and William T Freeman. Maskgit: Masked generative image transformer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11315–11325, 2022.
Huiwen Chang, Han Zhang, Jarred Barber, AJ Maschinot, Jose Lezama, Lu Jiang, Ming-Hsuan Yang, Kevin Murphy, William T Freeman, Michael Rubinstein, et al. Muse: Text-to-image generation via masked generative transformers. arXiv preprint https://arxiv.org/pdf/2301.00704, 2023.
Charlie Chen, Sebastian Borgeaud, Geoffrey Irving, Jean-Baptiste Lespiau, Laurent Sifre, and John Jumper. Accelerating large language model decoding with speculative sampling. arXiv preprint https://arxiv.org/pdf/2302.01318, 2023a.
Junsong Chen, Jincheng Yu, Chongjian Ge, Lewei Yao, Enze Xie, Yue Wu, Zhongdao Wang, James Kwok, Ping Luo, Huchuan Lu, et al. Pixart: Fast training of diffusion transformer for photorealistic text-to-image synthesis. arXiv preprint https://arxiv.org/pdf/2310.00426, 2023b.
Shoufa Chen, Mengmeng Xu, Jiawei Ren, Yuren Cong, Sen He, Yanping Xie, Animesh Sinha, Ping Luo, Tao Xiang, and Juan-Manuel Perez-Rua. Gentron: Delving deep into diffusion transformers for image and video generation. arXiv preprint https://arxiv.org/pdf/2312.04557, 2023c.
Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, et al. Palm: Scaling language modeling with pathways. Journal of Machine Learning Research, 24(240):1–113, 2023.
Hyung Won Chung, Le Hou, Shayne Longpre, Barret Zoph, Yi Tay, William Fedus, Yunxuan Li, Xuezhi Wang, Mostafa Dehghani, Siddhartha Brahma, et al. Scaling instruction-finetuned language models. Journal of Machine Learning Research, 25(70):1–53, 2024.
Wenliang Dai, Junnan Li, Dongxu Li, Anthony Meng Huat Tiong, Junqi Zhao, Weisheng Wang, Boyang Li, Pascale N Fung, and Steven Hoi. Instructblip: Towards general-purpose vision- language models with instruction tuning. Advances in Neural Information Processing Systems, 36, 2024.
Tri Dao, Dan Fu, Stefano Ermon, Atri Rudra, and Christopher Ré. Flashattention: Fast and memory- efficient exact attention with io-awareness. Advances in Neural Information Processing Systems, 35:16344–16359, 2022.
Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pp. 248–255. Ieee, 2009.
Tim Dettmers. bitsandbytes. https://github.com/TimDettmers/bitsandbytes, 2022.
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep
bidirectional transformers for language understanding. arXiv preprint https://arxiv.org/pdf/1810.04805, 2018. Prafulla Dhariwal and Alexander Nichol. Diffusion models beat gans on image synthesis. Advances in neural information processing systems, 34:8780–8794, 2021.
Runpei Dong, Chunrui Han, Yuang Peng, Zekun Qi, Zheng Ge, Jinrong Yang, Liang Zhao, Jianjian Sun, Hongyu Zhou, Haoran Wei, Xiangwen Kong, Xiangyu Zhang, Kaisheng Ma, and Li Yi. DreamLLM: Synergistic multimodal comprehension and creation. In The Twelfth International Conference on Learning Representations, 2024.
Patrick Esser, Robin Rombach, and Bjorn Ommer. Taming transformers for high-resolution image synthesis. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 12873–12883, 2021.
Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas Müller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, Dustin Podell, Tim Dockhorn, Zion En- glish, Kyle Lacey, Alex Goodwin, Yannik Marek, and Robin Rombach. Scaling rectified flow transformers for high-resolution image synthesis, 2024.
Yuxin Fang, Quan Sun, Xinggang Wang, Tiejun Huang, Xinlong Wang, and Yue Cao. Eva-02: A visual representation for neon genesis. arXiv preprint https://arxiv.org/pdf/2303.11331, 2023.
Yuying Ge, Sijie Zhao, Ziyun Zeng, Yixiao Ge, Chen Li, Xintao Wang, and Ying Shan. Making llama see and draw with seed tokenizer. arXiv preprint https://arxiv.org/pdf/2310.01218, 2023.
Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. Advances in neural information processing systems, 27, 2014.
Google. Bard. https://bard.google.com/, 2023.
Tom Henighan, Jared Kaplan, Mor Katz, Mark Chen, Christopher Hesse, Jacob Jackson, Heewoo Jun, Tom B Brown, Prafulla Dhariwal, Scott Gray, et al. Scaling laws for autoregressive generative modeling. arXiv preprint https://arxiv.org/pdf/2010.14701, 2020.
Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilibrium. Advances in neural information processing systems, 30, 2017.
Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance. arXiv preprint https://arxiv.org/pdf/2207.12598, 2022.
Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. Advances in neural information processing systems, 33:6840–6851, 2020.
Jonathan Ho, Chitwan Saharia, William Chan, David J Fleet, Mohammad Norouzi, and Tim Salimans. Cascaded diffusion models for high fidelity image generation. The Journal of Machine Learning Research, 23(1):2249–2281, 2022a.
Jonathan Ho, Chitwan Saharia, William Chan, David J Fleet, Mohammad Norouzi, and Tim Salimans. Cascaded diffusion models for high fidelity image generation. The Journal of Machine Learning Research, 23(1):2249–2281, 2022b.
Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, et al. Training compute-optimal large language models. arXiv preprint https://arxiv.org/pdf/2203.15556, 2022.
Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei A Efros. Image-to-image translation with conditional adversarial networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1125–1134, 2017.
Minguk Kang, Jun-Yan Zhu, Richard Zhang, Jaesik Park, Eli Shechtman, Sylvain Paris, and Taesung Park. Scaling up gans for text-to-image synthesis. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10124–10134, 2023.
Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models. arXiv preprint https://arxiv.org/pdf/2001.08361, 2020.
Tero Karras, Samuli Laine, and Timo Aila. A style-based generator architecture for generative adversarial networks. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 4401–4410, 2019.
Diederik P Kingma and Max Welling. Auto-encoding variational bayes. arXiv preprint https://arxiv.org/pdf/1312.6114, 2013.
Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model serving with pagedattention. In Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles, 2023.
Tuomas Kynkäänniemi, Tero Karras, Samuli Laine, Jaakko Lehtinen, and Timo Aila. Improved precision and recall metric for assessing generative models. Advances in neural information processing systems, 32, 2019.
LAION. Laion-coco 600m. https://laion.ai/blog/laion-coco, 2022.
Doyup Lee, Chiheon Kim, Saehoon Kim, Minsu Cho, and Wook-Shin Han. Autoregressive image generation using residual quantization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11523–11532, 2022.
Daiqing Li, Aleks Kamko, Ehsan Akhgari, Ali Sabet, Linmiao Xu, and Suhail Doshi. Playground v2. 5: Three insights towards enhancing aesthetic quality in text-to-image generation. arXiv preprint https://arxiv.org/pdf/2402.17245, 2024.
Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi. Blip: Bootstrapping language-image pre- training for unified vision-language understanding and generation. In International conference on machine learning, pp. 12888–12900. PMLR, 2022.
Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13, pp. 740–755. Springer, 2014.
Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. Advances in neural information processing systems, 36, 2024.
Cheng Lu, Yuhao Zhou, Fan Bao, Jianfei Chen, Chongxuan Li, and Jun Zhu. Dpm-solver: A fast ode solver for diffusion probabilistic model sampling in around 10 steps. Advances in Neural Information Processing Systems, 35:5775–5787, 2022a.
Jiasen Lu, Christopher Clark, Rowan Zellers, Roozbeh Mottaghi, and Aniruddha Kembhavi. Unified- io: A unified model for vision, language, and multi-modal tasks. arXiv preprint https://arxiv.org/pdf/2206.08916, 2022b.
Jiasen Lu, Christopher Clark, Sangho Lee, Zichen Zhang, Savya Khosla, Ryan Marten, Derek Hoiem, and Aniruddha Kembhavi. Unified-io 2: Scaling autoregressive multimodal models with vision, language, audio, and action. arXiv preprint https://arxiv.org/pdf/2312.17172, 2023.
Chuofan Ma, Yi Jiang, Jiannan Wu, Zehuan Yuan, and Xiaojuan Qi. Groma: Localized visual tokenization for grounding multimodal large language models. arXiv preprint https://arxiv.org/pdf/2404.13013, 2024.
Charlie Nash, Jacob Menick, Sander Dieleman, and Peter W Battaglia. Generating images with sparse representations. arXiv preprint https://arxiv.org/pdf/2103.03841, 2021.
Alex Nichol, Prafulla Dhariwal, Aditya Ramesh, Pranav Shyam, Pamela Mishkin, Bob McGrew, Ilya Sutskever, and Mark Chen. Glide: Towards photorealistic image generation and editing with text-guided diffusion models. arXiv preprint https://arxiv.org/pdf/2112.10741, 2021.
OpenAI. Chatgpt. https://openai.com/blog/chatgpt, 2022.
OpenAI. Consistency decoder. https://github.com/openai/consistencydecoder, 2023a.
OpenAI. Gpt-4 technical report. arXiv preprint https://arxiv.org/pdf/2303.08774, 2023b.
OpenLM-Research. Openllama 3b. https://huggingface.co/openlm-research/open_ llama_3b, 2023.
Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35: 27730–27744, 2022.
William Peebles and Saining Xie. Scalable diffusion models with transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4195–4205, 2023.
Zhiliang Peng, Wenhui Wang, Li Dong, Yaru Hao, Shaohan Huang, Shuming Ma, and Furu Wei. Kosmos-2: Grounding multimodal large language models to the world. arXiv preprint https://arxiv.org/pdf/2306.14824, 2023.
Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas Müller, Joe Penna, and Robin Rombach. Sdxl: Improving latent diffusion models for high-resolution image synthesis. arXiv preprint https://arxiv.org/pdf/2307.01952, 2023.
Alec Radford, Karthik Narasimhan, Tim Salimans, Ilya Sutskever, et al. Improving language understanding by generative pre-training. article, 2018.
Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al. Language models are unsupervised multitask learners. OpenAI blog, 1(8):9, 2019.
Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research, 21(1):5485–5551, 2020.
Aditya Ramesh, Mikhail Pavlov, Gabriel Goh, Scott Gray, Chelsea Voss, Alec Radford, Mark Chen, and Ilya Sutskever. Zero-shot text-to-image generation. In International Conference on Machine Learning, pp. 8821–8831. PMLR, 2021.
Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. Hierarchical text- conditional image generation with clip latents. arXiv preprint https://arxiv.org/pdf/2204.06125, 1(2):3, 2022.
Jeff Rasley, Samyam Rajbhandari, Olatunji Ruwase, and Yuxiong He. Deepspeed: System optimiza- tions enable training deep learning models with over 100 billion parameters. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 3505–3506, 2020.
Ali Razavi, Aaron Van den Oord, and Oriol Vinyals. Generating diverse high-fidelity images with vq-vae-2. Advances in neural information processing systems, 32, 2019.
Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High- resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF confer- ence on computer vision and pattern recognition, pp. 10684–10695, 2022.
Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily L Denton, Kamyar Ghasemipour, Raphael Gontijo Lopes, Burcu Karagol Ayan, Tim Salimans, et al. Photorealistic text-to-image diffusion models with deep language understanding. Advances in Neural Information Processing Systems, 35:36479–36494, 2022.
Tim Salimans, Ian Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec Radford, and Xi Chen. Improved techniques for training gans. Advances in neural information processing systems, 29, 2016.
Axel Sauer, Katja Schwarz, and Andreas Geiger. Stylegan-xl: Scaling stylegan to large diverse datasets. In ACM SIGGRAPH 2022 conference proceedings, pp. 1–10, 2022.
Noam Shazeer. Glu variants improve transformer. arXiv preprint https://arxiv.org/pdf/2002.05202, 2020.
Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper, and Bryan Catan- zaro. Megatron-lm: Training multi-billion parameter language models using model parallelism. arXiv preprint https://arxiv.org/pdf/1909.08053, 2019.
Jiaming Song, Chenlin Meng, and Stefano Ermon. Denoising diffusion implicit models. arXiv preprint https://arxiv.org/pdf/2010.02502, 2020.
Yang Song and Stefano Ermon. Generative modeling by estimating gradients of the data distribution. Advances in neural information processing systems, 32, 2019.
Jianlin Su, Murtadha Ahmed, Yu Lu, Shengfeng Pan, Wen Bo, and Yunfeng Liu. Roformer: Enhanced transformer with rotary position embedding. Neurocomputing, 568:127063, 2024.
Quan Sun, Qiying Yu, Yufeng Cui, Fan Zhang, Xiaosong Zhang, Yueze Wang, Hongcheng Gao, Jingjing Liu, Tiejun Huang, and Xinlong Wang. Generative pretraining in multimodality. arXiv preprint https://arxiv.org/pdf/2307.05222, 2023a.
Quan Sun, Qiying Yu, Yufeng Cui, Fan Zhang, Xiaosong Zhang, Yueze Wang, Hongcheng Gao, Jingjing Liu, Tiejun Huang, and Xinlong Wang. Generative pretraining in multimodality. arXiv preprint https://arxiv.org/pdf/2307.05222, 2023b.
Chameleon Team. Chameleon: Mixed-modal early-fusion foundation models. arXiv preprint https://arxiv.org/pdf/2405.09818, 2024.
Gemini Team, Rohan Anil, Sebastian Borgeaud, Yonghui Wu, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, et al. Gemini: a family of highly capable multimodal models. arXiv preprint https://arxiv.org/pdf/2312.11805, 2023.
InternLM Team. Internlm: A multilingual language model with progressively enhanced capabilities, 2023.
Keyu Tian, Yi Jiang, Zehuan Yuan, Bingyue Peng, and Liwei Wang. Visual autoregressive modeling: Scalable image generation via next-scale prediction. arXiv preprint https://arxiv.org/pdf/2404.02905, 2024.
Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models. arXiv preprint https://arxiv.org/pdf/2302.13971, 2023a.
Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint https://arxiv.org/pdf/2307.09288, 2023b.
Aaron Van Den Oord, Oriol Vinyals, et al. Neural discrete representation learning. Advances in neural information processing systems, 30, 2017.
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. Advances in neural information processing systems, 30, 2017.
Jason Wei, Yi Tay, Rishi Bommasani, Colin Raffel, Barret Zoph, Sebastian Borgeaud, Dani Yogatama, Maarten Bosma, Denny Zhou, Donald Metzler, et al. Emergent abilities of large language models. arXiv preprint https://arxiv.org/pdf/2206.07682, 2022.
BigScience Workshop, Teven Le Scao, Angela Fan, Christopher Akiki, Ellie Pavlick, Suzana Ilic ́, Daniel Hesslow, Roman Castagné, Alexandra Sasha Luccioni, François Yvon, et al. Bloom: A 176b-parameter open-access multilingual language model. arXiv preprint https://arxiv.org/pdf/2211.05100, 2022.
Zeyue Xue, Guanglu Song, Qiushan Guo, Boxiao Liu, Zhuofan Zong, Yu Liu, and Ping Luo. Raphael: Text-to-image generation via large mixture of diffusion paths. arXiv preprint https://arxiv.org/pdf/2305.18295, 2023.
Aiyuan Yang, Bin Xiao, Bingning Wang, Borong Zhang, Ce Bian, Chao Yin, Chenxu Lv, Da Pan, Dian Wang, Dong Yan, et al. Baichuan 2: Open large-scale language models. arXiv preprint https://arxiv.org/pdf/309.10305, 2023.
Jiahui Yu, Xin Li, Jing Yu Koh, Han Zhang, Ruoming Pang, James Qin, Alexander Ku, Yuanzhong Xu, Jason Baldridge, and Yonghui Wu. Vector-quantized image modeling with improved vqgan. arXiv preprint https://arxiv.org/pdf/2110.04627, 2021.
Jiahui Yu, Yuanzhong Xu, Jing Yu Koh, Thang Luong, Gunjan Baid, Zirui Wang, Vijay Vasudevan, Alexander Ku, Yinfei Yang, Burcu Karagol Ayan, et al. Scaling autoregressive models for content- rich text-to-image generation. arXiv preprint https://arxiv.org/pdf/2206.10789, 2(3):5, 2022.
Lijun Yu, Yong Cheng, Kihyuk Sohn, José Lezama, Han Zhang, Huiwen Chang, Alexander G Hauptmann, Ming-Hsuan Yang, Yuan Hao, Irfan Essa, et al. Magvit: Masked generative video transformer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10459–10469, 2023a.
Lijun Yu, José Lezama, Nitesh B Gundavarapu, Luca Versari, Kihyuk Sohn, David Minnen, Yong Cheng, Agrim Gupta, Xiuye Gu, Alexander G Hauptmann, et al. Language model beats diffusion– tokenizer is key to visual generation. arXiv preprint https://arxiv.org/pdf/2310.05737, 2023b.
Biao Zhang and Rico Sennrich. Root mean square layer normalization. Advances in Neural Information Processing Systems, 32, 2019.
Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shechtman, and Oliver Wang. The unreasonable effectiveness of deep features as a perceptual metric. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 586–595, 2018.
Shilong Zhang, Peize Sun, Shoufa Chen, Min Xiao, Wenqi Shao, Wenwei Zhang, Kai Chen, and Ping Luo. Gpt4roi: Instruction tuning large language model on region-of-interest. arXiv preprint https://arxiv.org/pdf/2307.03601, 2023.
Susan Zhang, Stephen Roller, Naman Goyal, Mikel Artetxe, Moya Chen, Shuohui Chen, Christopher Dewan, Mona Diab, Xian Li, Xi Victoria Lin, et al. Opt: Open pre-trained transformer language models. arXiv preprint https://arxiv.org/pdf/2205.01068, 2022.
Yanli Zhao, Andrew Gu, Rohan Varma, Liang Luo, Chien-Chin Huang, Min Xu, Less Wright, Hamid Shojanazeri, Myle Ott, Sam Shleifer, et al. Pytorch fsdp: experiences on scaling fully sharded data parallel. arXiv preprint https://arxiv.org/pdf/2304.11277, 2023.
Deyao Zhu, Jun Chen, Xiaoqian Shen, Xiang Li, and Mohamed Elhoseiny. Minigpt-4: Enhancing vision-language understanding with advanced large language models, 2023.

単語帳

数日ぶりの作業で小文字にするのを間違えて大文字にしちゃった。ごめんなさい。
処理がうまくなくて、arxivが53と44の2種類に分類している。
短い文字が、4種類個別に集計しているものがあった。

term count
THE 256
AL 202
A 194
ET 194
AND 190
OF 169
IMAGE 165
B 131
M 128
MODELS 125
IN 115
TO 109
IS 102
ARXIV 97
WITH 73
GENERATION 71
ON 62
FOR 55
MODEL 49
TRAINING 49
PREPRINT 48
WE 48
ARE 44
AS 43
TEXT 42
LANGUAGE 41
B 40
OUR 32
DIFFUSION 30
FROM 29
L 29
CODEBOOK 28
LARGE 28
YU 28
CONDITIONAL 27
LI 27
AUTOREGRESSIVE 26
BY 26
CHEN 26
SIZE 26
THE 26
IN 25
QUALITY 25
VISUAL 25
ALL 24
IMAGENET 24
ZHANG 24
AND 23
NO 22
OF 22
DATA 21
SYSTEMS 21
THIS 21
TOKENIZER 21
AN 20
FID 20
FREE 20
GUIDANCE 20
HIGH 20
TOKENS 20
INFORMATION 19
NEURAL 19
ON 19
PROCESSING 19
STAGE 19
THAT 19
XL 19
ADVANCES 18
CONFERENCE 18
IMAGES 18
K 18
PERFORMANCE 18
TOP 18
WANG 18
IMAGE 17
SCALING 17
ESSER 16
IS 16
SCALE 16
TOKEN 16
CFG 15
CLASSIFIER 15
I 15
PP 15
RECONSTRUCTION 15
XXL 15
CLASS 14
PARAMETERS 14
BENCHMARK 13
INFERENCE 13
ITS 13
OPEN 13
PROCEEDINGS 13
TABLE 13
CHANG 12
DHARIWAL 12
GENERATIVE 12
LAION 12
LLAMAGEN 12
RESOLUTION 12
ROMBACH 12
USE 12
USING 12
VISION 12
XIE 12
COCO 11
COMPUTER 11
EMBEDDING 11
HO 11
IEEE 11
LEARNING 11
LIU 11
LLAMAGEN 11
LU 11
NICHOL 11
PEEBLES 11
SHOWN 11
TOKENIZERS 11
TRAINED 11
VISION 11
ARCHITECTURE 10
BETTER 10
FIGURE 10
HTTPS 10
IT 10
LEE 10
LLM 10
NOT 10
PATTERN 10
RADFORD 10
RECOGNITION 10
SAME 10
SAMPLING 10
VECTOR 10
VLLM 10
WHEN 10
AT 9
BE 9
BLUE 9
DOWNSAMPLE 9
J 9
LEARNING 9
MORE 9
P 9
SALIMANS 9
SET 9
TABLE 9
THESE 9
TWO 9
UP 9
WHICH 9
BASED 8
CAN 8
COM 8
COMMUNITY 8
CVF 8
DESIGNS 8
FIGURE 8
GENERATED 8
II 8
LLAMA 8
LOSS 8
METHODS 8
MODELS 8
NEXT 8
OPENAI 8
OR 8
POPULAR 8
RESEARCH 8
T 8
TIM 8
UNIFIED 8
USED 8
WU 8
XL 8
YANG 8
ADVERSARIAL 7
AESTHETIC 7
ALWAYS 7
C 7
CODES 7
COMPETITIVE 7
DAVID 7
DECODER 7
EPOCHS 7
ET 7
FOUNDATION 7
HAN 7
LABEL 7
MACHINE 7
MASKED 7
METRICS 7
MODELING 7
QUALITY 7
RATIO 7
SAMPLES 7
SETTING 7
SONG 7
SPEED 7
TEAM 7
TOUVRON 7
TRAINING 7
VQGAN 7
WE 7
WORK 7
ACHIEVE 6
ADOPT 6
BROWN 6
CAPTION 6
CHRISTOPHER 6
DEEP 6
DEN 6
DONG 6
EFFECT 6
ENCODER 6
EXAMPLES 6
EXPERIMENTS 6
FEATURE 6
FIDELITY 6
FURTHER 6
GE 6
HAS 6
HUANG 6
IMPROVE 6
JIANG 6
KWON 6
MULTIMODAL 6
NO 6
OORD 6
PODELL 6
PRE 6
PREVIOUS 6
Q 6
RAMESH 6
RESEARCH 6
SCALABLE 6
SCORE 6
SERVING 6
SHOWS 6
SOURCE 6
SUN 6
SYNTHESIS 6
TEMPERATURE 6
THAN 6
THEIR 6
UNDERSTANDING 6
VAN 6
VANILLA 6
WEI 6
WEIGHT 6
WHERE 6
WILLIAM 6
XU 6
YI 6
ZHAO 6
ZHU 6
ADVANCED 5
AFTER 5
ALEXANDER 5
ALIGNMENT 5
ALSO 5
AR 5
ARE 5
BAI 5
BROCK 5
DAI 5
DENG 5
DESIGN 5
DETAILED 5
DEVELOPED 5
DIMENSION 5
DIT 5
E 5
EFFECT 5
ERMON 5
EVALUATING 5
FAN 5
FID竊 5
FLOWERS 5
FOR 5
FUTURE 5
GENERATION 5
GRADIENT 5
HAVE 5
INCEPTION 5
INCREASING 5
INSTRUCTION 5
INTERNAL 5
INTO 5
IS竊 5
JUN 5
L 5
LUO 5
ORIGINAL 5
OTHER 5
OVER 5
PENG 5
PHOTO 5
PIXELS 5
PROCESS 5
PSNR竊 5
RECALL 5
REPORT 5
RESIZED 5
RESULTS 5
ROOM 5
SCALABILITY 5
SEBASTIAN 5
SHAZEER 5
SITTING 5
SPACE 5
SSIM竊 5
THEY 5
TRANSFORMERS 5
USAGE 5
VALIDATION 5
VQGAN 5
WORKS 5
Z 5

docker

bash
$ docker commit c6430640806f kaizenjapan/wc
sha256:10a0f931326a53dc72485350888d90403714c059c878c7280660f5e5a9ee3f36
$ docker push kaizenjapan/wc
Using default tag: latest
The push refers to repository [docker.io/kaizenjapan/wc]
93bb6265160c: Pushed 
afa85b6e2808: Mounted from kaizenjapan/llm 
8da92be08e15: Mounted from kaizenjapan/llm 
latest: digest: sha256:f02a9142be5b988c7aafe65bdb08eeebf5b487c34f342bab15cc60be3cef990c size: 954
0
1
0

Register as a new user and use Qiita more conveniently

  1. You get articles that match your needs
  2. You can efficiently read back useful information
  3. You can use dark theme
What you can do with signing up
0
1

Delete article

Deleted articles cannot be recovered.

Draft of this article would be also deleted.

Are you sure you want to delete this article?