LLM(Large Language Model) Advent Calendar 2024
Small-scale proxies for large-scale Transformer training instabilities
Mitchell Wortsman, Peter J. Liu, Lechao Xiao, Katie Everett, Alex Alemi, Ben Adlam, John D. Co-Reyes, Izzeddin Gur, Abhishek Kumar, Roman Novak, Jeffrey Pennington, Jascha Sohl-dickstein, Kelvin Xu, Jaehoon Lee, Justin Gilmer, Simon Kornblith
This article is not completed. I will add some words and/or centences in order.
[1] Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hinton. Layer normalization. arXiv preprint arXiv:1607.06450, 2016.
[2] Igor Babuschkin, Kate Baumli, Alison Bell, Surya Bhupatiraju, Jake Bruce, Peter Buchlovsky, David Budden, Trevor Cai, Aidan Clark, Ivo Danihelka, Antoine Dedieu, Claudio Fantacci, Jonathan God- win, Chris Jones, Ross Hemsley, Tom Hennigan, Matteo Hessel, Shaobo Hou, Steven Kapturowski, Thomas Keck, Iurii Kemaev, Michael King, Markus Kunesch, Lena Martens, Hamza Merzic, Vladimir Mikulik, Tamara Norman, George Papamakarios, John Quan, Roman Ring, Francisco Ruiz, Alvaro Sanchez, Laurent Sartran, Rosalia Schneider, Eren Sezener, Stephen Spencer, Srivatsan Srinivasan, Miloˇs Stanojevi ́c, Wojciech Stokowiec, Luyu Wang, Guangyao Zhou, and Fabio Viola. The DeepMind JAX Ecosystem, 2020. URL http://github.com/ deepmind.
[3] Blake Bordelon and Cengiz Pehlevan. Dynamics of finite width kernel and prediction fluctuations in mean field neural networks. arXiv preprint arXiv:2304.03408, 2023.
[4] James Bradbury, Roy Frostig, Peter Hawkins, Matthew James Johnson, Chris Leary, Dougal Maclaurin, George Necula, Adam Paszke, Jake Van- derPlas, Skye Wanderman-Milne, and Qiao Zhang. JAX: composable transformations of Python+NumPy programs, 2018. URL http://github.com/google/ jax.
[5] X. Chen, S. Xie, and K. He.
study of training self-supervised vision transform- ers. In 2021 IEEE/CVF International Confer- ence on Computer Vision (ICCV), pages 9620– 9629, Los Alamitos, CA, USA, oct 2021. IEEE Computer Society. doi: 10.1109/ICCV48922.2021. 00950. URL https://doi.ieeecomputersociety. org/10.1109/ICCV48922.2021.00950.
[6] Aakanksha Chowdhery, Sharan Narang, Jacob De- vlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sut- ton, Sebastian Gehrmann, et al. Palm: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311, 2022.
[7] Jeremy M Cohen, Simran Kaur, Yuanzhi Li, J Zico Kolter, and Ameet Talwalkar. Gradient descent on An empirical neural networks typically occurs at the edge of sta- bility. arXiv preprint arXiv:2103.00065, 2021.
[8] Jeremy M Cohen, Behrooz Ghorbani, Shankar Kr- ishnan, Naman Agarwal, Sourabh Medapati, Michal Badura, Daniel Suo, David Cardoze, Zachary Nado, George E Dahl, et al. Adaptive gradient methods at the edge of stability. arXiv preprint arXiv:2207.14484, 2022.
[9] Alex Damian, Eshaan Nichani, and Jason D Lee. Self- stabilization: The implicit bias of gradient descent at the edge of stability. arXiv preprint arXiv:2209.15594, 2022.
[10] Aaron Defazio and Konstantin Mishchenko. Learning- rate-free learning by d-adaptation. arXiv preprint arXiv:2301.07733, 2023.
[11] Mostafa Dehghani, Josip Djolonga, Basil Mustafa, Piotr Padlewski, Jonathan Heek, Justin Gilmer, Andreas Steiner, Mathilde Caron, Robert Geirhos, Ibrahim Alabdulmohsin, et al. Scaling vision trans- formers to 22 billion parameters. arXiv preprint arXiv:2302.05442, 2023.
[12] Tim Dettmers, Mike Lewis, Younes Belkada, and Luke Zettlemoyer. Llm. int8 (): 8-bit matrix mul- tiplication for transformers at scale. arXiv preprint arXiv:2208.07339, 2022.
[13] Emily Dinan, Sho Yaida, and Susan Zhang. Effective theory of transformers at initialization. arXiv preprint arXiv:2304.02034, 2023.
[14] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszko- reit, and Neil Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale. In International Conference on Learning Representa- tions (ICLR), 2021. https://arxiv.org/abs/2010. 11929.
[15] Colin Gaffney, Dinghua Li, Ruoxin Sang, Ayush Jain, and Haitang Hu. Orbax, 2023. URL http://github. com/google/orbax.
[16] Justin Gilmer, Andrea Schioppa, and Jeremy Co- hen. Intriguing properties of transformer training instabilities. To appear.
[17] Justin Gilmer, Behrooz Ghorbani, Ankush Garg, Sneha Kudugunta, Behnam Neyshabur, David Car- doze, George Dahl, Zachary Nado, and Orhan Firat. A loss curvature perspective on training instability in deep learning. arXiv preprint arXiv:2110.04369, 2021.
[18] Xavier Glorot and Yoshua Bengio. Understanding the difficulty of training deep feedforward neural networks. In Proceedings of the thirteenth international confer- ence on artificial intelligence and statistics, pages 249–256. JMLR Workshop and Conference Proceedings, 2010.
[19] Google. Grain - feeding jax models, 2023. URL http://github.com/google/grain.
[20] Jonathan Heek, Anselm Levskaya, Avital Oliver, Mar- vin Ritter, Bertrand Rondepierre, Andreas Steiner, and Marc van Zee. Flax: A neural network li- brary and ecosystem for JAX, 2023. URL http: //github.com/google/flax.
[21] Dan Hendrycks and Kevin Gimpel. Gaussian error linear units (gelus). arXiv preprint arXiv:1606.08415, 2016.
[22] Jordan Hoffmann, Sebastian Borgeaud, Arthur Men- sch, Elena Buchatskaya, Trevor Cai, Eliza Ruther- ford, Diego de Las Casas, Lisa Anne Hendricks, Jo- hannes Welbl, Aidan Clark, et al. Training compute- optimal large language models. arXiv preprint arXiv:2203.15556, 2022.
[23] Weizhe Hua, Zihang Dai, Hanxiao Liu, and Quoc Le. Transformer quality in linear time. In International Conference on Machine Learning, pages 9099–9117. PMLR, 2022.
[24] Maor Ivgi, Oliver Hinder, and Yair Carmon. Dog is sgd’s best friend: A parameter-free dynamic step size schedule. arXiv preprint arXiv:2302.12022, 2023.
[25] Arthur Jacot, Franck Gabriel, and Cl ́ement Hon- gler. Neural tangent kernel: Convergence and gen- eralization in neural networks. In Advances in Neu- ral Information Processing Systems (NeurIPS), 2018. https://arxiv.org/abs/1806.07572.
[26] Norman P Jouppi, Cliff Young, Nishant Patil, David Patterson, Gaurav Agrawal, Raminder Bajwa, Sarah Bates, Suresh Bhatia, Nan Boden, Al Borchers, et al. In-datacenter performance analysis of a tensor pro- cessing unit. In Proceedings of the 44th annual inter- national symposium on computer architecture, pages 1–12, 2017.
[27] Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models. arXiv preprint arXiv:2001.08361, 2020.
[28]Taku Kudo and John Richardson. Sentencepiece: A simple and language independent subword tokenizer and detokenizer for neural text processing. arXiv preprint arXiv:1808.06226, 2018.
[29] Jaehoon Lee. A random walk model of transformer parameter growth, 2023.
[30] Liyuan Liu, Haoming Jiang, Pengcheng He, Weizhu Chen, Xiaodong Liu, Jianfeng Gao, and Jiawei Han. On the variance of the adaptive learning rate and beyond. arXiv preprint arXiv:1908.03265, 2019.
[31] Peter J. Liu*, Mohammad Saleh*, Etienne Pot, Ben Goodrich, Ryan Sepassi, Lukasz Kaiser, and Noam Shazeer. Generating wikipedia by summarizing long sequences. In International Conference on Learning Representations, 2018. URL https://openreview. net/forum?id=Hyg0vbWC-.
[32] Ilya Loshchilov and Frank Hutter. Sgdr: Stochas- tic gradient descent with warm restarts. In In- ternational Conference on Learning Representations
(ICLR), 2016. https://arxiv.org/abs/1608.03983.
[33] Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. In International Conference on Learning Representations (ICLR), 2019. https: //openreview.net/forum?id=Bkg6RiCqY7.
[34] William Merrill, Vivek Ramanujan, Yoav Goldberg, Roy Schwartz, and Noah Smith. Effects of param- eter norm growth during transformer training: In- ductive bias from gradient descent. arXiv preprint arXiv:2010.09697, 2020.
[35] Igor Molybog, Peter Albert, Moya Chen, Zachary DeVito, David Esiobu, Naman Goyal, Punit Singh Koura, Sharan Narang, Andrew Poulton, Ruan Silva, et al. A theory on adam instability in large-scale machine learning. arXiv preprint arXiv:2304.09871, 2023.
[36] Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, et al. Pytorch: An imperative style, high- performance deep learning library. In Advances in Neural Information Processing Systems (NeurIPS), 2019. https://arxiv.org/abs/1912.01703.
[37] Ofir Press and Lior Wolf. Using the output embed- ding to improve language models. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Vol- ume 2, Short Papers, pages 157–163, Valencia, Spain, April 2017. Association for Computational Linguistics. URL https://aclanthology.org/E17-2025.
[38] Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. Language Models are Unsupervised Multitask Learners, 2019. https: //openai.com/blog/better-language-models/.
[39] Colin Raffel, Noam Shazeer, Adam Roberts, Kather- ine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. Exploring the limits of transfer learning with a unified text-to-text trans- former. Journal of Machine Learning Research, 2020. http://jmlr.org/papers/v21/20-074.html.
[40] Colin Raffel, Noam Shazeer, Adam Roberts, Kather- ine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. Exploring the limits of trans- fer learning with a unified text-to-text transformer. Journal of Machine Learning Research, 21(140):1–67, 2020. URL http://jmlr.org/papers/v21/20-074. html.
[41] Jie Ren, Samyam Rajbhandari, Reza Yazdani Am- inabadi, Olatunji Ruwase, Shuangyan Yang, Min- jia Zhang, Dong Li, and Yuxiong He. {ZeRO- Offload}: Democratizing {Billion-Scale} model train- ing. In 2021 USENIX Annual Technical Conference
(USENIX ATC 21), pages 551–564, 2021.
[42] Noam Shazeer and Mitchell Stern. Adafactor: Adap- tive learning rates with sublinear memory cost. In International Conference on Machine Learning, pages 4596–4604. PMLR, 2018.
[43] Jianlin Su, Yu Lu, Shengfeng Pan, Ahmed Murtadha, Bo Wen, and Yunfeng Liu. Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:2104.09864, 2021.
[44] Vimal Thilak, Etai Littwin, Shuangfei Zhai, Omid Saremi, Roni Paiss, and Joshua Susskind. The sling- shot mechanism: An empirical study of adaptive opti- mizers and the grokking phenomenon. arXiv preprint arXiv:2206.04817, 2022.
[45] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. Advances in neural information processing sys- tems, 30, 2017.
[46] Mitchell Wortsman, Jaehoon Lee, Justin Gilmer, and Simon Kornblith. Replacing softmax with relu in vision transformers.
[47] MitchellWortsman,TimDettmers,LukeZettlemoyer, Ari Morcos, Ali Farhadi, and Ludwig Schmidt. Sta- ble and low-precision training for large-scale vision- language models. arXiv preprint arXiv:2304.13013, 2023.
[48] Sho Yaida. Meta-principled family of hyperparameter scaling strategies. arXiv preprint arXiv:2210.04909, 2022.
[49] Greg Yang and Edward J Hu. Tensor programs iv: Feature learning in infinite-width neural networks. In International Conference on Machine Learning, pages 11727–11737. PMLR, 2021.
[50] Greg Yang, Edward J Hu, Igor Babuschkin, Szymon Sidor, Xiaodong Liu, David Farhi, Nick Ryder, Jakub Pachocki, Weizhu Chen, and Jianfeng Gao. Ten- sor programs v: Tuning large neural networks via zero-shot hyperparameter transfer. arXiv preprint arXiv:2203.03466, 2022.
[51] Shuangfei Zhai, Tatiana Likhomanenko, Etai Littwin, Dan Busbridge, Jason Ramapuram, Yizhe Zhang, Jiatao Gu, and Josh Susskind. Stabilizing trans- former training by preventing attention entropy col- lapse. arXiv preprint arXiv:2303.06296, 2023.
[52] Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov, and Lucas Beyer. Sigmoid loss for language image pre-training. arXiv preprint arXiv:2303.15343, 2023.
[53] Susan Zhang, Stephen Roller, Naman Goyal, Mikel Artetxe, Moya Chen, Shuohui Chen, Christopher Dewan, Mona Diab, Xian Li, Xi Victoria Lin, et al. Opt: Open pre-trained transformer language models. arXiv preprint arXiv:2205.01068, 2022.
term list
no. | term | count |
1 | the | 471 |
2 | e | 231 |
3 | and | 225 |
4 | of | 204 |
5 | in | 180 |
6 | to | 151 |
7 | we | 142 |
8 | lr | 129 |
9 | learning | 117 |
10 | for | 115 |
11 | a | 110 |
12 | loss | 106 |
13 | is | 101 |
14 | n | 99 |
15 | rate | 88 |
16 | this | 88 |
17 | with | 83 |
18 | figure | 75 |
19 | that | 68 |
20 | at | 67 |
21 | sensitivity | 67 |
22 | as | 64 |
23 | scaling | 64 |
24 | model | 61 |
25 | arxiv | 59 |
26 | on | 59 |
27 | when | 57 |
28 | attention | 55 |
29 | by | 54 |
30 | layernorm | 53 |
31 | training | 48 |
32 | rms | 47 |
33 | scale | 45 |
34 | instability | 43 |
35 | models | 43 |
36 | al | 42 |
37 | et | 41 |
38 | qk | 40 |
39 | are | 39 |
40 | which | 39 |
41 | not | 38 |
42 | logit | 37 |
43 | section | 37 |
44 | decay | 36 |
45 | use | 35 |
46 | final | 34 |
47 | parameters | 32 |
48 | growth | 31 |
49 | param | 31 |
50 | from | 30 |
51 | layers | 30 |
52 | transformer | 30 |
53 | large | 29 |
54 | gradient | 28 |
55 | number | 28 |
56 | an | 27 |
57 | preprint | 27 |
58 | weight | 25 |
59 | eval | 24 |
60 | max | 24 |
61 | step | 24 |
62 | without | 24 |
63 | instabilities | 23 |
64 | z | 23 |
65 | be | 22 |
66 | default | 22 |
67 | small | 22 |
68 | output | 21 |
69 | effect | 20 |
70 | it | 20 |
71 | layer | 20 |
72 | or | 20 |
73 | our | 20 |
74 | have | 19 |
75 | logits | 19 |
76 | test | 19 |
77 | up | 19 |
78 | vs | 19 |
79 | appendix | 18 |
80 | dim | 18 |
81 | rates | 18 |
82 | where | 18 |
83 | across | 17 |
84 | also | 17 |
85 | head | 17 |
86 | num | 17 |
87 | size | 17 |
88 | b | 16 |
89 | can | 16 |
90 | grad | 16 |
91 | neural | 16 |
92 | parameter | 16 |
93 | width | 16 |
94 | adamw | 15 |
95 | depth | 15 |
96 | divergence | 15 |
97 | first | 15 |
98 | has | 15 |
99 | more | 15 |
100 | params | 15 |
101 | scales | 15 |
102 | steps | 15 |
103 | update | 15 |
104 | x | 15 |
105 | curves | 14 |
106 | does | 14 |
107 | experiment | 14 |
108 | no | 14 |
109 | other | 14 |
110 | stability | 14 |
111 | transformers | 14 |
112 | block | 13 |
113 | dh | 13 |
114 | optimal | 13 |
115 | p | 13 |
116 | two | 13 |
117 | useful | 13 |
118 | warm | 13 |
119 | batch | 12 |
120 | change | 12 |
121 | conference | 12 |
122 | high | 12 |
123 | i | 12 |
124 | interventions | 12 |
125 | j | 12 |
126 | largest | 12 |
127 | m | 12 |
128 | optimizer | 12 |
129 | refer | 12 |
130 | characteristics | 11 |
131 | fan | 11 |
132 | h | 11 |
133 | hyperparameter | 11 |
134 | illustrated | 11 |
135 | independent | 11 |
136 | liu | 11 |
137 | magnitude | 11 |
138 | mean | 11 |
139 | root | 11 |
140 | self | 11 |
141 | simple | 11 |
142 | will | 11 |
143 | work | 11 |
144 | adaptive | 10 |
145 | additional | 10 |
146 | before | 10 |
147 | changes | 10 |
148 | d | 10 |
149 | different | 10 |
150 | increases | 10 |
151 | international | 10 |
152 | language | 10 |
153 | occurs | 10 |
154 | results | 10 |
155 | spikes | 10 |
156 | study | 10 |
157 | their | 10 |
158 | there | 10 |
159 | these | 10 |
160 | trends | 10 |
161 | using | 10 |
162 | behavior | 9 |
163 | dimension | 9 |
164 | do | 9 |
165 | https | 9 |
166 | if | 9 |
167 | increasing | 9 |
168 | information | 9 |
169 | larger | 9 |
170 | mlp | 9 |
171 | norm | 9 |
172 | predict | 9 |
173 | so | 9 |
174 | square | 9 |
175 | url | 9 |
176 | value | 9 |
177 | was | 9 |
178 | while | 9 |
179 | yang | 9 |
180 | adam | 8 |
181 | between | 8 |
182 | but | 8 |
183 | data | 8 |
184 | dehghani | 8 |
185 | examine | 8 |
186 | experiments | 8 |
187 | gilmer | 8 |
188 | however | 8 |
189 | known | 8 |
190 | lee | 8 |
191 | networks | 8 |
192 | observed | 8 |
193 | only | 8 |
194 | org | 8 |
195 | pages | 8 |
196 | peter | 8 |
197 | range | 8 |
198 | such | 8 |
199 | then | 8 |
200 | via | 8 |
201 | warmup | 8 |
202 | because | 7 |
203 | both | 7 |
204 | c | 7 |
205 | cases | 7 |
206 | cohen | 7 |
207 | david | 7 |
208 | deviation | 7 |
209 | edge | 7 |
210 | find | 7 |
211 | flax | 7 |
212 | g | 7 |
213 | 7 | |
214 | how | 7 |
215 | http | 7 |
216 | increase | 7 |
217 | infrastructure | 7 |
218 | instead | 7 |
219 | jax | 7 |
220 | justin | 7 |
221 | let | 7 |
222 | linear | 7 |
223 | measuring | 7 |
224 | mitchell | 7 |
225 | mitigation | 7 |
226 | need | 7 |
227 | network | 7 |
228 | orders | 7 |
229 | paper | 7 |
230 | per | 7 |
231 | previously | 7 |
232 | reduces | 7 |
233 | related | 7 |
234 | reported | 7 |
235 | than | 7 |
236 | three | 7 |
237 | trained | 7 |
238 | v | 7 |
239 | values | 7 |
240 | weights | 7 |
241 | whether | 7 |
242 | zhai | 7 |
243 | base | 6 |
244 | believe | 6 |
245 | changing | 6 |
246 | chen | 6 |
247 | com | 6 |
248 | contributed | 6 |
249 | denote | 6 |
250 | descent | 6 |
251 | described | 6 |
252 | diverged | 6 |
253 | during | 6 |
254 | embedding | 6 |
255 | hutter | 6 |
256 | ii | 6 |
257 | instance | 6 |
258 | issue | 6 |
259 | li | 6 |
260 | log | 6 |
261 | loshchilov | 6 |
262 | machine | 6 |
263 | meaning | 6 |
264 | new | 6 |
265 | norms | 6 |
266 | now | 6 |
267 | over | 6 |
268 | predicted | 6 |
269 | prediction | 6 |
270 | previous | 6 |
271 | process | 6 |
272 | progressive | 6 |
273 | question | 6 |
274 | relationship | 6 |
275 | required | 6 |
276 | research | 6 |
277 | result | 6 |
278 | set | 6 |
279 | sharpening | 6 |
280 | shazeer | 6 |
281 | shown | 6 |
282 | similar | 6 |
283 | softmax | 6 |
284 | stabilization | 6 |
285 | standard | 6 |
286 | text | 6 |
287 | they | 6 |
288 | too | 6 |
289 | train | 6 |
290 | zero | 6 |
291 | become | 5 |
292 | becomes | 5 |
293 | been | 5 |
294 | best | 5 |
295 | chowdhery | 5 |
296 | coefficient | 5 |
297 | collapse | 5 |
298 | compute | 5 |
299 | curvature | 5 |
300 | details | 5 |
301 | ei | 5 |
302 | empirical | 5 |
303 | experimental | 5 |
304 | factor | 5 |
305 | figures | 5 |
306 | full | 5 |
307 | george | 5 |
308 | github | 5 |
309 | independently | 5 |
310 | initialization | 5 |
311 | intervention | 5 |
312 | jaehoon | 5 |
313 | key | 5 |
314 | keys | 5 |
315 | longer | 5 |
316 | lower | 5 |
317 | may | 5 |
318 | measure | 5 |
319 | metric | 5 |
320 | noam | 5 |
321 | performance | 5 |
322 | plot | 5 |
323 | probabilities | 5 |
324 | processing | 5 |
325 | proposed | 5 |
326 | query | 5 |
327 | recommended | 5 |
328 | s | 5 |
329 | schedule | 5 |
330 | sequence | 5 |
331 | shrink | 5 |
332 | smaller | 5 |
333 | stable | 5 |
334 | studied | 5 |
335 | technical | 5 |
336 | unscaled | 5 |
337 | used | 5 |
338 | variant | 5 |
339 | vision | 5 |
340 | widthscaling | 5 |
341 | writing | 5 |
342 | zhang | 5 |
343 | abs | 4 |
344 | alex | 4 |
345 | all | 4 |
346 | analysis | 4 |
347 | appear | 4 |
348 | based | 4 |
349 | better | 4 |
350 | case | 4 |
351 | constant | 4 |
352 | deepmind | 4 |
353 | discussed | 4 |
354 | due | 4 |
355 | each | 4 |
356 | effective | 4 |
357 | employed | 4 |
358 | examining | 4 |
359 | example | 4 |
360 | fast | 4 |
361 | finally | 4 |
362 | fit | 4 |
363 | focus | 4 |
364 | function | 4 |
365 | hu | 4 |
366 | hyperparameters | 4 |
367 | iii | 4 |
368 | image | 4 |
369 | important | 4 |
370 | initialize | 4 |
371 | input | 4 |
372 | james | 4 |
373 | jeffrey | 4 |
374 | john | 4 |
375 | led | 4 |
376 | length | 4 |
377 | methods | 4 |
378 | minimum | 4 |
379 | moreover | 4 |
380 | narang | 4 |
381 | one | 4 |
382 | practice | 4 |
383 | pre | 4 |
384 | predicting | 4 |
385 | primarily | 4 |
386 | proceedings | 4 |
387 | project | 4 |
388 | queries | 4 |
389 | regime | 4 |
390 | representations | 4 |
391 | reproduce | 4 |
392 | same | 4 |
393 | scaled | 4 |
394 | see | 4 |
395 | shape | 4 |
396 | sharan | 4 |
397 | shift | 4 |
398 | show | 4 |
399 | simon | 4 |
400 | sizes | 4 |
401 | stabilizing | 4 |
402 | techniques | 4 |
403 | throughout | 4 |
404 | tokens | 4 |
405 | trueqk | 4 |
406 | unstable | 4 |
407 | validation | 4 |
408 | wortsman | 4 |
409 | would | 4 |
410 | xj | 4 |
411 | zk | 4 |
412 | above | 3 |
413 | access | 3 |
414 | activation | 3 |
415 | add | 3 |
416 | advances | 3 |
417 | affect | 3 |
418 | aidan | 3 |
419 | aims | 3 |
420 | alleviate | 3 |
421 | although | 3 |
422 | appears | 3 |
423 | applying | 3 |
424 | around | 3 |
425 | averaged | 3 |
426 | ben | 3 |
427 | bias | 3 |
428 | billion | 3 |
429 | causes | 3 |
430 | colin | 3 |
431 | common | 3 |
432 | computer | 3 |
433 | computing | 3 |
434 | conducted | 3 |
435 | connections | 3 |
436 | consistent | 3 |
437 | contains | 3 |
438 | contrast | 3 |
439 | cosine | 3 |
440 | could | 3 |
441 | dahl | 3 |
442 | decrease | 3 |
443 | decreasing | 3 |
444 | deep | 3 |
445 | depthdim | 3 |
446 | dettmers | 3 |
447 | direction | 3 |
448 | discussion | 3 |
449 | dynamics | 3 |
450 | element | 3 |
451 | end | 3 |
452 | exceeds | 3 |
453 | experimentation | 3 |
454 | far | 3 |
455 | faster | 3 |
456 | feature | 3 |
457 | features | 3 |
458 | following | 3 |
459 | framing | 3 |
460 | free | 3 |
461 | gradually | 3 |
462 | grain | 3 |
463 | hand | 3 |
464 | he | 3 |
465 | here | 3 |
466 | iccv | 3 |
467 | iclr | 3 |
468 | igor | 3 |
469 | ilya | 3 |
470 | improve | 3 |
471 | improvement | 3 |
472 | indicate | 3 |
473 | intermediate | 3 |
474 | into | 3 |
475 | inverse | 3 |
476 | investigate | 3 |
477 | investigation | 3 |
478 | issues | 3 |
479 | its | 3 |
480 | jeremy | 3 |
481 | jmlr | 3 |
482 | jonathan | 3 |
483 | k | 3 |
484 | kernel | 3 |
485 | kornblith | 3 |
486 | last | 3 |
487 | lead | 3 |
488 | left | 3 |
489 | long | 3 |
490 | lrs | 3 |
491 | maximum | 3 |
492 | meaningfully | 3 |
493 | measures | 3 |
494 | michael | 3 |
495 | most | 3 |
496 | naman | 3 |
497 | next | 3 |
498 | nm | 3 |
499 | noah | 3 |
500 | normalization | 3 |
501 | observation | 3 |
502 | observe | 3 |
503 | optimizers | 3 |
504 | orbax | 3 |
505 | papers | 3 |
506 | particular | 3 |
507 | phenomenon | 3 |
508 | plots | 3 |
509 | pmlr | 3 |
510 | points | 3 |
511 | pools | 3 |
512 | presents | 3 |
513 | programs | 3 |
514 | projection | 3 |
515 | provided | 3 |
516 | pytorch | 3 |
517 | quadratic | 3 |
518 | raffel | 3 |
519 | recommend | 3 |
520 | reduce | 3 |
521 | reducing | 3 |
522 | regularization | 3 |
523 | remainder | 3 |
524 | remains | 3 |
525 | reproduced | 3 |
526 | researchers | 3 |
527 | resource | 3 |
528 | respectively | 3 |
529 | right | 3 |
530 | roberts | 3 |
531 | roman | 3 |
532 | run | 3 |
533 | sequences | 3 |
534 | shifting | 3 |
535 | slightly | 3 |
536 | slow | 3 |
537 | st | 3 |
538 | studying | 3 |
539 | substantially | 3 |
540 | successful | 3 |
541 | summarize | 3 |
542 | systems | 3 |
543 | tends | 3 |
544 | tensor | 3 |
545 | therefore | 3 |
546 | though | 3 |
547 | threshold | 3 |
548 | tom | 3 |
549 | top | 3 |
550 | total | 3 |
551 | transfer | 3 |
552 | trevor | 3 |
553 | typically | 3 |
554 | u | 3 |
555 | understanding | 3 |
556 | variation | 3 |
557 | vaswani | 3 |
558 | w | 3 |
559 | xw | 3 |
560 | y | 3 |
561 | zachary | 3 |
562 | zhou | 3 |
563 | zj | 3 |
564 | aaron | 2 |
565 | abhishek | 2 |
566 | achieve | 2 |
567 | adafactor | 2 |
568 | adlam | 2 |
569 | advice | 2 |
570 | aeos | 2 |
571 | alec | 2 |
572 | alemi | 2 |
573 | alexander | 2 |
574 | allows | 2 |
575 | amodei | 2 |
576 | amount | 2 |
577 | andreas | 2 |
578 | annual | 2 |
579 | another | 2 |
580 | answer | 2 |
581 | any | 2 |
582 | applies | 2 |
583 | architecture | 2 |
584 | areas | 2 |
585 | arthur | 2 |
586 | association | 2 |
587 | auxiliary | 2 |
588 | babuschkin | 2 |
589 | baseline | 2 |
590 | basil | 2 |
591 | behrooz | 2 |
592 | below | 2 |
593 | beyer | 2 |
594 | biases | 2 |
595 | big | 2 |
596 | blocks | 2 |
597 | bottom | 2 |
598 | bradbury | 2 |
599 | cai | 2 |
600 | cause | 2 |
601 | characterized | 2 |
602 | child | 2 |
603 | chris | 2 |
604 | clark | 2 |
605 | clear | 2 |
606 | closely | 2 |
607 | co | 2 |
608 | combine | 2 |
609 | comparing | 2 |
610 | computational | 2 |
611 | conclude | 2 |
612 | confirm | 2 |
613 | consider | 2 |
614 | consistently | 2 |
615 | context | 2 |
616 | corresponding | 2 |
617 | cossim | 2 |
618 | currently | 2 |
619 | curve | 2 |
620 | damian | 2 |
621 | dan | 2 |
622 | daniel | 2 |
623 | dario | 2 |
624 | decoupled | 2 |
625 | decreases | 2 |
626 | defined | 2 |
627 | demonstrating | 2 |
628 | depends | 2 |
629 | detailed | 2 |
630 | dickstein | 2 |
631 | displays | 2 |
632 | diverge | 2 |
633 | documented | 2 |
634 | doi | 2 |
635 | ecosystem | 2 |
636 | edward | 2 |
637 | emerge | 2 |
638 | emerges | 2 |
639 | enable | 2 |
640 | enables | 2 |
641 | entropy | 2 |
642 | eos | 2 |
643 | eps | 2 |
644 | epsilon | 2 |
645 | equally | 2 |
646 | etai | 2 |
647 | everett | 2 |
648 | evidence | 2 |
649 | examines | 2 |
650 | exhibit | 2 |
651 | explain | 2 |
652 | explanation | 2 |
653 | exploring | 2 |
654 | extrapolating | 2 |
655 | fails | 2 |
656 | falsen | 2 |
657 | finding | 2 |
658 | fixed | 2 |
659 | focuses | 2 |
660 | followed | 2 |
661 | forum | 2 |
662 | found | 2 |
663 | francisco | 2 |
664 | frank | 2 |
665 | further | 2 |
666 | gao | 2 |
667 | gaurav | 2 |
668 | ghorbani | 2 |
669 | goyal | 2 |
670 | gradients | 2 |
671 | greg | 2 |
672 | grow | 2 |
673 | gur | 2 |
674 | had | 2 |
675 | heads | 2 |
676 | heek | 2 |
677 | helpful | 2 |
678 | highlight | 2 |
679 | html | 2 |
680 | id | 2 |
681 | identify | 2 |
682 | ieee | 2 |
683 | impact | 2 |
684 | implementations | 2 |
685 | improves | 2 |
686 | index | 2 |
687 | individually | 2 |
688 | initial | 2 |
689 | insight | 2 |
690 | insights | 2 |
691 | interaction | 2 |
692 | interesting | 2 |
693 | invariant | 2 |
694 | investigations | 2 |
695 | isolation | 2 |
696 | iv | 2 |
697 | ivgi | 2 |
698 | izzeddin | 2 |
699 | jake | 2 |
700 | jakob | 2 |
701 | jascha | 2 |
702 | jason | 2 |
703 | jianfeng | 2 |
704 | joint | 2 |
705 | jointly | 2 |
706 | jones | 2 |
707 | journal | 2 |
708 | just | 2 |
709 | kaiser | 2 |
710 | kaplan | 2 |
711 | katherine | 2 |
712 | katie | 2 |
713 | kelvin | 2 |
714 | kevin | 2 |
715 | kolesnikov | 2 |
716 | kumar | 2 |
717 | kxk | 2 |
718 | latter | 2 |
719 | leads | 2 |
720 | lechao | 2 |
721 | less | 2 |
722 | library | 2 |
723 | limits | 2 |
724 | lin | 2 |
725 | linguistics | 2 |
726 | littwin | 2 |
727 | loading | 2 |
728 | losseps | 2 |
729 | lossweight | 2 |
730 | low | 2 |
731 | lucas | 2 |
732 | lukasz | 2 |
733 | luke | 2 |
734 | main | 2 |
735 | mainly | 2 |
736 | many | 2 |
737 | matena | 2 |
738 | matrix | 2 |
739 | meaningful | 2 |
740 | measurable | 2 |
741 | mechanism | 2 |
742 | merrill | 2 |
743 | method | 2 |
744 | middle | 2 |
745 | min | 2 |
746 | mitigates | 2 |
747 | mitigations | 2 |
748 | modification | 2 |
749 | mostafa | 2 |
750 | moya | 2 |
751 | much | 2 |
752 | multiple | 2 |
753 | muparam | 2 |
754 | mustafa | 2 |
755 | nado | 2 |
756 | nanodo | 2 |
757 | net | 2 |
758 | neurips | 2 |
759 | norman | 2 |
760 | note | 2 |
761 | novak | 2 |
762 | obtained | 2 |
763 | occur | 2 |
764 | oct | 2 |
765 | offers | 2 |
766 | often | 2 |
767 | oliver | 2 |
768 | openreview | 2 |
769 | opportunities | 2 |
770 | optax | 2 |
771 | order | 2 |
772 | otherwise | 2 |
773 | overall | 2 |
774 | packed | 2 |
775 | padding | 2 |
776 | parameterizations | 2 |
777 | parameterizing | 2 |
778 | paszke | 2 |
779 | pennington | 2 |
780 | performs | 2 |
781 | periodic | 2 |
782 | perspective | 2 |
783 | pi | 2 |
784 | play | 2 |
785 | pointwise | 2 |
786 | possible | 2 |
787 | precision | 2 |
788 | preconditioned | 2 |
789 | predicts | 2 |
790 | produces | 2 |
791 | provides | 2 |
792 | qi | 2 |
793 | qklayernorm | 2 |
794 | quadratically | 2 |
795 | radford | 2 |
796 | raises | 2 |
797 | random | 2 |
798 | recall | 2 |
799 | received | 2 |
800 | regardless | 2 |
801 | reliable | 2 |
802 | relu | 2 |
803 | repeats | 2 |
804 | report | 2 |
805 | reporting | 2 |
806 | residual | 2 |
807 | resolve | 2 |
808 | resolves | 2 |
809 | resources | 2 |
810 | rewon | 2 |
811 | reyes | 2 |
812 | rmsn | 2 |
813 | role | 2 |
814 | rotary | 2 |
815 | row | 2 |
816 | roy | 2 |
817 | runs | 2 |
818 | ryan | 2 |
819 | sam | 2 |
820 | scientific | 2 |
821 | sebastian | 2 |
822 | sections | 2 |
823 | sensitivityscaling | 2 |
824 | sensitivitystandardmuparam | 2 |
825 | sentencepiece | 2 |
826 | setting | 2 |
827 | sgd | 2 |
828 | sharpness | 2 |
829 | sho | 2 |
830 | shows | 2 |
831 | shuangfei | 2 |
832 | singh | 2 |
833 | smallest | 2 |
834 | sohl | 2 |
835 | some | 2 |
836 | sometimes | 2 |
837 | sources | 2 |
838 | specify | 2 |
839 | standardmuparam | 2 |
840 | state | 2 |
841 | steiner | 2 |
842 | stephen | 2 |
843 | stern | 2 |
844 | still | 2 |
845 | succeeds | 2 |
846 | summary | 2 |
847 | support | 2 |
848 | susan | 2 |
849 | susskind | 2 |
850 | sweep | 2 |
851 | th | 2 |
852 | thank | 2 |
853 | theory | 2 |
854 | thilak | 2 |
855 | thomas | 2 |
856 | through | 2 |
857 | tim | 2 |
858 | tokenizer | 2 |
859 | tool | 2 |
860 | towards | 2 |
861 | tpus | 2 |
862 | unified | 2 |
863 | unit | 2 |
864 | unless | 2 |
865 | usenix | 2 |
866 | uszkoreit | 2 |
867 | varying | 2 |
868 | very | 2 |
869 | ways | 2 |
870 | wei | 2 |
871 | weizhu | 2 |
872 | were | 2 |
873 | works | 2 |
874 | wu | 2 |
875 | xi | 2 |
876 | xiao | 2 |
877 | xiaodong | 2 |
878 | xiaohua | 2 |
879 | xu | 2 |
880 | yaida | 2 |
881 | yanqi | 2 |
882 | zeros | 2 |
883 | zettlemoyer | 2 |
884 | TRUE | 2 |
合計 | 1,882 | 9,477 |