1
1

Delete article

Deleted articles cannot be recovered.

Draft of this article would be also deleted.

Are you sure you want to delete this article?

Small-scale proxies for large-scale Transformer training instabilities LLM AI(7)

Last updated at Posted at 2024-11-10

LLM(Large Language Model) Advent Calendar 2024
https://qiita.com/advent-calendar/2024/llm
2日目投稿予定の記事です。

Small-scale proxies for large-scale Transformer training instabilities
Mitchell Wortsman, Peter J. Liu, Lechao Xiao, Katie Everett, Alex Alemi, Ben Adlam, John D. Co-Reyes, Izzeddin Gur, Abhishek Kumar, Roman Novak, Jeffrey Pennington, Jascha Sohl-dickstein, Kelvin Xu, Jaehoon Lee, Justin Gilmer, Simon Kornblith
https://arxiv.org/abs/2309.14322v2

<この項は書きかけです。順次追記します。>
This article is not completed. I will add some words and/or centences in order.

References

[1] Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hinton. Layer normalization. arXiv preprint arXiv:1607.06450, 2016.
[2] Igor Babuschkin, Kate Baumli, Alison Bell, Surya Bhupatiraju, Jake Bruce, Peter Buchlovsky, David Budden, Trevor Cai, Aidan Clark, Ivo Danihelka, Antoine Dedieu, Claudio Fantacci, Jonathan God- win, Chris Jones, Ross Hemsley, Tom Hennigan, Matteo Hessel, Shaobo Hou, Steven Kapturowski, Thomas Keck, Iurii Kemaev, Michael King, Markus Kunesch, Lena Martens, Hamza Merzic, Vladimir Mikulik, Tamara Norman, George Papamakarios, John Quan, Roman Ring, Francisco Ruiz, Alvaro Sanchez, Laurent Sartran, Rosalia Schneider, Eren Sezener, Stephen Spencer, Srivatsan Srinivasan, Miloˇs Stanojevi ́c, Wojciech Stokowiec, Luyu Wang, Guangyao Zhou, and Fabio Viola. The DeepMind JAX Ecosystem, 2020. URL http://github.com/ deepmind.
[3] Blake Bordelon and Cengiz Pehlevan. Dynamics of finite width kernel and prediction fluctuations in mean field neural networks. arXiv preprint arXiv:2304.03408, 2023.
[4] James Bradbury, Roy Frostig, Peter Hawkins, Matthew James Johnson, Chris Leary, Dougal Maclaurin, George Necula, Adam Paszke, Jake Van- derPlas, Skye Wanderman-Milne, and Qiao Zhang. JAX: composable transformations of Python+NumPy programs, 2018. URL http://github.com/google/ jax.
[5] X. Chen, S. Xie, and K. He.
study of training self-supervised vision transform- ers. In 2021 IEEE/CVF International Confer- ence on Computer Vision (ICCV), pages 9620– 9629, Los Alamitos, CA, USA, oct 2021. IEEE Computer Society. doi: 10.1109/ICCV48922.2021. 00950. URL https://doi.ieeecomputersociety. org/10.1109/ICCV48922.2021.00950.
[6] Aakanksha Chowdhery, Sharan Narang, Jacob De- vlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sut- ton, Sebastian Gehrmann, et al. Palm: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311, 2022.
[7] Jeremy M Cohen, Simran Kaur, Yuanzhi Li, J Zico Kolter, and Ameet Talwalkar. Gradient descent on An empirical neural networks typically occurs at the edge of sta- bility. arXiv preprint arXiv:2103.00065, 2021.
[8] Jeremy M Cohen, Behrooz Ghorbani, Shankar Kr- ishnan, Naman Agarwal, Sourabh Medapati, Michal Badura, Daniel Suo, David Cardoze, Zachary Nado, George E Dahl, et al. Adaptive gradient methods at the edge of stability. arXiv preprint arXiv:2207.14484, 2022.
[9] Alex Damian, Eshaan Nichani, and Jason D Lee. Self- stabilization: The implicit bias of gradient descent at the edge of stability. arXiv preprint arXiv:2209.15594, 2022.
[10] Aaron Defazio and Konstantin Mishchenko. Learning- rate-free learning by d-adaptation. arXiv preprint arXiv:2301.07733, 2023.
[11] Mostafa Dehghani, Josip Djolonga, Basil Mustafa, Piotr Padlewski, Jonathan Heek, Justin Gilmer, Andreas Steiner, Mathilde Caron, Robert Geirhos, Ibrahim Alabdulmohsin, et al. Scaling vision trans- formers to 22 billion parameters. arXiv preprint arXiv:2302.05442, 2023.
[12] Tim Dettmers, Mike Lewis, Younes Belkada, and Luke Zettlemoyer. Llm. int8 (): 8-bit matrix mul- tiplication for transformers at scale. arXiv preprint arXiv:2208.07339, 2022.
[13] Emily Dinan, Sho Yaida, and Susan Zhang. Effective theory of transformers at initialization. arXiv preprint arXiv:2304.02034, 2023.
[14] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszko- reit, and Neil Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale. In International Conference on Learning Representa- tions (ICLR), 2021. https://arxiv.org/abs/2010. 11929.
[15] Colin Gaffney, Dinghua Li, Ruoxin Sang, Ayush Jain, and Haitang Hu. Orbax, 2023. URL http://github. com/google/orbax.
[16] Justin Gilmer, Andrea Schioppa, and Jeremy Co- hen. Intriguing properties of transformer training instabilities. To appear.
[17] Justin Gilmer, Behrooz Ghorbani, Ankush Garg, Sneha Kudugunta, Behnam Neyshabur, David Car- doze, George Dahl, Zachary Nado, and Orhan Firat. A loss curvature perspective on training instability in deep learning. arXiv preprint arXiv:2110.04369, 2021.
[18] Xavier Glorot and Yoshua Bengio. Understanding the difficulty of training deep feedforward neural networks. In Proceedings of the thirteenth international confer- ence on artificial intelligence and statistics, pages 249–256. JMLR Workshop and Conference Proceedings, 2010.
[19] Google. Grain - feeding jax models, 2023. URL http://github.com/google/grain.

[20] Jonathan Heek, Anselm Levskaya, Avital Oliver, Mar- vin Ritter, Bertrand Rondepierre, Andreas Steiner, and Marc van Zee. Flax: A neural network li- brary and ecosystem for JAX, 2023. URL http: //github.com/google/flax.
[21] Dan Hendrycks and Kevin Gimpel. Gaussian error linear units (gelus). arXiv preprint arXiv:1606.08415, 2016.
[22] Jordan Hoffmann, Sebastian Borgeaud, Arthur Men- sch, Elena Buchatskaya, Trevor Cai, Eliza Ruther- ford, Diego de Las Casas, Lisa Anne Hendricks, Jo- hannes Welbl, Aidan Clark, et al. Training compute- optimal large language models. arXiv preprint arXiv:2203.15556, 2022.
[23] Weizhe Hua, Zihang Dai, Hanxiao Liu, and Quoc Le. Transformer quality in linear time. In International Conference on Machine Learning, pages 9099–9117. PMLR, 2022.
[24] Maor Ivgi, Oliver Hinder, and Yair Carmon. Dog is sgd’s best friend: A parameter-free dynamic step size schedule. arXiv preprint arXiv:2302.12022, 2023.
[25] Arthur Jacot, Franck Gabriel, and Cl ́ement Hon- gler. Neural tangent kernel: Convergence and gen- eralization in neural networks. In Advances in Neu- ral Information Processing Systems (NeurIPS), 2018. https://arxiv.org/abs/1806.07572.
[26] Norman P Jouppi, Cliff Young, Nishant Patil, David Patterson, Gaurav Agrawal, Raminder Bajwa, Sarah Bates, Suresh Bhatia, Nan Boden, Al Borchers, et al. In-datacenter performance analysis of a tensor pro- cessing unit. In Proceedings of the 44th annual inter- national symposium on computer architecture, pages 1–12, 2017.
[27] Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models. arXiv preprint arXiv:2001.08361, 2020.
[28]Taku Kudo and John Richardson. Sentencepiece: A simple and language independent subword tokenizer and detokenizer for neural text processing. arXiv preprint arXiv:1808.06226, 2018.
[29] Jaehoon Lee. A random walk model of transformer parameter growth, 2023.
[30] Liyuan Liu, Haoming Jiang, Pengcheng He, Weizhu Chen, Xiaodong Liu, Jianfeng Gao, and Jiawei Han. On the variance of the adaptive learning rate and beyond. arXiv preprint arXiv:1908.03265, 2019.
[31] Peter J. Liu*, Mohammad Saleh*, Etienne Pot, Ben Goodrich, Ryan Sepassi, Lukasz Kaiser, and Noam Shazeer. Generating wikipedia by summarizing long sequences. In International Conference on Learning Representations, 2018. URL https://openreview. net/forum?id=Hyg0vbWC-.
[32] Ilya Loshchilov and Frank Hutter. Sgdr: Stochas- tic gradient descent with warm restarts. In In- ternational Conference on Learning Representations
(ICLR), 2016. https://arxiv.org/abs/1608.03983.
[33] Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. In International Conference on Learning Representations (ICLR), 2019. https: //openreview.net/forum?id=Bkg6RiCqY7.
[34] William Merrill, Vivek Ramanujan, Yoav Goldberg, Roy Schwartz, and Noah Smith. Effects of param- eter norm growth during transformer training: In- ductive bias from gradient descent. arXiv preprint arXiv:2010.09697, 2020.
[35] Igor Molybog, Peter Albert, Moya Chen, Zachary DeVito, David Esiobu, Naman Goyal, Punit Singh Koura, Sharan Narang, Andrew Poulton, Ruan Silva, et al. A theory on adam instability in large-scale machine learning. arXiv preprint arXiv:2304.09871, 2023.
[36] Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, et al. Pytorch: An imperative style, high- performance deep learning library. In Advances in Neural Information Processing Systems (NeurIPS), 2019. https://arxiv.org/abs/1912.01703.
[37] Ofir Press and Lior Wolf. Using the output embed- ding to improve language models. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Vol- ume 2, Short Papers, pages 157–163, Valencia, Spain, April 2017. Association for Computational Linguistics. URL https://aclanthology.org/E17-2025.
[38] Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. Language Models are Unsupervised Multitask Learners, 2019. https: //openai.com/blog/better-language-models/.
[39] Colin Raffel, Noam Shazeer, Adam Roberts, Kather- ine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. Exploring the limits of transfer learning with a unified text-to-text trans- former. Journal of Machine Learning Research, 2020. http://jmlr.org/papers/v21/20-074.html.
[40] Colin Raffel, Noam Shazeer, Adam Roberts, Kather- ine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. Exploring the limits of trans- fer learning with a unified text-to-text transformer. Journal of Machine Learning Research, 21(140):1–67, 2020. URL http://jmlr.org/papers/v21/20-074. html.
[41] Jie Ren, Samyam Rajbhandari, Reza Yazdani Am- inabadi, Olatunji Ruwase, Shuangyan Yang, Min- jia Zhang, Dong Li, and Yuxiong He. {ZeRO- Offload}: Democratizing {Billion-Scale} model train- ing. In 2021 USENIX Annual Technical Conference
(USENIX ATC 21), pages 551–564, 2021.
[42] Noam Shazeer and Mitchell Stern. Adafactor: Adap- tive learning rates with sublinear memory cost. In International Conference on Machine Learning, pages 4596–4604. PMLR, 2018.
[43] Jianlin Su, Yu Lu, Shengfeng Pan, Ahmed Murtadha, Bo Wen, and Yunfeng Liu. Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:2104.09864, 2021.
[44] Vimal Thilak, Etai Littwin, Shuangfei Zhai, Omid Saremi, Roni Paiss, and Joshua Susskind. The sling- shot mechanism: An empirical study of adaptive opti- mizers and the grokking phenomenon. arXiv preprint arXiv:2206.04817, 2022.
[45] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. Advances in neural information processing sys- tems, 30, 2017.
[46] Mitchell Wortsman, Jaehoon Lee, Justin Gilmer, and Simon Kornblith. Replacing softmax with relu in vision transformers.
[47] MitchellWortsman,TimDettmers,LukeZettlemoyer, Ari Morcos, Ali Farhadi, and Ludwig Schmidt. Sta- ble and low-precision training for large-scale vision- language models. arXiv preprint arXiv:2304.13013, 2023.
[48] Sho Yaida. Meta-principled family of hyperparameter scaling strategies. arXiv preprint arXiv:2210.04909, 2022.
[49] Greg Yang and Edward J Hu. Tensor programs iv: Feature learning in infinite-width neural networks. In International Conference on Machine Learning, pages 11727–11737. PMLR, 2021.
[50] Greg Yang, Edward J Hu, Igor Babuschkin, Szymon Sidor, Xiaodong Liu, David Farhi, Nick Ryder, Jakub Pachocki, Weizhu Chen, and Jianfeng Gao. Ten- sor programs v: Tuning large neural networks via zero-shot hyperparameter transfer. arXiv preprint arXiv:2203.03466, 2022.
[51] Shuangfei Zhai, Tatiana Likhomanenko, Etai Littwin, Dan Busbridge, Jason Ramapuram, Yizhe Zhang, Jiatao Gu, and Josh Susskind. Stabilizing trans- former training by preventing attention entropy col- lapse. arXiv preprint arXiv:2303.06296, 2023.
[52] Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov, and Lucas Beyer. Sigmoid loss for language image pre-training. arXiv preprint arXiv:2303.15343, 2023.
[53] Susan Zhang, Stephen Roller, Naman Goyal, Mikel Artetxe, Moya Chen, Shuohui Chen, Christopher Dewan, Mona Diab, Xian Li, Xi Victoria Lin, et al. Opt: Open pre-trained transformer language models. arXiv preprint arXiv:2205.01068, 2022.

term list

no. term count
1 the 471
2 e 231
3 and 225
4 of 204
5 in 180
6 to 151
7 we 142
8 lr 129
9 learning 117
10 for 115
11 a 110
12 loss 106
13 is 101
14 n 99
15 rate 88
16 this 88
17 with 83
18 figure 75
19 that 68
20 at 67
21 sensitivity 67
22 as 64
23 scaling 64
24 model 61
25 arxiv 59
26 on 59
27 when 57
28 attention 55
29 by 54
30 layernorm 53
31 training 48
32 rms 47
33 scale 45
34 instability 43
35 models 43
36 al 42
37 et 41
38 qk 40
39 are 39
40 which 39
41 not 38
42 logit 37
43 section 37
44 decay 36
45 use 35
46 final 34
47 parameters 32
48 growth 31
49 param 31
50 from 30
51 layers 30
52 transformer 30
53 large 29
54 gradient 28
55 number 28
56 an 27
57 preprint 27
58 weight 25
59 eval 24
60 max 24
61 step 24
62 without 24
63 instabilities 23
64 z 23
65 be 22
66 default 22
67 small 22
68 output 21
69 effect 20
70 it 20
71 layer 20
72 or 20
73 our 20
74 have 19
75 logits 19
76 test 19
77 up 19
78 vs 19
79 appendix 18
80 dim 18
81 rates 18
82 where 18
83 across 17
84 also 17
85 head 17
86 num 17
87 size 17
88 b 16
89 can 16
90 grad 16
91 neural 16
92 parameter 16
93 width 16
94 adamw 15
95 depth 15
96 divergence 15
97 first 15
98 has 15
99 more 15
100 params 15
101 scales 15
102 steps 15
103 update 15
104 x 15
105 curves 14
106 does 14
107 experiment 14
108 no 14
109 other 14
110 stability 14
111 transformers 14
112 block 13
113 dh 13
114 optimal 13
115 p 13
116 two 13
117 useful 13
118 warm 13
119 batch 12
120 change 12
121 conference 12
122 high 12
123 i 12
124 interventions 12
125 j 12
126 largest 12
127 m 12
128 optimizer 12
129 refer 12
130 characteristics 11
131 fan 11
132 h 11
133 hyperparameter 11
134 illustrated 11
135 independent 11
136 liu 11
137 magnitude 11
138 mean 11
139 root 11
140 self 11
141 simple 11
142 will 11
143 work 11
144 adaptive 10
145 additional 10
146 before 10
147 changes 10
148 d 10
149 different 10
150 increases 10
151 international 10
152 language 10
153 occurs 10
154 results 10
155 spikes 10
156 study 10
157 their 10
158 there 10
159 these 10
160 trends 10
161 using 10
162 behavior 9
163 dimension 9
164 do 9
165 https 9
166 if 9
167 increasing 9
168 information 9
169 larger 9
170 mlp 9
171 norm 9
172 predict 9
173 so 9
174 square 9
175 url 9
176 value 9
177 was 9
178 while 9
179 yang 9
180 adam 8
181 between 8
182 but 8
183 data 8
184 dehghani 8
185 examine 8
186 experiments 8
187 gilmer 8
188 however 8
189 known 8
190 lee 8
191 networks 8
192 observed 8
193 only 8
194 org 8
195 pages 8
196 peter 8
197 range 8
198 such 8
199 then 8
200 via 8
201 warmup 8
202 because 7
203 both 7
204 c 7
205 cases 7
206 cohen 7
207 david 7
208 deviation 7
209 edge 7
210 find 7
211 flax 7
212 g 7
213 google 7
214 how 7
215 http 7
216 increase 7
217 infrastructure 7
218 instead 7
219 jax 7
220 justin 7
221 let 7
222 linear 7
223 measuring 7
224 mitchell 7
225 mitigation 7
226 need 7
227 network 7
228 orders 7
229 paper 7
230 per 7
231 previously 7
232 reduces 7
233 related 7
234 reported 7
235 than 7
236 three 7
237 trained 7
238 v 7
239 values 7
240 weights 7
241 whether 7
242 zhai 7
243 base 6
244 believe 6
245 changing 6
246 chen 6
247 com 6
248 contributed 6
249 denote 6
250 descent 6
251 described 6
252 diverged 6
253 during 6
254 embedding 6
255 hutter 6
256 ii 6
257 instance 6
258 issue 6
259 li 6
260 log 6
261 loshchilov 6
262 machine 6
263 meaning 6
264 new 6
265 norms 6
266 now 6
267 over 6
268 predicted 6
269 prediction 6
270 previous 6
271 process 6
272 progressive 6
273 question 6
274 relationship 6
275 required 6
276 research 6
277 result 6
278 set 6
279 sharpening 6
280 shazeer 6
281 shown 6
282 similar 6
283 softmax 6
284 stabilization 6
285 standard 6
286 text 6
287 they 6
288 too 6
289 train 6
290 zero 6
291 become 5
292 becomes 5
293 been 5
294 best 5
295 chowdhery 5
296 coefficient 5
297 collapse 5
298 compute 5
299 curvature 5
300 details 5
301 ei 5
302 empirical 5
303 experimental 5
304 factor 5
305 figures 5
306 full 5
307 george 5
308 github 5
309 independently 5
310 initialization 5
311 intervention 5
312 jaehoon 5
313 key 5
314 keys 5
315 longer 5
316 lower 5
317 may 5
318 measure 5
319 metric 5
320 noam 5
321 performance 5
322 plot 5
323 probabilities 5
324 processing 5
325 proposed 5
326 query 5
327 recommended 5
328 s 5
329 schedule 5
330 sequence 5
331 shrink 5
332 smaller 5
333 stable 5
334 studied 5
335 technical 5
336 unscaled 5
337 used 5
338 variant 5
339 vision 5
340 widthscaling 5
341 writing 5
342 zhang 5
343 abs 4
344 alex 4
345 all 4
346 analysis 4
347 appear 4
348 based 4
349 better 4
350 case 4
351 constant 4
352 deepmind 4
353 discussed 4
354 due 4
355 each 4
356 effective 4
357 employed 4
358 examining 4
359 example 4
360 fast 4
361 finally 4
362 fit 4
363 focus 4
364 function 4
365 hu 4
366 hyperparameters 4
367 iii 4
368 image 4
369 important 4
370 initialize 4
371 input 4
372 james 4
373 jeffrey 4
374 john 4
375 led 4
376 length 4
377 methods 4
378 minimum 4
379 moreover 4
380 narang 4
381 one 4
382 practice 4
383 pre 4
384 predicting 4
385 primarily 4
386 proceedings 4
387 project 4
388 queries 4
389 regime 4
390 representations 4
391 reproduce 4
392 same 4
393 scaled 4
394 see 4
395 shape 4
396 sharan 4
397 shift 4
398 show 4
399 simon 4
400 sizes 4
401 stabilizing 4
402 techniques 4
403 throughout 4
404 tokens 4
405 trueqk 4
406 unstable 4
407 validation 4
408 wortsman 4
409 would 4
410 xj 4
411 zk 4
412 above 3
413 access 3
414 activation 3
415 add 3
416 advances 3
417 affect 3
418 aidan 3
419 aims 3
420 alleviate 3
421 although 3
422 appears 3
423 applying 3
424 around 3
425 averaged 3
426 ben 3
427 bias 3
428 billion 3
429 causes 3
430 colin 3
431 common 3
432 computer 3
433 computing 3
434 conducted 3
435 connections 3
436 consistent 3
437 contains 3
438 contrast 3
439 cosine 3
440 could 3
441 dahl 3
442 decrease 3
443 decreasing 3
444 deep 3
445 depthdim 3
446 dettmers 3
447 direction 3
448 discussion 3
449 dynamics 3
450 element 3
451 end 3
452 exceeds 3
453 experimentation 3
454 far 3
455 faster 3
456 feature 3
457 features 3
458 following 3
459 framing 3
460 free 3
461 gradually 3
462 grain 3
463 hand 3
464 he 3
465 here 3
466 iccv 3
467 iclr 3
468 igor 3
469 ilya 3
470 improve 3
471 improvement 3
472 indicate 3
473 intermediate 3
474 into 3
475 inverse 3
476 investigate 3
477 investigation 3
478 issues 3
479 its 3
480 jeremy 3
481 jmlr 3
482 jonathan 3
483 k 3
484 kernel 3
485 kornblith 3
486 last 3
487 lead 3
488 left 3
489 long 3
490 lrs 3
491 maximum 3
492 meaningfully 3
493 measures 3
494 michael 3
495 most 3
496 naman 3
497 next 3
498 nm 3
499 noah 3
500 normalization 3
501 observation 3
502 observe 3
503 optimizers 3
504 orbax 3
505 papers 3
506 particular 3
507 phenomenon 3
508 plots 3
509 pmlr 3
510 points 3
511 pools 3
512 presents 3
513 programs 3
514 projection 3
515 provided 3
516 pytorch 3
517 quadratic 3
518 raffel 3
519 recommend 3
520 reduce 3
521 reducing 3
522 regularization 3
523 remainder 3
524 remains 3
525 reproduced 3
526 researchers 3
527 resource 3
528 respectively 3
529 right 3
530 roberts 3
531 roman 3
532 run 3
533 sequences 3
534 shifting 3
535 slightly 3
536 slow 3
537 st 3
538 studying 3
539 substantially 3
540 successful 3
541 summarize 3
542 systems 3
543 tends 3
544 tensor 3
545 therefore 3
546 though 3
547 threshold 3
548 tom 3
549 top 3
550 total 3
551 transfer 3
552 trevor 3
553 typically 3
554 u 3
555 understanding 3
556 variation 3
557 vaswani 3
558 w 3
559 xw 3
560 y 3
561 zachary 3
562 zhou 3
563 zj 3
564 aaron 2
565 abhishek 2
566 achieve 2
567 adafactor 2
568 adlam 2
569 advice 2
570 aeos 2
571 alec 2
572 alemi 2
573 alexander 2
574 allows 2
575 amodei 2
576 amount 2
577 andreas 2
578 annual 2
579 another 2
580 answer 2
581 any 2
582 applies 2
583 architecture 2
584 areas 2
585 arthur 2
586 association 2
587 auxiliary 2
588 babuschkin 2
589 baseline 2
590 basil 2
591 behrooz 2
592 below 2
593 beyer 2
594 biases 2
595 big 2
596 blocks 2
597 bottom 2
598 bradbury 2
599 cai 2
600 cause 2
601 characterized 2
602 child 2
603 chris 2
604 clark 2
605 clear 2
606 closely 2
607 co 2
608 combine 2
609 comparing 2
610 computational 2
611 conclude 2
612 confirm 2
613 consider 2
614 consistently 2
615 context 2
616 corresponding 2
617 cossim 2
618 currently 2
619 curve 2
620 damian 2
621 dan 2
622 daniel 2
623 dario 2
624 decoupled 2
625 decreases 2
626 defined 2
627 demonstrating 2
628 depends 2
629 detailed 2
630 dickstein 2
631 displays 2
632 diverge 2
633 documented 2
634 doi 2
635 ecosystem 2
636 edward 2
637 emerge 2
638 emerges 2
639 enable 2
640 enables 2
641 entropy 2
642 eos 2
643 eps 2
644 epsilon 2
645 equally 2
646 etai 2
647 everett 2
648 evidence 2
649 examines 2
650 exhibit 2
651 explain 2
652 explanation 2
653 exploring 2
654 extrapolating 2
655 fails 2
656 falsen 2
657 finding 2
658 fixed 2
659 focuses 2
660 followed 2
661 forum 2
662 found 2
663 francisco 2
664 frank 2
665 further 2
666 gao 2
667 gaurav 2
668 ghorbani 2
669 goyal 2
670 gradients 2
671 greg 2
672 grow 2
673 gur 2
674 had 2
675 heads 2
676 heek 2
677 helpful 2
678 highlight 2
679 html 2
680 id 2
681 identify 2
682 ieee 2
683 impact 2
684 implementations 2
685 improves 2
686 index 2
687 individually 2
688 initial 2
689 insight 2
690 insights 2
691 interaction 2
692 interesting 2
693 invariant 2
694 investigations 2
695 isolation 2
696 iv 2
697 ivgi 2
698 izzeddin 2
699 jake 2
700 jakob 2
701 jascha 2
702 jason 2
703 jianfeng 2
704 joint 2
705 jointly 2
706 jones 2
707 journal 2
708 just 2
709 kaiser 2
710 kaplan 2
711 katherine 2
712 katie 2
713 kelvin 2
714 kevin 2
715 kolesnikov 2
716 kumar 2
717 kxk 2
718 latter 2
719 leads 2
720 lechao 2
721 less 2
722 library 2
723 limits 2
724 lin 2
725 linguistics 2
726 littwin 2
727 loading 2
728 losseps 2
729 lossweight 2
730 low 2
731 lucas 2
732 lukasz 2
733 luke 2
734 main 2
735 mainly 2
736 many 2
737 matena 2
738 matrix 2
739 meaningful 2
740 measurable 2
741 mechanism 2
742 merrill 2
743 method 2
744 middle 2
745 min 2
746 mitigates 2
747 mitigations 2
748 modification 2
749 mostafa 2
750 moya 2
751 much 2
752 multiple 2
753 muparam 2
754 mustafa 2
755 nado 2
756 nanodo 2
757 net 2
758 neurips 2
759 norman 2
760 note 2
761 novak 2
762 obtained 2
763 occur 2
764 oct 2
765 offers 2
766 often 2
767 oliver 2
768 openreview 2
769 opportunities 2
770 optax 2
771 order 2
772 otherwise 2
773 overall 2
774 packed 2
775 padding 2
776 parameterizations 2
777 parameterizing 2
778 paszke 2
779 pennington 2
780 performs 2
781 periodic 2
782 perspective 2
783 pi 2
784 play 2
785 pointwise 2
786 possible 2
787 precision 2
788 preconditioned 2
789 predicts 2
790 produces 2
791 provides 2
792 qi 2
793 qklayernorm 2
794 quadratically 2
795 radford 2
796 raises 2
797 random 2
798 recall 2
799 received 2
800 regardless 2
801 reliable 2
802 relu 2
803 repeats 2
804 report 2
805 reporting 2
806 residual 2
807 resolve 2
808 resolves 2
809 resources 2
810 rewon 2
811 reyes 2
812 rmsn 2
813 role 2
814 rotary 2
815 row 2
816 roy 2
817 runs 2
818 ryan 2
819 sam 2
820 scientific 2
821 sebastian 2
822 sections 2
823 sensitivityscaling 2
824 sensitivitystandardmuparam 2
825 sentencepiece 2
826 setting 2
827 sgd 2
828 sharpness 2
829 sho 2
830 shows 2
831 shuangfei 2
832 singh 2
833 smallest 2
834 sohl 2
835 some 2
836 sometimes 2
837 sources 2
838 specify 2
839 standardmuparam 2
840 state 2
841 steiner 2
842 stephen 2
843 stern 2
844 still 2
845 succeeds 2
846 summary 2
847 support 2
848 susan 2
849 susskind 2
850 sweep 2
851 th 2
852 thank 2
853 theory 2
854 thilak 2
855 thomas 2
856 through 2
857 tim 2
858 tokenizer 2
859 tool 2
860 towards 2
861 tpus 2
862 unified 2
863 unit 2
864 unless 2
865 usenix 2
866 uszkoreit 2
867 varying 2
868 very 2
869 ways 2
870 wei 2
871 weizhu 2
872 were 2
873 works 2
874 wu 2
875 xi 2
876 xiao 2
877 xiaodong 2
878 xiaohua 2
879 xu 2
880 yaida 2
881 yanqi 2
882 zeros 2
883 zettlemoyer 2
884 TRUE 2
合計 1,882 9,477

合計は出現数1の単語を含みます。

1
1
0

Register as a new user and use Qiita more conveniently

  1. You get articles that match your needs
  2. You can efficiently read back useful information
  3. You can use dark theme
What you can do with signing up
1
1

Delete article

Deleted articles cannot be recovered.

Draft of this article would be also deleted.

Are you sure you want to delete this article?