Small-scale proxies for large-scale Transformer training instabilities LLM AI(7)

Last updated at 2024-11-10Posted at 2024-11-10

LLM(Large Language Model) Advent Calendar 2024
https://qiita.com/advent-calendar/2024/llm
2日目投稿予定の記事です。

Small-scale proxies for large-scale Transformer training instabilities
Mitchell Wortsman, Peter J. Liu, Lechao Xiao, Katie Everett, Alex Alemi, Ben Adlam, John D. Co-Reyes, Izzeddin Gur, Abhishek Kumar, Roman Novak, Jeffrey Pennington, Jascha Sohl-dickstein, Kelvin Xu, Jaehoon Lee, Justin Gilmer, Simon Kornblith
https://arxiv.org/abs/2309.14322v2

<この項は書きかけです。順次追記します。>
This article is not completed. I will add some words and/or centences in order.

References

[1] Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hinton. Layer normalization. arXiv preprint arXiv:1607.06450, 2016.
[2] Igor Babuschkin, Kate Baumli, Alison Bell, Surya Bhupatiraju, Jake Bruce, Peter Buchlovsky, David Budden, Trevor Cai, Aidan Clark, Ivo Danihelka, Antoine Dedieu, Claudio Fantacci, Jonathan God- win, Chris Jones, Ross Hemsley, Tom Hennigan, Matteo Hessel, Shaobo Hou, Steven Kapturowski, Thomas Keck, Iurii Kemaev, Michael King, Markus Kunesch, Lena Martens, Hamza Merzic, Vladimir Mikulik, Tamara Norman, George Papamakarios, John Quan, Roman Ring, Francisco Ruiz, Alvaro Sanchez, Laurent Sartran, Rosalia Schneider, Eren Sezener, Stephen Spencer, Srivatsan Srinivasan, Miloˇs Stanojevi ́c, Wojciech Stokowiec, Luyu Wang, Guangyao Zhou, and Fabio Viola. The DeepMind JAX Ecosystem, 2020. URL http://github.com/ deepmind.
[3] Blake Bordelon and Cengiz Pehlevan. Dynamics of finite width kernel and prediction fluctuations in mean field neural networks. arXiv preprint arXiv:2304.03408, 2023.
[4] James Bradbury, Roy Frostig, Peter Hawkins, Matthew James Johnson, Chris Leary, Dougal Maclaurin, George Necula, Adam Paszke, Jake Van- derPlas, Skye Wanderman-Milne, and Qiao Zhang. JAX: composable transformations of Python+NumPy programs, 2018. URL http://github.com/google/ jax.
[5] X. Chen, S. Xie, and K. He.
study of training self-supervised vision transform- ers. In 2021 IEEE/CVF International Confer- ence on Computer Vision (ICCV), pages 9620– 9629, Los Alamitos, CA, USA, oct 2021. IEEE Computer Society. doi: 10.1109/ICCV48922.2021. 00950. URL https://doi.ieeecomputersociety. org/10.1109/ICCV48922.2021.00950.
[6] Aakanksha Chowdhery, Sharan Narang, Jacob De- vlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sut- ton, Sebastian Gehrmann, et al. Palm: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311, 2022.
[7] Jeremy M Cohen, Simran Kaur, Yuanzhi Li, J Zico Kolter, and Ameet Talwalkar. Gradient descent on An empirical neural networks typically occurs at the edge of sta- bility. arXiv preprint arXiv:2103.00065, 2021.
[8] Jeremy M Cohen, Behrooz Ghorbani, Shankar Kr- ishnan, Naman Agarwal, Sourabh Medapati, Michal Badura, Daniel Suo, David Cardoze, Zachary Nado, George E Dahl, et al. Adaptive gradient methods at the edge of stability. arXiv preprint arXiv:2207.14484, 2022.
[9] Alex Damian, Eshaan Nichani, and Jason D Lee. Self- stabilization: The implicit bias of gradient descent at the edge of stability. arXiv preprint arXiv:2209.15594, 2022.
[10] Aaron Defazio and Konstantin Mishchenko. Learning- rate-free learning by d-adaptation. arXiv preprint arXiv:2301.07733, 2023.
[11] Mostafa Dehghani, Josip Djolonga, Basil Mustafa, Piotr Padlewski, Jonathan Heek, Justin Gilmer, Andreas Steiner, Mathilde Caron, Robert Geirhos, Ibrahim Alabdulmohsin, et al. Scaling vision trans- formers to 22 billion parameters. arXiv preprint arXiv:2302.05442, 2023.
[12] Tim Dettmers, Mike Lewis, Younes Belkada, and Luke Zettlemoyer. Llm. int8 (): 8-bit matrix mul- tiplication for transformers at scale. arXiv preprint arXiv:2208.07339, 2022.
[13] Emily Dinan, Sho Yaida, and Susan Zhang. Effective theory of transformers at initialization. arXiv preprint arXiv:2304.02034, 2023.
[14] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszko- reit, and Neil Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale. In International Conference on Learning Representa- tions (ICLR), 2021. https://arxiv.org/abs/2010. 11929.
[15] Colin Gaffney, Dinghua Li, Ruoxin Sang, Ayush Jain, and Haitang Hu. Orbax, 2023. URL http://github. com/google/orbax.
[16] Justin Gilmer, Andrea Schioppa, and Jeremy Co- hen. Intriguing properties of transformer training instabilities. To appear.
[17] Justin Gilmer, Behrooz Ghorbani, Ankush Garg, Sneha Kudugunta, Behnam Neyshabur, David Car- doze, George Dahl, Zachary Nado, and Orhan Firat. A loss curvature perspective on training instability in deep learning. arXiv preprint arXiv:2110.04369, 2021.
[18] Xavier Glorot and Yoshua Bengio. Understanding the difficulty of training deep feedforward neural networks. In Proceedings of the thirteenth international confer- ence on artificial intelligence and statistics, pages 249–256. JMLR Workshop and Conference Proceedings, 2010.
[19] Google. Grain - feeding jax models, 2023. URL http://github.com/google/grain.

[20] Jonathan Heek, Anselm Levskaya, Avital Oliver, Mar- vin Ritter, Bertrand Rondepierre, Andreas Steiner, and Marc van Zee. Flax: A neural network li- brary and ecosystem for JAX, 2023. URL http: //github.com/google/flax.
[21] Dan Hendrycks and Kevin Gimpel. Gaussian error linear units (gelus). arXiv preprint arXiv:1606.08415, 2016.
[22] Jordan Hoffmann, Sebastian Borgeaud, Arthur Men- sch, Elena Buchatskaya, Trevor Cai, Eliza Ruther- ford, Diego de Las Casas, Lisa Anne Hendricks, Jo- hannes Welbl, Aidan Clark, et al. Training compute- optimal large language models. arXiv preprint arXiv:2203.15556, 2022.
[23] Weizhe Hua, Zihang Dai, Hanxiao Liu, and Quoc Le. Transformer quality in linear time. In International Conference on Machine Learning, pages 9099–9117. PMLR, 2022.
[24] Maor Ivgi, Oliver Hinder, and Yair Carmon. Dog is sgd’s best friend: A parameter-free dynamic step size schedule. arXiv preprint arXiv:2302.12022, 2023.
[25] Arthur Jacot, Franck Gabriel, and Cl ́ement Hon- gler. Neural tangent kernel: Convergence and gen- eralization in neural networks. In Advances in Neu- ral Information Processing Systems (NeurIPS), 2018. https://arxiv.org/abs/1806.07572.
[26] Norman P Jouppi, Cliff Young, Nishant Patil, David Patterson, Gaurav Agrawal, Raminder Bajwa, Sarah Bates, Suresh Bhatia, Nan Boden, Al Borchers, et al. In-datacenter performance analysis of a tensor pro- cessing unit. In Proceedings of the 44th annual inter- national symposium on computer architecture, pages 1–12, 2017.
[27] Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models. arXiv preprint arXiv:2001.08361, 2020.
[28]Taku Kudo and John Richardson. Sentencepiece: A simple and language independent subword tokenizer and detokenizer for neural text processing. arXiv preprint arXiv:1808.06226, 2018.
[29] Jaehoon Lee. A random walk model of transformer parameter growth, 2023.
[30] Liyuan Liu, Haoming Jiang, Pengcheng He, Weizhu Chen, Xiaodong Liu, Jianfeng Gao, and Jiawei Han. On the variance of the adaptive learning rate and beyond. arXiv preprint arXiv:1908.03265, 2019.
[31] Peter J. Liu*, Mohammad Saleh*, Etienne Pot, Ben Goodrich, Ryan Sepassi, Lukasz Kaiser, and Noam Shazeer. Generating wikipedia by summarizing long sequences. In International Conference on Learning Representations, 2018. URL https://openreview. net/forum?id=Hyg0vbWC-.
[32] Ilya Loshchilov and Frank Hutter. Sgdr: Stochas- tic gradient descent with warm restarts. In In- ternational Conference on Learning Representations
(ICLR), 2016. https://arxiv.org/abs/1608.03983.
[33] Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. In International Conference on Learning Representations (ICLR), 2019. https: //openreview.net/forum?id=Bkg6RiCqY7.
[34] William Merrill, Vivek Ramanujan, Yoav Goldberg, Roy Schwartz, and Noah Smith. Effects of param- eter norm growth during transformer training: In- ductive bias from gradient descent. arXiv preprint arXiv:2010.09697, 2020.
[35] Igor Molybog, Peter Albert, Moya Chen, Zachary DeVito, David Esiobu, Naman Goyal, Punit Singh Koura, Sharan Narang, Andrew Poulton, Ruan Silva, et al. A theory on adam instability in large-scale machine learning. arXiv preprint arXiv:2304.09871, 2023.
[36] Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, et al. Pytorch: An imperative style, high- performance deep learning library. In Advances in Neural Information Processing Systems (NeurIPS), 2019. https://arxiv.org/abs/1912.01703.
[37] Ofir Press and Lior Wolf. Using the output embed- ding to improve language models. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Vol- ume 2, Short Papers, pages 157–163, Valencia, Spain, April 2017. Association for Computational Linguistics. URL https://aclanthology.org/E17-2025.
[38] Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. Language Models are Unsupervised Multitask Learners, 2019. https: //openai.com/blog/better-language-models/.
[39] Colin Raffel, Noam Shazeer, Adam Roberts, Kather- ine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. Exploring the limits of transfer learning with a unified text-to-text trans- former. Journal of Machine Learning Research, 2020. http://jmlr.org/papers/v21/20-074.html.
[40] Colin Raffel, Noam Shazeer, Adam Roberts, Kather- ine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. Exploring the limits of trans- fer learning with a unified text-to-text transformer. Journal of Machine Learning Research, 21(140):1–67, 2020. URL http://jmlr.org/papers/v21/20-074. html.
[41] Jie Ren, Samyam Rajbhandari, Reza Yazdani Am- inabadi, Olatunji Ruwase, Shuangyan Yang, Min- jia Zhang, Dong Li, and Yuxiong He. {ZeRO- Offload}: Democratizing {Billion-Scale} model train- ing. In 2021 USENIX Annual Technical Conference
(USENIX ATC 21), pages 551–564, 2021.
[42] Noam Shazeer and Mitchell Stern. Adafactor: Adap- tive learning rates with sublinear memory cost. In International Conference on Machine Learning, pages 4596–4604. PMLR, 2018.
[43] Jianlin Su, Yu Lu, Shengfeng Pan, Ahmed Murtadha, Bo Wen, and Yunfeng Liu. Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:2104.09864, 2021.
[44] Vimal Thilak, Etai Littwin, Shuangfei Zhai, Omid Saremi, Roni Paiss, and Joshua Susskind. The sling- shot mechanism: An empirical study of adaptive opti- mizers and the grokking phenomenon. arXiv preprint arXiv:2206.04817, 2022.
[45] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. Advances in neural information processing sys- tems, 30, 2017.
[46] Mitchell Wortsman, Jaehoon Lee, Justin Gilmer, and Simon Kornblith. Replacing softmax with relu in vision transformers.
[47] MitchellWortsman,TimDettmers,LukeZettlemoyer, Ari Morcos, Ali Farhadi, and Ludwig Schmidt. Sta- ble and low-precision training for large-scale vision- language models. arXiv preprint arXiv:2304.13013, 2023.
[48] Sho Yaida. Meta-principled family of hyperparameter scaling strategies. arXiv preprint arXiv:2210.04909, 2022.
[49] Greg Yang and Edward J Hu. Tensor programs iv: Feature learning in infinite-width neural networks. In International Conference on Machine Learning, pages 11727–11737. PMLR, 2021.
[50] Greg Yang, Edward J Hu, Igor Babuschkin, Szymon Sidor, Xiaodong Liu, David Farhi, Nick Ryder, Jakub Pachocki, Weizhu Chen, and Jianfeng Gao. Ten- sor programs v: Tuning large neural networks via zero-shot hyperparameter transfer. arXiv preprint arXiv:2203.03466, 2022.
[51] Shuangfei Zhai, Tatiana Likhomanenko, Etai Littwin, Dan Busbridge, Jason Ramapuram, Yizhe Zhang, Jiatao Gu, and Josh Susskind. Stabilizing trans- former training by preventing attention entropy col- lapse. arXiv preprint arXiv:2303.06296, 2023.
[52] Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov, and Lucas Beyer. Sigmoid loss for language image pre-training. arXiv preprint arXiv:2303.15343, 2023.
[53] Susan Zhang, Stephen Roller, Naman Goyal, Mikel Artetxe, Moya Chen, Shuohui Chen, Christopher Dewan, Mona Diab, Xian Li, Xi Victoria Lin, et al. Opt: Open pre-trained transformer language models. arXiv preprint arXiv:2205.01068, 2022.

term list

no.	term	count
1	the	471
2	e	231
3	and	225
4	of	204
5	in	180
6	to	151
7	we	142
8	lr	129
9	learning	117
10	for	115
11	a	110
12	loss	106
13	is	101
14	n	99
15	rate	88
16	this	88
17	with	83
18	figure	75
19	that	68
20	at	67
21	sensitivity	67
22	as	64
23	scaling	64
24	model	61
25	arxiv	59
26	on	59
27	when	57
28	attention	55
29	by	54
30	layernorm	53
31	training	48
32	rms	47
33	scale	45
34	instability	43
35	models	43
36	al	42
37	et	41
38	qk	40
39	are	39
40	which	39
41	not	38
42	logit	37
43	section	37
44	decay	36
45	use	35
46	final	34
47	parameters	32
48	growth	31
49	param	31
50	from	30
51	layers	30
52	transformer	30
53	large	29
54	gradient	28
55	number	28
56	an	27
57	preprint	27
58	weight	25
59	eval	24
60	max	24
61	step	24
62	without	24
63	instabilities	23
64	z	23
65	be	22
66	default	22
67	small	22
68	output	21
69	effect	20
70	it	20
71	layer	20
72	or	20
73	our	20
74	have	19
75	logits	19
76	test	19
77	up	19
78	vs	19
79	appendix	18
80	dim	18
81	rates	18
82	where	18
83	across	17
84	also	17
85	head	17
86	num	17
87	size	17
88	b	16
89	can	16
90	grad	16
91	neural	16
92	parameter	16
93	width	16
94	adamw	15
95	depth	15
96	divergence	15
97	first	15
98	has	15
99	more	15
100	params	15
101	scales	15
102	steps	15
103	update	15
104	x	15
105	curves	14
106	does	14
107	experiment	14
108	no	14
109	other	14
110	stability	14
111	transformers	14
112	block	13
113	dh	13
114	optimal	13
115	p	13
116	two	13
117	useful	13
118	warm	13
119	batch	12
120	change	12
121	conference	12
122	high	12
123	i	12
124	interventions	12
125	j	12
126	largest	12
127	m	12
128	optimizer	12
129	refer	12
130	characteristics	11
131	fan	11
132	h	11
133	hyperparameter	11
134	illustrated	11
135	independent	11
136	liu	11
137	magnitude	11
138	mean	11
139	root	11
140	self	11
141	simple	11
142	will	11
143	work	11
144	adaptive	10
145	additional	10
146	before	10
147	changes	10
148	d	10
149	different	10
150	increases	10
151	international	10
152	language	10
153	occurs	10
154	results	10
155	spikes	10
156	study	10
157	their	10
158	there	10
159	these	10
160	trends	10
161	using	10
162	behavior	9
163	dimension	9
164	do	9
165	https	9
166	if	9
167	increasing	9
168	information	9
169	larger	9
170	mlp	9
171	norm	9
172	predict	9
173	so	9
174	square	9
175	url	9
176	value	9
177	was	9
178	while	9
179	yang	9
180	adam	8
181	between	8
182	but	8
183	data	8
184	dehghani	8
185	examine	8
186	experiments	8
187	gilmer	8
188	however	8
189	known	8
190	lee	8
191	networks	8
192	observed	8
193	only	8
194	org	8
195	pages	8
196	peter	8
197	range	8
198	such	8
199	then	8
200	via	8
201	warmup	8
202	because	7
203	both	7
204	c	7
205	cases	7
206	cohen	7
207	david	7
208	deviation	7
209	edge	7
210	find	7
211	flax	7
212	g	7
213	google	7
214	how	7
215	http	7
216	increase	7
217	infrastructure	7
218	instead	7
219	jax	7
220	justin	7
221	let	7
222	linear	7
223	measuring	7
224	mitchell	7
225	mitigation	7
226	need	7
227	network	7
228	orders	7
229	paper	7
230	per	7
231	previously	7
232	reduces	7
233	related	7
234	reported	7
235	than	7
236	three	7
237	trained	7
238	v	7
239	values	7
240	weights	7
241	whether	7
242	zhai	7
243	base	6
244	believe	6
245	changing	6
246	chen	6
247	com	6
248	contributed	6
249	denote	6
250	descent	6
251	described	6
252	diverged	6
253	during	6
254	embedding	6
255	hutter	6
256	ii	6
257	instance	6
258	issue	6
259	li	6
260	log	6
261	loshchilov	6
262	machine	6
263	meaning	6
264	new	6
265	norms	6
266	now	6
267	over	6
268	predicted	6
269	prediction	6
270	previous	6
271	process	6
272	progressive	6
273	question	6
274	relationship	6
275	required	6
276	research	6
277	result	6
278	set	6
279	sharpening	6
280	shazeer	6
281	shown	6
282	similar	6
283	softmax	6
284	stabilization	6
285	standard	6
286	text	6
287	they	6
288	too	6
289	train	6
290	zero	6
291	become	5
292	becomes	5
293	been	5
294	best	5
295	chowdhery	5
296	coefficient	5
297	collapse	5
298	compute	5
299	curvature	5
300	details	5
301	ei	5
302	empirical	5
303	experimental	5
304	factor	5
305	figures	5
306	full	5
307	george	5
308	github	5
309	independently	5
310	initialization	5
311	intervention	5
312	jaehoon	5
313	key	5
314	keys	5
315	longer	5
316	lower	5
317	may	5
318	measure	5
319	metric	5
320	noam	5
321	performance	5
322	plot	5
323	probabilities	5
324	processing	5
325	proposed	5
326	query	5
327	recommended	5
328	s	5
329	schedule	5
330	sequence	5
331	shrink	5
332	smaller	5
333	stable	5
334	studied	5
335	technical	5
336	unscaled	5
337	used	5
338	variant	5
339	vision	5
340	widthscaling	5
341	writing	5
342	zhang	5
343	abs	4
344	alex	4
345	all	4
346	analysis	4
347	appear	4
348	based	4
349	better	4
350	case	4
351	constant	4
352	deepmind	4
353	discussed	4
354	due	4
355	each	4
356	effective	4
357	employed	4
358	examining	4
359	example	4
360	fast	4
361	finally	4
362	fit	4
363	focus	4
364	function	4
365	hu	4
366	hyperparameters	4
367	iii	4
368	image	4
369	important	4
370	initialize	4
371	input	4
372	james	4
373	jeffrey	4
374	john	4
375	led	4
376	length	4
377	methods	4
378	minimum	4
379	moreover	4
380	narang	4
381	one	4
382	practice	4
383	pre	4
384	predicting	4
385	primarily	4
386	proceedings	4
387	project	4
388	queries	4
389	regime	4
390	representations	4
391	reproduce	4
392	same	4
393	scaled	4
394	see	4
395	shape	4
396	sharan	4
397	shift	4
398	show	4
399	simon	4
400	sizes	4
401	stabilizing	4
402	techniques	4
403	throughout	4
404	tokens	4
405	trueqk	4
406	unstable	4
407	validation	4
408	wortsman	4
409	would	4
410	xj	4
411	zk	4
412	above	3
413	access	3
414	activation	3
415	add	3
416	advances	3
417	affect	3
418	aidan	3
419	aims	3
420	alleviate	3
421	although	3
422	appears	3
423	applying	3
424	around	3
425	averaged	3
426	ben	3
427	bias	3
428	billion	3
429	causes	3
430	colin	3
431	common	3
432	computer	3
433	computing	3
434	conducted	3
435	connections	3
436	consistent	3
437	contains	3
438	contrast	3
439	cosine	3
440	could	3
441	dahl	3
442	decrease	3
443	decreasing	3
444	deep	3
445	depthdim	3
446	dettmers	3
447	direction	3
448	discussion	3
449	dynamics	3
450	element	3
451	end	3
452	exceeds	3
453	experimentation	3
454	far	3
455	faster	3
456	feature	3
457	features	3
458	following	3
459	framing	3
460	free	3
461	gradually	3
462	grain	3
463	hand	3
464	he	3
465	here	3
466	iccv	3
467	iclr	3
468	igor	3
469	ilya	3
470	improve	3
471	improvement	3
472	indicate	3
473	intermediate	3
474	into	3
475	inverse	3
476	investigate	3
477	investigation	3
478	issues	3
479	its	3
480	jeremy	3
481	jmlr	3
482	jonathan	3
483	k	3
484	kernel	3
485	kornblith	3
486	last	3
487	lead	3
488	left	3
489	long	3
490	lrs	3
491	maximum	3
492	meaningfully	3
493	measures	3
494	michael	3
495	most	3
496	naman	3
497	next	3
498	nm	3
499	noah	3
500	normalization	3
501	observation	3
502	observe	3
503	optimizers	3
504	orbax	3
505	papers	3
506	particular	3
507	phenomenon	3
508	plots	3
509	pmlr	3
510	points	3
511	pools	3
512	presents	3
513	programs	3
514	projection	3
515	provided	3
516	pytorch	3
517	quadratic	3
518	raffel	3
519	recommend	3
520	reduce	3
521	reducing	3
522	regularization	3
523	remainder	3
524	remains	3
525	reproduced	3
526	researchers	3
527	resource	3
528	respectively	3
529	right	3
530	roberts	3
531	roman	3
532	run	3
533	sequences	3
534	shifting	3
535	slightly	3
536	slow	3
537	st	3
538	studying	3
539	substantially	3
540	successful	3
541	summarize	3
542	systems	3
543	tends	3
544	tensor	3
545	therefore	3
546	though	3
547	threshold	3
548	tom	3
549	top	3
550	total	3
551	transfer	3
552	trevor	3
553	typically	3
554	u	3
555	understanding	3
556	variation	3
557	vaswani	3
558	w	3
559	xw	3
560	y	3
561	zachary	3
562	zhou	3
563	zj	3
564	aaron	2
565	abhishek	2
566	achieve	2
567	adafactor	2
568	adlam	2
569	advice	2
570	aeos	2
571	alec	2
572	alemi	2
573	alexander	2
574	allows	2
575	amodei	2
576	amount	2
577	andreas	2
578	annual	2
579	another	2
580	answer	2
581	any	2
582	applies	2
583	architecture	2
584	areas	2
585	arthur	2
586	association	2
587	auxiliary	2
588	babuschkin	2
589	baseline	2
590	basil	2
591	behrooz	2
592	below	2
593	beyer	2
594	biases	2
595	big	2
596	blocks	2
597	bottom	2
598	bradbury	2
599	cai	2
600	cause	2
601	characterized	2
602	child	2
603	chris	2
604	clark	2
605	clear	2
606	closely	2
607	co	2
608	combine	2
609	comparing	2
610	computational	2
611	conclude	2
612	confirm	2
613	consider	2
614	consistently	2
615	context	2
616	corresponding	2
617	cossim	2
618	currently	2
619	curve	2
620	damian	2
621	dan	2
622	daniel	2
623	dario	2
624	decoupled	2
625	decreases	2
626	defined	2
627	demonstrating	2
628	depends	2
629	detailed	2
630	dickstein	2
631	displays	2
632	diverge	2
633	documented	2
634	doi	2
635	ecosystem	2
636	edward	2
637	emerge	2
638	emerges	2
639	enable	2
640	enables	2
641	entropy	2
642	eos	2
643	eps	2
644	epsilon	2
645	equally	2
646	etai	2
647	everett	2
648	evidence	2
649	examines	2
650	exhibit	2
651	explain	2
652	explanation	2
653	exploring	2
654	extrapolating	2
655	fails	2
656	falsen	2
657	finding	2
658	fixed	2
659	focuses	2
660	followed	2
661	forum	2
662	found	2
663	francisco	2
664	frank	2
665	further	2
666	gao	2
667	gaurav	2
668	ghorbani	2
669	goyal	2
670	gradients	2
671	greg	2
672	grow	2
673	gur	2
674	had	2
675	heads	2
676	heek	2
677	helpful	2
678	highlight	2
679	html	2
680	id	2
681	identify	2
682	ieee	2
683	impact	2
684	implementations	2
685	improves	2
686	index	2
687	individually	2
688	initial	2
689	insight	2
690	insights	2
691	interaction	2
692	interesting	2
693	invariant	2
694	investigations	2
695	isolation	2
696	iv	2
697	ivgi	2
698	izzeddin	2
699	jake	2
700	jakob	2
701	jascha	2
702	jason	2
703	jianfeng	2
704	joint	2
705	jointly	2
706	jones	2
707	journal	2
708	just	2
709	kaiser	2
710	kaplan	2
711	katherine	2
712	katie	2
713	kelvin	2
714	kevin	2
715	kolesnikov	2
716	kumar	2
717	kxk	2
718	latter	2
719	leads	2
720	lechao	2
721	less	2
722	library	2
723	limits	2
724	lin	2
725	linguistics	2
726	littwin	2
727	loading	2
728	losseps	2
729	lossweight	2
730	low	2
731	lucas	2
732	lukasz	2
733	luke	2
734	main	2
735	mainly	2
736	many	2
737	matena	2
738	matrix	2
739	meaningful	2
740	measurable	2
741	mechanism	2
742	merrill	2
743	method	2
744	middle	2
745	min	2
746	mitigates	2
747	mitigations	2
748	modification	2
749	mostafa	2
750	moya	2
751	much	2
752	multiple	2
753	muparam	2
754	mustafa	2
755	nado	2
756	nanodo	2
757	net	2
758	neurips	2
759	norman	2
760	note	2
761	novak	2
762	obtained	2
763	occur	2
764	oct	2
765	offers	2
766	often	2
767	oliver	2
768	openreview	2
769	opportunities	2
770	optax	2
771	order	2
772	otherwise	2
773	overall	2
774	packed	2
775	padding	2
776	parameterizations	2
777	parameterizing	2
778	paszke	2
779	pennington	2
780	performs	2
781	periodic	2
782	perspective	2
783	pi	2
784	play	2
785	pointwise	2
786	possible	2
787	precision	2
788	preconditioned	2
789	predicts	2
790	produces	2
791	provides	2
792	qi	2
793	qklayernorm	2
794	quadratically	2
795	radford	2
796	raises	2
797	random	2
798	recall	2
799	received	2
800	regardless	2
801	reliable	2
802	relu	2
803	repeats	2
804	report	2
805	reporting	2
806	residual	2
807	resolve	2
808	resolves	2
809	resources	2
810	rewon	2
811	reyes	2
812	rmsn	2
813	role	2
814	rotary	2
815	row	2
816	roy	2
817	runs	2
818	ryan	2
819	sam	2
820	scientific	2
821	sebastian	2
822	sections	2
823	sensitivityscaling	2
824	sensitivitystandardmuparam	2
825	sentencepiece	2
826	setting	2
827	sgd	2
828	sharpness	2
829	sho	2
830	shows	2
831	shuangfei	2
832	singh	2
833	smallest	2
834	sohl	2
835	some	2
836	sometimes	2
837	sources	2
838	specify	2
839	standardmuparam	2
840	state	2
841	steiner	2
842	stephen	2
843	stern	2
844	still	2
845	succeeds	2
846	summary	2
847	support	2
848	susan	2
849	susskind	2
850	sweep	2
851	th	2
852	thank	2
853	theory	2
854	thilak	2
855	thomas	2
856	through	2
857	tim	2
858	tokenizer	2
859	tool	2
860	towards	2
861	tpus	2
862	unified	2
863	unit	2
864	unless	2
865	usenix	2
866	uszkoreit	2
867	varying	2
868	very	2
869	ways	2
870	wei	2
871	weizhu	2
872	were	2
873	works	2
874	wu	2
875	xi	2
876	xiao	2
877	xiaodong	2
878	xiaohua	2
879	xu	2
880	yaida	2
881	yanqi	2
882	zeros	2
883	zettlemoyer	2
884	TRUE	2
合計	1,882	9,477

合計は出現数１の単語を含みます。

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up