æŠèŠ
ããŒã¯ãã€ãºç·šã®ïŒåç®ãšãªããŸããä»åã¯Unigram Language Modelã«ã€ããŠèªåãªãã«ãŸãšããŠã¿ãŸãããããŒã¯ãã€ãºã®èãæ¹ã§ãããææ³ãšåæåå²ã®æ¹æ³ã«åããŠæŽçããŠã¿ãŸããã
-
ã¢ã«ãŽãªãºã ïŒææ³ïŒ
- BPE (Byte Pair Encoding): 飿¥ããããŒã¯ã³x,yã®é »åºŠfreq(x,y)ãæå€§åããããŒã¯ã³ããã£ã€ããŠããïŒïŒåç®ïŒ
- WordPiece: 飿¥ããããŒã¯ã³x,yã®PMI(x,y)ãæå€§åããããŒã¯ã³ããã£ã€ããŠããïŒïŒåç®ïŒ
- Unigram: æåã«èªåœéåãæ±ºããŠæç« å°€åºŠãæå€§åããèªåœãæ®ãã圱é¿åºŠã®å°ãªãèªåœãåé€ããŠããïŒïŒåç®ïŒ
-
åæå岿¹æ³ïŒååŠçïŒ
- MetaspaceïŒUnicodeæååäœã§åå²ã空çœãç¹æ®èšå·"_"ã«å€æ
- ByteLevelïŒæåãUTF-8ã®ãã€ãåäœã§åå²ã256çš®é¡ã®æ°å€ã§ãã¹ãŠè¡šçŸã§ããã
- WhitespaceïŒç©ºçœã§åå²
ã¢ã«ãŽãªãºã ãšåæåå²ã®æ¹æ³ã§çµã¿åãããããããã§ãããçµã¿åããæ¹ã«ãçžæ§ãããã£ãœããïŒåç®ã¯Unigramã®ææ³ã§ãããã®åã«ãããŒã¯ã³ãšããèšè䜿ãã«ã€ããŠã§ãããã®èšäºã§ã¯ãåèªãäžæåïŒããïŒãïŒ0ïŒ1ïŒAïŒBããªã©ïŒãèªåœãããŒã¯ã³ãšèªãã§ããŸããããŒã¯ã³ãéããéåãèªåœéå$V$ãšããŸãã
åºæ¬çã«ãå·¥è€å€§å
çã®è«æã«åºã¥ããŠãåèæç®ã®æ¬ãé Œãã«çè§£ãæ·±ããŠã¿ãŸããã
Taku Kudo (2018)
Subword Regularization: Improving Neural Network Translation Models with Multiple Subword Candidates
åæã«èšå·ã倿ŽããŠããŸãç³ãèš³ãªãéã![]()
èæ¯
ãããŸã§ã®ããã¹ãåé¡(第20åã第24å)ã§ã¯æ¥æ¬èªæååã圢æ
çŽ ã«åå²ããŠIDåããæ¹æ³ã䜿ã£ãŠããŸããããã®æ¹æ³ã§ã¯ãæ°ããæååã«å¯ŸããŠ<unk>ãå²ãåœãŠãŠããŸããããããŒã¿ã»ããå
ã§å®çµããªãç¶æ³ã§ã¯æçšæ§ããããŸãããèŠããã«ããå
šã䜿ãã
ããšããããšã«ãªã£ãŠããŸããŸã![]()
![]()
![]()
å€§èŠæš¡èšèªã¢ãã«ã§å©çšãããã¿ã€ãã®IDã®å²ãåœãŠæ¹ã調ã¹ãŠã¿ãŸããã3åç®ã¯Unigram Language Model (Unigram LM)ã§ãã
æŒç¿çšã®ãã¡ã€ã«
- ããŒã¿ã®ãã¡ã€ã«
- æ¥æ¬èªïŒtiny_cc100_ja.csv
- æ¥æ¬èªããã¡æžãïŒtiny_cc100_ja_wakati.csv
- ã³ãŒã: sample_27.ipynb
1. Unigramã®èãæ¹
Unigramã®èãæ¹ã¯ãæåã«å€§ããªèªåœéåãæ³å®ããŠãèªåœã®åœ±é¿åºŠãèªåœã®è²¢ç®åºŠã®ãããªå°€åºŠã®å·®ãå°ããèªåœãåé€ããŠããæ¹åŒãšèããã°è¯ãããã§ããBPEãword pieceãèªåœãã€ãªããŠãæ°ããèªåœãäœããªãããèªåœéåãæ§æããŠããã®ã«å¯ŸããŠãunigramã¯æç« ã衚çŸããäžã§åé€ããŠãåé¡ãªãèªåœãæžãããªããèªåœéåãæ§æããŠããéã®çºæ³ã«ãªã£ãŠããŸãã
- èªåœéå$V_1$ã§æ§æã§ããæç« ã®å°€åºŠãšãèªåœéå$V_1$ããããŒã¯ã³$w$ãåé€ãã$V_1\setminus \{w\}$ã§æ§æã§ããæç« ã®å°€åºŠã®å·®ãæ±ããã
- ããŒã¯ã³$w$ã®åœ±é¿åºŠã»è²¢ç®åºŠã衚ã尀床ã®å·®ïŒããŒã¯ã³$w$ã®æå€±ïŒã倧ããããŒã¯ã³ã¯æ®ããå°ããããŒã¯ã³ãåé€ããŠã次ã®èªåœéå$V_2$ãäœãã
èªåœéå$V_2$ã䜿ã£ãŠïŒçªããç¹°ãè¿ã圢ã«ãªããŸãã
1.1 å°ãã ã詳ãã
- $D=\{X_1,..,X_j,...,X_n\}$: ã³ãŒãã¹
- $X_j\in D$: $j$çªç®ã®æç«
- $V$: åæã®èªåœéåããã¹ãŠã®æç« ãäžæåãã€ã«åè§£ããŠããã®ããŒã¯ã³ããäœãããçµã¿åããã®ã€ã¡ãŒãžã
- $\theta = \{p(v)\}_{v\in V}$: èªåœéåäžã®ç¢ºçã§$\sum_{v\in V}p(v)=1$ãšãªããã®ã
- $S(X_j)$: æç« $X_j$ã®$V$ã«ããåå²ã®éåã
äŸ
æç« $X=(a, b, c)$ãããªãã³ãŒãã¹$D=\{X\}$ãèããŸããåæèªåœéå$V$ã¯ã
V = \{ a, b, c, ab, bc, abc \}
ããããŒã¯ã³ã®æ°ã2åãŸã§ã«å¶éãã
V = \{ a, b, c, ab, bc\}
ã®ãããªãã®ãæ³å®ãããŸããèªåœã¯ã³ãŒãã¹äžã§é£æ¥ããŠåºçŸããéšååã®ã¿ããæ§æããããããäŸãã° $ac$ ã®ãããªé飿¥ããŒã¯ã³ãããªããã®ã¯ $V$ ã«å«ãŸãªãããðµ
$V = \{ a, b, c, ab, bc\}$ã®æãæç« $X=(a, b, c)$ã®$V$ã«ããåå²ã®éåã¯
S(X) = \{ [a][b][c], [a][bc], [ab][c]\}
ã®ããã«3åã®éåãšãªããŸãã[ab]ã§aãšbã®å¡ã§ïŒã€ã®ããŒã¯ã³ã衚çŸããŠããŸãã
èªåœã®ç¢ºçã$\theta=\{p(v)\}_{v\in V}$ã®æãåå²$Z$ã§æç« ãçæããã確ç$P_{\theta}(Z)$ã
P_{\theta}(Z)=\prod_{t=1}^{|Z|}p(v_t)
ãšèããŸããç¬ç«ãåååžãªã©ãæé»ã®ãã¡ã«ä»®å®ãããŠãããã ãããªã£ãŠã
Unigram LMã®åºæ¬çãªã¢ã€ãã£ã¢ã¯ãã³ãŒãã¹å šäœã®å¯Ÿæ°åšèŸºå°€åºŠãæå€§ã«ãã$\theta = \{p(v)\}_{v\in V}$ãæ±ããŠã尀床ã«ããŸã圱é¿ã®ãªãããŒã¯ã³ãåé€ããŠããæ¹æ³ãšãªããŸãã
圢åŒçã«ã¯ãã³ãŒãã¹$X_j$ã®åå²$Z$ã§æç« ãçæããã確ç$P_{\theta}(Z)$ã䜿ãã察æ°åšèŸºå°€åºŠãæå€§ã«ãããããªç¢ºç$p(v)$ãæ±ãã圢ã«èœãšã蟌ãããšã«ãªããŸãã
\arg\max_{\theta}\sum_{j=1}^{n}\log \sum_{Z\in S(X_j)}P_{\theta}(Z)
= \arg\max_{p}\sum_{j=1}^{n}\log \sum_{Z\in S(X_j)}\prod_{t=1}^{|Z|}p(v_t)
äžèšã®åŒïŒå¯Ÿæ°åšèŸºå°€åºŠïŒãæå€§ã«ãããããªç¢ºç$p(v)$ãæ±ããŠã察æ°åšèŸºå°€åºŠã«åœ±é¿ã®å°ãªãããŒã¯ã³ãåé€ããŠããæ¹æ³ãUnigram LMã«ããããŒã¯ãã€ãºã«ãªããŸãã察æ°åšèŸºå°€åºŠãæå€§ã«ããéšåã§ãããçè«äžã¯EMã¢ã«ãŽãªãºã ã䜿ãå°€åºŠã®æå€§åãç®æããŸãããããã«ããã®ã§åŸã§å ·äœäŸã§æ±ããŠã¿ããðž
1.2 ã¡ãã£ãšã ã詳ãã
æ¬æ¥ã¯ã³ãŒãã¹ã®éåãšããŠ$D=\{X_1,..,X_j,...,X_n\}$ã®ããã«ããããã®æç« ãèããã®ã§ããã$D=\{X\}$ãšäžæã ãã«ããŠãUnigram LMã«ããããŒã¯ã³ã®åé€ãŸã§ã®çè«çãªæµãã远ã£ãŠã¿ãããšæããŸãã尀床ã®$\sum_{j=1}^{n}$ããªããªãã®ã§å°ãç°¡åã«ãªããŸãã
1. èšå·ã®æºå
-
$D=\{X\}$: æç« äžã€ã®ã³ãŒãã¹éå
-
$V=\{v_1,...,v_k\}$: åæèªåœã®éå
-
$V^m=\{(z_1,âŠ,z_m)â£z_t\in V\}$ : $m$åã®é·ãã®èªåœïŒæã¿ãããªãã®ïŒ
-
$\Omega=\cup_{m=1}^{\infty}V^m$: èªåœéå$V$ããäœãããæç« å šäœ
-
$\theta=\{p(v)\}_{v\in V}$: èªåœéå$V$äžã®ç¢ºç $\sum_{v\in V}p(v)=1$ãšãªããŸãã
-
$S(X)$: æç« $X$ã®$V$ã«ããåå²ã®éå
-
$P_{\theta}(Z)$: èªåœã®ç¢ºçã $\{p(v)\}$ ã®æã«çæãããæ$Z\in\Omega$ã®çæç¢ºçã§
$$P_{\theta}(Z)= \prod_{t=1}^{|Z|}âp(v_t)$$ãšèšç®ããããçµ±èšåŠã£ãœãããã¹ãã ãš$P(Z|\theta)$ãšæžãããããšãå€ãã¿ããã -
å°æå$p$ã§èªåœã®ç¢ºçã倧æå$P_{\theta}$ã§æã®ç¢ºçã衚ããŠããŸãã
-
$P_{\theta}(X) = \sum_{Z\in S(X)} P_{\theta}(Z)$ãšå®çŸ©ããã$X\not\in\Omega$ã ãã©$X$ã®åå²ã䜿ã£ãŠæ§æããŠãããŸãã
-
æç« $X$ã®æ¡ä»¶ã®ããšã§ãåå²ã$Z\in S(X)$ã§ããäºåŸç¢ºç
$$P_{\theta}(Z|X)=\frac{P_{\theta}(Z)}{P_{\theta}(X)}=\frac{P_{\theta}(Z)}{\sum_{Z\in S(X)}P_{\theta}(Z)}$$
2. 察æ°åšèŸºå°€åºŠ
ã³ãŒãã¹ã®å¯Ÿæ°åšèŸºå°€åºŠ$L(D, V,\theta)$ãå€åœ¢ããŠã¿ãŸãã1ã
\begin{align*}
L(D,V,\theta)
& = \log P_{\theta}(X) \\
& = \log\sum_{Z\in S(X)}P_{\theta}(Z)\\
& = \log\sum_{Z\in S(X)}\prod_{i=1}^{|Z|}p(v_i)\\
\end{align*}
ãã®å°€åºŠãæå€§ã«ãã$\theta = \{p(v)\}$ãæ±ãããïŒã§ããlogã®äžã«âãããã®ã§é¢åãã![]()
3. EMã¢ã«ãŽãªãºã ã䜿ã
$\{p(v)\}$ãæšå®ããæ¹æ³ãšããŠãEMã¢ã«ãŽãªãºã ã䜿ããŸããEMã¢ã«ãŽãªãºã ã¯è£å©çãªç¢ºç$q(Z)$ãšJensenã®äžçåŒã䜿ã£ãŠ$\{p(v)\}$ãèšç®ããæ¹æ³ã§ã
- E step: $\theta$ãåºå®ããŠ$L(D, V, \theta)$ã®äžçãæå€§ã«ãã$q(Z)$ãæ±ããã
- M step: $q(Z)$ãåºå®ããŠã$L(D, V, \theta)$ã®äžçãæå€§ã«ãã$\{p(v)\}$ãæ±ããã
ãã®2ã¹ããããç¹°ãè¿ã圢ã«ãªããŸããExpectation StepãšMaximization StepããªïŒ
E stepãŸã§ã®éã®ã
- 察æ°åšèŸºå°€åºŠãåŸã ã«å€åœ¢ããŠãããŸãã
- $q(Z)$ã¯$\sum_{Z\in S(X)}q(Z)=1$ãšãªã確çã§ç¡çãã远å ããŠJensenã®äžçåŒã䜿ãã圢ã«å€åœ¢ããŸãã
- æåŸã確çå士ã®è·é¢ã¿ãããªæŠå¿µã®KLãã€ããŒãžã§ã³ã¹ã䜿ã£ãŠåã³ç¡çããå€åœ¢ããŸã
\begin{align*}
L(D, V, \theta)
& = \log P_{\theta}(X)
& Xã®å¯Ÿæ°å°€åºŠ \\
& = \log \sum_{Z\in S(X)} P_{\theta}(Z)\\
& = \log \sum_{Z\in S(X)} q(Z) \cdot \frac{P_{\theta}(Z)}{q(Z)}\\
&\ge \sum_{Z\in S(X)} q(Z) \cdot \log \frac{P_{\theta}(Z)}{q(Z)}
& \text{Jensen}ã®äžçåŒ\\
& = \sum_{Z\in S(X)} q(Z) \Bigl[ \log P_{\theta}(Z) - \log q(Z) \Bigr] \\
& = - \Bigl[ \sum_{Z\in S(X)} q(Z) \log q(Z) - \sum_{Z\in S(X)} q(Z)\log P_{\theta}(Z) \Bigr]
& (ð)\\
& = - D_{KL}(q || P_{\theta}( \cdot |X)) + \log P_{\theta}(X)
& \text{KLãã€ããŒãžã§ã³ã¹ã®éšå}\\
\end{align*}
äžèšã®åŒã®æåŸã®çåŒïŒKLãã€ããŒãžã§ã³ã¹ã®éšåïŒã®èšç®ã確èªãâãšãé¢åããã«èŠãããã©ãæ¡ä»¶ä»ã確çãæ»ããŠãlogã®å²ãç®ãèšç®ããŠããã ã![]()
\begin{align*}
D_{KL}(q || P_{\theta}( \cdot |X))
& = \sum_{Z\in S(X)} q(Z) \log \frac{q(Z)}{P_{\theta}(Z|X)}
& KLãã€ããŒãžã§ã³ã¹ã®å®çŸ©\\
& = \sum_{Z\in S(X)} q(Z) \log q(Z) - \sum_{Z\in S(X)} q(Z)\log P_{\theta}(Z|X) \\
& = \sum_{Z\in S(X)} q(Z) \log q(Z) - \sum_{Z\in S(X)} q(Z)\log \frac{P_{\theta}(Z)}{P_{\theta}(X)} \\
& = \sum_{Z\in S(X)} q(Z) \log q(Z) - \sum_{Z\in S(X)} q(Z)\Bigl[\log P_{\theta}(Z) - \log P_{\theta}(X) \Bigr]\\
& = \sum_{Z\in S(X)} q(Z) \log q(Z) - \sum_{Z\in S(X)} q(Z)\log P_{\theta}(Z) + \log P_{\theta}(X)
& (â
)\\
\end{align*}
ããšã¯ãé©åœãã«ç§»é ãããšKLãã€ããŒãžã§ã³ã¹ã®éšåã®åŒã«ãªããŸãã
E step
$L(D,V,\theta)$ã®äžçã§ããã$- D_{KL}(q || P_{\theta}(\cdot|X))+\log P_{\theta}(X)$ãã倧ãããã$q(Z)$ãæ¢ããŸããKLãã€ããŒãžã§ã³ã¹ã¯$q(Z)$ãš$P_{\theta}(Z|X)$ã®è·é¢ã¿ãããªãã®ãªã®ã§ã$q(Z)=P_{\theta}(Z|X)$ã®æãäžçãæå€§ã«ãªããŸãã
å ·äœçã«èšç®ãããšã...
\begin{align*}
q(Z)
& = P_{\theta}(Z|X) \\
& = \frac{P_{\theta}(Z)}{P_{\theta}(X)} \\
& = \frac{P_{\theta}(Z)}{\sum_{Z\in S(X)}P_{\theta}(Z)}
\end{align*}
$p(v)$ã§æ±ããåå²$Z$ã®çæç¢ºçããã¡ãããšç¢ºçã«ãªãããã«èª¿æŽïŒæ£èŠåïŒãããã®ã$q(Z)$ãšãªããŸããããšã§å
·äœäŸã§èšç®ããŠã¿ãããšã«ããã![]()
確èªãã€ã³ã
äžçãæå€§ã«ãªã£ãŠããã®ã ãã©ã(â
)ã®åŒã®æåãšæåŸã ããçºãããšã
D_{KL}(q || P_{\theta}( \cdot |X))
= \sum_{Z\in S(X)} q(Z) \log q(Z) - \sum_{Z\in S(X)} q(Z)\log P_{\theta}(Z) + \log P_{\theta}(X) \\
ãšãªããŸããç§»é äœæ¥ããŠã
\begin{align*}
\log P_{\theta}(X)
& = D_{KL}(q || P_{\theta}( \cdot |X)) - \left[\sum_{Z\in S(X)} q(Z) \log q(Z) - \sum_{Z\in S(X)} q(Z)\log P_{\theta}(Z)\right] \\
& = - \left[\sum_{Z\in S(X)} q(Z) \log q(Z) - \sum_{Z\in S(X)} q(Z)\log P_{\theta}(Z)\right] \\
\end{align*}
ãšãã圢ã«ãªããŸãã
- $\log P_{\theta}(X)$ã¯å¯Ÿæ°å°€åºŠã§ããããæå€§ã«ãããã
- []ã§å²ãŸããéšåãããã¯ðåŒã®éšåãšåãïŒãâããã¡ãããšå«ããŠãïŒ
- E stepã§æ±ããå€ã䜿ããšãKLã®å€ã¯0
E stepã§æ±ãã$q(Z) = P_{\theta}(Z|X)$ã䜿ããšãäžç(ðã®åŒ)ãåœåæå€§åãããã£ã尀床$\log P_{\theta}(X)$ã«äžèŽããŠããŸãã(ã¡ãã£ãšèª¬æãäžæãã
)
E step ãšã次ã®M step ãç¹°ãè¿ããŠããã倧ããª$L(D,V,\theta)$ã®å€ãèŠã€ãããã
M step
E stepã§æ§æãã$\{q(Z)\}_{Z\in S(X)}$ãæäžãšããŠã尀床ãæå€§åãã$\theta = \{p(v)\}$ãæ±ããŸãã
åã³ã尀床ã®åŒããåããããªå€åœ¢ãããŸãã
\begin{align*}
L(D, V, \theta)
& = \log \sum_{Z\in S(X)} P_{\theta}(Z)\\
& = \log \sum_{Z\in S(X)} q(Z) \cdot \frac{P_{\theta}(Z)}{q(Z)}\\
&\ge \sum_{Z\in S(X)} q(Z) \cdot \log \frac{P_{\theta}(Z)}{q(Z)}
& \text{Jensen}ã®äžçåŒ\\
& = \sum_{Z\in S(X)} q(Z) \Bigl[ \log P_{\theta}(Z) - \log q(Z) \Bigr] \\
& = \sum_{Z\in S(X)} q(Z)\log P_{\theta}(Z) - \sum_{Z\in S(X)} q(Z) \log q(Z) \\
\end{align*}
ãããåããã¿ãŒã³ãäžçã倧ããããããã«èããã®ã§ãããæåŸã®åŒã®ç¬¬2é ç®ã¯$\theta$ã«ç¡é¢ä¿ã®åºå®å€ãå®è³ªçã«æå€§åã§éèŠãªã®ã¯ç¬¬äžé ç® $\sum q(Z)\log P_{\theta}(Z)$ ã ããšãªããŸãã
\begin{align*}
\sum_{Z\in S(X)} q(Z)\log P_{\theta}(Z)
& = \sum_{Z\in S(X)} q(Z) \log \prod_{t=1}^{|Z|} p(v_t) \\
& = \sum_{Z\in S(X)} q(Z) \sum_{t=1}^{|Z|}\log p(v_t) & (ðž)\\
& = \sum_{v\in V}\left[ \sum_{Z\in S(X)} q(Z) n(v,Z) \right] \log p(v) \\
& = \sum_{v\in V} c(v) \log p(v)
\end{align*}
- $n(v, Z)$ã¯åå²$Z$ã®äžã«ãããŒã¯ã³$v$ãååšããæ°ãšãªããŸãã
- $Z$ã®äžã«$v_t$ã2åç»å Žããå Žåã(ðž)ã®éšåã¯$\log p(v_t)$ã2åããããããšã$q(Z)$ããããããšã«ãªããŸããã€ãŸãã$n(v_t, Z)=2$ãš$\log p(v_t)$ãš$q(Z)$ãæãç®ããããšã«ãªããŸãããããã«ãã説æã§å€§å€ç³ãèš³ç¡ãéã


- []ã§å²ãŸããéšåã$c(v)$ãšè¡šèšããŸãã
$$
c(v) = \sum_{Z\in S(X)} q(Z) n(v,Z)
$$
æåŸ ã«ãŠã³ããšè§£éã§ãããã§ãã
åœåã®ç®çã®æå€§åã«æ»ããŸãã$\sum_{v\in V} p(v) = 1 $ãšããæ¡ä»¶ã®ããšã
\arg\max_{\{p(v)\}} \sum_{v\in V} c(v) \log p(v)
ãæ±ããããšã«ãªããŸããå¶çŽæ¡ä»¶ä»ãã®æå€§ååé¡ãªã®ã§ã©ã°ã©ã³ãžã¥ä¹æ°æ³ã䜿ã£ã¡ãããŸãã
L = \sum_{v\in V} c(v) \log p(v) + \lambda \left(1-\sum_{v\in V}p(v)\right)
ãšãããŠã$p(v)$ã$\lambda$ã§åŸ®åããŠã€ã³ãŒã«0ãšããŠèšç®ããŸãããã
\begin{align}
\frac{\partial L}{\partial p(v)} = \frac{c(v)}{p(v)} - \lambda = 0 \\
\frac{\partial L}{\partial \lambda} = 1 - \sum_{v\in V} p(v) = 0
\end{align}
(1)ãš(2)ããã
p(v) = \frac{c(v)}{\sum_{v\in V}c(v)}
ãšãã綺éºãªåœ¢ã«ãªããŸããæåŸ ã«ãŠã³ããæ£èŠåããå€ã尀床ã倧ãããã$p(v)$ãšãªããŸãã
4. ããŒã¯ã³ã®åœ±é¿åºŠã»è²¢ç®åºŠ
ããŒã¯ã³$t$ã®åœ±é¿åºŠã»è²¢ç®åºŠã¯
\text{loss}(t) = L(D, V, \theta) - L(D, V\setminus\{t\}, \theta)
ã§æ±ãããããã§ããããŒã¯ã³$t$ãåé€ããããšãã®æå€±ãšããæå³åãã§ãããã®ã§ãlossãšåœåããŠãããŸãã
5. åé€åè£
loss(t)ãå°ããããŒã¯ã³ãèªåœéå$V$ããåé€ããŸããéã«ãloss(t)ã倧ããããŒã¯ã³ã¯æ®ã圢ã«ãªããŸããïŒåã ãåé€ãšãã ãšå¹çãæªãããªã®ã§ã20%åé€ã80%æ®ãã¿ãããªå€æãããã£ãœãã
åé€åè£ã$t^{*}$ãªãã
t^{*}\in \arg\min_{t} L(D, V, \theta) - L(D, V\setminus\{t\}, \theta)
ã¿ããã«è¡šçŸã§ããŸããæŠå¿µäžã§ã¯ããã¹ãŠã®åå²ã§æåŸ
å€ãèšç®ãããã$L(D, V\setminus{t}, \theta)$ã®å€ãEMã¢ã«ãŽãªãºã ã䜿ã£ãŠèšç®ããããã§ãã®ã§éåžžãã«é¢åããã§ããå®éã¯ãè²ããªå·¥å€«ãããŠloss(t)ãç°¡äŸ¿ã«æ±ããŠããããã§ãã![]()
6. åå²
èªåœéå$V$ã宿ããããšãã©ã®ããã«æãããŒã¯ãã€ãºããŠããã®ãã«ã€ããŠã§ããæç« $X$ã®å°€åºŠãæå€§ã«ãªãåå²$Z^*\in S(X)$ãšããŸããåŒã ãš
Z^* \in \arg\max_{Z\in S(X)}\prod_{t=1}^{|Z|}p(v_t)
ãšãªããŸããè«æã§ã¯Viterbiã¢ã«ãŽãªãºã ã䜿ã£ãŠæ±ãããšããŠããŸãã
1.3 å ·äœäŸã§æçŽã«èšç®ããŠã¿ã
æç« $X=(a, b, a)$ ãããªãã³ãŒãã¹$D$ãèããŸããåæèªåœéåã$V_1=\{a, b, ab, ba, aba\}$ãšããŸãã
- $(a, b, a)$ : æç«
- $V_1=\{a, b, ab, ba, aba\}$: åæèªåœã®éå
- $p(v)$: ããŒã¯ã³$v$ã®ç¢ºç
- $S(X) = \{[a][b][a], [a][ba], [ab][a], [aba]\}$
$p(v) = 1/5$ãã€ãŸãããã¹ãŠã®ããŒã¯ã³ãçããç¶æ³ãåæç¢ºçãšããŸãã
1. æã®ç¢ºç $P_Ξ(Z)$ãèšç®
$P_{\theta}(Z) = \prod_{t=1}^{|Z|}p(v_t)$ã䜿ã£ãŠèšç®ããŸãã
- $P_{\theta}$([a][b][a]) = p(a)p(b)p(a) = (1/5) * (1/5) * (1/5) = 1/125
- $P_{\theta}$([a][ba]) = p(a)p(ba) = (1/5) * (1/5) = 1/25
- $P_{\theta}$([ab][a]) = p(ab)p(a) = (1/5) * (1/5) = 1/25
- $P_{\theta}$([aba]) = p(aba) = 1/5
åèšãæ±ããŸãã
$$\sum_{Z\in S(X)}P_{\theta}(Z) = 36/125$$
2. E step: $q(Z)$ãèšç®
$q(Z)= \frac{P_{\theta}(Z)}{\sum_{Z\in S(X)}P_{\theta}(Z)}$ã䜿ã£ãŠèšç®ããŸãã
- q([a][b][a]) = (1/125)/(36/125) = 1/36
- q([a][ba]) = (5/125)/(36/125) = 5/36
- q([ab][a]) = (5/125)/(36/125) = 5/36
- q([aba]) = (25/125)/(36/125) = 25/36
3. æåŸ
ã«ãŠã³ã: $c(v)$ã®èšç®
åå²ã®äžã«ãããŒã¯ã³ãã©ããããããã®ãã®æåŸ
å€ãèšç®ããŸãã
- c(a) = 2Ãq([a][b][a]) + 1Ãq([a][ba]) + 1Ãq([ab][a]) = 12/36
- c(b) = 1Ãq([a][b][a]) = 1/36
- c(ab) = 1Ãq([ab][a]) = 5/36
- c(ba) = 1Ãq([a][ba]) = 5/36
- c(aba) = 1Ãq([aba]) = 25/36
åèšãæ±ããŸãã
$$\sum c(v) = 48/36$$
4. M step: æåŸ
ã«ãŠã³ãã®æ£èŠå
$p(v) = \frac{c(v)}{\sum_{v\in V}c(v)}$ãå©çšããŠãæ£èŠåããæåŸ
ã«ãŠã³ããæ±ããŸãã
- p(a) = (12/36)/(48/36) = 12/48
- p(b) = (1/36)/(48/36) = 1/48
- p(ab) = (5/36)/(48/36) = 5/48
- p(ba) = (5/36)/(48/36) = 5/48
- p(aba) = (25/36)/(48/36) = 25/48
ããã§ãEMã¢ã«ãŽãªãºã çã«æé©ãªç¢ºçãæ±ãŸããŸãã
p(a)=1/5 â p(a) = 1/4 =: p*(a)
ãšã¢ããããŒããããæããæé©åãããpãp*ãšæžããŠãããŸãã
5. EMåŸã®$L(D,V_1,\{p^*(v)\})$ãæ±ãã
\begin{align*}
L(D,V_1,\theta^{*})
& = \log ( P_{\theta}([a][b][a]) + P_{\theta}([a][ba])+ P_{\theta}([ab][a]) + P_{\theta}([aba])) \\
& = \log \left( \frac{1}{4}\cdot\frac{1}{48}\cdot\frac{1}{4}
+\frac{1}{4}\cdot\frac{5}{48}
+\frac{5}{48}\cdot\frac{1}{48} + \frac{25}{48}\right)
\end{align*}
6. åé€ããŒã¯ã³ãæ¢ã
VâããããŒã¯ã³vãåé€ããŠãL(D, V\{v}, Ξ)ãèšç®ããããããã«ãEM䜿ãããããããããŠãéåžžã«é¢å
ã ãããå®éã®ã³ãŒãã¯è¿äŒŒèšç®ã«ãªããã ãªã£ãŠæããã
6.1 {b}ãåé€åè£ãšããŠã¿ã
- $(a, b, a)$ : æç«
- $V_2=\{a, ab, ba, aba\} = V_1 \setminus \{b\}$
- $p(v)$: ããŒã¯ã³$v$ã®ç¢ºç
- $S(X) = \{[a][ba], [ab][a], [aba]\}$
ããŒã¯ã³$v$ã®ç¢ºçããæ±ããŸããã¢ããããŒãããã$p^*$ã¯ããŒã¯ã³$b$ã®ç¢ºçãå²ãåœãŠãããŠããŸãã$b$ãé€ãã圢ã§ç¢ºçåããã°ããã®ã§ã
p*(a) + p*(ab) + p*(ba) + p*(aba) = 47/48ã䜿ã£ãŠãããŒã¯ã³ã®ç¢ºç$p^*$ãæ±ããŸãã
衚èšããã¡ããã¡ãã«ãªãã®ã§ãããåã³ã ãããŒã¯ã³ã®ç¢ºçã$p$ã ãšæžãçŽããŸãã
- p(a) = (12/48)/(47/48) = 12/47
- p(ab) = 5/47
- p(ba) = 5/47
- p(aba) = 25/47
6.1 $P_Ξ(Z)$ãèšç®
èªåœéåã«$b$ãç¡ãã®ã§ãæã®åå²ãå€ãã£ãŠããŸãã
- $P_{\theta}$([a][ba]) = p(a)p(ba) = (12/47) * (5/47)
- $P_{\theta}$([ab][a]) = p(ab)p(a) = (5/47) * (12/47)
- $P_{\theta}$([aba]) = p(aba) = 25/47
åèšãæ±ããŸãã
$$\sum_{Z\in S(X)}P_{\theta}(Z) = 1295/(47*47) = 1295/2209$$
6.2 E step: $q(Z)$ãèšç®
$q(Z)= \frac{P_{\theta}(Z)}{\sum_{Z\in S(X)}P_{\theta}(Z)}$ã䜿ã£ãŠèšç®ããŸãã
- q([a][ba]) = (60/2209)/(1295/2209) = 12/259
- q([ab][a]) = (60/2209)/(1295/2209) = 12/259
- q([aba]) = (25/2209)/(1295/2209) = 235/259
6.3 æåŸ
ã«ãŠã³ã: $c(v)$ã®èšç®
åå²ã®äžã«ãããŒã¯ã³ãã©ããããããã®ãã®æåŸ
å€ãèšç®ããŸãã
- c(a) = 1Ãq([a][ba]) + 1xq([ab][a]) = 24/259
- c(ab) = 1Ãq([ab][a]) = 12/259
- c(ba) = 1Ãq([a][ba]) = 12/259
- c(aba) = 1Ãq([aba]) = 235/259
åèšãæ±ããŸãã
$$\sum c(v) = 283/259$$
6.4. M step: æåŸ
ã«ãŠã³ãã®æ£èŠå
$p(v) = \frac{c(v)}{\sum_{v\in V}c(v)}$ãå©çšããŠãæ£èŠåããæåŸ
ã«ãŠã³ããæ±ããŸãã
- p(a) = (24/259)/(283/259) = 24/283
- p(ab) = (12/259)/(283/259) = 12/283
- p(ba) = (12/259)/(283/259) = 12/283
- p(aba) = (235/259)/(283/259) = 235/283
ããã§ãEMã¢ã«ãŽãªãºã çã«æé©ãªç¢ºçãæ±ãŸããŸãã
p(a)=12/47 â p(a) = 24/283 =: p*(a)
ãšã¢ããããŒããããæããæé©åããã$p$ã$p^*$ãšæžããŠãããŸãã
6.5 EMåŸã®$L(D,V_2,\{p^*(v)\})$ãæ±ãã
\begin{align*}
L(D,V_1\setminus\{b\},\theta^{*})
& = \log ( P_{\theta}([a][ba])+ P_{\theta}([ab][a]) + P_{\theta}([aba])) \\
& = \log \left( \frac{24}{283}\cdot\frac{12}{283}
+\frac{12}{283}\cdot\frac{24}{283}
+\frac{235}{283}\right)
\end{align*}
6.6. ããŒã¯ã³$b$ã®æå€±
\text{loss}(b) = L(D, V_{1}, \theta^*) - L(D,V_{1}\setminus\{b\}, \theta^*)
ããã§ããŒã¯ã³$b$ã®æå€±ãæ±ãŸããŸããå€åèšç®ãã£ãŠãããšæããã©ãããŸãèªä¿¡ãªããªãããäžã€ã ãèšç®ããŠã¿ãã
6.7. {aba}ãåé€åè£ãšããŠã¿ã
- $(a, b, a)$ : æç«
- $V_2=\{a, b, ab, ba\} = V_1 \setminus \{aba\}$
- $p(v)$: ããŒã¯ã³$v$ã®ç¢ºç
- $S(X) = \{[a][b][ab], [a][ba], [ab][a]\}$
åæ§ã«èšç®ãããšã
\begin{align*}
L(D,V_1\setminus\{aba\},\theta^{*})
& = \log ( P_{\theta}([a][b][a]) + P_{\theta}([a][ba])+ P_{\theta}([ab][a])) \\
& = \log \left( \frac{127\cdot6\cdot127}{248^3}+\frac{127\cdot124\cdot115}{248^3}+\frac{127\cdot124\cdot115â}{248^3}\right)
\end{align*}
\text{loss}(aba) = L(D,V_1, \theta^*) - L(D,V_{1}\setminus\{aba\}, \theta^*)
7. æå€±ã®æ¯èŒ
ããšã¯ãåãããã«$ab$ã$ba$ã$a$ã®ããŒã¯ã³ãåé€åè£ãšããŠå¯Ÿæ°å°€åºŠãæ±ããŠæå€±ãŸã§èšç®ããŸããå€åãããŒã¯ã³bã®æå€±ãå°ããã®ã§$b$ãåé€ããããšãšãªããŸãã
\text{loss}(b) = \min \{
\text{loss}(aba),
\text{loss}(ab),
\text{loss}(ba),
\text{loss}(a) ,
\text{loss}(b)\}
ãªã®ã§ãæå€±ïŒåé€å¹æïŒãå°ããããŒã¯ã³bãåé€ããããšã«ãªããŸãã
éåžžã«é¢åð±
ãšãããããåŸãªã倧å€ã
ã ãããå®éã®ã³ãŒãã¯è¿äŒŒèšç®ã«ãªããã ãªã£ãŠæããã
ããããæ°ãåãçŽããŠãå®è£ ã«ç§»ããããªãè¿äŒŒèšç®ã§éåžžã«é«éãªã¯ãã
2. å®è£
unigram language model ã«ããããŒã¯ãã€ã¶ãŒãäœæããŠã¿ãŸããBPEã§ã¯å€èšèªãword pieceã§ã¯åãã¡æžãã§ã詊ããŠã¿ãŸãããä»åã¯å°ããªãµã€ãºã®æ¥æ¬èªã³ãŒãã¹ãæ±ããŸããä»åã¯çè«é¢ã§ã®è§£èª¬ãäžå¿ã ã£ãã®ã§ãå®è£
é¢ã¯å°ããã«![]()
åŠç¿ã«å©çšããããŒã¿ã¯cc100ããŒã¿ã»ããã®æ¥æ¬èªïŒjaïŒããæœåºããïŒäžè¡ãšãªããŸããããã¹ããã¡ã€ã«ã«ããŠã¿ãŸããã
import random
import pandas as pd
from tokenizers import Tokenizer, models, trainers, pre_tokenizers, normalizers
# (1) Unigram ã䜿ãïŒunk_token 㯠trainer åŽã§æå®ããã®ããã€ã³ãïŒ
tokenizer = Tokenizer(models.Unigram())
#tokenizer = Tokenizer(models.BPE(unk_token="<unk>"))
#tokenizer = Tokenizer(models.WordPiece(unk_token="<unk>"))
# (2) æ£èŠåã€ããŠã¿ã
tokenizer.normalizer = normalizers.NFKC()
# (3)
# åãã¡æžãã®æONã«ããŠå¹æã確èªããŠã¿ãããª
#tokenizer.pre_tokenizer = pre_tokenizers.Metaspace(replacement="â")
#(4) <unk>ã®æå®
trainer = trainers.UnigramTrainer(
vocab_size=10_000,
special_tokens=["<pad>", "<bos>", "<eos>", "<unk>", "<mask>"],
unk_token="<unk>",
shrinking_factor=0.75,
max_piece_length=16,
n_sub_iterations=2,
)
# (5) csvãã¡ã€ã«ããçŽæ¥åŠç¿
paths = ["./data/tiny_cc100_ja.csv"]
# (6) ã©ã³ãã ã«ããæå³ãªããã©ããã®ãŸãŸã³ããŒããŠäœ¿ã£ãã ã
def mixed_iterator(paths):
texts = []
for p in paths:
# textåã ãèªã¿èŸŒã
df = pd.read_csv(p)
texts.extend(df["text"].tolist())
# äžæ°ã«ã·ã£ããã«ïŒæ°çŸäžä»¶çšåºŠãŸã§ãªããã®æ¹æ³ã§OKãªã¯ãïŒ
random.shuffle(texts)
for t in texts:
yield t
# (7) åŠç¿
tokenizer.train_from_iterator(mixed_iterator(paths), trainer=trainer)
# (8) ä¿å
tokenizer.save("./tokenizer/unigram_10k.json")
説æã¡ã¢
-
NFKCã¯ã¡ãã£ãšããæ£èŠååŠçããããŸã§ã€ããŠãªãã£ããã©

- å šè§è±æ°åãåè§è±æ°å ïŒïŒïŒ â ABC123
- åè§ã«ããå šè§ã«ã ïŸïœœïŸã§ã â ãã¹ãã§ã
- ã¿ã¿ãããªã®ïŒãªããŠèšãã®ããªïŒ ã¿ââ³â¡ â æ ªåŒäŒç€Ÿââ³â¡
-
(3) Metaspaceã®éšåãOFFã«ããŠãããŸããONã«ãããšå¹æãã¯ã£ãããããããšæããŸãããµã³ãã«åºåã§ã_ãã ããã«ãªãã®ã§ä»åã¯çç¥OFFã«ããã
-
(8)åºåãããJSONãã¡ã€ã«ãä»ãŸã§ãšç°ãªããããŒã¯ã³ã®log(確ç)ãã€ããŠããŸããç¹æ®ããŒã¯ã³ã®log(確ç)ã¯0ãšãªã£ãŠããŸãã
æ¯ååããã¿ãŒã³ã§ãã
text_list = [
"ããã¯æ¥æ¬èªã®ãã¹ãã§ã",
"Awesome blog! Do you have any suggestions",
"ð·ãì êµí멎ì ì벜íê² ìë ë¡ê·ž 칎ë©ëŒë¥Œ ì¬íí ê² ê°ë€.",
"äœ å¥œ",
"ðð"
]
for text in text_list:
encoded = tokenizer.encode(text)
print(f"æç« : {text}")
print("ããŒã¯ã³:", encoded.tokens)
print("ID:", encoded.ids)
print(f"ãã³ãŒã: {tokenizer.decode(encoded.ids)}\n")
å ¥åããŒã¿ã«ååšããªãæåã¯<unk>ãšãªããŸãããã³ãŒãããŠãå ã®æåã«æ»ããªãããã
æç« : ããã¯æ¥æ¬èªã®ãã¹ãã§ã
ããŒã¯ã³: ['ããã¯', 'æ¥æ¬èª', 'ã®', 'ãã¹ã', 'ã§ã']
ID: [391, 1324, 6, 4120, 207]
ãã³ãŒã: ããã¯ æ¥æ¬èª ã® ãã¹ã ã§ã
æç« : Awesome blog! Do you have any suggestions
ããŒã¯ã³: ['A', 'w', 'e', 's', 'o', 'm', 'e', ' ', 'b', 'l', 'o', 'g', '! ', 'D', 'o', ' ', 'y', 'o', 'u', ' ', 'h', 'a', 'v', 'e', ' ', 'an', 'y', ' ', 's', 'u', 'g', 'g', 'est', 'i', 'on', 's']
ID: [121, 1133, 226, 522, 171, 318, 226, 14, 951, 547, 171, 600, 1005, 361, 171, 14, 1309, 171, 415, 14, 647, 276, 1257, 226, 14, 3229, 1309, 14, 522, 415, 600, 600, 6908, 255, 1871, 522]
ãã³ãŒã: A w e s o m e b l o g ! D o y o u h a v e an y s u g g est i on s
æç« : ð·ãì êµí멎ì ì벜íê² ìë ë¡ê·ž 칎ë©ëŒë¥Œ ì¬íí ê² ê°ë€.
ããŒã¯ã³: ['ð·', ' ', 'ì êµí멎ì', ' ', 'ì벜íê²', ' ', 'ìë ë¡ê·ž', ' ', '칎ë©ëŒë¥Œ', ' ', 'ì¬íí', ' ', 'ê²', ' ', 'ê°ë€', '.']
ID: [3, 14, 3, 14, 3, 14, 3, 14, 3, 14, 3, 14, 3, 14, 3, 254]
ãã³ãŒã: .
æç« : äœ å¥œ
ããŒã¯ã³: ['äœ ', '奜']
ID: [3, 2149]
ãã³ãŒã: 奜
æç« : ðð
ããŒã¯ã³: ['ðð']
ID: [3]
ãã³ãŒã:
- æ¥æ¬èªãããŒã¯ãã€ãºãããŠããã®ã¯OKããšãããããããç®çã
- äœæ ãã¢ã«ãã¡ãããã綺éºã«åŸ©å ãããŠããã
- ä»åå©çšããtiny_cc100_ja.csvãšããæ¥æ¬èªã³ãŒãã¹ã®è¶ ããã¥ãã¥ã¢çã«ãã¢ã«ãã¡ããããããããå ¥ã£ãŠããããã§ãããMacbookProããšããSubmitãã¿ã³ããšèšãæãã«èªç¶ã«éŠŽæãã§ããŸãããæ¥æ¬èªãžã®ã¢ã«ãã¡ãããã®æµžé床æãã¹ãð
åè
æ©æ¢°åŠç¿ãšæ
å ±æè¡ãšãããµã€ãã®ããŒã¯ãã€ãŒãŒã·ã§ã³ïŒBPE/WordPiece/SentencePieceïŒã解説ã«ã³ã³ãã¯ãã«ãŸãšãŸã£ãŠããŸããä»ã®å
容ã倧å€å匷ã«ãªããŸã![]()
ä»åèªåãåèã«ããã®ã¯ãæ¬¡ã®æžç±ã«ãªããŸãã詳ããæç®æ¡å ãã€ããŠããŠå€§å€å匷ã«ãªããŸãããæ¥æ¬èªã®å°éæžã£ãŠããããããªãšæã£ãŠããŸã£ãð
- ææ©å€§å° (2025) ãçµ±èšçããã¹ãã¢ãã«ãèšèªãžã®ãã€ãºçã¢ãããŒãã岩波æžåº
次å
BERTã®ãããªTransformer Encoderã¿ã€ãã®ã¢ãã«ã§ç»å ŽããMLM (Masked Language Modeling)ã«ã€ããŠæ±ãäºå®ã§ãã
ç®æ¬¡ããŒãž
泚
-
察æ°å°€åºŠ$L(D,V,\theta)$ã®è¡šçŸã§ãããä»åã¯èªåœéå$V$ãå€ãã£ãŠããã®ã§ã$V$ã颿°ã®äžã«å ¥ããŠè¡šçŸããŸããã â©