はじめに
Deep Learning Specialization の RNN, Week 1 (RNN W1) の内容です。
(RNN W1L01) RNN Model - Introduction, Motivation, Why Sequence Model
内容
- RNN のモデルの説明
- Outputs from previous step are fed as input to the current step
- Has hidden state which remembers information about a sequence
- Have memory
- $h_t = f(h_{t-1}, X_t)$
- Sequential data の例
- 機械翻訳 ; 単語は独立していなくて,前後の単語に依存している
- Name entity recognition
- Sentiment classification
- Word prediction
(RNN W1L02) Sentence / Word Representation
内容
-
notation
- $X^{(i)}$ ; $i$-th input sentence
- $X^{(i)<t>}$ ; $t$-th word of $i$-th sentence
- $T_X$ ; 語数
- $Y^{(i)}$ ; $i$-th output sentence
-
Word presentation
- Need to define Dictionary (Vocabulary)
- Represent each word by One-Hot Encoding (denotes position of a word in the Vocabulary)
(RNN W1L03) RNN Model
内容
- 普通の network を使うときの課題
- Inputs, outputs can be different lengths in different examples.
- Doesn't share feature learned across different position of text
- RNN の弱点
- 前の情報は使えるが,後ろの情報を使えない (それを解決するために,Bidirectional RNN (BRNN) がある)
Forword Propagation
a^{<0>} = \vec{0} \\
a^{<t>} = g_1 \left( W_{aa} a^{<t-1>} + W_{ax} X^{<t>} + b_a \right) \\
\hat{y}^{<t>} = g_2 \left( W_{ya} a^{<t>} + b_y \right)
- activation function
- $g_1$ ; $\tanh$ か ReLU
- $g_2$ ; sigmoid
simplified RNN notation
a^{<t>} = g\left( W_a \left[a^{<t-1>} , X^{<t>} \right] + b_a \right) \\
\hat{y}^{<t>} = g\left( W_y a^{<t>} + b_y \right)
- $W_a$ は $W_{aa}$ と $W_{ax}$ を横に並べたもの
- $\left[a^{<t-1>} , X^{<t>} \right]$ は,2 つのベクトルを縦に並べたもの
(RNN W1L04) Backpropagation through time
内容
L^{<t>}\left( \hat{y}^{<t>}, y^{<t>} \right) = -y^{<t>} \log \hat{y}^{<t>} - \left( 1-y^{<t>}\right) \log \left( 1-\hat{y}^{<t>} \right) \\
L\left( \hat{y}, y \right) = \sum_{t=1}^{T_y} L^{<t>} \left( \hat{y}^{<t>}, y^{<t>} \right)
(RNN W1L05) Different types of RNNs
内容
- Examples of sequence data
- Speech recognition
- Music generation
- Sentiment classification
- DNA sequence analysis
- Machine translation
- Video activity recognition
- Name entity recognition
- Example of RNN architecture
- one-to-one
- many-to-one ; sentiment classification
- one-to-many ; music generation
- many-to-many ; machine learning
(RNN W1L06) Language model and sequence generation
内容
What is language modelling?
- Speech recognition
- P(sentence) = probability
Language modelling with an RNN
- Training set ; large corpus of English text
- tokenize
- <EOS> ; end of sentence
- <UNK> ; unknown word
- RNN model
- $a^{<0>} = \vec{0}$, $X^{<0>} = \vec{0}$
- $X^{<2>} = y^{<1>}$, $X^{<3>} = y^{<2>}$, $\cdots$ として入力
- $L(\hat{y}^{<t>}, y^{<t>}) = - \sum_i y_i^{<t>} \log\hat{y}_i^{<t>}$
- $L = \sum_t L(\hat{y}^{<t>}, y^{<t>})$
(RNN W1L07) Sampling novel sequence
内容
- Character-level language model
- 通常は Vocabulary = [a, aane, ... , zulu, <UNK>]
- Character-level language model は Vocabulary = [a, b, c, ..., X, Y, Z]
- メリット ; 未知の単語にも対応できる
- デメリット ; トレーニングコストがかかる
(RNN W1L08) Vanishing gradients with RNNs
内容
- gradient clipping ; 閾値を越えたときに, gradient を clip する
(RNN W1L09) Gated Recurrent Unit (GRU)
内容
GRU (simplified)
c = \textrm{memory cell}\\
c^{<t>} = a^{<t>} \\
\tilde{c}^{<t>} = \tanh\left( W_c \left[ c^{<t-1>}, X^{<t>}\right] + b_c\right) \\
\Gamma_u = \sigma\left(W_u \left[ c^{<t-1>}, X^{<t>}\right] + b_u\right) \\
c^{<t>} = \Gamma_u \ast \tilde{c}^{<t>} + \left( 1-\Gamma_u \right) \ast \tilde{c}^{<t-1>}
- gradient が 0 に近付いても,memory cell が維持される
Full GRU
c^{<t>} = a^{<t>} \\
\tilde{c}^{<t>} = \tanh\left( W_c \left[ \Gamma_r \ast c^{<t-1>}, X^{<t>}\right] + b_c\right) \\
\Gamma_u = \sigma\left(W_u \left[ c^{<t-1>}, X^{<t>}\right] + b_u\right) \\
\Gamma_r = \sigma\left(W_r \left[ c^{<t-1>}, X^{<t>}\right] + b_c\right) \\
c^{<t>} = \Gamma_u \ast \tilde{c}^{<t>} + \left( 1-\Gamma_u \right) \ast \tilde{c}^{<t-1>}
感想
- 実はよく分かっていない
(RNN W1L10) LSTM (long short term memory) unit
内容
\tilde{c}^{<t>} =\tanh \left( W_c \left[ a^{<t-1>}, X^{<t>} \right] + b_c \right) \\
\Gamma_u = \sigma\left( W_u \left[ a^{<t-1>}, X^{<t>} \right] + b_u \right) \\
\Gamma_f = \sigma\left( W_f \left[ a^{<t-1>}, X^{<t>} \right] + b_f \right) \\
\Gamma_o = \sigma\left( W_o \left[ a^{<t-1>}, X^{<t>} \right] + b_o \right) \\
c^{<t>} = \Gamma_u \ast \tilde{c}^{<t>} + \Gamma_f \ast c^{<t-1>} \\
a^{<t>} = \Gamma_o \ast \tanh c^{<t>}
- suffix
- u ; update
- f ; forget
- o ; output
(RNN W1L11) Bidirectional RNN
内容
- Getting information from the future
\hat{y}^{<t>} = g\left( W_y \left[ \overrightarrow{a}^{<t>} , \overleftarrow{a}^{<t>} \right] + b_y \right)
- $\overleftarrow{a}^{<t>}$ が,未来からの情報
所感
- 反対向きのベクトル矢印 ($\overleftarrow{a}$) が書けたのが嬉しい
(RNN W1L12) Deep RNNs
内容
- $a^{[l]<t>}$ ; $t$ において,layer $l$ の activation
- 例えば ...
a^{[2]<3>} = g\left( W_a^{[2]} \left[ a^{[2]<2>}, a^{[1]<3>} \right] + b_a^{[2]} \right)