Deep Learning Specialization (Coursera) 自習記録 (RNN W1)

Deep Learning Specialization の RNN, Week 1 (RNN W1) の内容です。

(RNN W1L01) RNN Model - Introduction, Motivation, Why Sequence Model


  • RNN のモデルの説明
    • Outputs from previous step are fed as input to the current step
    • Has hidden state which remembers information about a sequence
    • Have memory
    • $h_t = f(h_{t-1}, X_t)$
  • Sequential data の例
    • 機械翻訳 ; 単語は独立していなくて,前後の単語に依存している
    • Name entity recognition
    • Sentiment classification
    • Word prediction

(RNN W1L02) Sentence / Word Representation


  • notation

    • $X^{(i)}$ ; $i$-th input sentence
    • $X^{(i)<t>}$ ; $t$-th word of $i$-th sentence
    • $T_X$ ; 語数
    • $Y^{(i)}$ ; $i$-th output sentence
  • Word presentation

    • Need to define Dictionary (Vocabulary)
    • Represent each word by One-Hot Encoding (denotes position of a word in the Vocabulary)

(RNN W1L03) RNN Model


  • 普通の network を使うときの課題
    • Inputs, outputs can be different lengths in different examples.
    • Doesn't share feature learned across different position of text
  • RNN の弱点
    • 前の情報は使えるが,後ろの情報を使えない (それを解決するために,Bidirectional RNN (BRNN) がある)

Forword Propagation

a^{<0>} = \vec{0} \\
a^{<t>} = g_1 \left( W_{aa} a^{<t-1>} + W_{ax} X^{<t>} + b_a  \right) \\
\hat{y}^{<t>} = g_2 \left( W_{ya} a^{<t>} + b_y \right)

  • activation function
    • $g_1$ ; $\tanh$ か ReLU
    • $g_2$ ; sigmoid

simplified RNN notation

a^{<t>} = g\left( W_a \left[a^{<t-1>} , X^{<t>} \right] + b_a \right) \\
\hat{y}^{<t>} = g\left( W_y a^{<t>} + b_y  \right)

  • $W_a$ は $W_{aa}$ と $W_{ax}$ を横に並べたもの
  • $\left[a^{<t-1>} , X^{<t>} \right]$ は,2 つのベクトルを縦に並べたもの

(RNN W1L04) Backpropagation through time


L^{<t>}\left( \hat{y}^{<t>}, y^{<t>} \right) = -y^{<t>} \log \hat{y}^{<t>} - \left( 1-y^{<t>}\right) \log \left( 1-\hat{y}^{<t>} \right) \\
L\left( \hat{y}, y \right) = \sum_{t=1}^{T_y} L^{<t>} \left( \hat{y}^{<t>}, y^{<t>} \right)

(RNN W1L05) Different types of RNNs


  • Examples of sequence data
    • Speech recognition
    • Music generation
    • Sentiment classification
    • DNA sequence analysis
    • Machine translation
    • Video activity recognition
    • Name entity recognition
  • Example of RNN architecture
    • one-to-one
    • many-to-one ; sentiment classification
    • one-to-many ; music generation
    • many-to-many ; machine learning

(RNN W1L06) Language model and sequence generation


What is language modelling?

  • Speech recognition
    • P(sentence) = probability

Language modelling with an RNN

  • Training set ; large corpus of English text
  • tokenize
    • <EOS> ; end of sentence
    • <UNK> ; unknown word
  • RNN model
    • $a^{<0>} = \vec{0}$, $X^{<0>} = \vec{0}$
    • $X^{<2>} = y^{<1>}$, $X^{<3>} = y^{<2>}$, $\cdots$ として入力
    • $L(\hat{y}^{<t>}, y^{<t>}) = - \sum_i y_i^{<t>} \log\hat{y}_i^{<t>}$
    • $L = \sum_t L(\hat{y}^{<t>}, y^{<t>})$

(RNN W1L07) Sampling novel sequence


  • Character-level language model
    • 通常は Vocabulary = [a, aane, ... , zulu, <UNK>]
    • Character-level language model は Vocabulary = [a, b, c, ..., X, Y, Z]
    • メリット ; 未知の単語にも対応できる
    • デメリット ; トレーニングコストがかかる

(RNN W1L08) Vanishing gradients with RNNs


  • gradient clipping ; 閾値を越えたときに, gradient を clip する

(RNN W1L09) Gated Recurrent Unit (GRU)


GRU (simplified)

c = \textrm{memory cell}\\
c^{<t>} = a^{<t>} \\
\tilde{c}^{<t>} = \tanh\left( W_c \left[ c^{<t-1>}, X^{<t>}\right]  + b_c\right) \\
\Gamma_u = \sigma\left(W_u \left[ c^{<t-1>}, X^{<t>}\right]  + b_u\right) \\
c^{<t>} = \Gamma_u \ast \tilde{c}^{<t>} + \left( 1-\Gamma_u \right) \ast \tilde{c}^{<t-1>}
  • gradient が 0 に近付いても,memory cell が維持される

Full GRU

c^{<t>} = a^{<t>} \\
\tilde{c}^{<t>} = \tanh\left( W_c \left[ \Gamma_r \ast c^{<t-1>}, X^{<t>}\right]  + b_c\right) \\
\Gamma_u = \sigma\left(W_u \left[ c^{<t-1>}, X^{<t>}\right]  + b_u\right) \\
\Gamma_r = \sigma\left(W_r \left[ c^{<t-1>}, X^{<t>}\right]  + b_c\right) \\
c^{<t>} = \Gamma_u \ast \tilde{c}^{<t>} + \left( 1-\Gamma_u \right) \ast \tilde{c}^{<t-1>}


  • 実はよく分かっていない

(RNN W1L10) LSTM (long short term memory) unit


\tilde{c}^{<t>} =\tanh \left( W_c \left[ a^{<t-1>}, X^{<t>} \right] + b_c \right) \\
\Gamma_u = \sigma\left( W_u \left[ a^{<t-1>}, X^{<t>} \right] + b_u \right) \\
\Gamma_f = \sigma\left( W_f \left[ a^{<t-1>}, X^{<t>} \right] + b_f \right) \\
\Gamma_o = \sigma\left( W_o \left[ a^{<t-1>}, X^{<t>} \right] + b_o \right) \\
c^{<t>} = \Gamma_u \ast \tilde{c}^{<t>} + \Gamma_f \ast c^{<t-1>} \\
a^{<t>} = \Gamma_o \ast \tanh c^{<t>}
  • suffix
    • u ; update
    • f ; forget
    • o ; output

(RNN W1L11) Bidirectional RNN


  • Getting information from the future
\hat{y}^{<t>} = g\left( W_y \left[ \overrightarrow{a}^{<t>} , \overleftarrow{a}^{<t>} \right]  + b_y \right)

  • $\overleftarrow{a}^{<t>}$ が,未来からの情報


  • 反対向きのベクトル矢印 ($\overleftarrow{a}$) が書けたのが嬉しい

(RNN W1L12) Deep RNNs


  • $a^{[l]<t>}$ ; $t$ において,layer $l$ の activation
  • 例えば ...
a^{[2]<3>} = g\left( W_a^{[2]} \left[  a^{[2]<2>}, a^{[1]<3>} \right] + b_a^{[2]} \right)



