More than 3 years have passed since last update.

言語処理100本ノック(2020)-86: 畳み込みニューラルネットワーク(CNN)(Keras)

Posted at 2021-12-15

言語処理100本ノック 2020 (Rev2)の「第9章: RNN, CNN」の86本目「畳み込みニューラルネットワーク(CNN)」記録です。CNNでNLPを行うことは目からウロコでした正直、NLPにCNNと聞いて、「そんなの本当に有効なの？」と思いました。半信半疑で「自然言語処理における畳み込みニューラルネットワークを理解する」を読みながら理解しました。
記事「まとめ: 言語処理100本ノックで学べることと成果」に言語処理100本ノック 2015についてはまとめていますが、追加で差分の言語処理100本ノック 2020 (Rev2)についても更新します。

参考リンク

リンク	備考
86_畳み込みニューラルネットワーク(CNN).ipynb	回答プログラムのGitHubリンク
言語処理100本ノック 2020 第9章: RNN, CNN	(PyTorchだけど)解き方の参考
【言語処理100本ノック 2020】第9章: RNN, CNN	(PyTorchだけど)解き方の参考
まとめ: 言語処理100本ノックで学べることと成果	言語処理100本ノックまとめ記事
自然言語処理における畳み込みニューラルネットワークを理解する	最初に見ておくべき素晴らしい記事(翻訳版)
CNNで文からアニメの主人公を予測する	kerasのCNN書き方参考にしました
【Tensorflow2】1次元CNNとTensorflow2での実装 Conv1D	Conv1Dについて確認

環境

後々GPUを使わないと厳しいので、Google Colaboratory使いました。Pythonやそのパッケージでより新しいバージョンありますが、新機能使っていないので、プリインストールされているものをそのまま使っています。

種類	バージョン	内容
Python	3.7.12	Google Colaboratoryのバージョン
google	2.0.3	Google Driveのマウントに使用
tensorflow	2.7.0	ディープラーニングの主要処理
nltk	3.2.5	Tokenの辞書作成に使用
pandas	1.1.5	行列に関する処理に使用

第8章: ニューラルネット

学習内容

深層学習フレームワークを用い，再帰型ニューラルネットワーク（RNN）や畳み込みニューラルネットワーク（CNN）を実装します．

86. 畳み込みニューラルネットワーク(CNN)

ID番号で表現された単語列$\boldsymbol x = (x_1, x_2, \dots, x_T)$がある．ただし，$T$は単語列の長さ，$x_t \in \mathbb{R}^{V}$は単語のID番号のone-hot表記である（$V$は単語の総数である）．畳み込みニューラルネットワーク（CNN: Convolutional Neural Network）を用い，単語列$\boldsymbol x$からカテゴリ$y$を予測するモデルを実装せよ．

ただし，畳み込みニューラルネットワークの構成は以下の通りとする．

単語埋め込みの次元数: $d_w$

畳み込みのフィルターのサイズ: 3 トークン
畳み込みのストライド: 1 トークン
畳み込みのパディング: あり
畳み込み演算後の各時刻のベクトルの次元数: $d_h$
畳み込み演算後に最大値プーリング（max pooling）を適用し，入力文を$d_h$次元の隠れベクトルで表現
すなわち，時刻$t$の特徴ベクトル$p_t \in \mathbb{R}^{d_h}$は次式で表される．

p_t = g(W^{(px)} [\mathrm{emb}(x_{t-1}); \mathrm{emb}(x_t); \mathrm{emb}(x_{t+1})] + b^{(p)})
$]


>ただし，$W^{(px)} \in \mathbb{R}^{d_h \times 3d_w}, b^{(p)} \in \mathbb{R}^{d_h}$はCNNのパラメータ，$g$は活性化関数（例えば$\tanh$やReLUなど），$[a; b; c]$はベクトル$a, b, c$の連結である．なお，行列$W^{(px)}$の列数が$3d_w$になるのは，3個のトークンの単語埋め込みを連結したものに対して，線形変換を行うためである．
最大値プーリングでは，特徴ベクトルの次元毎に全時刻における最大値を取り，入力文書の特徴ベクトル$c \in \mathbb{R}^{d_h}$を求める．$c[i]$でベクトル$c$の$i$番目の次元の値を表すことにすると，最大値プーリングは次式で表される．

>```math
c[i] = \max_{1 \leq t \leq T} p_t[i]

最後に，入力文書の特徴ベクトル$c$に行列$W^{(yc)} \in \mathbb{R}^{L \times d_h}$とバイアス項$b^{(y)} \in \mathbb{R}^{L}$による線形変換とソフトマックス関数を適用し，カテゴリ$y$を予測する．

y = {\rm softmax}(W^{(yc)} c + b^{(y)})


>なお，この問題ではモデルの学習を行わず，ランダムに初期化された重み行列で$y$を計算するだけでよい．

# 回答
## 回答結果

```python:結果
> model.predict([['this is a pen']])
array([[0.09699879, 0.05598527, 0.82543594, 0.02157998]], dtype=float32)

回答プログラム 86_畳み込みニューラルネットワーク(CNN).ipynb

GitHubには確認用コードも含めていますが、ここには必要なものだけ載せています。

import numpy as np
import nltk
from gensim.models import KeyedVectors
import pandas as pd
import tensorflow as tf
from google.colab import drive

drive.mount('/content/drive')

BASE_PATH = '/content/drive/MyDrive/ColabNotebooks/ML/NLP100_2020/'
max_len = 0
vocabulary = []
w2v_model = KeyedVectors.load_word2vec_format(BASE_PATH+'07.WordVector/input/GoogleNews-vectors-negative300.bin.gz', binary=True)

def read_dataset(type_):
    global max_len
    global vocabulary
    df = pd.read_table(BASE_PATH+'06.MachineLearning/'+type_+'.feature.txt')
    df.info()
    sr_title = df['title'].str.split().explode()
    max_len_ = df['title'].map(lambda x: len(x.split())).max()
    if max_len < max_len_:
        max_len = max_len_
    if len(vocabulary) == 0:
        vocabulary = [k for k, v in nltk.FreqDist(sr_title).items() if v > 1]
    else:
        vocabulary.extend([k for k, v in nltk.FreqDist(sr_title).items() if v > 1])
    y = df['category'].replace({'b':0, 't':1, 'e':2, 'm':3})
    return df['title'], tf.keras.utils.to_categorical(y, dtype='int32')  # 4値分類なので訓練・検証・テスト共通でone-hot化


X_train, y_train = read_dataset('train')
X_valid, y_valid = read_dataset('valid')
X_test, y_test = read_dataset('test') # あまりこだわらずにテストデータセットも追加

# setで重複削除し、タプル形式に設定
tup_voc = tuple(set(vocabulary))

print(f'vocabulary size before removing duplicates: {len(vocabulary)}')
print(f'vocabulary size after removing duplicates: {len(tup_voc)}')
print(f'sample vocabulary: {tup_voc[:10]}')
print(f'max length is {max_len}')

vectorize_layer = tf.keras.layers.TextVectorization(
 output_mode='int',
 vocabulary=tup_voc,
 output_sequence_length=max_len)

print(f'vocabulary size is {vectorize_layer.vocabulary_size()}')

embedding_dim = 300
hits = 0
misses = 0

embedding_matrix = np.zeros((vectorize_layer.vocabulary_size(), embedding_dim))
for i, word in enumerate(vectorize_layer.get_vocabulary()):
    try:
        embedding_matrix[i] = w2v_model.get_vector(word)
        hits += 1
        # Words not found in embedding index will be all-zeros.
        # This includes the representation for "padding" and "OOV"
    except:
        misses += 1
        if misses < 7:  # Show 6 words as example
            print(word)

embedding_layer = tf.keras.layers.Embedding(
    vectorize_layer.vocabulary_size(),
    embedding_dim,
    embeddings_initializer=tf.keras.initializers.Constant(embedding_matrix),
    trainable=False
)


model = tf.keras.models.Sequential()
model.add(tf.keras.Input(shape=(1,), dtype=tf.string))
model.add(vectorize_layer)
model.add(embedding_layer)
model.add(tf.keras.layers.Reshape((max_len * embedding_dim, 1)))
model.add(tf.keras.layers.Conv1D(filters=8, 
                                 kernel_size=embedding_dim*3, 
                                 padding='same',
                                 strides=embedding_dim,
                                 activation='relu'))
model.add(tf.keras.layers.MaxPooling1D(pool_size=2))
model.add(tf.keras.layers.Flatten())
model.add(tf.keras.layers.Dense(4, activation='softmax'))
model.compile(loss='categorical_crossentropy', optimizer='sgd', metrics=['acc'])
model.summary()

tf.keras.utils.plot_model(model, show_shapes=True)

model.predict([['this is a pen']])

回答解説

CNN

NLPとCNN

CNNの畳み込みとプーリングは昔、以下の記事に書きました。

NLPでのCNNで有名なのは論文「A Sensitivity Analysis of (and Practitioners' Guide to) Convolutional Neural Networks for Sentence Classification」のようです。そちらのモデル概念図を転載します。フィルタサイズが複数なこともあって(凡人には)すぐに理解できないです。

"I like this movie"という文をEmbedding後にこんな行列(4Token × 2次元)だったとします(実際には行列の中身はTokenに合致した数字)。

Token	1	2
I	0(I-1)	1(I-2)
like	2(like-1)	3(like-2)
this	4(this-1)	5(this-2)
movie	6(movie-1)	7(movie-2)

コードにするとこんな出力。("I like this movie"ではないけど)0から連番で7までの8つの数字を入力し、CNNの手前であるEmbeddingにて出力。

> import numpy as np
> import tensorflow as tf

> INPUT = 4
> EMBED_DIM = 2
> initial = np.arange(INPUT * EMBED_DIM).reshape(INPUT, EMBED_DIM)

> model = tf.keras.models.Sequential()
> model.add(tf.keras.Input(shape=(INPUT,)))
> model.add(tf.keras.layers.Embedding(INPUT, EMBED_DIM, 
>                                     embeddings_initializer=tf.keras.initializers.Constant(initial)))
> model.predict(np.arange(INPUT).reshape(1, INPUT))
array([[[0., 1.],
        [2., 3.],
        [4., 5.],
        [6., 7.]]], dtype=float32)

で、Reshapeして、Conv1Dできる以下の形に整形します。

Token	1
I-1	0
I-2	1
like-1	2
like-2	3
this-1	4
this-2	5
movie-1	6
movie-1	7

model = tf.keras.models.Sequential()
model.add(tf.keras.Input(shape=(INPUT,)))
model.add(tf.keras.layers.Embedding(INPUT, EMBED_DIM, 
                                    embeddings_initializer=tf.keras.initializers.Constant(initial)))
model.add(tf.keras.layers.Reshape((INPUT * EMBED_DIM, 1))) 
model.predict(np.arange(INPUT).reshape(1, INPUT))

array([[[0.],
        [1.],
        [2.],
        [3.],
        [4.],
        [5.],
        [6.],
        [7.]]], dtype=float32)

で、Conv1Dで畳み込み。記事「【Tensorflow2】1次元CNNとTensorflow2での実装 Conv1D」に書かれている通りです。フィルタサイズを3 Token でストライドを1 Tokenにすると以下のイメージの畳み込みです(paddingの理解が合っているか少し自信なし)。

モデルコーディング

今回のコードでモデル構築とsummary関数で出力している部分です。

model = tf.keras.models.Sequential()
model.add(tf.keras.Input(shape=(1,), dtype=tf.string))
model.add(vectorize_layer)
model.add(embedding_layer)
model.add(tf.keras.layers.Reshape((max_len * embedding_dim, 1)))
model.add(tf.keras.layers.Conv1D(filters=8, 
                                 kernel_size=embedding_dim*3, 
                                 padding='same',
                                 strides=embedding_dim,
                                 activation='relu'))
model.add(tf.keras.layers.MaxPooling1D(pool_size=2))
model.add(tf.keras.layers.Flatten())
model.add(tf.keras.layers.Dense(4, activation='softmax'))
model.compile(loss='categorical_crossentropy', optimizer='sgd', metrics=['acc'])
model.summary()

結果

 Layer (type)                Output Shape              Param #   
=================================================================
 text_vectorization (TextVec  (None, 18)               0         
 torization)                                                     
                                                                 
 embedding (Embedding)       (None, 18, 300)           2341200   
                                                                 
 reshape_11 (Reshape)        (None, 5400, 1)           0         
                                                                 
 conv1d_8 (Conv1D)           (None, 18, 8)             7208      
                                                                 
 max_pooling1d_7 (MaxPooling  (None, 9, 8)             0         
 1D)                                                             
                                                                 
 flatten_8 (Flatten)         (None, 72)                0         
                                                                 
 dense_8 (Dense)             (None, 4)                 292       
                                                                 
=================================================================
Total params: 2,348,700
Trainable params: 7,500
Non-trainable params: 2,341,200
_________________________________________________________________

plot_model関数の出力。

tf.keras.utils.plot_model(model, show_shapes=True)

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up