More than 3 years have passed since last update.

言語処理100本ノック(2020)-84: 単語ベクトルの導入(Keras)

Posted at 2021-12-01

言語処理100本ノック 2020 (Rev2)の「第9章: RNN, CNN」の84本目「単語ベクトルの導入」記録です。仕事でやりたいと考えていて時間足りずにできなかった内容です。「学習済の単語ベクトルを使えばすごく精度良くなるんじゃ？」と考えていましたが、やはりその通り精度が良くなりました。今後、機会があれば仕事でも使っていきたい内容です。

記事「まとめ: 言語処理100本ノックで学べることと成果」に言語処理100本ノック 2015についてはまとめていますが、追加で差分の言語処理100本ノック 2020 (Rev2)についても更新します。

参考リンク

リンク	備考
84_単語ベクトルの導入.ipynb	回答プログラムのGitHubリンク
言語処理100本ノック 2020 第9章: RNN, CNN	(PyTorchだけど)解き方の参考
【言語処理100本ノック 2020】第9章: RNN, CNN	(PyTorchだけど)解き方の参考
まとめ: 言語処理100本ノックで学べることと成果	言語処理100本ノックまとめ記事
【Keras公式】Using pre-trained word embeddings	ここのコードを参考に書いた
Kerasメモ（Embeddingレイヤ）その2	少し参考にしました

環境

後々GPUを使わないと厳しいので、Google Colaboratory使いました。Pythonやそのパッケージでより新しいバージョンありますが、新機能使っていないので、プリインストールされているものをそのまま使っています。

種類	バージョン	内容
Python	3.7.12	Google Colaboratoryのバージョン
google	2.0.3	Google Driveのマウントに使用
tensorflow	2.7.0	ディープラーニングの主要処理
nltk	3.2.5	Tokenの辞書作成に使用
pandas	1.1.5	行列に関する処理に使用
gensim	3.6.0	Google Newsデータセットの読込に使用

第8章: ニューラルネット

学習内容

深層学習フレームワークを用い，再帰型ニューラルネットワーク（RNN）や畳み込みニューラルネットワーク（CNN）を実装します．

84. 単語ベクトルの導入

事前学習済みの単語ベクトル（例えば，Google Newsデータセット（約1,000億単語）での学習済み単語ベクトル）で単語埋め込み$\mathrm{emb}(x)$を初期化し，学習せよ．

回答

回答結果

前回ノックとの精度比較です。訓練データセットに対して精度が悪化し、検証と評価データセットに対しては向上しているのがわかります(Embedding以外は同じ条件でともに100Epochの学習)。過学習気味だったものが解消されているのでしょう。

データセット	Loss	正答率
訓練	0.162 0.384(+0.222)	94.7% 86.7%(-8.0%)
検証	0.650 0.470(-0.180)	82.1% 83.8%(+1.7%)
評価	0.556 0.446(-0.110)	83.8% 85.1%(+1.3%)

参考に評価データセットの結果です。

結果

42/42 [==============================] - 0s 6ms/step - loss: 0.4455 - acc: 0.8510
[0.4454902708530426, 0.851047933101654]

また、訓練時間が7分25秒から5分25秒へと約25%短縮しました。訓練するパラメータが少なくなったからでしょう。Jupyterのマジックコマンド%%timeで測っています。

回答プログラム 84_単語ベクトルの導入.ipynb

GitHubには確認用コードも含めていますが、ここには必要なものだけ載せています。

import numpy as np
import nltk
from gensim.models import KeyedVectors
import pandas as pd
import tensorflow as tf
from google.colab import drive

drive.mount('/content/drive')

BASE_PATH = '/content/drive/MyDrive/ColabNotebooks/ML/NLP100_2020/'
max_len = 0
vocabulary = []
w2v_model = KeyedVectors.load_word2vec_format(BASE_PATH+'07.WordVector/input/GoogleNews-vectors-negative300.bin.gz', binary=True)

def read_dataset(type_):
    global max_len
    global vocabulary
    df = pd.read_table(BASE_PATH+'06.MachineLearning/'+type_+'.feature.txt')
    df.info()
    sr_title = df['title'].str.split().explode()
    max_len_ = df['title'].map(lambda x: len(x.split())).max()
    if max_len < max_len_:
        max_len = max_len_
    if len(vocabulary) == 0:
        vocabulary = [k for k, v in nltk.FreqDist(sr_title).items() if v > 1]
    else:
        vocabulary.extend([k for k, v in nltk.FreqDist(sr_title).items() if v > 1])
    y = df['category'].replace({'b':0, 't':1, 'e':2, 'm':3})
    return df['title'], tf.keras.utils.to_categorical(y, dtype='int32')  # 4値分類なので訓練・検証・テスト共通でone-hot化


X_train, y_train = read_dataset('train')
X_valid, y_valid = read_dataset('valid')
X_test, y_test = read_dataset('test')

# setで重複削除し、タプル形式に設定
tup_voc = tuple(set(vocabulary))

print(f'vocabulary size before removing duplicates: {len(vocabulary)}')
print(f'vocabulary size after removing duplicates: {len(tup_voc)}')
print(f'sample vocabulary: {tup_voc[:10]}')
print(f'max length is {max_len}')

vectorize_layer = tf.keras.layers.TextVectorization(
 output_mode='int',
 vocabulary=tup_voc,
 output_sequence_length=max_len)

print(f'vocabulary size is {vectorize_layer.vocabulary_size()}')

embedding_dim = 300
hits = 0
misses = 0

embedding_matrix = np.zeros((vectorize_layer.vocabulary_size(), embedding_dim))
for i, word in enumerate(vectorize_layer.get_vocabulary()):
    try:
        embedding_matrix[i] = w2v_model.get_vector(word)
        hits += 1
        # Words not found in embedding index will be all-zeros.
        # This includes the representation for "padding" and "OOV"
    except:
        misses += 1
        if misses < 7:  # Show 6 words as example
            print(word)

print("Converted %d words (%d misses)" % (hits, misses))

embedding_layer = tf.keras.layers.Embedding(
    vectorize_layer.vocabulary_size(),
    embedding_dim,
    embeddings_initializer=tf.keras.initializers.Constant(embedding_matrix),
    trainable=False
)

model = tf.keras.models.Sequential()
model.add(tf.keras.Input(shape=(1,), dtype=tf.string))
model.add(vectorize_layer)
model.add(embedding_layer)
model.add(tf.keras.layers.GRU(50))
model.add(tf.keras.layers.Dense(4, activation='softmax'))
model.compile(loss='categorical_crossentropy', optimizer='sgd', metrics=['acc'])
model.summary()

model.fit(X_train, y_train, epochs=100, validation_data=(X_valid, y_valid))

model.evaluate(X_test, y_test)

回答解説

学習済み単語ベクトル読込

60本目のノック「言語処理100本ノック(2020)-60: 単語ベクトルの読み込みと表示」と同じです。そちらの解説を参照ください。

w2v_model = KeyedVectors.load_word2vec_format(BASE_PATH+'07.WordVector/input/GoogleNews-vectors-negative300.bin.gz', binary=True)

Embeddingの行列作成

【Keras公式】Using pre-trained word embeddingsを参考に書きました(だいぶシンプルにしています)。事前に作ったTextVectorizationのレイヤを使ってembedding_matrixに単語とベクトルを入れていきます。Embedding層のパラメータinput_dimを【Keras公式】Using pre-trained word embeddingsでは+2していますが、maskと未知語分はvectorize_layer.vocabulary_size()に含んでいるので不要です。
Gensimのget_vector関数を使ってベクトル取得しています。
Gensimのget_keras_embedding関数を使えば非常に簡単っぽいのですが、データ量が多すぎて今回は使えませんでした(1000億単語のマトリックスを作ってしまう)。

%%time
embedding_dim = 300
hits = 0
misses = 0

embedding_matrix = np.zeros((vectorize_layer.vocabulary_size(), embedding_dim))
for i, word in enumerate(vectorize_layer.get_vocabulary()):
    try:
        embedding_matrix[i] = w2v_model.get_vector(word)
        hits += 1
        # Words not found in embedding index will be all-zeros.
        # This includes the representation for "padding" and "OOV"
    except:
        misses += 1
        if misses < 7:  # Show 6 words as example
            print(word)

6つサンプルとして学習済み単語ベクトルになかった単語を出力しています(maskと未知語[UNK]も
含む)。だいたい固有名詞ですかね。98%程度はカバーされているようです。
8000語弱なので1秒未満で終わります。

結果


[UNK]
celgene
grey
heartbleed
H&M
Converted 7664 words (140 misses)
CPU times: user 60 ms, sys: 15.1 ms, total: 75 ms
Wall time: 74.1 ms

Embedding層作成

Embedding層を作成します。学習しないようtrainableはFalseにします。

embedding_layer = tf.keras.layers.Embedding(
    vectorize_layer.vocabulary_size(),
    embedding_dim,
    embeddings_initializer=tf.keras.initializers.Constant(embedding_matrix),
    trainable=False
)

試しにtrainableをTrueにしてみました。訓練のみが良くなり他は悪くなっています。過学習ですね。そんなことできないでしょうが、学習済み単語ベクトルに含まれなかった語句のみ学習ができれば良いのでしょうが。

データセット	Loss(訓練なし)	Loss(訓練あり)	正答率(訓練なし)	正答率(訓練あり)
訓練	0.384	0.240(-0.144)	86.7%	92.3%(+5.6%)
検証	0.470	0.582(+0.112)	83.8%	80.8%(-3.0%)
評価	0.446	0.525(+0.079)	85.1%	82.3%(-2.8%)

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up