More than 5 years have passed since last update.

Kerasのテキスト前処理 Tokenizerについて

Posted at 2019-08-17

KerasのTokenizerを用いたテキストのベクトル化についてメモ。

Tokenizerのfit_on_textsメソッドを用いてテキストのベクトル化を行うと、単語のシーケンス番号（1～）の列を示すベクトルが得られる。

from keras.preprocessing import text 
texts = ["I have a pen.", "I have an apple", "You have pen and apple."]
tokenizer = text.Tokenizer()
tokenizer.fit_on_texts(texts)
list_tokenized = tokenizer.texts_to_sequences(texts)
print(list_tokenized)

# 出力結果
[[2, 1, 5, 3], [2, 1, 6, 4], [7, 1, 3, 8, 4]]

参考：https://keras.io/ja/preprocessing/text/

しかし、このままでは各文の長さが異なり扱いにくいため、固定長に変換するのが通常。

from keras.preprocessing import sequence
list_sequence = sequence.pad_sequences(list_tokenized, maxlen=10)
print(list_sequence)

# 出力結果
[[0 0 0 0 0 0 2 1 5 3]
 [0 0 0 0 0 0 2 1 6 4]
 [0 0 0 0 0 7 1 3 8 4]]

以上。

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up