LoginSignup
14
7

More than 3 years have passed since last update.

Kerasのテキスト前処理 Tokenizerについて

Posted at

KerasのTokenizerを用いたテキストのベクトル化についてメモ。

Tokenizerのfit_on_textsメソッドを用いてテキストのベクトル化を行うと、単語のシーケンス番号(1~)の列を示すベクトルが得られる。

from keras.preprocessing import text 
texts = ["I have a pen.", "I have an apple", "You have pen and apple."]
tokenizer = text.Tokenizer()
tokenizer.fit_on_texts(texts)
list_tokenized = tokenizer.texts_to_sequences(texts)
print(list_tokenized)

#出力結果
[[2, 1, 5, 3], [2, 1, 6, 4], [7, 1, 3, 8, 4]]

参考:https://keras.io/ja/preprocessing/text/

しかし、このままでは各文の長さが異なり扱いにくいため、固定長に変換するのが通常。

from keras.preprocessing import sequence
list_sequence = sequence.pad_sequences(list_tokenized, maxlen=10)
print(list_sequence)

#出力結果
[[0 0 0 0 0 0 2 1 5 3]
 [0 0 0 0 0 0 2 1 6 4]
 [0 0 0 0 0 7 1 3 8 4]]

以上。

14
7
0

Register as a new user and use Qiita more conveniently

  1. You get articles that match your needs
  2. You can efficiently read back useful information
  3. You can use dark theme
What you can do with signing up
14
7