More than 5 years have passed since last update.

huggingface/transformersのBertModelで日本語文章ベクトルを作成

Last updated at 2020-03-07Posted at 2020-03-07

事前学習済みBERTから日本語文章ベクトルを作成する方法を紹介します。

環境

Python (3.6.9)
PyTorch (1.3.0)
transformers (2.5.1)

手順

1. モデル読み込み


import torch
from transformers.tokenization_bert_japanese import BertJapaneseTokenizer
from transformers import BertModel 

# 日本語トークナイザ
tokenizer = BertJapaneseTokenizer.from_pretrained('bert-base-japanese')
# 事前学習済みBert
model = BertModel.from_pretrained('bert-base-japanese')

2. 入力データの準備

今回は3文を格納したリストを用意

input_batch = \
    ["すもももももももものうち", 
    "隣の客はよく柿食う客だ",
    "東京特許許可局許可局長"]

3. 前処理（単語Id化, Padding, 特殊トークン付与)

batch_encode_plusを使えば、文章リストからモデル入力用のミニバッチへ前処理してくれます。
pad_to_max_lengthはPaddingのオプション。

encoded_data = tokenizer.batch_encode_plus(
input_batch, pad_to_max_length=True, add_special_tokens=True)

結果
辞書型を返却するので注意が必要です。
input_idsが単語ID化したものです。

{'input_ids': [[2, 340, 28480, 28480, 28, 18534, 28, 18534, 5, 859, 3, 0],
  [2, 2107, 5, 1466, 9, 1755, 14983, 761, 28489, 1466, 75, 3],
  [2, 391, 6192, 3591, 600, 3591, 5232, 3, 0, 0, 0, 0]],
 'token_type_ids': [[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
  [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
  [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]],
 'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0],
  [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
  [1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0]]}

ちなみに、どのようにトークナイズされたか確認すると以下のにようになります。

input_ids = torch.tensor(encoded_data["input_ids"])
tokenizer.convert_ids_to_tokens(input_ids[0].tolist())

結果

特殊トークンがきちんと付与されています。

['[CLS]', 'す', '##も', '##も', 'も', 'もも', 'も', 'もも', 'の', 'うち', '[SEP]', '[PAD]']

4. BERTで文ベクトル化

さきほどのtensor化したinput_idsをBERTへ入力します。

公式ドキュメントによると、モデルはタプルを返却する。
1つ目の要素が最終層の隠れ状態ベクトルになるので、outputs[0]で取り出す。

outputs = model(input_ids)
last_hidden_states = outputs[0]
print(last_hidden_states.size())
# torch.Size([3, 12, 768])

出力ベクトルのサイズを見てみると、（ミニバッチサイズ, 系列長, ベクトル次元数）になっています。
入力したテキストの先頭に付与されている[CLS]から文ベクトルを作りたいので、以下のように取り出します。

sentencevec = last_hidden_states[:,0,:]
print(sentencevec.size())
# torch.Size([3, 768])

これで完成です。

参考ページ

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up