More than 5 years have passed since last update.

Pytorchで日本語のbert学習済みモデルを動かすまで

Last updated at 2019-05-02Posted at 2019-04-23

タイトル通りpytorchでbertを動かすまでにやったこと

やってみた系記事です
まとまってる記事がなかったので各サイトのドキュメント読めばわかりますが、一応

環境

MacOS High Sierra 10.13.3

ダウンロード

京大の学習済みコーパスを以下よりダウンロード
http://nlp.ist.i.kyoto-u.ac.jp/index.php?BERT日本語Pretrainedモデル
Jumanのインストール
https://qiita.com/riverwell/items/7a85ebf95647eaf18a6c
pytorch版bertモデルのダウンロード
https://github.com/huggingface/pytorch-pretrained-BERT

注意

http://nlp.ist.i.kyoto-u.ac.jp/index.php?BERT日本語Pretrainedモデル
の詳細に書いてあるので書くまでもないですが見落としてたので一応

注意: --do_lower_case False オプションをつけてください。これをつけないと、濁点が落ちてしまいます。また、tokenization.pyの以下の行をコメントアウトしてください。これを行わないと漢字が全て一文字単位になってしまいます。

 # text = self._tokenize_chinese_chars(text)

なおtokenization.pyの場所については以下参照
https://qiita.com/t-fuku/items/83c721ed7107ffe5d8ff

コード

　mask言語モデルを試してみます

prepare

import torch
from pytorch_pretrained_bert import BertTokenizer, BertForPreTraining, modeling
from pyknp import Jumanpp

config = modeling.BertConfig(vocab_size_or_config_json_file=32006,
                             hidden_size=768, num_hidden_layers=12,
                             num_attention_heads=12, intermediate_size=3072)
model = BertForPreTraining(config=config)
model.load_state_dict(torch.load("pytorch_model.bin"))
tokenizer = BertTokenizer("vocab.txt", do_lower_case=False)

jm = Jumanpp()
res = jm.analysis("今日は良い天気でサッカーがしたくなりますね。")
hoge = ["[CLS]"] + [i.midasi for i in res.mrph_list()]
hoge[6] = "[MASK]"
tokens = tokenizer.tokenize(" ".join(hoge))
ids = tokenizer.convert_tokens_to_ids(tokens)
print([(i, j) for i, j in enumerate(tokens)])

[(0, '[CLS]'), (1, '今日'), (2, 'は'), (3, '良い'), (4, '天気'), (5, 'で'), (6, '[MASK]'), (7, 'が'), (8, 'し'), (9, 'たく'), (10, 'なり'), (11, 'ます'), (12, 'ね'), (13, '。')]

predict

ids = torch.tensor(ids).reshape(1,-1)
model.eval()
with torch.no_grad():
    output, _ = model(ids)
print([tokenizer.ids_to_tokens[i.item()] for i in output[0][6].argsort()[-10:]])

['サーフィン', '買い物', '歌', '旅', '試合', '勉強', '食事', 'ゲーム', '話', '仕事']

上手く動いてそうですね。

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up