More than 5 years have passed since last update.

ABEJA, Inc.

BERT導入手順おさらい個人メモ

Posted at 2019-07-14

概要

BERTをインストールし、学習済みモデルをセットアップし、とりあえず動かすところまで到達するための手順をおさらいするメモである。
後々、同様の作業が必要になった時に参照するためのものである。

ゴール

BERTをインストールする
- GoogleのBERTリポジトリをクローンして、日本語用にカスタマイズして動かす。
学習済みモデルをダウンロードして使えるようにする
- 京都大学の黒橋・河原研究室で公開されている学習済みBERTモデルを使用する。
日本語対応で動作するようにする
- JUMANを用いて日本語文を形態素解析にかける。

導入手順を確認した環境について

Ubunts 16.04.3 LTS にて導入手順を確認している。

参考エントリ

作業手順の作成にあたり、以下のエントリを参考にさせていただいた。
この場を借りて感謝申し上げたい。
https://dev.classmethod.jp/machine-learning/bert-text-embedding/

学習済み日本語モデルのダウンロード

大きなファイルなので、先にダウンロード処理を始めておく。
BERT日本語Pretrainedモデル@黒橋・河原研究室

$ wget "http://nlp.ist.i.kyoto-u.ac.jp/DLcounter/lime.cgi?down=http://nlp.ist.i.kyoto-u.ac.jp/nl-resource/JapaneseBertPretrainedModel/Japanese_L-12_H-768_A-12_E-30_BPE.zip&name=Japanese_L-12_H-768_A-12_E-30_BPE.zip"

ダウンロードしたZIPファイルを解凍すると以下のファイルが展開される。

-rw-r--r-- 1 *******       4429  3月 30 19:53 README.txt
-rw-r--r-- 1 *******        313  3月 29 14:48 bert_config.json
-rw-r--r-- 1 ******* 1334971496  3月 29 14:50 bert_model.ckpt.data-00000-of-00001
-rw-r--r-- 1 *******      23350  3月 29 14:50 bert_model.ckpt.index
-rw-r--r-- 1 *******    3916436  3月 29 14:50 bert_model.ckpt.meta
-rw-r--r-- 1 *******  445037530  3月 29 14:48 pytorch_model.bin
-rw-r--r-- 1 *******     283356  3月 29 14:48 vocab.txt

vocab.txt には、使う語句の定義が記載されている。
冒頭部には、システム予約語句が入っている。

    [PAD]
    [UNK]
    [CLS]
    [SEP]
    [MASK]
    の
    、
    。
    に
    は
    を
    が
（・・・中略・・・）

JUMAN++ のインストール

今回使用する学習済みBERTモデルは、JUMAN++ を前提に構築されている。
JUMAN ではないので注意が必要だ。（間違ってJUMANを導入するとどうなるかは、この後で紹介）

JUMAN++ インストール前の準備

公式説明マニュアルによると以下のパッケージの導入をすべしとのことである。

必須ツール・ライブラリ
- gcc (4.9 以降)
- Boost C++ Libraries (1.57 以降)
推奨ライブラリ（導入することで，動作を高速化することができる）
- gperftool 2
- libunwind 3 (gperftool を 64bit 環境で動作させる場合に必要)

ということで、以下のコマンドで必要なパッケージ群を導入しておく。

$ sudo  apt-get install libboost-all-dev google-perftools libgoogle-perftools-dev

補足：libunwind だが、apt search で確認したところ、すでに導入済みだったので上記コマンドではスキップしている。

# 以下、apt search の結果から抜粋

libboost-all-dev/xenial 1.58.0.1ubuntu1 amd64
  Boost C++ Libraries development files (ALL) (default version)


google-perftools/xenial-updates,xenial-updates 2.4-0ubuntu5.16.04.1 all
  command line utilities to analyze the performance of C++ programs

libgoogle-perftools-dev/xenial-updates 2.4-0ubuntu5.16.04.1 amd64
  libraries for CPU and heap analysis, plus an efficient thread-caching malloc

libgoogle-perftools4/xenial-updates 2.4-0ubuntu5.16.04.1 amd64
  libraries for CPU and heap analysis, plus an efficient thread-caching malloc

libtcmalloc-minimal4/xenial-updates 2.4-0ubuntu5.16.04.1 amd64
  efficient thread-caching malloc
  

libunwind8/xenial,now 1.1-4.1 amd64 [インストール済み、自動]
  プログラムのコールチェーン測定ライブラリ - ランタイム版

libunwind8-dbg/xenial 1.1-4.1 amd64
  プログラムのコールチェーン測定ライブラリ - ランタイム版

libunwind8-dev/xenial 1.1-4.1 amd64
  library to determine the call-chain of a program - development

JUMAN++ ソースコードのダウンロード＆ビルド

必要なパッケージ群をインストールし終えたので、JUMAN++ のソースコードをダウンロードしてビルドする。

$ wget "http://lotus.kuee.kyoto-u.ac.jp/nl-resource/jumanpp/jumanpp-1.02.tar.xz"
$ tar xJvf jumanpp-1.02.tar.xz
$ cd jumanpp-1.02
$ ./configure
  (・・・中略・・・)
    checking for strstr... yes
    checking for strtol... yes
    checking for strtoul... yes
    checking that generated files are newer than configure... done
    configure: creating ./config.status
    config.status: creating Makefile
    config.status: creating src/Makefile
    config.status: creating src/cdb/Makefile
    config.status: creating src/config.h
    config.status: executing depfiles commands
  
$ make
$ sudo make install

上記コマンドで無事にJUMAN++をインストールすることができた。

Pythonバインディングモジュール(`pyknp`)のインストール

pyknp は、黒橋・河原研究室が公開しているJUMAN++用のpythonバインディングモジュールである。

pip コマンドを使い、以下の様にしてインストールした。

$ sudo pip install pyknp

BERTのコードのクローン＆カスタマイズ

今回は、GoogleのリポジトリからBERTのコードをクローンしてカスタマイズする。

BERTのコードのクローン

まず、以下の様にしてBERTのコードをクローンする。

$ git clone https://github.com/google-research/bert

カスタマイズ（JUMAN++対応）

デフォルトの状態では、JUMAN++ に対応していないので、コードを一部改修して対処する。

（１）FullTokenizer クラスの改修

end-to-end で入力されたテキストをトークン(＝形態素）に分割するクラスである。
これを改修して、後程定義する JumanPPTokenizer を呼び出してトークン化処理を実施するようにする。

tokenization.py


class FullTokenizer(object):
  """Runs end-to-end tokenziation."""

  def __init__(self, vocab_file, do_lower_case=True):
    self.vocab = load_vocab(vocab_file)
    self.inv_vocab = {v: k for k, v in self.vocab.items()}

    # JUMAN++を用いた日本語対応の場合、BasicTokenizerを使わない。
    # self.basic_tokenizer = BasicTokenizer(do_lower_case=do_lower_case)
    self.jumanpp_tokenizer = JumanPPTokenizer()
    self.wordpiece_tokenizer = WordpieceTokenizer(vocab=self.vocab)

  def tokenize(self, text):
    split_tokens = []
    # JUMAN++を用いた日本語対応用。
    # 設定で切り替えることが出来ればよいのだが。
    #for token in self.basic_tokenizer.tokenize(text):
    for token in self.jumanpp_tokenizer.tokenize(text):
      for sub_token in self.wordpiece_tokenizer.tokenize(token):
        split_tokens.append(sub_token)

    return split_tokens

  def convert_tokens_to_ids(self, tokens):
    return convert_by_vocab(self.vocab, tokens)

  def convert_ids_to_tokens(self, ids):
    return convert_by_vocab(self.inv_vocab, ids)

（２）JumanPPTokenizer の追加

Juman++ を使ったトークン化処理のクラスを追加する。
tokenization.py の末尾に以下のコードを追加する方法をとった。

tokenization.py


class JumanPPTokenizer(BasicTokenizer):
  def __init__(self):
    """
        日本語専用トークナイザの構築。
        JUMAN++ を使用する。
    """
    from pyknp import Juman

    self.do_lower_case = False
    self._jumanpp = Juman()

  def tokenize(self, text):
    """Tokenizes a piece of text."""
    text = convert_to_unicode(text.replace(' ', ''))
    text = self._clean_text(text)

    juman_result = self._jumanpp.analysis(text)
    split_tokens = []
    for mrph in juman_result.mrph_list():
      split_tokens.extend(self._run_split_on_punc(mrph.midasi))

    output_tokens = whitespace_tokenize(" ".join(split_tokens))
    print(split_tokens)
    return output_tokens

動作実験

学習済みモデルを導入したディレクトリのパスを環境変数 BERT_BASE_DIR に指定して、特徴量抽出処理のスクリプトを実行した。
実行結果は、引数 --output_file で指定したパスに、JSON形式で出力される。

export BERT_BASE_DIR="/some_path/bert/Japanese_L-12_H-768_A-12_E-30_BPE"
python ./extract_features.py \
  --input_file=/tmp/input.txt \
  --output_file=/tmp/output.jsonl \
  --vocab_file=$BERT_BASE_DIR/vocab.txt \
  --bert_config_file=$BERT_BASE_DIR/bert_config.json \
  --init_checkpoint=$BERT_BASE_DIR/bert_model.ckpt \
  --do_lower_case False \
  --layers -2  
  
['本日', 'の', '天気', 'は', '曇り', 'である', '。', '気温', 'も', '高', 'すぎ', 'ず', '、', '過ごし', 'やすい', '。']
INFO:tensorflow:*** Example ***
INFO:tensorflow:unique_id: 0
INFO:tensorflow:tokens: [CLS] 本 ##日 の 天気 は 曇 ##り である 。 気温 も 高 すぎ ず 、 過ごし やすい 。 [SEP]
INFO:tensorflow:input_ids: 2 97 2581 5 9292 9 27195 445 32 7 4835 23 235 6273 109 6 12675 2273 7 3 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
INFO:tensorflow:input_mask: 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
INFO:tensorflow:input_type_ids: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
['昨日', 'は', '突然', '大雨', 'が', '降って', 'きて', '大変だった', '。', '気温', 'が', '低く', '、', '体調', 'に', '気', 'を', 'つけ', 'なければ', 'なら', 'なかった', '。']
INFO:tensorflow:*** Example ***  
  
  (・・・中略・・・)

注意点：入力ファイルの末尾に`EOS`をセットせよ！

入力ファイル(引数--input_fileで指定）の末尾に、文字列EOSだけの行を入れておかないと、pyknp がハングアップするので注意が必要である。

以下の様に、最後の行に文字列EOSだけ記述しておくとよい。

$ cat /tmp/tmp/input.txt
本日の天気は曇りである。気温も高すぎず、過ごしやすい。
昨日は突然大雨が降ってきて大変だった。気温が低く、体調に気をつけなければならなかった。
小腹がすいたとき、アーモンドや胡桃をつまんで空腹を満たしている。
（・・・中略・・・）
来週の土日は、晴天に恵まれると良いな。農作業日和だとうれしい。
20時を回ってきた。そろそろ夕食の時間だな。良い感じにお腹も空いてきた。
EOS

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up