BERTの日本語事前学習済みモデルでテキスト埋め込み #Python

https://dev.classmethod.jp/machine-learning/bert-text-embedding/
で紹介されているものを試してみた。

環境は
* Python: 3.7.2
* MacOS: Mojave

BERTの学習済みモデル入手

公開されている日本語pretrainedのBERTモデルをダウンロードする
http://nlp.ist.i.kyoto-u.ac.jp/index.php?BERT日本語Pretrainedモデル#k1aa6ee3

BERTのコード入手

git clone https://github.com/google-research/bert

今回はめんどくさいので、このレポジトリ内で全て実行してしまおう。

cd bert
pyenv local 3.7.2
python -m venv venv
source venv/bin/activate
pip install -r requirements.txt

JUMAN++とpyknpの入手

JUMAN++とpyknpがトークナイズに必要なのでインストールする。

ref. https://dev.classmethod.jp/server-side/python/pyknpjumann-tutorial/

JUMAN++のインストール

cd ~/Downloads
wget http://lotus.kuee.kyoto-u.ac.jp/nl-resource/jumanpp/jumanpp-1.02.tar.xz
tar xJvf jumanpp-1.02.tar.xz
cd jumanpp-1.02
./configure

ハマったところ

checking host system type... x86_64-apple-darwin18.7.0
checking for boostlib >= 1.57... configure: We could not detect the boost libraries (version 1.57 or higher). If you have a staged boost library (still not installed) please specify $BOOST_ROOT in your environment and do not give a PATH to --with-boost option.  If you are sure you have boost installed, then check your version number looking in <boost/version.hpp>. See http://randspringer.de/boost for more documentation.
configure: error: "Error: cannot find available Boost library."

boostがないよー

ということで、Boostのインストール。

brew install boost

でOK。楽。

インストール再開

./configure
make -j 2 
sudo make install

今度は大丈夫。
動作確認します。

$ jumanpp -v
JUMAN++ 1.02 
$ echo "すもももももももものうち" | jumanpp
すもも すもも すもも 名詞 6 普通名詞 1 * 0 * 0 "代表表記:酸桃/すもも 自動獲得:EN_Wiktionary"
@ すもも すもも すもも 名詞 6 普通名詞 1 * 0 * 0 "自動獲得:テキスト"
も も も 助詞 9 副助詞 2 * 0 * 0 NIL
もも もも もも 名詞 6 普通名詞 1 * 0 * 0 "代表表記:股/もも カテゴリ:動物-部位"
@ もも もも もも 名詞 6 普通名詞 1 * 0 * 0 "代表表記:桃/もも 漢字読み:訓 カテゴリ:植物;人工物-食べ物 ドメイン:料理・食事"
も も も 助詞 9 副助詞 2 * 0 * 0 NIL
もも もも もも 名詞 6 普通名詞 1 * 0 * 0 "代表表記:股/もも カテゴリ:動物-部位"
@ もも もも もも 名詞 6 普通名詞 1 * 0 * 0 "代表表記:桃/もも 漢字読み:訓 カテゴリ:植物;人工物-食べ物 ドメイン:料理・食事"
の の の 助詞 9 接続助詞 3 * 0 * 0 NIL
うち うち うち 名詞 6 副詞的名詞 9 * 0 * 0 "代表表記:うち/うち"
EOS

やったね。

pyknpのインストール

JUMAN++をPythonから扱うためにpyknpを入れる。
これは今回の動作環境で実行。

pip install pyknp

文章の埋め込みベクトルを求める

に書かれている通りにやってみる。

修正結果はこんな感じ。

diff --git a/requirements.txt b/requirements.txt
index 357b5ea..28dea11 100644
--- a/requirements.txt
+++ b/requirements.txt
@@ -1,2 +1,2 @@
-tensorflow >= 1.11.0   # CPU Version of TensorFlow.
+tensorflow == 1.15.0rc1   # CPU Version of TensorFlow.
 # tensorflow-gpu  >= 1.11.0  # GPU version of TensorFlow.
diff --git a/tokenization.py b/tokenization.py
index 0ee1359..b5d6ab1 100644
--- a/tokenization.py
+++ b/tokenization.py
@@ -164,12 +164,14 @@ class FullTokenizer(object):
   def __init__(self, vocab_file, do_lower_case=True):
     self.vocab = load_vocab(vocab_file)
     self.inv_vocab = {v: k for k, v in self.vocab.items()}
-    self.basic_tokenizer = BasicTokenizer(do_lower_case=do_lower_case)
+    # self.basic_tokenizer = BasicTokenizer(do_lower_case=do_lower_case)
+    self.jumanpp_tokenizer = JumanPPTokenizer()
     self.wordpiece_tokenizer = WordpieceTokenizer(vocab=self.vocab)

   def tokenize(self, text):
     split_tokens = []
-    for token in self.basic_tokenizer.tokenize(text):
+    # for token in self.basic_tokenizer.tokenize(text):
+    for token in self.jumanpp_tokenizer.tokenize(text):
       for sub_token in self.wordpiece_tokenizer.tokenize(token):
         split_tokens.append(sub_token)

@@ -397,3 +399,26 @@ def _is_punctuation(char):
   if cat.startswith("P"):
     return True
   return False
+
+class JumanPPTokenizer(BasicTokenizer):
+  def __init__(self):
+    """Constructs a BasicTokenizer.
+    """
+    from pyknp import Juman
+
+    self.do_lower_case = False
+    self._jumanpp = Juman()
+
+  def tokenize(self, text):
+    """Tokenizes a piece of text."""
+    text = convert_to_unicode(text.replace(' ', ''))
+    text = self._clean_text(text)
+
+    juman_result = self._jumanpp.analysis(text)
+    split_tokens = []
+    for mrph in juman_result.mrph_list():
+      split_tokens.extend(self._run_split_on_punc(mrph.midasi))
+
+    output_tokens = whitespace_tokenize(" ".join(split_tokens))
+    print(split_tokens)
+    return output_tokens

ハマったところ

Tensorflowのバージョンが2.0.0だと以下のエラーが出て動作しなかったので、1系の新しいものに置き換えている。

Traceback (most recent call last):
  File "./extract_features.py", line 30, in <module>
    flags = tf.flags
AttributeError: module 'tensorflow' has no attribute 'flags'

実行

/tmp/input.txtと、以下のスクリプトを用意。

run.sh

python ./extract_features.py \
  --input_file=/tmp/input.txt \
  --output_file=/tmp/output.jsonl \
  --vocab_file=$BERT_BASE_DIR/vocab.txt \
  --bert_config_file=$BERT_BASE_DIR/bert_config.json \
  --init_checkpoint=$BERT_BASE_DIR/bert_model.ckpt \
  --do_lower_case False \
  --layers -2

実行。
$ bash run.sh

出力結果をtsvに変換する。スクリプトは上記リンク先で紹介されているまま。

jsonl2tsv.py

import json
import numpy as np

# 参照するレイヤーを指定する
TARGET_LAYER = -2

# 参照するトークンを指定する
SENTENCE_EMBEDDING_TOKEN = '[CLS]'

with open('/tmp/output.jsonl', 'r') as f:
    output_jsons = f.readlines()

embedding_list = []
for output_json in output_jsons:
    output = json.loads(output_json)
    for feature in output['features']:
        if feature['token'] != SENTENCE_EMBEDDING_TOKEN: continue
        for layer in feature['layers']:
            if layer['index'] != TARGET_LAYER: continue
            embedding_list.append(layer['values'])

np.savetxt('/tmp/output.tsv', embedding_list, delimiter='\t')

これで動いた。

以上！