More than 3 years have passed since last update.

ALBERTを使った日本語QAモデルを作りたかった

Last updated at 2021-06-20Posted at 2021-06-20

はじめに

やること

以下記事(以降、参考記事)でBERTを使った日本語QAモデルの作成について解説されていたので、それを参考にALBERTで動作を目指した。

手順

・既存の日本語事前学習モデルを使用
・SQuAD2.0 形式の公開データセットを使用
・QAモデルとしてfinetuning

環境

Google Colaboratory

ALBERTとは

A Lite BERT。
BERTの改良版。
軽量かつスコアもBERTよりいい(らしい)

※BERTの説明をするには余白が足りない。

結論

失敗

自分のレベルでは、なぜ上手くいかなかったかもわからない。。。
この記事を読んだあなた。どうか真相を暴いてください。（そしてこっそり教えてください）
それだけが私の望みです。

やったこと

事前学習データを用意する

まず、ここで詰まった。
BERTでは参考記事でも紹介されている通り、有名な東北大学様の日本語事前学習モデルがあるが、同じもののALBERT版は無い様子。
・bert-base-japanese
・bert-base-japanese-whole-word-masking
・bert-base-japanese-char
・bert-base-japanese-char-whole-word-masking

探したところ、参考記事と同様にhuggingfaceからダウンロード可能なALBERT版の日本語事前学習モデルがあったので、これを使用させていただく。（作成者様に感謝）

※なお、下記記事でもALBERT版日本語事前学習モデルを公開されていたが、run_squad.pyで読み込むとエラーになり、それが解消できず断念。

SQuAD2.0 形式の公開データセットを使用

日本語のSQuAD2.0のデータセットといえば、運転ドメインデータセット。
運転ドメインデータセットということで、運転ドメインに強くなるが、データ数が多いためか、割と他のドメインのQAも対応できるモデルが作れた。(BERTでは)

QAモデルとしてfinetuning

初期化。（いまいち理解していない）
SQuAD2.0でfinetuningを行うrun_squad.pyはALBERT用の設定記述を追記したものを、自分のGitHubからダウンロードして使用。

!pip install transformers==2.9.1
!git clone https://github.com/huggingface/transformers
!git clone https://github.com/NVIDIA/apex
!pip install -v --no-cache-dir apex/
!apt-get -y install mecab libmecab-dev mecab-ipadic-utf8 > /dev/null
!pip install mecab-python3 > /dev/null
!git clone https://github.com/yydevelop/albert_qa.git
!ln -s /etc/mecabrc /usr/local/etc/mecabrc

運転ドメインデータセットを展開

from google.colab import drive
drive.mount('/content/gdrive')
!tar zxvf "/content/gdrive/MyDrive/Colab Notebooks/BERT_QA/DDQA

先述の事前学習モデル(ALINEAR/albert-japanese-v2)を使用し、run_squad.pyでfinetuning

!python albert_qa/run_squad.py\
  --model_type  "albert-japanese-v2" \
  --model_name_or_path  "ALINEAR/albert-japanese-v2" \
  --train_file DDQA-1.0/RC-QA/DDQA-1.0_RC-QA_train.json \
  --do_train \
  --per_gpu_train_batch_size 12 \
  --predict_file DDQA-1.0/RC-QA/DDQA-1.0_RC-QA_dev.json \
  --learning_rate 3e-5 \
  --num_train_epochs 2.0 \
  --max_seq_length 384 \
  --doc_stride 128 \
  --fp16 \
  --do_eval \
  --save_steps 3000 \
  --version_2_with_negative \
  --output_dir albert_model_output/

出力

エラーは出ずfinetuning処理自体は行われている様子。
2エポック回るまで気長に待つ

Iteration:   0% 2/1493 [00:00<08:44,  2.84it/s]
Iteration:   0% 3/1493 [00:00<08:19,  2.98it/s]
Iteration:   0% 4/1493 [00:01<08:03,  3.08it/s]
Iteration:   0% 5/1493 [00:01<07:48,  3.18it/s]
Iteration:   0% 6/1493 [00:01<07:39,  3.24it/s]
Iteration:   0% 7/1493 [00:02<07:31,  3.29it/s]
Iteration:   1% 8/1493 [00:02<07:26,  3.33it/s]
Iteration:   1% 9/1493 [00:02<07:21,  3.36it/s]

最後まで進んでプロセス自体は完了している様子。

Iteration:  99% 1485/1493 [07:30<00:02,  3.29it/s]
Iteration: 100% 1486/1493 [07:31<00:02,  3.28it/s]
Iteration: 100% 1487/1493 [07:31<00:01,  3.28it/s]
Iteration: 100% 1488/1493 [07:31<00:01,  3.28it/s]
Iteration: 100% 1489/1493 [07:32<00:01,  3.29it/s]
Iteration: 100% 1490/1493 [07:32<00:00,  3.29it/s]
Iteration: 100% 1491/1493 [07:32<00:00,  3.28it/s]
Iteration: 100% 1493/1493 [07:33<00:00,  3.30it/s]
Epoch: 100% 2/2

INFO - __main__ -    global_step = 1494, average loss = 0.24636068274296627

Resultsを見ると異様に低い数値。
※BERTではここで、exactが20は超えていた。（それが高いのかは疑問だが）

Results: {'exact': 1.253616200578592, 'f1': 1.253616200578592, 'total': 1037, 'HasAns_exact': 1.253616200578592, 'HasAns_f1': 1.253616200578592, 'HasAns_total': 1037, 'best_exact': 1.253616200578592, 'best_exact_thresh': 0.0, 'best_f1': 1.253616200578592, 'best_f1_thresh': 0.0}

finetuningモデルの使用

参考記事の手順のbert部分をalbertに変えただけ

from transformers import AlbertTokenizer, AlbertForQuestionAnswering, AutoTokenizer, AutoConfig,AlbertConfig
import torch
model_directory = "/content/gdrive/MyDrive/Colab Notebooks/output/albert_model_output"
pretrained_model = "ALINEAR/albert-japanese-v2"
 
config = AlbertConfig.from_pretrained(model_directory + "/config.json")
tokenizer_config = AlbertConfig.from_pretrained(model_directory + "/tokenizer_config.json")

model = AlbertForQuestionAnswering.from_pretrained(pretrained_model, config=config)
tokenizer = AutoTokenizer.from_pretrained(pretrained_model, config=tokenizer_config)
model.load_state_dict(torch.load(model_directory + "/pytorch_model.bin", map_location=torch.device('cpu')))

本文にアンダーバーを付けて返すだけの悲しいマシーンに。

def predict(quesion, context):
  print(question)
  print(context)
  input_ids = tokenizer.encode(question, context)
  token_type_ids = [0 if i <= input_ids.index(3) else 1 for i in range(len(input_ids))]
  start_scores, end_scores = model(torch.tensor([input_ids]), token_type_ids=torch.tensor([token_type_ids]))
  all_tokens = tokenizer.convert_ids_to_tokens(input_ids)
  prediction = ''.join(all_tokens[torch.argmax(start_scores) : torch.argmax(end_scores)+1])
  prediction = prediction.replace("#", "")
  prediction = prediction.replace(" ","")
  return prediction

print("例")
question = "りんごは何色"
context = "りんごは赤くて美味しい。"
print("\ncontext:", context ,"question:", question)
print(predict(question, context))

出力
context: りんごは赤くて美味しい。
question: りんごは何色
▁りんごは赤くて美味しい

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up