More than 3 years have passed since last update.

【自然言語処理】【BERT】文章の一部を予測する

Last updated at 2021-05-03Posted at 2021-05-03

はじめに

自然言語処理の手法であるBERTを用いて、神奈川県高校入試の英語問題における出題予測をすることを目的とする。

今回はBERTを用いて、神奈川県高校入試の英語で出題された問２適語補充において、解答を予測ししてみる。
また、今後、長文問題の文書分類を行い、今後出題されるカテゴリの予測をしたいため、形態素解析を行うための準備を進める。
その際、BERTを用いたTokenizerと、Stanford大学の自然言語処理プラグラムのStanzaを用いて、どちらが的確にTokenに区分できているかを比較する。

問２適語補充の解答予測

コメントでご教授いただき、以下２点のコードを変更しました。

（１）pytorch-transformersをtransformersに変更

変更の際は以下のサイトを参考にしました。
・Transformers
・class transformers.BertForMaskedLM

修正前

!pip install folium==0.2.1
!pip install urllib3==1.25.11
!pip install pytorch-transformers==1.2.0

修正後

!pip install transformers

（２）ライブラリのimport部分

修正前

import torch
from pytorch_transformers import BertForMaskedLM
from pytorch_transformers import BertTokenizer

修正後

import torch
from transformers import BertTokenizer, BertForMaskedLM

（３）全体修正後のコード

!pip install transformers

import torch
from transformers import BertTokenizer, BertForMaskedLM

text = "[CLS] Many people climb this mountain during the summer every year. [SEP]"

tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
words = tokenizer.tokenize(text)
print(words)

# 予測する単語を[MASK]に変換する
# 今回は[during]を予測するため、6番目を変換
msk_idx = 6
words[msk_idx] = "[MASK]"
print(words)

# tokenをidに変換する
word_ids = tokenizer.convert_tokens_to_ids(words)
word_tensor = torch.tensor([word_ids])
print(word_tensor)

msk_model = BertForMaskedLM.from_pretrained("bert-base-uncased")
# GPUを使用
msk_model.cuda()
msk_model.eval()

x = word_tensor.cuda()
y = msk_model(x)
result = y[0]
print(result.size())

_, msk_ids = torch.topk(result[0][msk_idx], k=5)
result_words = tokenizer.convert_ids_to_tokens(msk_ids.tolist())

print(result_words)

結果

2011年度問２（２）

July 31
Dear my family,
Hi! How are you doing ? I am writing this postcard at the top of Mt. Fuji.
Mt. Fuji is the （１）(highest) mountain in Japan. Many people climb this mountain （２）(during) the summer every year. Don't you think it's beautiful ?
My Japanese friend Aika and her father brought me here. We left Aika's house （３）(yesterday)
and stayed at the eighth station of Mt. Fuji last night. This morning we got up early, got
to the top and saw a wonderful sunrise. I'm not （４）(tired). I feel very happy.
With love, Chris

['in', 'during', 'throughout', 'over', 'for']

正解はduringのため、２番目に確率が高いものになりました。
実際の入試問題では（d_____）のように単語の先頭の文字が指定されています。
その為、条件を指定すれば、解答を予測できていると考えていいでしょう。

2018年度問２（１）

Eri: Hi, Alex. What are you doing now ?
Alex: Hi, Eri. I'm doing my homework. I'm learning about （１）(traditional) Japanese events like Setsubun and Hinamatsuri for my speech next week.
Eri: Oh, you'll talk about events that have a long history.
Alex: I will. Well, I'd like to know what some Japanese words mean. Do you have a （２）(dictionary)?
Eri: Yes. Here you are.
Alex: Thank you. This will be very （３）(useful) to finish my homework. Can I use it at home today?
Eri: Sure. I hope it will help you. Good luck, Alex.

['special', 'important', 'traditional', 'the', 'some']

正解はtraditionalのため、３番目に高い確率のもの。

2018年度問２（３）

['difficult', 'important', 'hard', 'nice', 'good', 'easy', 'helpful', 'interesting', 'useful', 'fun']

正解は９番目に高い確率のもの。
これまでに比べて、確率が低い結果になりました。ただ、予測された単語の一覧を見ると、違和感はありません。文章の前後の関係からポジティブな意味を持つ単語は正解にはならないことがわかるから、それらは除外できます。
残ったネガティブな単語から正解にたどり着くには単語の頭文字から検討する他はないのかな。

ここまでの結果をみると、BERTを用いた結果の検証では「文脈を読むこと」ができています。

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up