More than 3 years have passed since last update.

事前学習済みモデルBERTを使用して、TOEIC Part5の解答導出過程で着目した単語をattentionで可視化し、分析してみた

Last updated at 2021-10-23Posted at 2021-10-23

1.背景

IT業界、特にAI領域への転職を見据え、2021年5月より、Aidemyのデータ分析講座の受講を開始しました。
主な学習内容は下記の通りです。
※下記は必須の学習内容ですが、その他にも多数のコースが受講可能です。

＜必須コース＞

Pythonの基礎知識
- NumPy（数値計算）
- Pandas（データベース）
- Matplotlib（データの可視化）
機械学習の手法
- 教師あり学習（回帰・分類）
- 教師なし学習
時系列データの解析
データクレンジング・データハンドリング
ディーブラーニングの基礎知識
自然言語処理の基礎知識
- Twitterの感情分析と株価予測
タイタニックのKaggkeコンペ
テーマを決めてブログ作成

＜Aidemy Premium Plan＞

https://aidemy.net/grit/premium/

2.読んでほしい対象

機械学習（特に自然言語処理）の基礎が理解できている人
Pythonの基礎が理解できている人

3.今回取り組むテーマ

今回はTOEIC Part5の問題をテーマに設定しました。
人間の思考と今回使用する事前学習済みモデルBERTの解答の導出過程が同じであれば、人間の学習の支援になるかもしれないと考え、このテーマにしました（自分自身のTOEICスコアアップにも役立つかなと思い...笑）。
TOEICの各Partの中でもReadigSectionのPart5に絞った理由は、今回使用するBERTの事前学習と同じ単語の穴埋め問題だからです。
解答を導くためにBERTから得られるattention確率を可視化することによって、モデルがどの単語に着目したかを調査しようと思います。

まず、今回のテーマで大きな問題になったのが、検証に使用するデータの確保です。
TOEICは試験問題を公開していないため、紙ベースの公式問題集などからデータを集めるしか方法はないのかな？と思っていたのですが、KaggleにTOEIC Part5の問題のデータセットとBERTの解答を出力するアルゴリズムのコードが公開されていたので、今回はこちらを参考に事前学習済みモデルBERTに未知の問題の解答を出力させる、そして解答を導くために注目した単語について探っていこうと思います。

また、TOEICのPart5は単語の穴埋め問題であり、これはBERTの事前学習と同じような問題であるため、特にfine-tuneは実施せずにどれくらい正答できるかについても検証します。

今回使用するモデルの入出力のイメージとして、下記のようなテストデータ（問題・解答・選択肢のセット）を入力した場合、期待する解答（出力）は'different'になります。

▼テストデータ

question = {'1': 'different',
 '2': 'differently',
 '3': 'difference',
 '4': 'differences',
 'answer': 'different',
 'question': 'Matos Realty has developed two ___ methods of identifying undervalued properties.'}

4.開発環境

＜使用した機種＞

OS：Mac（Intel）
モデル：MacBook Air (13-inch, 2017)

＜使用した開発プラットフォームと設定＞

kaggle（Accelerator：GPU、Internet：ON）

5.参考にした記事

本記事の執筆にあたり、参考にした記事は下記の通りです。

TOEICのデータおよびBERTを使用した解答モデルの参考：BERT model for answering TOEIC reading test
attention確率の可視化の参考：PyTorchで日本語BERTによる文章分類＆Attentionの可視化を実装してみた

6.手法

今回は自然言語処理のSoTA（最も高精度なアルゴリズム）であるBERTを使用しました。BERTとは、「Bidirectional Encoder Representations from Transformers」の略で、2018年にGoogleのJacob Devlinらによって発表された機械学習の手法です。

7.実装

※データセット内のキーである"answer"のスペルが"anwser"になっています（間違っています）。
※変数名内の"answer"は正しいスペルで記載しています。

（1）データの読み込み&リスト化

まずはデータセットを読み込み、読み込んだ1つ目のデータを確認してみます。

import json
with open('../input/toeic-test/toeic_test.json') as input_json:
    data = json.load(input_json)
data['1']

▼実行結果

※出力結果内の"answer"のスペルは間違ったスペル"anwser"出力されています。

{'1': 'suffer',
 '2': 'suffers',
 '3': 'suffering',
 '4': 'suffered',
 'anwser': 'suffered',
 'question': 'The assets of Marble Faun Publishing Company ___ last quarter when one of their main local distributors went out of business.'}

読み込んだデータをリスト形式の配列に格納し、1つ目のデータを取り出します。

question_infors = []
for key, value in data.items():
    question_infors.append(value)
question_infors[0]

▼実行結果

{'1': 'suffer',
 '2': 'suffers',
 '3': 'suffering',
 '4': 'suffered',
 'anwser': 'suffered',
 'question': 'The assets of Marble Faun Publishing Company ___ last quarter when one of their main local distributors went out of business.'}

（2）事前学習済みのBERTを読み込み

transformersをインストールします。

!pip install -U transformers==4.10.1

pytorchとbertのBertTokenizer, BertModel, BertForMaskedLMをインポートします。

import torch
from transformers import BertTokenizer, BertModel, BertForMaskedLM

（3）クラスを作成

class TOEICBert():
    # クラス設定をします
    def __init__(self, bertmodel):
        # GPUが使える：True、使えない：False
        self.use_cuda = torch.cuda.is_available()
        # GPUを使えるかどうかに応じてtorch.deviceオブジェクトを作成
        self.device = torch.device("cuda" if self.use_cuda else "cpu")
        # 引数bertmodelをself.bertmodelに代入
        self.bertmodel = bertmodel
        # BertTokenizerのインスタンスを作成
        self.tokenizer = BertTokenizer.from_pretrained(self.bertmodel)
        # BertForMaskedLMのインスタンスを作成
        self.model = BertForMaskedLM.from_pretrained(self.bertmodel).to(self.device)       
        # BERTの順伝搬結果からattentionを取得できるように設定
        self.model.config.output_attentions = True
        # 推論モードで、dropoutなどの学習時の挙動を無効化
        self.model.eval()

    # スコアをゲットする関数を定義    
    def get_score(self, question_tensors, masked_index, candidate):
        # 回答候補の単語をサブワードに分割
        candidate_tokens = self.tokenizer.tokenize(candidate)
        # サブワードをIDに変換
        candidate_ids = self.tokenizer.convert_tokens_to_ids(candidate_tokens)
        # 予測
        predictions = self.model(question_tensors)

        # predictions.attentionsにはBERTの各層のattention確率のタプルが格納されています
        # ここでは最終層のattentionを取得し利用します
        # attentionのshapeは(bsz, num_heads, seq_len, seq_len)です
        # BERTはMulti-head attentionという、attentionを計算するモジュールが複数存在しています
        # アテンションの最終層を取得
        attention_in_last_layer = predictions.attentions[-1]
        # attention_in_last_layerを変数attentionに代入
        attention = attention_in_last_layer
        # マスクされた単語のattention確率を取得
        attention_of_mask = attention[:, :, masked_index, :]

        # サブワードごとにスコアを取得し、その平均を回答候補のスコアとします
        predictions_candidates = predictions.logits[0,masked_index, candidate_ids].mean()
        # 戻り値のpredictions_candidatesはtorch.tensorから、float型のデータに変換します
        return predictions_candidates.item(), attention_of_mask

    # 結果（予測値）を取得する関数を定義します
    def predict(self,row):
        # 問題文の単語をサブワードに分割します
        # マスクされた単語が___という文字列になっているので、_に変化してから分割します
        question_tokens = self.tokenizer.tokenize(row['question'].replace('___', '_'))
        # マスクされた単語の位置を取得します
        masked_index = question_tokens.index('_')
        # masked_indexの位置の単語'_'を'[MASK]'に置換
        question_tokens[masked_index] = '[MASK]'
        # サブワードをIDに変換
        question_ids = self.tokenizer.convert_tokens_to_ids(question_tokens)
        # torch.tensorに変換し、GPUに載せます
        question_tensors = torch.tensor([question_ids]).to(self.device)
        # 変数candidatesに4つの選択肢をリスト形式で格納
        candidates = [row['1'], row['2'], row['3'], row['4']]

        results = [self.get_score(question_tensors, masked_index, candidate) for candidate in candidates]
        # 各選択肢の予測値を変数predict_tensorに格納
        predict_tensor = torch.tensor([result[0] for result in results])
        # 一番スコアの高い選択肢の値をtorch.tensorから、float型のデータに変換してpredict_idxに格納します
        predict_idx = torch.argmax(predict_tensor).item()
        # 各選択肢の中で、一番スコアが高い単語のattentionを取得
        attention = results[predict_idx][1]
        return candidates[predict_idx], attention[0], question_tokens

（4）Bertmodelのインスタンスを生成

# 変数Bertmodelに'bert-large-uncased'を格納します
Bertmodel = 'bert-large-uncased'
# Bertmodelのインスタンスを生成
model = TOEICBert(Bertmodel)

（5）モデルの性能を評価

count = 0
for question in question_infors:
    answer_predict, _, _ = model.predict(question)
    #右辺のquestion[]内の"answer"はデータセット内のスペルが間違えているため、あえて間違えている方のスペル"anwser"で記載しています。
    if answer_predict == question['anwser']:
        count+=1

num_questions = len(question_infors)
print(f'The model predict {round(count/num_questions,2) * 100} % of total {len(question_infors)} questions')

▼実行結果

The model predict 76.0 % of total 3625 questions

（6）attentionの可視化と解答を出力する関数の定義

モデルに入力するテストデータの「'1': '○○○' 〜 '4': '○○○'」部分が選択肢、「'question': '○○○'」部分が問題文、「'answer':'○○○'」部分が正解となります。
例えば、下記コード内のテストデータ（question）を入力したときに、期待する解答（answer）は選択肢1の'different'になります。

# attention可視化のため、displayとHTMLをインポートします
from IPython.display import display, HTML

# 可視化のための関数を定義します
# URL：https://qiita.com/m__k/items/e312ddcf9a3d0ea64d72を参考にしました
def highlight(attn, word):
  html_color = '#%02X%02X%02X' % (255, int(255*(1 - attn)), int(255*(1 - attn)))
  return '<span style="background-color: {}">{}</span>'.format(html_color, word)

# 解答を出力する関数を定義します
def Answer_toeic(question, show_attention = False):    
    predict_answer, attention, question_tokens = model.predict(question)
    answer = question['answer']
    if show_attention:
        html_text = ""
        for attn, token in zip(attention.mean(0), question_tokens):
            if token.startswith('##'):
                token = token[2:]
            else:
                token = ' ' + token

            html_text += highlight(attn, token)

        #HTML形式で可視化します
        print('mean of attention')
        display(HTML(html_text))
    
        for m in range(attention.size(0)):
            html_text = ""
        
            for attn, token in zip(attention[m], question_tokens):
                if token.startswith('##'):
                    token = token[2:]
                else:
                    token = ' ' + token

                html_text += highlight(attn, token)

            #HTML形式で可視化します
            print(f'head {m}')
            display(HTML(html_text))

    if predict_answer == answer:
        print(f'The BertModel answer: {predict_answer}')
        print('This is right answer')
    else:
        print(f'The BertModel answer: {predict_answer}')
        print('This is wrong answer')

# テストデータ（学習には使用していない未知のデータ）
question = {'1': 'different',
 '2': 'differently',
 '3': 'difference',
 '4': 'differences',
 'answer': 'different',
 'question': 'Matos Realty has developed two ___ methods of identifying undervalued properties.'}

解答を出力する関数（Answer_toeic）にquestion（問題・解答・選択肢のセット）とパラメーター引数にTrueを設定し、最終層の全てのattention headを出力します。

# テストデータを解答を出力する関数に投入し出力結果を取得します。
Answer_toeic(question, True)

▼実行結果

8.分析

BERTが解答を導き出すときの最終層のMulti-head attentionを可視化することにより、BERTの解答導出過程を分析しました。

（1）事例の分析 -正解編-

Case1：全てのattention headが人間の思考と同じ単語に着目している

▼question（問題・解答・選択肢のセット）と実行結果

question = {'1': 'different',
 '2': 'differently',
 '3': 'difference',
 '4': 'differences',
 'answer': 'different',
 'question': 'Matos Realty has developed two ___ methods of identifying undervalued properties.'}

▼考察
人間と同じ思考であれば、"methods"に着目し、名詞"methods"を修飾する形容詞"different"が入ると判断すると思いますが、BERTも最終層の全てのheadにおいて名詞”methods"に比較的高いattention確率が付いていて、人間の解答導出過程と同じだと判断できる結果でした。

Case2：一部のattention headのみ、人間の思考と同じ単語に着目している

▼question（問題・解答・選択肢のセット）と実行結果

question = {'1': 'him',
 '2': 'his', 
 '3': 'himself', 
 '4': 'he', 
 'answer': 'his', 
 'question': 'lndie film director Luke Steele will be in London for the premiere of ___ new movie.'}

▼考察
人間の思考だと、名詞"new movie"に着目し、"new movie"を修飾する所有格が入ると判断すると思います。しかし、BERT最終層の16個のheadの内、"new movie"に注目しているheadは3個（head2、head6、head9）、かつ注目しているheadのattention確率はほとんど無視できるレベルで低かったのですが、正解できていました。

また、冠詞や前置詞、ピリオドなど、解答の導出にあまり重要でなさそうな単語に高いattention確率が付いていました。

（2）事例の分析 -不正解編-

Case1：時制の問題

▼question（問題・解答・選択肢のセット）と実行結果

question = {'1': 'having', 
 '2': 'will have', 
 '3': 'was having', 
 '4': 'has', 
 'answer': 'was having', 
 'question': "Ms. Cho relayed her concerns about the company's financial situation while she ___ a meeting with the manager."}

▼考察
人間の思考だと、動詞"relayed"に注目し、過去時制と判断し解答を導出すると思うのですが、"relayed"に注目しているheadはなく、解答も不正解でした。

Case2：単語の意味問題

▼question（問題・解答・選択肢のセット）と実行結果

question = {'1': 'enormously', 
 '2': 'financially', 
 '3': 'exceptionally', 
 '4': 'generously', 
 'answer': 'generously', 
 'question': 'Open Society Institute has ___ offered to sponsor a number of participants from developing countries for attendance at the OA Workshop.'}

question = {'1': 'picturesque', 
 '2': 'monetary', 
 '3': 'horrible', 
 '4': 'singular', 
 'answer': 'picturesque', 
 'question': 'The movie is quite popular because of its ___ scenes and solid plot, both of which have been praised by movie critics.'}

▼考察
こちらのケースは品詞や時制などの文法的問題ではなく、単語の意味で解答する必要がある問題です。このような問題でBERTが間違った解答をしていました。

9.まとめ

今回は事前学習済みのBERTを使用してTOEICのPart5の問題を解答させ、解答を導くために着目した単語を可視化し、分析することにより、人間の学習支援につながる結果が得られるかどうかについて検証しました。

分析に関しましては、統計的な分析はできなかったのですが、Kaggle上のデータセットから事例をいくつか集め、BERTの着眼点と人間の思考による着眼点に共通点や相違点はあるのか、について探ってみました。

正解となる事例では、人間の着目する単語と同じ単語に着目していて、attention確率も比較的高めの付与されたheadが少なくとも一つ以上は存在しました。

不正解となる事例では、人間の着目する単語とは違う単語に着目していたり、文法的ではなく、単語の意味で解答を導出するタイプの問題で間違った解答を出力していました。

今回の検証では、検証できた事例が少ないので一概に言えないのですが、以下のことが分かりました。

最終層の各attention head内で、人間と同じ着眼点に少なくとも1つ以上、低めでもattention確率が付与されていると、正解を導ける場合がある、ということが分かりました。
最終層の各attention head内で、どのheadも人間と同じ着眼点に着目していないと正解を出力できない場合がある、ということが分かりました。

今後はさらに事例を集め、問題のパターンごとにセグメント化して統計的に分析することにより、BERTが得意な問題、不得意な問題などの傾向をつかむことができるのではないかと考えています。

そして、このBERTを応用することができれば、問題のパターンごとにどの単語に着目したら良いかなどをBERTが自動的に出力し、人間の学習を支援できる可能性も十分にあるのではないか、と思っています。

また、BERTが正解した事例のcase2では、あまり重要ではなさそうな文末のピリオドに高いattention確率が付与されていました。最先端の研究では、attention確率のみによる分析ではなく、ベクトルのノルムに基づく分析が有効であるということが言われているので、今後試してみたいと思います。

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up