More than 3 years have passed since last update.

huggingface/transformersでpipelineの出力にtokenized textとattention weightを加える

Last updated at 2021-11-26Posted at 2021-11-18

はじめに

huggingface/transformersのpipelineは数行のコードで推論を行えますが、出力が少々不自由です。
本記事ではpipelineの一部を変更し、出力にtokenized textとattention weightを加えました。
具体的には使用したいpipeline(本記事ではTextClassificationPipeline)を継承し、クラスメソッドのforwordとpostprocessに数行のコードを加えます。これにより、自作のコードを最小限にしつつ欲しい出力を得られます。
本記事ではBERTのtext classificationを例にして、下記の構成で説明します。
　- tokenized textの取得方法
　- attention weightの取得方法
　- pipelineでのtokenized textとattention weightの出力方法 (本題)
コードだけを見たい方はこちらまで飛ばしてお読みください。
huggingface/transformersの基本的な使用方法は説明しませんので、公式ドキュメントをご確認ください。
また筆者が未熟なため、記事内容の誤りや記事内容より優れた方法があるかと思います。そのような箇所を見つけた際にはコメントでご指摘いただけますと幸いです。

tokenized textの取得方法

まずは下記の通りtokenizerでtextを分割します。

code1

from transformers import AutoTokenizer

model_name = '学習済みモデルのpath'
tokenizer = AutoTokenizer.from_pretrained(model_name)

# 「吾輩は猫である」を例文に使用します
text = '吾輩は猫である。名前はまだ無い。'
model_inputs = tokenizer(text, return_tensors='pt')

result1

{
'input_ids': tensor([[    2,  1583,  5159,   897,  3574,   889, 20656,   829,  1564,  1402,   897,  6466,  6626,  3464,   854,   829,     3]]),
'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]),
'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])
}

input_idsからtokenを復元すれば、tokenized textが取得可能だとわかります。
huggingface/transformersには、このためのメソッド(convert_ids_to_tokens)があらかじめ用意されています。

code2

tokenized_text = tokenizer.convert_ids_to_tokens(model_inputs['input_ids'][0])

result2

['[CLS]', '吾', '輩', 'は', '猫', 'で', '##ある', '。', '名', '前', 'は', '##ま', '##だ', '無', 'い', '。', '[SEP]']

以上で、tokenized textの取得方法がわかりました。

attention weightの取得方法

モデル設定でoutput_attentions=Trueを渡すことで、出力結果にattention weightが含まれます。

code3

from transformers import AutoModelForSequenceClassification

model = AutoModelForSequenceClassification.from_pretrained(model_name, output_attentions=True)
model_outputs = model(model_inputs['input_ids'])

result3

# attentionsは省略して記載しているため、実際の出力とは異なります
SequenceClassifierOutput(
loss=None, 
logits=tensor([[0.0077, 0.1880]], grad_fn=<AddmmBackward0>), 
hidden_states=None, 
attentions=(tensor([[[[0.1091, 0.0460, 0.0619,  ..., 0.0487, 0.0344, 0.0468],[0.0431, 0.0089, 0.1076,  ..., 0.1150, 0.0315, 0.0183],[0.1115, 0.0604, 0.0373,  ..., 0.0246, 0.0797, 0.1302],  ...,[4.1084e-03, 9.1141e-04, 1.6878e-04,  ..., 7.1091e-03,　1.2396e-02, 9.3627e-01], [1.8521e-02, 3.7775e-03, 5.7404e-03,  ..., 6.6644e-03, 1.0438e-02, 9.2752e-01], [3.5021e-02, 1.7682e-02, 3.3329e-02,  ..., 1.2351e-02, 1.4987e-02, 7.6510e-01]]]], grad_fn=<SoftmaxBackward0>))
)

つまり、model_outputs['attentions']でattention weightが取得できます。

pipelineでのtokenized textとattention weightの出力方法

本題に入ります。

pipelineの出力結果

まずはpipelineの出力を確認します。

code4

from transformers import TextClassificationPipeline

classifier = TextClassificationPipeline(model=model, tokenizer=tokenizer)
outputs = classifier(text)

result4

[{'label': 'LABEL_1', 'score': 0.5449662804603577}]

このようにcode3でoutput_attentions=Trueを渡していても、pipelineの出力結果にはattention weightがありません。これはクラスメソッドのpostprocessによるものです。

postprocessの説明およびattention weightの出力方法

postprocessは、生の出力(model_outputs)を受け取り、見やすい結果(outputs)を返すメソッドです。しかし、返り値にattention weightがないため、result4のようになります。
つまり、postprocessの返り値にattention weightを加えることで、attention weightを出力できます。

tokenized textの出力方法

postprocessの返り値を変更すれば、出力結果を自由にできることがわかりました。しかし、code2のようにtokenized textを得るにはmodel_inputsが必要です。また、model_inputsから取得したtokenized textをpostprocessに渡さなければなりません。そのため、model_inputsを受け取り、model_outputsを返すクラスメソッドforwardを変更します。
forwardが受け取ったmodel_inputsからcode2でtokenized textを取得し、model_outputsに加えてpostprocessに引き渡します。
後は、attention weightと同様にpostprocessを変更すれば、tokenized textも出力できます。

変更を加えたクラスの実装

以上を踏まえて、forwardとpostprocessに変更を加えます。
huggingface/transformersを直接修正しても良いですが、元のライブラリに変更は加えたくないため、
TextClassificationPipelineを継承しTextClassificationPipelineAddOutputsを実装します。

text_classification_add_outputs.py

class TextClassificationPipelineAddOutputs(TextClassificationPipeline):
    def forward(self, model_inputs, **forward_params):
        # 既存の処理を行います
        model_outputs = super().forward(model_inputs, **forward_params)

        # code2と同様です
        tokenized_text = self.tokenizer.convert_ids_to_tokens(model_inputs['input_ids'][0])

        # model_outputs(辞書型)にtokenized textを加えます
        model_outputs['tokenized_text'] = tokenized_text
        return model_outputs

    def postprocess(self, model_outputs, function_to_apply=None, return_all_scores=False):
        # 既存の処理を行います
        processed = super().postprocess(model_outputs, function_to_apply, return_all_scores)

        # 見やすさのために変数を用意しています
        tokenized_text = model_outputs['tokenized_text']
        attentions = model_outputs['attentions']

        # outputsにtokenized textとattention weightを追加
        if return_all_scores:
            return [{**proc, **{'tokenized_text':tokenized_text, 'attentions':attentions}} for proc in processed]
        else:
            return {**processed, **{'tokenized_text':tokenized_text, 'attentions':attentions}}

実装したクラスの出力を確認します。

code5

# code3と同様に、'model'に'output_attentions=True'を渡してください
classifier = TextClassificationPipelineAddOutputs(model=model, tokenizer=tokenizer)
outputs = classifier(text)

result5

# attentionsは省略して記載しているため、実際の出力とは異なります
[{
'label': 'LABEL_1', 
'score': 0.5449662804603577, 
'tokenized_text': ['[CLS]', '吾', '輩', 'は', '猫', 'で', '##ある', '。', '名', '前', 'は', '##ま', '##だ', '無', 'い', '。', '[SEP]'], 
'attentions': (tensor([[[[0.1091, 0.0460, 0.0619,  ..., 0.0487, 0.0344, 0.0468],[0.0431, 0.0089, 0.1076,  ..., 0.1150, 0.0315, 0.0183],[0.1115, 0.0604, 0.0373,  ..., 0.0246, 0.0797, 0.1302], ..., [0.0074, 0.1923, 0.2223,  ..., 0.0083, 0.0186, 0.0239], [0.0107, 0.0105, 0.0127,  ..., 0.0151, 0.1036, 0.6139], [0.0220, 0.0112, 0.0168,  ..., 0.0362, 0.3030, 0.1718]]]]))}]

tokenized textとattention weightが出力できました。

おわりに

本記事がお役に立てれば幸いです。
繰り返しになりますが、記事内容の誤りや記事内容より優れた方法があれば、コメントでご指摘いただけますとありがたく存じます。

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up