More than 1 year has passed since last update.

[翻訳] Hugging Face transformersにおける質問応答

Posted at 2023-05-23

本書は抄訳であり内容の正確性を保証するものではありません。正確な内容に関しては原文を参照ください。

質問応答タスクは、指定された質問に対する回答を返却します。Alexa、Siri、Googleのようなバーチャルアシスタントに天気を尋ねたことがあるのであれば、以前に質問応答モデルを使ったことがあることになります。質問応答タスクには2つの一般的なタイプがあります:

Extractive: 指定されたコンテキストから回答を抽出します。
Abstractive: コンテキストから質問に正確に答える回答を生成します。

このガイドでは、以下の方法を説明します:

extractiveな質問応答のためにSQuADデータセットでDistilBERTをファインチューニングします。
ファインチューニングしたモデルを推論で使用します。

このチュートリアルで説明されるタスクは以下のモデルアーキテクチャをサポートしています:

ALBERT, BART, BERT, BigBird, BigBird-Pegasus, BLOOM, CamemBERT, CANINE, ConvBERT, Data2VecText, DeBERTa, DeBERTa-v2, DistilBERT, ELECTRA, ERNIE, ErnieM, FlauBERT, FNet, Funnel Transformer, OpenAI GPT-2, GPT Neo, GPT NeoX, GPT-J, I-BERT, LayoutLMv2, LayoutLMv3, LED, LiLT, Longformer, LUKE, LXMERT, MarkupLM, mBART, MEGA, Megatron-BERT, MobileBERT, MPNet, MVP, Nezha, Nyströmformer, OPT, QDQBert, Reformer, RemBERT, RoBERTa, RoBERTa-PreLayerNorm, RoCBert, RoFormer, Splinter, SqueezeBERT, XLM, XLM-RoBERTa, XLM-RoBERTa-XL, XLNet, X-MOD, YOSO

始める前に、必要なすべてのライブラリがインストールされていることを確認してください:

pip install transformers datasets evaluate

コミュニティーにモデルをアップロードし、共有できる様にHugging Faceアカウントにログインすることをお勧めします。

Python

from huggingface_hub import notebook_login

notebook_login()

SQuADデータセットのロード

🤗 DatasetsライブラリからSQuADデータセットの小規模なサブセットをロードすることからスタートします。これによって、データセット全体でのトレーニングでより多くの時間を費やす前に、実験を行い、すべてが動作することを確認することができます。

Python

from datasets import load_dataset

squad = load_dataset("squad", split="train[:5000]")

train_test_splitメソッドを用いて、データセットのtrainスプリットをトレーニングセットとテストセットに分割します:

Python

squad = squad.train_test_split(test_size=0.2)

サンプルを見てみましょう:

Python

squad["train"][0]

JSON

{'answers': {'answer_start': [515], 'text': ['Saint Bernadette Soubirous']},
 'context': 'Architecturally, the school has a Catholic character. Atop the Main Building\'s gold dome is a golden statue of the Virgin Mary. Immediately in front of the Main Building and facing it, is a copper statue of Christ with arms upraised with the legend "Venite Ad Me Omnes". Next to the Main Building is the Basilica of the Sacred Heart. Immediately behind the basilica is the Grotto, a Marian place of prayer and reflection. It is a replica of the grotto at Lourdes, France where the Virgin Mary reputedly appeared to Saint Bernadette Soubirous in 1858. At the end of the main drive (and in a direct line that connects through 3 statues and the Gold Dome), is a simple, modern stone statue of Mary.',
 'id': '5733be284776f41900661182',
 'question': 'To whom did the Virgin Mary allegedly appear in 1858 in Lourdes France?',
 'title': 'University_of_Notre_Dame'
}

ここにはいくつかの重要なフィールドがあります:

answers: 回答のトークンの開始位置と回答のテキスト。
context: モデルが回答を抽出するために必要な背景情報。
question: モデルが回答すべき質問。

前処理

次のステップでは、questionとcontextフィールドを処理するために、DistilBERTのトークナイザーをロードします:

Python

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")

注意すべき質問応答タスクに固有な前処理ステップがいくつか存在します:

データセットのいくつかのサンプルには、モデルの最大入力長を超える非常に長いcontextが含まれる場合があります。長いシーケンスを取り扱うには、truncation="only_second"を設定することでcontextのみを切り取ります。
次に、return_offset_mapping=Trueを設定することで、オリジナルのcontextに対する回答の開始位置、終了位置をマッピングします。
マッピングを行うことで、回答の開始トークンと終了トークンを特定することができる様になります。オフセットのどの部分がquestionに対応し、どの部分がcontextに対応するのかを特定するためにsequence_idsを活用します。

こちらが、切り取りを行い、answerの開始トークンと終了トークンをcontextにマッピングする関数の作成方法です:

Python

def preprocess_function(examples):
    questions = [q.strip() for q in examples["question"]]
    inputs = tokenizer(
        questions,
        examples["context"],
        max_length=384,
        truncation="only_second",
        return_offsets_mapping=True,
        padding="max_length",
    )

    offset_mapping = inputs.pop("offset_mapping")
    answers = examples["answers"]
    start_positions = []
    end_positions = []

    for i, offset in enumerate(offset_mapping):
        answer = answers[i]
        start_char = answer["answer_start"][0]
        end_char = answer["answer_start"][0] + len(answer["text"][0])
        sequence_ids = inputs.sequence_ids(i)

        # Find the start and end of the context
        idx = 0
        while sequence_ids[idx] != 1:
            idx += 1
        context_start = idx
        while sequence_ids[idx] == 1:
            idx += 1
        context_end = idx - 1

        # If the answer is not fully inside the context, label it (0, 0)
        if offset[context_start][0] > end_char or offset[context_end][1] < start_char:
            start_positions.append(0)
            end_positions.append(0)
        else:
            # Otherwise it's the start and end token positions
            idx = context_start
            while idx <= context_end and offset[idx][0] <= start_char:
                idx += 1
            start_positions.append(idx - 1)

            idx = context_end
            while idx >= context_start and offset[idx][1] >= end_char:
                idx -= 1
            end_positions.append(idx + 1)

    inputs["start_positions"] = start_positions
    inputs["end_positions"] = end_positions
    return inputs

データセット全体に前処理関数を適用するために、🤗 Datasetsのmap関数を使用します。一度にデータセットの複数の要素を処理するためにbatched=Trueを設定することで、map関数をスピードアップすることができます。不要なすべてのカラムを削除します:

Python

tokenized_squad = squad.map(preprocess_function, batched=True, remove_columns=squad["train"].column_names)

次に、DefaultDataCollatorを用いてサンプルのバッチを作成します。🤗 Transformersの他のdata collatorと異なり、DefaultDataCollatorはパディングの様な追加の前処理は適用しません。

Pytorch

Python

from transformers import DefaultDataCollator

data_collator = DefaultDataCollator()

TensorFlow

Python

from transformers import DefaultDataCollator

data_collator = DefaultDataCollator(return_tensors="tf")

トレーニング

Pytorch

Trainerを用いたモデルのファインチューニングに馴染みがないのであれば、こちらの基本的なチュートリアルをご覧ください！

これで、モデルのトレーニングをスタートする準備ができました！AutoModelForQuestionAnsweringを用いてDistilBERTをロードします。

Python

from transformers import AutoModelForQuestionAnswering, TrainingArguments, Trainer

model = AutoModelForQuestionAnswering.from_pretrained("distilbert-base-uncased")

この時点で、残りは3つのステップです:

TrainingArgumentsでトレーニングのハイパーパラメーターを定義します。必要な唯一のパラメーターは、モデルを保存する場所を指定するoutput_dirです。push_to_hub=Trueを設定することで、このモデルをHubにプッシュします(モデルをアップロードするにはHugging Faceにサインインする必要があります)。
モデル、データセット、トークナイザー、データコレーターとともにトレーニングの引数をTrainerに引き渡します。
モデルをファインチューンするためにtrain()を呼び出します。

Python

training_args = TrainingArguments(
    output_dir="my_awesome_qa_model",
    evaluation_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    num_train_epochs=3,
    weight_decay=0.01,
    push_to_hub=True,
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_squad["train"],
    eval_dataset=tokenized_squad["test"],
    tokenizer=tokenizer,
    data_collator=data_collator,
)

trainer.train()

トレーニングが完了したら、このモデルを誰でも使える様に、push_to_hub()メソッドを用いてHubにモデルを共有します。

Python

trainer.push_to_hub()

TensorFlow

Trainerを用いたモデルのファインチューニングに馴染みがないのであれば、こちらの基本的なチュートリアルをご覧ください！

TensorFlowでモデルをファインチューンにするには、optimizer関数、学習率のスケジュール、いくつかのトレーニングハイパーパラメーターをセットアップするところからスタートします:

Python

from transformers import create_optimizer

batch_size = 16
num_epochs = 2
total_train_steps = (len(tokenized_squad["train"]) // batch_size) * num_epochs
optimizer, schedule = create_optimizer(
    init_lr=2e-5,
    num_warmup_steps=0,
    num_train_steps=total_train_steps,
)

次に、TFAutoModelForQuestionAnsweringでDistilBERTをロードします。

Python

from transformers import TFAutoModelForQuestionAnswering

model = TFAutoModelForQuestionAnswering("distilbert-base-uncased")

prepare_tf_dataset()を用いて、データセットをtf.data.Datasetに変換します:

Python

tf_train_set = model.prepare_tf_dataset(
    tokenized_squad["train"],
    shuffle=True,
    batch_size=16,
    collate_fn=data_collator,
)

tf_validation_set = model.prepare_tf_dataset(
    tokenized_squad["test"],
    shuffle=False,
    batch_size=16,
    collate_fn=data_collator,
)

compileを用いてトレーニングするモデルを設定します:

Python

import tensorflow as tf

model.compile(optimizer=optimizer)

トレーニングをスタートする前にセットアップする最後のポイントは、Hubにモデルをプッシュする方法を指定することです。これは、PushToHubCallbackで、モデルをトークナイザーをどこにプッシュするのかを指定することで実現できます。

Python

from transformers.keras_callbacks import PushToHubCallback

callback = PushToHubCallback(
    output_dir="my_awesome_qa_model",
    tokenizer=tokenizer,
)

最後に、モデルのトレーニングを開始できる準備ができました！モデルをファインチューンするために、トレーニングデータセット、検証データセット、エポック数を指定してfitを呼び出します:

Python

model.fit(x=tf_train_set, validation_data=tf_validation_set, epochs=3, callbacks=[callback])

トレーニングが完了すると、誰でも使用できる様に自動でモデルがHubにアップロードされます！

質問応答のモデルのファインチューンのより詳細なサンプルに関しては、対応するPyTorch notebookやTensorFlow notebookをご覧ください。

評価

質問応答の評価には、膨大な後処理が必要となります。あまり皆様が時間をかけないで済む様に、このガイドでは評価ステップをスキップします。しかし、Trainerは、トレーニング過程で評価lossを計算するので、モデルのパフォーマンスが全くわからないということはありません。

あなたに時間の余裕があり、質問応答のモデルの評価方法に興味があるのであれば、🤗 Hugging FaceコースのQuestion answeringをご覧ください！

推論

すばらしいです。これでモデルをファインチューンしたので推論で活用することができます！

モデルに予測させたい質問といくつかのコンテキストを考えだします:

Python

question = "How many programming languages does BLOOM support?"
context = "BLOOM has 176 billion parameters and can generate text in 46 languages natural languages and 13 programming languages."

推論でファインチューンしたモデルをトライする最もシンプルな方法は、pipeline()を活用することです。モデルの質問応答のためのpipelineのインスタンスを作成し、テキストを引き渡します:

Python

from transformers import pipeline

question_answerer = pipeline("question-answering", model="my_awesome_qa_model")
question_answerer(question=question, context=context)

{'score': 0.2058267742395401,
 'start': 10,
 'end': 95,
 'answer': '176 billion parameters and can generate text in 46 languages natural languages and 13'}

あるいは、必要であればpipelineの結果を手動で再現することができます:

Pytorch

テキストをトークナイズしPyTorchのtensorを返却します:

Python

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("my_awesome_qa_model")
inputs = tokenizer(question, context, return_tensors="pt")

モデルへの入力を渡して、logitsを返却します:

Python

from transformers import AutoModelForQuestionAnswering

model = AutoModelForQuestionAnswering.from_pretrained("my_awesome_qa_model")
with torch.no_grad():
    outputs = model(**inputs)

開始位置、終了位置に関するモデル出力で最も高い確率を取得します:

Python

answer_start_index = outputs.start_logits.argmax()
answer_end_index = outputs.end_logits.argmax()

回答を取得するために予測したトークンをデコードします:

Python

predict_answer_tokens = inputs.input_ids[0, answer_start_index : answer_end_index + 1]
tokenizer.decode(predict_answer_tokens)

'176 billion parameters and can generate text in 46 languages natural languages and 13'

TensorFlow

テキストをトークナイズしPyTorchのtensorを返却します:

Python

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("my_awesome_qa_model")
inputs = tokenizer(question, text, return_tensors="tf")

モデルへの入力を渡して、logitsを返却します:

Python

from transformers import TFAutoModelForQuestionAnswering

model = TFAutoModelForQuestionAnswering.from_pretrained("my_awesome_qa_model")
outputs = model(**inputs)

開始位置、終了位置に関するモデル出力で最も高い確率を取得します:

Python

answer_start_index = int(tf.math.argmax(outputs.start_logits, axis=-1)[0])
answer_end_index = int(tf.math.argmax(outputs.end_logits, axis=-1)[0])

回答を取得するために予測したトークンをデコードします:

Python

predict_answer_tokens = inputs.input_ids[0, answer_start_index : answer_end_index + 1]
tokenizer.decode(predict_answer_tokens)

'176 billion parameters and can generate text in 46 languages natural languages and 13'

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up