LLMの出力を評価する方法についてまとめてみた

Last updated at 2023-12-18Posted at 2023-12-18

この記事はBrainPad Advent Calender 2023 19日目の記事です。

はじめに

　近年、OpenAI社のGPT4-Turboをはじめ、さまざまなLLMが続々と公開されており、日々性能も向上しています。これらのLLMを個人として利用するには、大変便利だと感じています。ただし、業務として、これらLLMをシステムやアプリケーションの一部として組み込む必要がある場合は、LLMの出力を評価することが必要となってきます。
　LLMの出力の評価方法には、従来のメトリクスである F1-score や ROUGE, BLUE などを使用した方法とLLMを評価メトリクスとして使用するllm-as-a-judgeと呼ばれる手法（人手に匹敵する評価を、機械的に行えることが期待できる手法）があります。
　ここでは、mlflowを参考に従来のメトリクスとllmを使ったメトリクスについて「類似性」「関連性」「安全性」の3つの項目でまとめてみました。

※この記事は、mlflowのリポジトリや参考文献をもとに自身の理解を整理したものです。
誤りがある場合はご指摘頂けますと幸いです。

評価方法

以下の3つの項目について評価方法をそれぞれ紹介します。

#	評価項目	概要
1	類似性	LLMの出力と正解データの類似度を評価
2	関連性	LLMの出力（回答）と入力（質問）およびコンテキスト（追加情報）の関連度を評価
3	安全性	LLMの出力のToxicity（ヘイトスピーチなどの毒性）を評価

1. 類似性

この項目は、LLMの出力と正解データの類似度を評価したいときに有効です。

Embeddingsを使用した方法

LLMの出力と正解データそれぞれについてLLM を使用して生成した Embeddings ベクトルを作成し、コサイン類似度を算出することで類似度をみることが可能です。
また、LLM以外にもsentence transformerを使用する事も可能です。

従来の指標を使った方法

LLMの出力と正解データの類似性をみるために、ROUGEやBLEU、f1-score(予測された回答の単語数と実測の単語数に基づき算出する)を使うことができます。
ただし、これらの指標は要約や翻訳など特定のタスクで使用されている指標であるためタスクに応じて必ずしも正しく作用するとは限りません。

2. 関連性

この項目では、LLMの出力（回答）が入力（質問）やコンテキストにどの程度関連しているかどうかをLLMを使って評価します。
評価方法としては、以下の種類があります。

LLMの出力と入力の関連度評価
LLMの出力とコンテキストの関連度評価
LLMの出力が入力とコンテキスト両方に関しての関連度評価

LLMの出力と入力の関連度評価

モデルで生成された出力が入力に対してどの程度関連しているかを評価しています。
5点満点で関連度を評価させており、スコアの付け方は以下のようになっている。

- スコア1: アウトプットが質問について何も言及していないか、インプットと全く無関係である。
- スコア5：アウトプットが質問のすべての側面に対応しており、アウトプットのすべての部分が有意義で、質問に関連している。

以下プロンプト

Task:
You must return the following fields in your response in two lines, one below the other:
score: Your numerical score for the model's answer_relevance based on the rubric
justification: Your reasoning about the model's answer_relevance score

You are an impartial judge. You will be given an input that was sent to a machine
learning model, and you will be given an output that the model produced. You
may also be given additional information that was used by the model to generate the output.

Your task is to determine a numerical score called answer_relevance based on the input and output.
A definition of answer_relevance and a grading rubric are provided below.
You must use the grading rubric to determine your score. You must also justify your score.

Examples could be included below for reference. Make sure to use them as references and to
understand them before completing the task.

Input:
{input}

Output:
{output}

{grading_context_columns}

Metric definition:
Answer relevance measures the appropriateness and applicability of the output with respect to the input. Scores should reflect the extent to which the output directly addresses the question provided in the input, and give lower scores for incomplete or redundant output.

Grading rubric:
Answer relevance: Please give a score from 1-5 based on the degree of relevance to the input, where the lowest and highest scores are defined as follows:
- Score 1: The output doesn't mention anything about the question or is completely irrelevant to the input.
- Score 5: The output addresses all aspects of the question and all parts of the output are meaningful and relevant to the question.

Examples:

Example Input:
How is MLflow related to Databricks?

Example Output:
Databricks is a company that specializes in big data and machine learning solutions.



Example score: 2
Example justification: The output provided by the model does give some information about Databricks, which is part of the input question. However, it does not address the main point of the question, which is the relationship between MLflow and Databricks. Therefore, while the output is not completely irrelevant, it does not fully answer the question, leading to a lower score.
        

Example Input:
How is MLflow related to Databricks?

Example Output:
MLflow is a product created by Databricks to enhance the efficiency of machine learning processes.



Example score: 5
Example justification: The output directly addresses the input question by explaining the relationship between MLflow and Databricks. It provides a clear and concise answer that MLflow is a product created by Databricks, and also adds relevant information about the purpose of MLflow, which is to enhance the efficiency of machine learning processes. Therefore, the output is highly relevant to the input and deserves a full score.
        

You must return the following fields in your response in two lines, one below the other:
score: Your numerical score for the model's answer_relevance based on the rubric
justification: Your reasoning about the model's answer_relevance score

Do not add additional new lines. Do not add any other fields.

LLMの出力とコンテキストの関連度評価

提供されたコンテキストに基づいて、モデルが生成した出力がどの程度忠実であるかを評価しています。
5点満点で関連度を評価させており、スコアの付け方は以下のようになっている。

- スコア1：提供された文脈から、アウトプットの主張のどれも推論できない。
- スコア2：アウトプットの主張のいくつかは提供された文脈から推測できるが、アウトプットの大部分は提供された文脈から欠落しているか、矛盾しているか、または矛盾している。
- スコア3：アウトプット中の主張の半分以上は、提供された文脈から推測できる。
- スコア4：アウトプット中の主張の大部分は、提供された文脈から推測できる。
- スコア5：アウトプットのすべての主張が、提供された文脈によって直接的に裏付けられ、提供された文脈に忠実である。

以下プロンプト

Task:
You must return the following fields in your response in two lines, one below the other:
score: Your numerical score for the model's faithfulness based on the rubric
justification: Your reasoning about the model's faithfulness score

You are an impartial judge. You will be given an input that was sent to a machine
learning model, and you will be given an output that the model produced. You
may also be given additional information that was used by the model to generate the output.

Your task is to determine a numerical score called faithfulness based on the input and output.
A definition of faithfulness and a grading rubric are provided below.
You must use the grading rubric to determine your score. You must also justify your score.

Examples could be included below for reference. Make sure to use them as references and to
understand them before completing the task.

Input:
{input}

Output:
{output}

{grading_context_columns}

Metric definition:
Faithfulness is only evaluated with the provided output and provided context, please ignore the provided input entirely when scoring faithfulness. Faithfulness assesses how much of the provided output is factually consistent with the provided context. A higher score indicates that a higher proportion of claims present in the output can be derived from the provided context. Faithfulness does not consider how much extra information from the context is not present in the output.

Grading rubric:
Faithfulness: Below are the details for different scores:
- Score 1: None of the claims in the output can be inferred from the provided context.
- Score 2: Some of the claims in the output can be inferred from the provided context, but the majority of the output is missing from, inconsistent with, or contradictory to the provided context.
- Score 3: Half or more of the claims in the output can be inferred from the provided context.
- Score 4: Most of the claims in the output can be inferred from the provided context, with very little information that is not directly supported by the provided context.
- Score 5: All of the claims in the output are directly supported by the provided context, demonstrating high faithfulness to the provided context.

Examples:

Example Input:
How is MLflow related to Databricks?

Example Output:
Databricks is a company that specializes in big data and machine learning solutions. MLflow has nothing to do with Databricks. MLflow is an open-source platform for managing the end-to-end machine learning (ML) lifecycle.

Additional information used by the model:
key: context
value:
MLflow is an open-source platform for managing the end-to-end machine learning (ML) lifecycle. It was developed by Databricks, a company that specializes in big data and machine learning solutions. MLflow is designed to address the challenges that data scientists and machine learning engineers face when developing, training, and deploying machine learning models.

Example score: 2
Example justification: The output claims that "MLflow has nothing to do with Databricks" which is contradictory to the provided context that states "It was developed by Databricks". This is a major inconsistency. However, the output correctly identifies that "MLflow is an open-source platform for managing the end-to-end machine learning (ML) lifecycle" and "Databricks is a company that specializes in big data and machine learning solutions", which are both supported by the context. Therefore, some of the claims in the output can be inferred from the provided context, but the majority of the output is inconsistent with the provided context, leading to a faithfulness score of 2.
        

Example Input:
How is MLflow related to Databricks?

Example Output:
Databricks is a company that specializes in big data and machine learning solutions.

Additional information used by the model:
key: context
value:
MLflow is an open-source platform for managing the end-to-end machine learning (ML) lifecycle. It was developed by Databricks, a company that specializes in big data and machine learning solutions. MLflow is designed to address the challenges that data scientists and machine learning engineers face when developing, training, and deploying machine learning models.

Example score: 5
Example justification: The output states that "Databricks is a company that specializes in big data and machine learning solutions." This claim is directly supported by the context, which states "It was developed by Databricks, a company that specializes in big data and machine learning solutions." Therefore, the faithfulness score is 5 as all the claims in the output are directly supported by the provided context."
        

You must return the following fields in your response in two lines, one below the other:
score: Your numerical score for the model's faithfulness based on the rubric
justification: Your reasoning about the model's faithfulness score

Do not add additional new lines. Do not add any other fields.

LLMの出力が入力とコンテキスト両方に関しての関連度評価

モデルが生成した出力が入力とコンテキストの両方に関してどの程度関連しているかを評価しています。
5点満点で関連度を評価させており、スコアの付け方は以下のようになっています。

- スコア1：提供された文脈から、アウトプットの主張のどれも推論できない。
- スコア2：アウトプットの主張のいくつかは提供された文脈から推測できるが、アウトプットの大部分は提供された文脈から欠落しているか、矛盾しているか、または矛盾している。
- スコア3：アウトプット中の主張の半分以上は、提供された文脈から推測できる。
- スコア4：アウトプット中の主張の大部分は、提供された文脈から推測できる。
- スコア5：アウトプットのすべての主張が、提供された文脈によって直接的に裏付けられ、提供された文脈に忠実である。

以下プロンプト


Task:
You must return the following fields in your response in two lines, one below the other:
score: Your numerical score for the model's relevance based on the rubric
justification: Your reasoning about the model's relevance score

You are an impartial judge. You will be given an input that was sent to a machine
learning model, and you will be given an output that the model produced. You
may also be given additional information that was used by the model to generate the output.

Your task is to determine a numerical score called relevance based on the input and output.
A definition of relevance and a grading rubric are provided below.
You must use the grading rubric to determine your score. You must also justify your score.

Examples could be included below for reference. Make sure to use them as references and to
understand them before completing the task.

Input:
{input}

Output:
{output}

{grading_context_columns}

Metric definition:
Relevance encompasses the appropriateness, significance, and applicability of the output with respect to both the input and context. Scores should reflect the extent to which the output directly addresses the question provided in the input, given the provided context.

Grading rubric:
Relevance: Below are the details for different scores:
- Score 1: The output doesn't mention anything about the question or is completely irrelevant to the provided context.
- Score 2: The output provides some relevance to the question and is somehow related to the provided context.
- Score 3: The output mostly answers the question and is largely consistent with the provided context.
- Score 4: The output answers the question and is consistent with the provided context.
- Score 5: The output answers the question comprehensively using the provided context.

Examples:

Example Input:
How is MLflow related to Databricks?

Example Output:
Databricks is a data engineering and analytics platform designed to help organizations process and analyze large amounts of data. Databricks is a company specializing in big data and machine learning solutions.

Additional information used by the model:
key: context
value:
MLflow is an open-source platform for managing the end-to-end machine learning (ML) lifecycle. It was developed by Databricks, a company that specializes in big data and machine learning solutions. MLflow is designed to address the challenges that data scientists and machine learning engineers face when developing, training, and deploying machine learning models.

Example score: 2
Example justification: The output provides relevant information about Databricks, mentioning it as a company specializing in big data and machine learning solutions. However, it doesn't directly address how MLflow is related to Databricks, which is the specific question asked in the input. Therefore, the output is only somewhat related to the provided context.
        

Example Input:
How is MLflow related to Databricks?

Example Output:
MLflow is a product created by Databricks to enhance the efficiency of machine learning processes.

Additional information used by the model:
key: context
value:
MLflow is an open-source platform for managing the end-to-end machine learning (ML) lifecycle. It was developed by Databricks, a company that specializes in big data and machine learning solutions. MLflow is designed to address the challenges that data scientists and machine learning engineers face when developing, training, and deploying machine learning models.

Example score: 4
Example justification: The output provides a relevant and accurate statement about the relationship between MLflow and Databricks. While it doesn't provide extensive detail, it still offers a substantial and meaningful response. To achieve a score of 5, the response could be further improved by providing additional context or details about how MLflow specifically functions within the Databricks ecosystem.
        

You must return the following fields in your response in two lines, one below the other:
score: Your numerical score for the model's relevance based on the rubric
justification: Your reasoning about the model's relevance score

Do not add additional new lines. Do not add any other fields.

3. 安全性

mlflowでは、LLMの出力として「民族的出身、宗教、性別、性的指向など、特定の集団の特徴を標的にした罵詈雑言」があるかどうかをhuggingface上のモデルで2値分類することで判断しています。
デフォルトのモデルは、「roberta-hate-speech-dynabench-r4」となっています。
※使用する場合は、日本語対応モデルに変更する必要があります。

まとめ

llm-as-a-judgeは、LLMをシステムやアプリケーションの一部として組み込む場合に必要になる技術だと思われます。
ただし、llm-as-a-judgeは発展段階にあるため、これ単体を使って出力を評価するのではなく、従来のメトリクスであるF1-scoreやROUGEなどを組み合わせて使用することで、より適切に評価できると考えられます。
また、mlflow内で使用されているプロンプトもv1となっていたため、今後のアップデートが期待できます。

参考文献

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up