Delete article

Deleted articles cannot be recovered.

Draft of this article would be also deleted.

Are you sure you want to delete this article?

More than 1 year has passed since last update.


Last updated at Posted at 2023-12-18

この記事はBrainPad Advent Calender 2023 19日目の記事です。


 LLMの出力の評価方法には、従来のメトリクスである F1-score や ROUGE, BLUE などを使用した方法とLLMを評価メトリクスとして使用するllm-as-a-judgeと呼ばれる手法(人手に匹敵する評価を、機械的に行えることが期待できる手法)があります。




# 評価項目 概要
1 類似性 LLMの出力と正解データの類似度を評価
2 関連性 LLMの出力(回答)と入力(質問)およびコンテキスト(追加情報)の関連度を評価
3 安全性 LLMの出力のToxicity(ヘイトスピーチなどの毒性)を評価

1. 類似性



LLMの出力と正解データそれぞれについてLLM を使用して生成した Embeddings ベクトルを作成し、コサイン類似度を算出することで類似度をみることが可能です。
また、LLM以外にもsentence transformerを使用する事も可能です。



2. 関連性


  • LLMの出力と入力の関連度評価
  • LLMの出力とコンテキストの関連度評価
  • LLMの出力が入力とコンテキスト両方に関しての関連度評価



- スコア1: アウトプットが質問について何も言及していないか、インプットと全く無関係である。
- スコア5:アウトプットが質問のすべての側面に対応しており、アウトプットのすべての部分が有意義で、質問に関連している。


You must return the following fields in your response in two lines, one below the other:
score: Your numerical score for the model's answer_relevance based on the rubric
justification: Your reasoning about the model's answer_relevance score

You are an impartial judge. You will be given an input that was sent to a machine
learning model, and you will be given an output that the model produced. You
may also be given additional information that was used by the model to generate the output.

Your task is to determine a numerical score called answer_relevance based on the input and output.
A definition of answer_relevance and a grading rubric are provided below.
You must use the grading rubric to determine your score. You must also justify your score.

Examples could be included below for reference. Make sure to use them as references and to
understand them before completing the task.




Metric definition:
Answer relevance measures the appropriateness and applicability of the output with respect to the input. Scores should reflect the extent to which the output directly addresses the question provided in the input, and give lower scores for incomplete or redundant output.

Grading rubric:
Answer relevance: Please give a score from 1-5 based on the degree of relevance to the input, where the lowest and highest scores are defined as follows:
- Score 1: The output doesn't mention anything about the question or is completely irrelevant to the input.
- Score 5: The output addresses all aspects of the question and all parts of the output are meaningful and relevant to the question.


Example Input:
How is MLflow related to Databricks?

Example Output:
Databricks is a company that specializes in big data and machine learning solutions.

Example score: 2
Example justification: The output provided by the model does give some information about Databricks, which is part of the input question. However, it does not address the main point of the question, which is the relationship between MLflow and Databricks. Therefore, while the output is not completely irrelevant, it does not fully answer the question, leading to a lower score.

Example Input:
How is MLflow related to Databricks?

Example Output:
MLflow is a product created by Databricks to enhance the efficiency of machine learning processes.

Example score: 5
Example justification: The output directly addresses the input question by explaining the relationship between MLflow and Databricks. It provides a clear and concise answer that MLflow is a product created by Databricks, and also adds relevant information about the purpose of MLflow, which is to enhance the efficiency of machine learning processes. Therefore, the output is highly relevant to the input and deserves a full score.

You must return the following fields in your response in two lines, one below the other:
score: Your numerical score for the model's answer_relevance based on the rubric
justification: Your reasoning about the model's answer_relevance score

Do not add additional new lines. Do not add any other fields.



- スコア1:提供された文脈から、アウトプットの主張のどれも推論できない。
- スコア2:アウトプットの主張のいくつかは提供された文脈から推測できるが、アウトプットの大部分は提供された文脈から欠落しているか、矛盾しているか、または矛盾している。
- スコア3:アウトプット中の主張の半分以上は、提供された文脈から推測できる。
- スコア4:アウトプット中の主張の大部分は、提供された文脈から推測できる。
- スコア5:アウトプットのすべての主張が、提供された文脈によって直接的に裏付けられ、提供された文脈に忠実である。


You must return the following fields in your response in two lines, one below the other:
score: Your numerical score for the model's faithfulness based on the rubric
justification: Your reasoning about the model's faithfulness score

You are an impartial judge. You will be given an input that was sent to a machine
learning model, and you will be given an output that the model produced. You
may also be given additional information that was used by the model to generate the output.

Your task is to determine a numerical score called faithfulness based on the input and output.
A definition of faithfulness and a grading rubric are provided below.
You must use the grading rubric to determine your score. You must also justify your score.

Examples could be included below for reference. Make sure to use them as references and to
understand them before completing the task.




Metric definition:
Faithfulness is only evaluated with the provided output and provided context, please ignore the provided input entirely when scoring faithfulness. Faithfulness assesses how much of the provided output is factually consistent with the provided context. A higher score indicates that a higher proportion of claims present in the output can be derived from the provided context. Faithfulness does not consider how much extra information from the context is not present in the output.

Grading rubric:
Faithfulness: Below are the details for different scores:
- Score 1: None of the claims in the output can be inferred from the provided context.
- Score 2: Some of the claims in the output can be inferred from the provided context, but the majority of the output is missing from, inconsistent with, or contradictory to the provided context.
- Score 3: Half or more of the claims in the output can be inferred from the provided context.
- Score 4: Most of the claims in the output can be inferred from the provided context, with very little information that is not directly supported by the provided context.
- Score 5: All of the claims in the output are directly supported by the provided context, demonstrating high faithfulness to the provided context.


Example Input:
How is MLflow related to Databricks?

Example Output:
Databricks is a company that specializes in big data and machine learning solutions. MLflow has nothing to do with Databricks. MLflow is an open-source platform for managing the end-to-end machine learning (ML) lifecycle.

Additional information used by the model:
key: context
MLflow is an open-source platform for managing the end-to-end machine learning (ML) lifecycle. It was developed by Databricks, a company that specializes in big data and machine learning solutions. MLflow is designed to address the challenges that data scientists and machine learning engineers face when developing, training, and deploying machine learning models.

Example score: 2
Example justification: The output claims that "MLflow has nothing to do with Databricks" which is contradictory to the provided context that states "It was developed by Databricks". This is a major inconsistency. However, the output correctly identifies that "MLflow is an open-source platform for managing the end-to-end machine learning (ML) lifecycle" and "Databricks is a company that specializes in big data and machine learning solutions", which are both supported by the context. Therefore, some of the claims in the output can be inferred from the provided context, but the majority of the output is inconsistent with the provided context, leading to a faithfulness score of 2.

Example Input:
How is MLflow related to Databricks?

Example Output:
Databricks is a company that specializes in big data and machine learning solutions.

Additional information used by the model:
key: context
MLflow is an open-source platform for managing the end-to-end machine learning (ML) lifecycle. It was developed by Databricks, a company that specializes in big data and machine learning solutions. MLflow is designed to address the challenges that data scientists and machine learning engineers face when developing, training, and deploying machine learning models.

Example score: 5
Example justification: The output states that "Databricks is a company that specializes in big data and machine learning solutions." This claim is directly supported by the context, which states "It was developed by Databricks, a company that specializes in big data and machine learning solutions." Therefore, the faithfulness score is 5 as all the claims in the output are directly supported by the provided context."

You must return the following fields in your response in two lines, one below the other:
score: Your numerical score for the model's faithfulness based on the rubric
justification: Your reasoning about the model's faithfulness score

Do not add additional new lines. Do not add any other fields.



- スコア1:提供された文脈から、アウトプットの主張のどれも推論できない。
- スコア2:アウトプットの主張のいくつかは提供された文脈から推測できるが、アウトプットの大部分は提供された文脈から欠落しているか、矛盾しているか、または矛盾している。
- スコア3:アウトプット中の主張の半分以上は、提供された文脈から推測できる。
- スコア4:アウトプット中の主張の大部分は、提供された文脈から推測できる。
- スコア5:アウトプットのすべての主張が、提供された文脈によって直接的に裏付けられ、提供された文脈に忠実である。


You must return the following fields in your response in two lines, one below the other:
score: Your numerical score for the model's relevance based on the rubric
justification: Your reasoning about the model's relevance score

You are an impartial judge. You will be given an input that was sent to a machine
learning model, and you will be given an output that the model produced. You
may also be given additional information that was used by the model to generate the output.

Your task is to determine a numerical score called relevance based on the input and output.
A definition of relevance and a grading rubric are provided below.
You must use the grading rubric to determine your score. You must also justify your score.

Examples could be included below for reference. Make sure to use them as references and to
understand them before completing the task.




Metric definition:
Relevance encompasses the appropriateness, significance, and applicability of the output with respect to both the input and context. Scores should reflect the extent to which the output directly addresses the question provided in the input, given the provided context.

Grading rubric:
Relevance: Below are the details for different scores:
- Score 1: The output doesn't mention anything about the question or is completely irrelevant to the provided context.
- Score 2: The output provides some relevance to the question and is somehow related to the provided context.
- Score 3: The output mostly answers the question and is largely consistent with the provided context.
- Score 4: The output answers the question and is consistent with the provided context.
- Score 5: The output answers the question comprehensively using the provided context.


Example Input:
How is MLflow related to Databricks?

Example Output:
Databricks is a data engineering and analytics platform designed to help organizations process and analyze large amounts of data. Databricks is a company specializing in big data and machine learning solutions.

Additional information used by the model:
key: context
MLflow is an open-source platform for managing the end-to-end machine learning (ML) lifecycle. It was developed by Databricks, a company that specializes in big data and machine learning solutions. MLflow is designed to address the challenges that data scientists and machine learning engineers face when developing, training, and deploying machine learning models.

Example score: 2
Example justification: The output provides relevant information about Databricks, mentioning it as a company specializing in big data and machine learning solutions. However, it doesn't directly address how MLflow is related to Databricks, which is the specific question asked in the input. Therefore, the output is only somewhat related to the provided context.

Example Input:
How is MLflow related to Databricks?

Example Output:
MLflow is a product created by Databricks to enhance the efficiency of machine learning processes.

Additional information used by the model:
key: context
MLflow is an open-source platform for managing the end-to-end machine learning (ML) lifecycle. It was developed by Databricks, a company that specializes in big data and machine learning solutions. MLflow is designed to address the challenges that data scientists and machine learning engineers face when developing, training, and deploying machine learning models.

Example score: 4
Example justification: The output provides a relevant and accurate statement about the relationship between MLflow and Databricks. While it doesn't provide extensive detail, it still offers a substantial and meaningful response. To achieve a score of 5, the response could be further improved by providing additional context or details about how MLflow specifically functions within the Databricks ecosystem.

You must return the following fields in your response in two lines, one below the other:
score: Your numerical score for the model's relevance based on the rubric
justification: Your reasoning about the model's relevance score

Do not add additional new lines. Do not add any other fields.

3. 安全性






Register as a new user and use Qiita more conveniently

  1. You get articles that match your needs
  2. You can efficiently read back useful information
  3. You can use dark theme
What you can do with signing up

Delete article

Deleted articles cannot be recovered.

Draft of this article would be also deleted.

Are you sure you want to delete this article?