More than 1 year has passed since last update.

ビジネスエンジニアリング株式会社（B-EN-G）Advent Calendar 2023

Ragasのソースコードを読んで評価指標を理解する

Last updated at 2024-06-03Posted at 2023-12-03

はじめに

LLM技術の代表的なユースケースの1つに RAG (Retrieval Augmented Generation) があります。
LangChainやLlamaIndexなどを利用して手軽にRAGを構築できるようになった一方で、
その性能を定量的に計測する手段はまだ確立されていないように感じます。

今回は、RAGの性能を計測するフレームワーク「Ragas」で使われている評価指標を学んで、RAGのパフォーマンス計測のヒントを探してみます。

対象読者

LLMの出力結果を定量的に評価する方法に興味がある人
LlamaIndexやLangChainなどでRAGパイプラインを構築したことのある人

本記事のゴール

Ragasの概要について理解する。
Ragasで利用されている評価指標について理解する。
RAGの評価方法についてのヒントを得る。

引用元のソースコード

Ragasとは

RAGパイプラインの性能を定量的に評価するためのフレームワークです (RAG Assessment)。
RAGのメトリクス駆動開発のスタンダードを確立することを目的として創設されたそうです。
今回は、評価指標によるパフォーマンス計測に焦点を当てて解説します。

できること

評価指標によるパフォーマンス計測
テストデータの生成
検証環境のモニタリング

Ragasが提供する評価指標

Ragasでは、以下の4種類の情報のいくつかを入力として、評価値を計算します。

question: RAGに入力された質問
contexts: 外部知識ソースから取得したコンテキスト
answer: questionとcontextsを元にしてLLMが生成した回答
ground_truths: 教師データ (質問に対する理想的な回答)

評価指標	評価基準	入力	評価対象
Faithfulness	コンテキストに基づいて回答しているか	question, contexts, answer	生成モデル
Answer Relevancy	質問に対して簡潔かつ適切に回答しているか	question, answer	生成モデル
Context Precision	コンテキストを正確に取得できているか	question, contexts	検索モデル
Context Relevancy	コンテキストと質問に関連性があるか	question, contexts	検索モデル
Context Recall	教師データからコンテキストをどの程度再現できるか	contexts, ground_truths	検索モデル
Answer Semantic Similarity	回答が教師データとどの程度類似しているか	answer, ground_truths	End to End
Answer Correctness	回答がどの程度正確か	answer, ground_truths	End to End
Aspect Critique	回答が特定の品質基準を満たしているか	question, context, answer	生成モデル

ほとんどの評価指標で、評価値を算出する前にLLMへの問い合わせを行っています。
続いて、各指標がどのような方法で評価値を算出しているかについて、Ragasのソースコードに記載されているプロンプトを引用しながら説明します。

Faithfulness (信頼性)

生成された回答が、どの程度コンテキストに基づいているかを計測します。

1. 「LONG_FORM_ANSWER_PROMPT」を利用して、質問と回答を元にトピックとなる文章を複数個生成します。

ragas/metric/_faithfulness.py

LONG_FORM_ANSWER_PROMPT = HumanMessagePromptTemplate.from_template(
    """\
Given a question and answer, create one or more statements from each sentence in the given answer.
question: Who was  Albert Einstein and what is he best known for?
answer: He was a German-born theoretical physicist, widely acknowledged to be one of the greatest and most influential physicists of all time. He was best known for developing the theory of relativity, he also made important contributions to the development of the theory of quantum mechanics.
statements:\nAlbert Einstein was born in Germany.\nAlbert Einstein was best known for his theory of relativity.
question: Cadmium Chloride is slightly soluble in this chemical, it is also called what?
answer: alcohol
statements:\nCadmium Chloride is slightly soluble in alcohol.
question: Were Shahul and Jithin of the same nationality?
answer: They were from different countries.
statements:\nShahul and Jithin were from different countries.
question:{question}
answer: {answer}
statements:\n"""  # noqa: E501
)

2. 「NLI_STATEMENTS_MESSAGE」を利用して、1で取得した各トピックがコンテキストに含まれているか判定します。

ragas/metric/_faithfulness.py

NLI_STATEMENTS_MESSAGE = HumanMessagePromptTemplate.from_template(
    """
Prompt: Natural language inference
Consider the given context and following statements, then determine whether they are supported by the information present in the context.Provide a brief explanation for each statement before arriving at the verdict (Yes/No). Provide a final verdict for each statement in order at the end in the given format. Do not deviate from the specified format.

Context:\nJohn is a student at XYZ University. He is pursuing a degree in Computer Science. He is enrolled in several courses this semester, including Data Structures, Algorithms, and Database Management. John is a diligent student and spends a significant amount of time studying and completing assignments. He often stays late in the library to work on his projects.
statements:\n1. John is majoring in Biology.\n2. John is taking a course on Artificial Intelligence.\n3. John is a dedicated student.\n4. John has a part-time job.\n5. John is interested in computer programming.\n
Answer:
1. John is majoring in Biology.
Explanation: John's major is explicitly mentioned as Computer Science. There is no information suggesting he is majoring in Biology.  Verdict: No.
2. John is taking a course on Artificial Intelligence.
Explanation: The context mentions the courses John is currently enrolled in, and Artificial Intelligence is not mentioned. Therefore, it cannot be deduced that John is taking a course on AI. Verdict: No.
3. John is a dedicated student.
Explanation: The prompt states that he spends a significant amount of time studying and completing assignments. Additionally, it mentions that he often stays late in the library to work on his projects, which implies dedication. Verdict: Yes.
4. John has a part-time job.
Explanation: There is no information given in the context about John having a part-time job. Therefore, it cannot be deduced that John has a part-time job.  Verdict: No.
5. John is interested in computer programming.
Explanation: The context states that John is pursuing a degree in Computer Science, which implies an interest in computer programming. Verdict: Yes.
Final verdict for each statement in order: No. No. Yes. No. Yes.
context:\n{context}
statements:\n{statements}
Answer:
"""  # noqa: E501
)

3. 2で出力された判定結果(Final verdict) を抽出して、評価値を算出します。

Final verdict for each statement in order: No. No. Yes. No. Yes.

上記の例では、5つのトピックのうち3つがコンテキストに含まれているので、評価値は0.6になります。

Answer Relevancy (回答の関連性)

質問に対して簡潔かつ適切に回答しているかを計測します。

1. 「QUESTION_GEN」を利用して、回答からいくつかの質問を生成します。

ragas/metric/_answer_relevance.py

QUESTION_GEN = HumanMessagePromptTemplate.from_template(
    """
Generate question for the given answer.
Answer:\nThe PSLV-C56 mission is scheduled to be launched on Sunday, 30 July 2023 at 06:30 IST / 01:00 UTC. It will be launched from the Satish Dhawan Space Centre, Sriharikota, Andhra Pradesh, India 
Question: When is the scheduled launch date and time for the PSLV-C56 mission, and where will it be launched from?

Answer:{answer}
Question:
"""  # noqa: E501
)

2. 1で生成した質問と元の質問をベクトル化して、コサイン類似度を算出します。
　このコサイン類似度の平均値が、Answer Relevancyの評価値となります。
　このため、不完全な回答や冗長な情報が含まれているとスコアが下がる傾向にあります。

Context Precision (コンテキストの精度)

コンテキストを正確に取得できているかを計測します。

1. 「CONTEXT_PRECISION」を利用して、取得したコンテキストが回答作成の役に立つか判定します。

ragas/metric/_comtext_precition.py

CONTEXT_PRECISION = HumanMessagePromptTemplate.from_template(
    """\
Given a question and a context, verify if the information in the given context is useful in answering the question. Return a Yes/No answer.
question:{question}
context:\n{context}
answer:
"""  # noqa: E501
)

2. 1でYesと判定されたコンテキストの数を計測して、評価値を算出します。
　有用なコンテキストの割合が多ければ多いほど、スコアが高くなります。

Context Relevancy (コンテキストの関連性)

コンテキストと質問の関連性を計測します。

1. 「CONTEXT_RELEVANCE」を利用して、コンテキストから回答作成に必要な文章を抽出します。

ragas/metric/_context_relevancy.py

CONTEXT_RELEVANCE = HumanMessagePromptTemplate.from_template(
    """\
Please extract relevant sentences from the provided context that is absolutely required answer the following question. If no relevant sentences are found, or if you believe the question cannot be answered from the given context, return the phrase "Insufficient Information".  While extracting candidate sentences you're not allowed to make any changes to sentences from given context.

question:{question}
context:\n{context}
candidate sentences:\n"""  # noqa: E501
)

2. 1で抽出した文章の数を、コンテキスト全体の文章の数で割った値を評価値とします。

Context Recall (コンテキストの再現率)

教師データからコンテキストをどの程度再現できるかを計測します。

1. 「CONTEXT_RECALL_RA」を利用して、教師データからコンテキストを想起できるか判定します。
　教師データ内の文章毎に判定し、その文章がコンテキストに関連している場合は[Attributed]、関連していない場合は[Not Attributed]とします。

ragas/metric/_context_recall.py

CONTEXT_RECALL_RA = HumanMessagePromptTemplate.from_template(
    """
Given a context, and an answer, analyze each sentence in the answer and classify if the sentence can be attributed to the given context or not.
Think in steps and reason before coming to conclusion. 

context: Albert Einstein (14 March 1879 – 18 April 1955) was a German-born theoretical physicist,widely held to be one of the greatest and most influential scientists of all time. Best known for developing the theory of relativity, he also made important contributions to quantum mechanics, and was thus a central figure in the revolutionary reshaping of the scientific understanding of nature that modern physics accomplished in the first decades of the twentieth century. His mass–energy equivalence formula E = mc2, which arises from relativity theory, has been called "the world's most famous equation". He received the 1921 Nobel Prize in Physics "for his services to theoretical physics, and especially for his discovery of the law of the photoelectric effect", a pivotal step in the development of quantum theory. His work is also known for its influence on the philosophy of science. In a 1999 poll of 130 leading physicists worldwide by the British journal Physics World, Einstein was ranked the greatest physicist of all time. His intellectual achievements and originality have made Einstein synonymous with genius.
answer: Albert Einstein born in 14 March 1879 was  German-born theoretical physicist, widely held to be one of the greatest and most influential scientists of all time. He received the 1921 Nobel Prize in Physics "for his services to theoretical physics. He published 4 papers in 1905.  Einstein moved to Switzerland in 1895 
classification
1. Albert Einstein born in 14 March 1879 was  German-born theoretical physicist, widely held to be one of the greatest and most influential scientists of all time. The date of birth of Einstein is mentioned clearly in the context. So [Attributed]
2. He received the 1921 Nobel Prize in Physics "for his services to theoretical physics. The exact sentence is present in the given context. So [Attributed]
3. He published 4 papers in 1905. There is no mention about papers he wrote in given the context. So [Not Attributed]
4. Einstein moved to Switzerland in 1895. There is not supporting evidence for this in the given the context. So [Not Attributed]

context:{context}
answer:{ground_truth}
classification:
"""  # noqa: E501
)

2. 1で[Attributed]と判定された文章の数を、教師データ全体の文章の数で割った値を評価値とします。

Answer Semantic Similarity (回答の意味的類似性)

教師データと回答がどの程度類似しているかを計測します。
ここではプロンプトを用いた問い合わせは行わず、クロスエンコーダモデルを利用してベクトル同士の比較を行います。
クロスエンコーダモデルを指定しない場合は、各ベクトルの内積を算出します。

Answer Correctness (回答の正確性)

回答の正確性を計測します。
プロンプトを用いた問い合わせは行わず、既出の評価指標を組み合わせて利用します。

1. 教師データをコンテキストとして、質問と元の回答とのFaithfulnessを計算します。
2. 教師データと元の回答のAnswer Semantic Similarityを計算します。
3. 1と2の加重平均を評価値とします。

Aspect Critique (アスペクト批評)

回答が特定の品質基準を満たしているかを判定します。
デフォルトで用意されている品質基準は以下の通りです(独自の基準を提起することもできます)。

　有害性（Harmfulness）:
　　基準: 「提出物が個人、グループ、または社会全体に害を及ぼす可能性があるか」
　　回答が潜在的に有害または危険な情報を含むかどうかを評価します。

　悪意（Maliciousness）:
　　基準: 「提出物がユーザーに害を及ぼす、欺く、または利用する意図があるか」
　　回答が故意に悪意のある内容を含むかどうかを判断します。

　一貫性（Coherence）:
　　基準: 「提出物が論理的かつ組織的にアイデア、情報、または議論を提示しているか」
　　回答に一貫性があり、論理的に整理されているかを評価します。

　正確性（Correctness）:
　　基準: 「提出物が事実に基づいており、誤りがないか」。
　　回答が事実的に正確で、誤りがないかどうかを判断します。

　簡潔性（Conciseness）:
　　基準: 「提出物が情報やアイデアを明確かつ効率的に伝えており、不要または冗長な詳細がないか」
　　回答が簡潔で、必要以上の情報を含まないかを評価します。

1. 「CRITIQUE_PROMPT」を利用して、選択した基準で判定を行います。
　{input}には質問、{submission}には回答、{criteria}には指定した基準が設定されます。

ragas/metric/critique.py

CRITIQUE_PROMPT = HumanMessagePromptTemplate.from_template(
    """Given a input and submission. Evaluate the submission only using the given criteria. 
Think step by step providing reasoning and arrive at a conclusion at the end by generating a Yes or No verdict at the end.

input: Who was the director of Los Alamos Laboratory?
submission: Einstein was the director of  Los Alamos Laboratory.
criteria: Is the output written in perfect grammar
Here's are my thoughts: the criteria for evaluation is whether the output is written in perfect grammar. In this case, the output is grammatically correct. Therefore, the answer is:\n\nYes

input:{input}
submission:{submission}
criteria:{criteria}
Here's are my thoughts:
"""  # noqa: E501
)

2. 1の出力結果がYesの場合は1、Noの場合は0を出力します。

まとめ

Ragasとは？

RAGパイプラインの性能を定量的に評価するためのフレームワークです。

どんな評価指標がある？

質問、コンテキスト、回答、教師データを入力とする8種類の評価指標があります。
- コンポーネント(生成モデル、検索モデル)を評価する指標と、パイプライン全体(E2E)を評価する指標が用意されています。
LLMを利用して入力を計測可能な数値データに変換して評価する指標と、Embeddingを利用して評価する指標が用意されています。

調べてみて分かったこと

LLMを活用して、文章を数値で計測可能な単位に変換すれば定量的に比較できる。
- これまで人間が手を動かしてやっていた仕事を、LLMに任せられるようになった。
- プロンプトでは、Few-shot promptingやchain-of-thought prompting を活用している。
  　
まずは、何を評価すべきかを切り分けて考えると良いかも？
- プロンプトやLLMを変更する場合は、生成モデルを評価できる指標で影響確認する。
- ドキュメントやEmbeddingを変更する場合は、検索モデルを評価できる指標で影響確認する。
- 教師データを準備できる場合は、End-to-Endで評価して理想とのギャップを確認する。

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up