はじめに
LLMアプリケーションの挙動を理解するときには、LLMに投入されるプロンプトを見るのが近道です。
今回はRAGASにおける以下の指標のプロンプトを見ていきます。
- faithfulness
- answer relevancy
- answer correctness
- context precision
- context recall
基本
RAGASのプロンプトは原則、PydanticPromptを継承したクラスとして実装されています。
PydanticPromptはinput_model, output_model, instruction, examplesを属性に持ちます。
InputModelとOutputModelをジェネリクスとして受け取り、それがinstruction以外の型定義に使われます。InputModelとOutputModelはPydanticのBaseModelを継承したクラスである必要があります。
class PydanticPrompt(BasePrompt, t.Generic[InputModel, OutputModel]):
# these are class attributes
input_model: t.Type[InputModel]
output_model: t.Type[OutputModel]
instruction: str
examples: t.List[t.Tuple[InputModel, OutputModel]] = []
PydanticPromptはto_stringメソッドから、これによりLLMに投入される最終的な文字列を取得できます。
大雑把には、指示、出力形式、例、入力の順に結合されたプロンプトです。
class PydanticPrompt(BasePrompt, t.Generic[InputModel, OutputModel]):
...
def to_string(self, data: t.Optional[InputModel] = None) -> str:
return (
f"{self.instruction}\n"
+ self._generate_output_signature()
+ "\n"
+ self._generate_examples()
+ "\n-----------------------------\n"
+ "\nNow perform the same with the following input\n"
+ (
"input: " + data.model_dump_json(indent=4, exclude_none=True) + "\n"
if data is not None
else "Input: (None)\n"
)
+ "Output: "
)
各指標のプロンプトを管理するクラスはPydanticPromptを継承しています。
クラスの定義時に、入出力の型、指示、例を指定します。
class StatementGeneratorPrompt(
PydanticPrompt[StatementGeneratorInput, StatementGeneratorOutput]
):
instruction = "Given a question and an answer, analyze the complexity of each sentence in the answer. Break down each sentence into one or more fully understandable statements. Ensure that no pronouns are used in any statement. Format the outputs in JSON."
input_model = StatementGeneratorInput
output_model = StatementGeneratorOutput
examples = [
...
faithfulness
faithfulnessは、生成された回答が与えられたコンテキストに基づいているかを評価します。
- 主張の生成
- 主張の評価
という2段階で構成されています。
それぞれのプロンプトテンプレートを見ていきます。
主張生成
関係するコードは以下です。
class StatementGeneratorInput(BaseModel):
question: str = Field(description="The question to answer")
answer: str = Field(description="The answer to the question")
class StatementGeneratorOutput(BaseModel):
statements: t.List[str] = Field(description="The generated statements")
class StatementGeneratorPrompt(
PydanticPrompt[StatementGeneratorInput, StatementGeneratorOutput]
):
instruction = "Given a question and an answer, analyze the complexity of each sentence in the answer. Break down each sentence into one or more fully understandable statements. Ensure that no pronouns are used in any statement. Format the outputs in JSON."
input_model = StatementGeneratorInput
output_model = StatementGeneratorOutput
examples = [
(
StatementGeneratorInput(
question="Who was Albert Einstein and what is he best known for?",
answer="He was a German-born theoretical physicist, widely acknowledged to be one of the greatest and most influential physicists of all time. He was best known for developing the theory of relativity, he also made important contributions to the development of the theory of quantum mechanics.",
),
StatementGeneratorOutput(
statements=[
"Albert Einstein was a German-born theoretical physicist.",
"Albert Einstein is recognized as one of the greatest and most influential physicists of all time.",
"Albert Einstein was best known for developing the theory of relativity.",
"Albert Einstein also made important contributions to the development of the theory of quantum mechanics.",
]
),
)
]
to_stringしてみると具体的なプロンプトを確認できます。
# v0.3 系のクラス名に合わせた例
from ragas.metrics._faithfulness import StatementGeneratorPrompt
print(StatementGeneratorPrompt().to_string())
Given a question and an answer, analyze the complexity of each sentence in the answer. Break down each sentence into one or more fully understandable statements. Ensure that no pronouns are used in any statement. Format the outputs in JSON.
Please return the output in a JSON format that complies with the following schema as specified in JSON Schema:
{"properties": {"statements": {"description": "The generated statements", "items": {"type": "string"}, "title": "Statements", "type": "array"}}, "required": ["statements"], "title": "StatementGeneratorOutput", "type": "object"}Do not use single quotes in your response but double quotes,properly escaped with a backslash.
--------EXAMPLES-----------
Example 1
Input: {
"question": "Who was Albert Einstein and what is he best known for?",
"answer": "He was a German-born theoretical physicist, widely acknowledged to be one of the greatest and most influential physicists of all time. He was best known for developing the theory of relativity, he also made important contributions to the development of the theory of quantum mechanics."
}
Output: {
"statements": [
"Albert Einstein was a German-born theoretical physicist.",
"Albert Einstein is recognized as one of the greatest and most influential physicists of all time.",
"Albert Einstein was best known for developing the theory of relativity.",
"Albert Einstein also made important contributions to the development of the theory of quantum mechanics."
]
}
-----------------------------
Now perform the same with the following input
Input: (None)
Output:
日本語訳すると以下のような感じです。
質問と回答が与えられた場合、回答に含まれる各文の複雑さを分析してください。各文を1つ以上の完全に理解可能な文に分解してください。どの文にも代名詞が使用されていないことを確認してください。出力はJSON形式でフォーマットしてください。
出力は、JSON Schemaで指定されている以下のスキーマに準拠したJSON形式で返してください。
{出力形式の参照: 文字列のリスト}
応答では一重引用符を使用せず、二重引用符を使用し、バックスラッシュで適切にエスケープしてください。
-------例-----------
略
----------------------------
次の入力で同じ操作を実行してください。
入力: (なし)
出力:
ユーザ目線ではquestionとanswerを与えると
{
"question": QUESTION,
"answer": ANSWER
}
というjson文字列に変換されて、上記の「入力: (なし)」の部分に埋め込まれます。
出力はstringのlistを含むjson文字列になります。
指示は
- 回答を文に分解する
- 各文を主張に分解する
- 主張は代名詞を使わず、完全に理解可能な文にする
みたいな内容になっています。
多くの場合で使えそうな指示文ではありますが、適用ドメインによってはカスタマイズの余地があります。
カスタマイズしたいときに、具体的なプロンプトテンプレートがわかっていれば、playgroundなどで試しやすくなります。
例えば、英語のテンプレートの末尾を以下のようにして、出力を見てみます。
Now perform the same with the following input
Input: {"question": "水は100度で氷りますか?", "answer": "いいえ"}
Output:
{
"statements": [
"水は100度で氷りません。"
]
}
単純に回答文だけをみると「いいえ」しか情報がありませんが、質問からきちんと意味を補って、短文で成立する主張を生成してくれていることがわかります。
主張判定
主張判定では、生成された主張がコンテキストに基づいているかを判定します。
関係するコードは以下です。
NLIStatementPromptというクラス名は一見、faithfulnessとは関係なさそうですが、contextとstatementの関係性を判定するNLIタスクという意味合いで命名されているのだと思われます。
class StatementFaithfulnessAnswer(BaseModel):
statement: str = Field(..., description="the original statement, word-by-word")
reason: str = Field(..., description="the reason of the verdict")
verdict: int = Field(..., description="the verdict(0/1) of the faithfulness.")
class NLIStatementOutput(BaseModel):
statements: t.List[StatementFaithfulnessAnswer]
class NLIStatementInput(BaseModel):
context: str = Field(..., description="The context of the question")
statements: t.List[str] = Field(..., description="The statements to judge")
class NLIStatementPrompt(PydanticPrompt[NLIStatementInput, NLIStatementOutput]):
instruction = "Your task is to judge the faithfulness of a series of statements based on a given context. For each statement you must return verdict as 1 if the statement can be directly inferred based on the context or 0 if the statement can not be directly inferred based on the context."
input_model = NLIStatementInput
output_model = NLIStatementOutput
examples = [
(
NLIStatementInput(
context="""John is a student at XYZ University. He is pursuing a degree in Computer Science. He is enrolled in several courses this semester, including Data Structures, Algorithms, and Database Management. John is a diligent student and spends a significant amount of time studying and completing assignments. He often stays late in the library to work on his projects.""",
statements=[
"John is majoring in Biology.",
"John is taking a course on Artificial Intelligence.",
"John is a dedicated student.",
"John has a part-time job.",
],
),
NLIStatementOutput(
statements=[
StatementFaithfulnessAnswer(
statement="John is majoring in Biology.",
reason="John's major is explicitly mentioned as Computer Science. There is no information suggesting he is majoring in Biology.",
verdict=0,
),
StatementFaithfulnessAnswer(
statement="John is taking a course on Artificial Intelligence.",
reason="The context mentions the courses John is currently enrolled in, and Artificial Intelligence is not mentioned. Therefore, it cannot be deduced that John is taking a course on AI.",
verdict=0,
),
StatementFaithfulnessAnswer(
statement="John is a dedicated student.",
reason="The context states that he spends a significant amount of time studying and completing assignments. Additionally, it mentions that he often stays late in the library to work on his projects, which implies dedication.",
verdict=1,
),
StatementFaithfulnessAnswer(
statement="John has a part-time job.",
reason="There is no information given in the context about John having a part-time job.",
verdict=0,
),
]
),
),
(
NLIStatementInput(
context="Photosynthesis is a process used by plants, algae, and certain bacteria to convert light energy into chemical energy.",
statements=[
"Albert Einstein was a genius.",
],
),
NLIStatementOutput(
statements=[
StatementFaithfulnessAnswer(
statement="Albert Einstein was a genius.",
reason="The context and statement are unrelated",
verdict=0,
)
]
),
),
]
to_string()してみます。
from ragas.metrics._faithfulness import NLIStatementPrompt
prompt = NLIStatementPrompt()
print(prompt.to_string())
Your task is to judge the faithfulness of a series of statements based on a given context. For each statement you must return verdict as 1 if the statement can be directly inferred based on the context or 0 if the statement can not be directly inferred based on the context.
Please return the output in a JSON format that complies with the following schema as specified in JSON Schema:
{"$defs": {"StatementFaithfulnessAnswer": {"properties": {"statement": {"description": "the original statement, word-by-word", "title": "Statement", "type": "string"}, "reason": {"description": "the reason of the verdict", "title": "Reason", "type": "string"}, "verdict": {"description": "the verdict(0/1) of the faithfulness.", "title": "Verdict", "type": "integer"}}, "required": ["statement", "reason", "verdict"], "title": "StatementFaithfulnessAnswer", "type": "object"}}, "properties": {"statements": {"items": {"$ref": "#/$defs/StatementFaithfulnessAnswer"}, "title": "Statements", "type": "array"}}, "required": ["statements"], "title": "NLIStatementOutput", "type": "object"}Do not use single quotes in your response but double quotes,properly escaped with a backslash.
--------EXAMPLES-----------
Example 1
Input: {
"context": "John is a student at XYZ University. He is pursuing a degree in Computer Science. He is enrolled in several courses this semester, including Data Structures, Algorithms, and Database Management. John is a diligent student and spends a significant amount of time studying and completing assignments. He often stays late in the library to work on his projects.",
"statements": [
"John is majoring in Biology.",
"John is taking a course on Artificial Intelligence.",
"John is a dedicated student.",
"John has a part-time job."
]
}
Output: {
"statements": [
{
"statement": "John is majoring in Biology.",
"reason": "John's major is explicitly mentioned as Computer Science. There is no information suggesting he is majoring in Biology.",
"verdict": 0
},
{
"statement": "John is taking a course on Artificial Intelligence.",
"reason": "The context mentions the courses John is currently enrolled in, and Artificial Intelligence is not mentioned. Therefore, it cannot be deduced that John is taking a course on AI.",
"verdict": 0
},
{
"statement": "John is a dedicated student.",
"reason": "The context states that he spends a significant amount of time studying and completing assignments. Additionally, it mentions that he often stays late in the library to work on his projects, which implies dedication.",
"verdict": 1
},
{
"statement": "John has a part-time job.",
"reason": "There is no information given in the context about John having a part-time job.",
"verdict": 0
}
]
}
Example 2
Input: {
"context": "Photosynthesis is a process used by plants, algae, and certain bacteria to convert light energy into chemical energy.",
"statements": [
"Albert Einstein was a genius."
]
}
Output: {
"statements": [
{
"statement": "Albert Einstein was a genius.",
"reason": "The context and statement are unrelated",
"verdict": 0
}
]
}
-----------------------------
Now perform the same with the following input
Input: (None)
Output:
指示の日本語訳だけ載せておきます。
あなたの課題は、与えられた文脈に基づいて、一連の文の忠実性を判断することです。各文について、文脈から直接推論できる場合は1、文脈から直接推論できない場合は0を判定として返してください。
利用時には、statements(stringのlist)とcontext(string)を与えます。contextは基本的には検索結果だと思います。検索結果が複数得られている場合は、結合するなどしてstringとして与える必要があります。
プロンプトには
Input: {
"context": CONTEXT,
"statements": [
STATEMENT_1,
STATEMENT_2,
...
]
}
みたいな感じで埋め込まれます。
出力は下記StatementFaithfulnessAnswerのリストになります。
主張、理由、判定結果を含みます。理由が先に生成されるのでchain of thoughtに一応なっています。
class StatementFaithfulnessAnswer(BaseModel):
statement: str = Field(..., description="the original statement, word-by-word")
reason: str = Field(..., description="the reason of the verdict")
verdict: int = Field(..., description="the verdict(0/1) of the faithfulness.")
試しにplaygroundで入力を以下のようにしてみます。
Now perform the same with the following input
Input: {
"context": "富士山は日本一高い\nエベレストは世界一高い",
"statements": [
"日本一の山は富士山です",
"中国1の山はエベレストです"
Output:
{
"statements": [
{
"statement": "日本一の山は富士山です",
"reason": "The context states that 富士山 (Mount Fuji) is the tallest in Japan (日本一高い), which directly supports this statement.",
"verdict": 1
},
{
"statement": "中国1の山はエベレストです",
"reason": "The context mentions エベレスト (Mount Everest) as the tallest in the world (世界一高い) but does not specify its relation to China or that it is the tallest in China. Therefore, this cannot be directly inferred.",
"verdict": 0
}
]
}
answer relevancy
answer relevancyは、直訳すると「回答の関連性」なのですが、ソースコードを見ると、「わかりません」みたいに回答が曖昧かどうかを判定する指標みたいです。
関係するコードは以下です。指標やクラス名がanswer relevance、answer relevancy、response relevanceなど揺れており、なんとなく迷いのあとが伺えます。
class ResponseRelevanceOutput(BaseModel):
question: str
noncommittal: int
class ResponseRelevanceInput(BaseModel):
response: str
class ResponseRelevancePrompt(
PydanticPrompt[ResponseRelevanceInput, ResponseRelevanceOutput]
):
instruction = """Generate a question for the given answer and Identify if answer is noncommittal. Give noncommittal as 1 if the answer is noncommittal and 0 if the answer is committal. A noncommittal answer is one that is evasive, vague, or ambiguous. For example, "I don't know" or "I'm not sure" are noncommittal answers"""
input_model = ResponseRelevanceInput
output_model = ResponseRelevanceOutput
examples = [
(
ResponseRelevanceInput(
response="""Albert Einstein was born in Germany.""",
),
ResponseRelevanceOutput(
question="Where was Albert Einstein born?",
noncommittal=0,
),
),
(
ResponseRelevanceInput(
response="""I don't know about the groundbreaking feature of the smartphone invented in 2023 as am unaware of information beyond 2022. """,
),
ResponseRelevanceOutput(
question="What was the groundbreaking feature of the smartphone invented in 2023?",
noncommittal=1,
),
),
]
to_string()してみます。
from ragas.metrics._answer_relevance import ResponseRelevancePrompt
prompt = ResponseRelevancePrompt()
print(prompt.to_string())
Generate a question for the given answer and Identify if answer is noncommittal. Give noncommittal as 1 if the answer is noncommittal and 0 if the answer is committal. A noncommittal answer is one that is evasive, vague, or ambiguous. For example, "I don't know" or "I'm not sure" are noncommittal answers
Please return the output in a JSON format that complies with the following schema as specified in JSON Schema:
{"properties": {"question": {"title": "Question", "type": "string"}, "noncommittal": {"title": "Noncommittal", "type": "integer"}}, "required": ["question", "noncommittal"], "title": "ResponseRelevanceOutput", "type": "object"}Do not use single quotes in your response but double quotes,properly escaped with a backslash.
--------EXAMPLES-----------
Example 1
Input: {
"response": "Albert Einstein was born in Germany."
}
Output: {
"question": "Where was Albert Einstein born?",
"noncommittal": 0
}
Example 2
Input: {
"response": "I don't know about the groundbreaking feature of the smartphone invented in 2023 as am unaware of information beyond 2022. "
}
Output: {
"question": "What was the groundbreaking feature of the smartphone invented in 2023?",
"noncommittal": 1
}
-----------------------------
Now perform the same with the following input
Input: (None)
Output:
指示の日本語訳だけ載せておきます。
与えられた回答に対して質問を生成し、回答が曖昧かどうかを識別します。回答が曖昧な場合は「1」、明確な場合は「0」と評価します。曖昧な回答とは、言い逃れ、曖昧、または曖昧な回答のことです。例えば、「わかりません」や「よくわかりません」は曖昧な回答です。
プロンプトだけ見ると、questionをわざわざ生成させる意味がわかりません(「A noncommittal answer is one that is evasive, vague, or ambiguous. For example, "I don't know" or "I'm not sure" are noncommittal answers」という説明があれば、直接responseをみて0/1を判定すればよいように思えます)が、実はこのquestionは1つのanswerにつき複数回生成されたあと、真のquestionとcos類似度で比較されて平均されて最終スコアになる、みたいな処理につながっています。
class ResponseRelevancy(MetricWithLLM, MetricWithEmbeddings, SingleTurnMetric):
...
def _calculate_score(
self, answers: t.Sequence[ResponseRelevanceOutput], row: t.Dict
) -> float:
question = row["user_input"]
gen_questions = [answer.question for answer in answers]
all_noncommittal = np.all([answer.noncommittal for answer in answers])
if all(q == "" for q in gen_questions):
logger.warning(
"Invalid JSON response. Expected dictionary with key 'question'"
)
score = np.nan
else:
cosine_sim = self.calculate_similarity(question, gen_questions)
score = cosine_sim.mean() * int(not all_noncommittal)
return score
気持ちとしては、回答が曖昧であれば、生成されるquestionの多様性が増し、真のquestionとの類似度が下がるので、この指標によって、回答の曖昧性を測れるということになります。ただ、もし「富士山が日本一高いかどうかはわかりません」みたいな回答があった場合、生成されるquestionは「富士山は日本一高いですか?」みたいにほぼ一定になるので、真のquestionとの類似度は高くなります。そこで、question判定のたびに同時出力される曖昧さの評価結果がすべて曖昧(noncommittal)であった場合には、cos類似度に関係なく0になるような補正(score = cosine_sim.mean() * int(not all_noncommittal)の部分)がなされていまるようです。個人的にはcommital判定はquestion生成のたびに行うよりは、全体で1回だけ行ったほうが安定するような気もしますが、プロンプトを2種類管理する必要があって実装は面倒そうなので、RAGASでは採用されなかったのかもしれません。
answer correctness
answer correctnessは、answerとground truthの一致度を測る指標です。
関係するコードは以下です。
CorrectnessClassifierというクラス名はあまり直感的ではありませんが、answerとgroundtruthをTP、NP、FPに分類する役割を持っているため、この名前になっているのだと思われます。
class QuestionAnswerGroundTruth(BaseModel):
question: str
answer: list[str]
ground_truth: list[str]
class StatementsWithReason(BaseModel):
statement: str
reason: str
class ClassificationWithReason(BaseModel):
TP: list[StatementsWithReason]
FP: list[StatementsWithReason]
FN: list[StatementsWithReason]
class CorrectnessClassifier(
PydanticPrompt[QuestionAnswerGroundTruth, ClassificationWithReason]
):
instruction = "Given a ground truth and an answer statements, analyze each statement and classify them in one of the following categories: TP (true positive): statements that are present in answer that are also directly supported by the one or more statements in ground truth, FP (false positive): statements present in the answer but not directly supported by any statement in ground truth, FN (false negative): statements found in the ground truth but not present in answer. Each statement can only belong to one of the categories. Provide a reason for each classification."
input_model = QuestionAnswerGroundTruth
output_model = ClassificationWithReason
examples = [
(
QuestionAnswerGroundTruth(
question="What powers the sun and what is its primary function?",
answer=[
"The sun is powered by nuclear fission, similar to nuclear reactors on Earth.",
"The primary function of the sun is to provide light to the solar system.",
],
ground_truth=[
"The sun is powered by nuclear fusion, where hydrogen atoms fuse to form helium.",
"This fusion process in the sun's core releases a tremendous amount of energy.",
"The energy from the sun provides heat and light, which are essential for life on Earth.",
"The sun's light plays a critical role in Earth's climate system.",
"Sunlight helps to drive the weather and ocean currents.",
],
),
ClassificationWithReason(
TP=[
StatementsWithReason(
statement="The primary function of the sun is to provide light to the solar system.",
reason="This statement is somewhat supported by the ground truth mentioning the sun providing light and its roles, though it focuses more broadly on the sun's energy.",
)
],
FP=[
StatementsWithReason(
statement="The sun is powered by nuclear fission, similar to nuclear reactors on Earth.",
reason="This statement is incorrect and contradicts the ground truth which states that the sun is powered by nuclear fusion.",
)
],
FN=[
StatementsWithReason(
statement="The sun is powered by nuclear fusion, where hydrogen atoms fuse to form helium.",
reason="This accurate description of the sun’s power source is not included in the answer.",
),
StatementsWithReason(
statement="This fusion process in the sun's core releases a tremendous amount of energy.",
reason="This process and its significance are not mentioned in the answer.",
),
StatementsWithReason(
statement="The energy from the sun provides heat and light, which are essential for life on Earth.",
reason="The answer only mentions light, omitting the essential aspects of heat and its necessity for life, which the ground truth covers.",
),
StatementsWithReason(
statement="The sun's light plays a critical role in Earth's climate system.",
reason="This broader impact of the sun’s light on Earth's climate system is not addressed in the answer.",
),
StatementsWithReason(
statement="Sunlight helps to drive the weather and ocean currents.",
reason="The effect of sunlight on weather patterns and ocean currents is omitted in the answer.",
),
],
),
),
(
QuestionAnswerGroundTruth(
question="What is the boiling point of water?",
answer=[
"The boiling point of water is 100 degrees Celsius at sea level"
],
ground_truth=[
"The boiling point of water is 100 degrees Celsius (212 degrees Fahrenheit) at sea level.",
"The boiling point of water can change with altitude.",
],
),
ClassificationWithReason(
TP=[
StatementsWithReason(
statement="The boiling point of water is 100 degrees Celsius at sea level",
reason="This statement is directly supported by the ground truth which specifies the boiling point of water as 100 degrees Celsius at sea level.",
)
],
FP=[],
FN=[
StatementsWithReason(
statement="The boiling point of water can change with altitude.",
reason="This additional information about how the boiling point of water can vary with altitude is not mentioned in the answer.",
)
],
),
),
]
to_string()してみます。
from ragas.metrics._answer_correctness import CorrectnessClassifier
prompt = CorrectnessClassifier()
print(prompt.to_string())
Given a ground truth and an answer statements, analyze each statement and classify them in one of the following categories: TP (true positive): statements that are present in answer that are also directly supported by the one or more statements in ground truth, FP (false positive): statements present in the answer but not directly supported by any statement in ground truth, FN (false negative): statements found in the ground truth but not present in answer. Each statement can only belong to one of the categories. Provide a reason for each classification.
Please return the output in a JSON format that complies with the following schema as specified in JSON Schema:
{"$defs": {"StatementsWithReason": {"properties": {"statement": {"title": "Statement", "type": "string"}, "reason": {"title": "Reason", "type": "string"}}, "required": ["statement", "reason"], "title": "StatementsWithReason", "type": "object"}}, "properties": {"TP": {"items": {"$ref": "#/$defs/StatementsWithReason"}, "title": "Tp", "type": "array"}, "FP": {"items": {"$ref": "#/$defs/StatementsWithReason"}, "title": "Fp", "type": "array"}, "FN": {"items": {"$ref": "#/$defs/StatementsWithReason"}, "title": "Fn", "type": "array"}}, "required": ["TP", "FP", "FN"], "title": "ClassificationWithReason", "type": "object"}Do not use single quotes in your response but double quotes,properly escaped with a backslash.
--------EXAMPLES-----------
Example 1
Input: {
"question": "What powers the sun and what is its primary function?",
"answer": [
"The sun is powered by nuclear fission, similar to nuclear reactors on Earth.",
"The primary function of the sun is to provide light to the solar system."
],
"ground_truth": [
"The sun is powered by nuclear fusion, where hydrogen atoms fuse to form helium.",
"This fusion process in the sun's core releases a tremendous amount of energy.",
"The energy from the sun provides heat and light, which are essential for life on Earth.",
"The sun's light plays a critical role in Earth's climate system.",
"Sunlight helps to drive the weather and ocean currents."
]
}
Output: {
"TP": [
{
"statement": "The primary function of the sun is to provide light to the solar system.",
"reason": "This statement is somewhat supported by the ground truth mentioning the sun providing light and its roles, though it focuses more broadly on the sun's energy."
}
],
"FP": [
{
"statement": "The sun is powered by nuclear fission, similar to nuclear reactors on Earth.",
"reason": "This statement is incorrect and contradicts the ground truth which states that the sun is powered by nuclear fusion."
}
],
"FN": [
{
"statement": "The sun is powered by nuclear fusion, where hydrogen atoms fuse to form helium.",
"reason": "This accurate description of the sun’s power source is not included in the answer."
},
{
"statement": "This fusion process in the sun's core releases a tremendous amount of energy.",
"reason": "This process and its significance are not mentioned in the answer."
},
{
"statement": "The energy from the sun provides heat and light, which are essential for life on Earth.",
"reason": "The answer only mentions light, omitting the essential aspects of heat and its necessity for life, which the ground truth covers."
},
{
"statement": "The sun's light plays a critical role in Earth's climate system.",
"reason": "This broader impact of the sun’s light on Earth's climate system is not addressed in the answer."
},
{
"statement": "Sunlight helps to drive the weather and ocean currents.",
"reason": "The effect of sunlight on weather patterns and ocean currents is omitted in the answer."
}
]
}
Example 2
Input: {
"question": "What is the boiling point of water?",
"answer": [
"The boiling point of water is 100 degrees Celsius at sea level"
],
"ground_truth": [
"The boiling point of water is 100 degrees Celsius (212 degrees Fahrenheit) at sea level.",
"The boiling point of water can change with altitude."
]
}
Output: {
"TP": [
{
"statement": "The boiling point of water is 100 degrees Celsius at sea level",
"reason": "This statement is directly supported by the ground truth which specifies the boiling point of water as 100 degrees Celsius at sea level."
}
],
"FP": [],
"FN": [
{
"statement": "The boiling point of water can change with altitude.",
"reason": "This additional information about how the boiling point of water can vary with altitude is not mentioned in the answer."
}
]
}
-----------------------------
Now perform the same with the following input
Input: (None)
Output:
指示の日本語訳だけ載せておきます。
グラウンドトゥルースと解答文が与えられた場合、各文を分析し、以下のいずれかのカテゴリに分類してください。TP(真陽性):解答に存在し、グラウンドトゥルースに含まれる1つ以上の文によって直接裏付けられている文、FP(偽陽性):解答には存在するが、グラウンドトゥルースに含まれるいずれの文によっても直接裏付けられていない文、FN(偽陰性):グラウンドトゥルースに含まれるが、解答には存在しない文。各文は、いずれかのカテゴリにのみ属することができます。それぞれの分類理由を明記してください。
一読では理解が難しいですが、
- answerをgroundtruthと比較してTPまたはFPに分類
- groundtruthのうちanswerに含まれないものをFNに分類
ということをやっています。
answer全体とgroundtruth全体が情報としてどのくらい近いかを測る指標です。
groundtruthを検索結果にすればfaithfulnessに少し似た指標になります。groundtruth(検索結果)の冗長性を気にするかどうかだけの違いです。
入力はquestion(string)、answer(list[string])、groundtruth(list[string])です。これが以下の形式でプロンプトに埋め込まれます。
"question": QUESTION,
"answer": [ANSWER_1, ANSWER_2, ...],
"ground_truth": [GROUNDTRUTH_1, GROUNDTRUTH_2, ...]
}
出力はTP、FP、FNの3つのlistを含むjson文字列になります。
TP、FP、FNの各要素はstatementとreasonを含みます。
これらの値は、後段の処理でスコア計算に使われます。
context precision
関連するコードは以下です。
class QAC(BaseModel):
question: str = Field(..., description="Question")
context: str = Field(..., description="Context")
answer: str = Field(..., description="Answer")
class Verification(BaseModel):
reason: str = Field(..., description="Reason for verification")
verdict: int = Field(..., description="Binary (0/1) verdict of verification")
class ContextPrecisionPrompt(PydanticPrompt[QAC, Verification]):
name: str = "context_precision"
instruction: str = 'Given question, answer and context verify if the context was useful in arriving at the given answer. Give verdict as "1" if useful and "0" if not with json output.'
input_model = QAC
output_model = Verification
examples = [
(
QAC(
question="What can you tell me about Albert Einstein?",
context="Albert Einstein (14 March 1879 – 18 April 1955) was a German-born theoretical physicist, widely held to be one of the greatest and most influential scientists of all time. Best known for developing the theory of relativity, he also made important contributions to quantum mechanics, and was thus a central figure in the revolutionary reshaping of the scientific understanding of nature that modern physics accomplished in the first decades of the twentieth century. His mass–energy equivalence formula E = mc2, which arises from relativity theory, has been called 'the world's most famous equation'. He received the 1921 Nobel Prize in Physics 'for his services to theoretical physics, and especially for his discovery of the law of the photoelectric effect', a pivotal step in the development of quantum theory. His work is also known for its influence on the philosophy of science. In a 1999 poll of 130 leading physicists worldwide by the British journal Physics World, Einstein was ranked the greatest physicist of all time. His intellectual achievements and originality have made Einstein synonymous with genius.",
answer="Albert Einstein, born on 14 March 1879, was a German-born theoretical physicist, widely held to be one of the greatest and most influential scientists of all time. He received the 1921 Nobel Prize in Physics for his services to theoretical physics.",
),
Verification(
reason="The provided context was indeed useful in arriving at the given answer. The context includes key information about Albert Einstein's life and contributions, which are reflected in the answer.",
verdict=1,
),
),
(
QAC(
question="who won 2020 icc world cup?",
context="The 2022 ICC Men's T20 World Cup, held from October 16 to November 13, 2022, in Australia, was the eighth edition of the tournament. Originally scheduled for 2020, it was postponed due to the COVID-19 pandemic. England emerged victorious, defeating Pakistan by five wickets in the final to clinch their second ICC Men's T20 World Cup title.",
answer="England",
),
Verification(
reason="the context was useful in clarifying the situation regarding the 2020 ICC World Cup and indicating that England was the winner of the tournament that was intended to be held in 2020 but actually took place in 2022.",
verdict=1,
),
),
(
QAC(
question="What is the tallest mountain in the world?",
context="The Andes is the longest continental mountain range in the world, located in South America. It stretches across seven countries and features many of the highest peaks in the Western Hemisphere. The range is known for its diverse ecosystems, including the high-altitude Andean Plateau and the Amazon rainforest.",
answer="Mount Everest.",
),
Verification(
reason="the provided context discusses the Andes mountain range, which, while impressive, does not include Mount Everest or directly relate to the question about the world's tallest mountain.",
verdict=0,
),
),
]
to_string()してみます。
from ragas.metrics._context_precision import ContextPrecisionPrompt
prompt = ContextPrecisionPrompt()
print(prompt.to_string())
Given question, answer and context verify if the context was useful in arriving at the given answer. Give verdict as "1" if useful and "0" if not with json output.
Please return the output in a JSON format that complies with the following schema as specified in JSON Schema:
{"properties": {"reason": {"description": "Reason for verification", "title": "Reason", "type": "string"}, "verdict": {"description": "Binary (0/1) verdict of verification", "title": "Verdict", "type": "integer"}}, "required": ["reason", "verdict"], "title": "Verification", "type": "object"}Do not use single quotes in your response but double quotes,properly escaped with a backslash.
--------EXAMPLES-----------
Example 1
Input: {
"question": "What can you tell me about Albert Einstein?",
"context": "Albert Einstein (14 March 1879 – 18 April 1955) was a German-born theoretical physicist, widely held to be one of the greatest and most influential scientists of all time. Best known for developing the theory of relativity, he also made important contributions to quantum mechanics, and was thus a central figure in the revolutionary reshaping of the scientific understanding of nature that modern physics accomplished in the first decades of the twentieth century. His mass–energy equivalence formula E = mc2, which arises from relativity theory, has been called 'the world's most famous equation'. He received the 1921 Nobel Prize in Physics 'for his services to theoretical physics, and especially for his discovery of the law of the photoelectric effect', a pivotal step in the development of quantum theory. His work is also known for its influence on the philosophy of science. In a 1999 poll of 130 leading physicists worldwide by the British journal Physics World, Einstein was ranked the greatest physicist of all time. His intellectual achievements and originality have made Einstein synonymous with genius.",
"answer": "Albert Einstein, born on 14 March 1879, was a German-born theoretical physicist, widely held to be one of the greatest and most influential scientists of all time. He received the 1921 Nobel Prize in Physics for his services to theoretical physics."
}
Output: {
"reason": "The provided context was indeed useful in arriving at the given answer. The context includes key information about Albert Einstein's life and contributions, which are reflected in the answer.",
"verdict": 1
}
Example 2
Input: {
"question": "who won 2020 icc world cup?",
"context": "The 2022 ICC Men's T20 World Cup, held from October 16 to November 13, 2022, in Australia, was the eighth edition of the tournament. Originally scheduled for 2020, it was postponed due to the COVID-19 pandemic. England emerged victorious, defeating Pakistan by five wickets in the final to clinch their second ICC Men's T20 World Cup title.",
"answer": "England"
}
Output: {
"reason": "the context was useful in clarifying the situation regarding the 2020 ICC World Cup and indicating that England was the winner of the tournament that was intended to be held in 2020 but actually took place in 2022.",
"verdict": 1
}
Example 3
Input: {
"question": "What is the tallest mountain in the world?",
"context": "The Andes is the longest continental mountain range in the world, located in South America. It stretches across seven countries and features many of the highest peaks in the Western Hemisphere. The range is known for its diverse ecosystems, including the high-altitude Andean Plateau and the Amazon rainforest.",
"answer": "Mount Everest."
}
Output: {
"reason": "the provided context discusses the Andes mountain range, which, while impressive, does not include Mount Everest or directly relate to the question about the world's tallest mountain.",
"verdict": 0
}
-----------------------------
Now perform the same with the following input
Input: (None)
Output:
指示の日本語訳だけ載せておきます。
与えられた質問、回答、文脈から、文脈が与えられた回答に至る上で役立ったかどうかを検証します。役立った場合は「1」、役に立たない場合は「0」と判定し、JSON形式で出力します。
利用時には、question(string)、context(string)、answer(string)を与えます。
指示文だけ見るとfaithfulnessに似ていますが、このプロンプトを用いるクラスではretrieved_contexts(検索結果)を1つずつcontextに与え、平均の評価を算出するため、precision(正と予測した結果が実際にどれくらい正だったか)を計算したことになっています。
context recall
関連するコードは以下です。
class QCA(BaseModel):
question: str
context: str
answer: str
class ContextRecallClassification(BaseModel):
statement: str
reason: str
attributed: int
class ContextRecallClassifications(BaseModel):
classifications: t.List[ContextRecallClassification]
class ContextRecallClassificationPrompt(
PydanticPrompt[QCA, ContextRecallClassifications]
):
name: str = "context_recall_classification"
instruction: str = "Given a context, and an answer, analyze each sentence in the answer and classify if the sentence can be attributed to the given context or not. Use only 'Yes' (1) or 'No' (0) as a binary classification. Output json with reason."
input_model = QCA
output_model = ContextRecallClassifications
examples = [
(
QCA(
question="What can you tell me about albert Albert Einstein?",
context="Albert Einstein (14 March 1879 - 18 April 1955) was a German-born theoretical physicist, widely held to be one of the greatest and most influential scientists of all time. Best known for developing the theory of relativity, he also made important contributions to quantum mechanics, and was thus a central figure in the revolutionary reshaping of the scientific understanding of nature that modern physics accomplished in the first decades of the twentieth century. His mass-energy equivalence formula E = mc2, which arises from relativity theory, has been called 'the world's most famous equation'. He received the 1921 Nobel Prize in Physics 'for his services to theoretical physics, and especially for his discovery of the law of the photoelectric effect', a pivotal step in the development of quantum theory. His work is also known for its influence on the philosophy of science. In a 1999 poll of 130 leading physicists worldwide by the British journal Physics World, Einstein was ranked the greatest physicist of all time. His intellectual achievements and originality have made Einstein synonymous with genius.",
answer="Albert Einstein, born on 14 March 1879, was a German-born theoretical physicist, widely held to be one of the greatest and most influential scientists of all time. He received the 1921 Nobel Prize in Physics for his services to theoretical physics. He published 4 papers in 1905. Einstein moved to Switzerland in 1895.",
),
ContextRecallClassifications(
classifications=[
ContextRecallClassification(
statement="Albert Einstein, born on 14 March 1879, was a German-born theoretical physicist, widely held to be one of the greatest and most influential scientists of all time.",
reason="The date of birth of Einstein is mentioned clearly in the context.",
attributed=1,
),
ContextRecallClassification(
statement="He received the 1921 Nobel Prize in Physics for his services to theoretical physics.",
reason="The exact sentence is present in the given context.",
attributed=1,
),
ContextRecallClassification(
statement="He published 4 papers in 1905.",
reason="There is no mention about papers he wrote in the given context.",
attributed=0,
),
ContextRecallClassification(
statement="Einstein moved to Switzerland in 1895.",
reason="There is no supporting evidence for this in the given context.",
attributed=0,
),
]
),
),
]
to_string()してみます。
from ragas.metrics._context_recall import ContextRecallClassificationPrompt
prompt = ContextRecallClassificationPrompt()
print(prompt.to_string())
Given a context, and an answer, analyze each sentence in the answer and classify if the sentence can be attributed to the given context or not. Use only 'Yes' (1) or 'No' (0) as a binary classification. Output json with reason.
Please return the output in a JSON format that complies with the following schema as specified in JSON Schema:
{"$defs": {"ContextRecallClassification": {"properties": {"statement": {"title": "Statement", "type": "string"}, "reason": {"title": "Reason", "type": "string"}, "attributed": {"title": "Attributed", "type": "integer"}}, "required": ["statement", "reason", "attributed"], "title": "ContextRecallClassification", "type": "object"}}, "properties": {"classifications": {"items": {"$ref": "#/$defs/ContextRecallClassification"}, "title": "Classifications", "type": "array"}}, "required": ["classifications"], "title": "ContextRecallClassifications", "type": "object"}Do not use single quotes in your response but double quotes,properly escaped with a backslash.
--------EXAMPLES-----------
Example 1
Input: {
"question": "What can you tell me about albert Albert Einstein?",
"context": "Albert Einstein (14 March 1879 - 18 April 1955) was a German-born theoretical physicist, widely held to be one of the greatest and most influential scientists of all time. Best known for developing the theory of relativity, he also made important contributions to quantum mechanics, and was thus a central figure in the revolutionary reshaping of the scientific understanding of nature that modern physics accomplished in the first decades of the twentieth century. His mass-energy equivalence formula E = mc2, which arises from relativity theory, has been called 'the world's most famous equation'. He received the 1921 Nobel Prize in Physics 'for his services to theoretical physics, and especially for his discovery of the law of the photoelectric effect', a pivotal step in the development of quantum theory. His work is also known for its influence on the philosophy of science. In a 1999 poll of 130 leading physicists worldwide by the British journal Physics World, Einstein was ranked the greatest physicist of all time. His intellectual achievements and originality have made Einstein synonymous with genius.",
"answer": "Albert Einstein, born on 14 March 1879, was a German-born theoretical physicist, widely held to be one of the greatest and most influential scientists of all time. He received the 1921 Nobel Prize in Physics for his services to theoretical physics. He published 4 papers in 1905. Einstein moved to Switzerland in 1895."
}
Output: {
"classifications": [
{
"statement": "Albert Einstein, born on 14 March 1879, was a German-born theoretical physicist, widely held to be one of the greatest and most influential scientists of all time.",
"reason": "The date of birth of Einstein is mentioned clearly in the context.",
"attributed": 1
},
{
"statement": "He received the 1921 Nobel Prize in Physics for his services to theoretical physics.",
"reason": "The exact sentence is present in the given context.",
"attributed": 1
},
{
"statement": "He published 4 papers in 1905.",
"reason": "There is no mention about papers he wrote in the given context.",
"attributed": 0
},
{
"statement": "Einstein moved to Switzerland in 1895.",
"reason": "There is no supporting evidence for this in the given context.",
"attributed": 0
}
]
}
-----------------------------
Now perform the same with the following input
Input: (None)
Output:
指示の日本語訳だけ載せておきます。
文脈と回答が与えられた場合、回答に含まれる各文を分析し、その文が与えられた文脈に帰属するかどうかを分類します。「はい」(1)または「いいえ」(0)のいずれかの二値分類のみを使用します。理由をJSON形式で出力します。
指示では文脈と回答にしか言及していませんが、実際にはquestionも与えられ、questionも考慮した評価となります。context_precisionと異なり、contextは検索結果が複数得られている場合には、適宜joinなどして、まとめて与えます。
faithfulnessと似ていますが
- faithfulnessは主張生成と主張評価の2段階だが、context_recallは1段階
- faithfulnessは生成回答、context_recallは検索結果が評価対象として想定されている(それが微妙にプロンプトの違いに現れている)
みたいな差が強いていえばあります。でもほとんど同じだとも思います。
出力はclassificationsというlistで、その中にstatement(string)、reason(string)、attributed(0/1)が含まれます。
後段の処理ではattributedの平均を計算してスコアとしています。つまり、answerに含まれる文の数に対する、文脈に帰属する文の割合がスコアとなります。
本来、recallは「実際に正であるもののうち、正と予測されたものの割合」を指すため、この指標は厳密にはrecallとは言えません。context_precisionのほうはprecisionっぽい評価になっているためややこしいです。検索結果だけしか入力がなければ「実際に正であるもの」の総数は不明なので仕方なくはあります。ただそうであればrecallという名前はつけないほうが良かったのでは、とも思います。無理やり解釈するなら、answerの各文章に対する正解が1つずつ検索対象に含まれる、という前提なのかもしれませんが...
おわりに
以上、RAGASに実装されている主要な指標について、プロンプトの中身を具体的に確認してみました。
PydanticPromptを用いていることだけ押さえれば、他の指標についてもプロンプトのイメージは湧きやすいのではと思います。