MLOps（LLMOps、生成AIOps）Advent Calendar 2024

RAGだけでない！Ragas再入門

Posted at 2024-12-25

1 Ragasとは？

LLMアプリケーションや基盤モデルの評価を効率化してくれるオープンソースなライブラリ。
評価だけでなく、評価に使用するテストデータセットのデータ生成を効率化する機能も含まれている。

便利なところ

LangChainやLlamaIndexであれば簡単に統合可能
事前定義されたメトリクスを利用することで、簡単に評価可能
評価対象もRAGだけでなくなった！？
v0.2.0から大幅アップデート
ドキュメントを読み出して、テストデータセット生成可能
RAGASでKnowledge Graphを生成可能

本記事では秋に大幅アップデートされたRagas v0.2の概要を今更ですがまとめてみました！

2 RagasのLLM評価

ターゲットユースケースは何と5つ+α
各ユースケースごとに評価メトリクス(評価用プロンプト)が事前定義されている
Ragasという名前に似合わず、対応可能なLLMアプリケーションはもはやRAGだけが対象ではない

Ragasでは評価メトリクスの計算方式が2種類ある
1. LLMベースのメトリクス
2. Non-LLMベースのメトリクス

LLMベースのメトリクス
- 評価値の計算にLLMを使用する
- LLMを用いるため、評価値が実行ごとに異なる
- 人に近い評価をできることが論文などで報告されている
Non-LLMベースのメトリクス
1. 評価値の計算にLLMを使用しない決定論的な評価メトリクス
2. 従来のBLEUスコアなど
3. 人の評価値とは相関性が低いことが報告されている

Ragasでは評価対象データとして2つのカテゴリが定義
1. Single Turnメトリクス
2. Multi Turnメトリクス

Single Turnメトリクス
- ユーザとAI間のInteractionが1ターンのアプリケーションを対象にした評価
Multi Turnメトリクス
- ユーザとAI間のInteractionが複数ターンのアプリケーションを対象にした評価

※参考

メトリクス設計原則 (Metric Design Principal)
- Ragasが、よりよいメトリクス設計のために遵守しているメトリクス設計の基本原則
- 自分自身で評価メトリクスを定義する際にも参考になる

設計原則	概要
Single-Aspect Focus	1つのメトリクスは1つのパフォーマンス側面にのみ焦点を当てる
Intuitive and Interpretable	理解しやすく、解釈しやすくする
Effective Prompt Flows	LLMベースの評価では、人による評価と整合するようなプロンプトフローを構築する
Robustness	LLMベースの評価では、few-shot promptによって望ましい結果が出るようにする
Consistent Scoring Ranges	評価値を正規化するか、0~1などの範囲に収め、比較しやすくする

3 評価項目

Ragasで定義されている評価項目について、ユースケースごとに概要を紹介

ユースケース	評価項目	概要
Retrieval Augmented Generation(RAG)	Context Precision	検索抽出したコンテキストと元質問の関連性
	Context Recall	抽出すべきドキュメントから正しく検索抽出したドキュメントの割合
	Context Entities Recall	Context RecallをEntity単位で計算したメトリクス
	Noise Sensitivity	参考になる・ならないコンテキストから生成された回答のクレームに不要なクレームが含まれている割合
	Response Relevancy	生成された回答が元の質問クエリとの関連性
	Faithfulness	生成された回答が与えられたコンテキストから推測できるか評価
Agents or Tool use cases	Topic adherence	AIシステムの回答が事前定義したドメインのみに関連しているか評価・LLMはユーザが求めるドメインではなく一般的な知識を用いて回答するケースがあるため、AIが特定ドメインに関連した回答をしているかを測定する
	Tool call Accuracy	エージェントがタスク完了に必要なツールを実際に利用したか評価・理想とするtool callsのリストと実際のメッセージを比較する
	Agent Goal Accuracy	ユーザ目標をエージェントが達成できたかを示すバイナリメトリクス
Natural Language Comparison	Factual Correctness	生成された回答とGroundTruthとのPrecision/Recall、またF1スコアを求める
	Semantic Similarity	生成された回答とGroundTruthの意味的類似度を測定・測定にはCross Encoder Modelを利用
	Non LLM String Similarity	従来より活用されている文字列距離計算にて評価・レーベンシュタイン距離、ハミング距離、ジャロ・ウィンクラー距離から選択可能
	BLEU Score	BLEUスコア
	ROUGE Score	ROUGEスコア
	String Presence	生成された回答にGround Truthが含まれているかを示すバイナリメトリクス
	Exact Match	生成された回答にGround Truthが一致しているかを示すバイナリメトリクス
SQL	Execution based Datacompy Score	Datacompyライブラリによる取得データとGroudTruthを比較評価・DatacompyはPandasのDataFrameを比較するライブラリ
	SQL query Equivalence	生成したSQLクエリと正解となるSQLクエリとを意味的に比較して評価・評価にはスキーマも渡す必要がある・SQLを実行する必要がないので、スキーマが準備できていればSQL実行時間や実行不可環境でも評価可能
General purpose	Aspect critic	事前定義された有害性、一貫性などの評価基準を元に、LLMを用いて評価
	Simple Criteria Scoring	事前定義した単一の自由形式の最適準によって回答を採点する評価。評価値は特定範囲の整数値
	Rubrics based scoring	(一般的に) 1~5の範囲で各評価値に対する説明を定義して、LLMに評価してもらう
	Instance specific rubrics scoring	個々の事例(インスタンス)ごとに評価基準を設けた、Rubics basedの評価指標LLMは記述された説明に基づいて評価
Other tasks	Summarization	生成回答が与えられたコンテキストから、いかに重要な情報が抽出できているか評価

4 Ragasのデータセット定義

HuggingFaceのDatasetsから、Ragas独自クラスのEvaluationDatasetへ移行
EvaluationDatasetは評価サンプルのリストを持っている
評価サンプルは「ユーザ~LLM」間の1回のやり取り/ユースケースを示す
Ragasv0.2ではシングルターンのやり取りを表現するSingleTurnSampleとマルチターンのやり取りを表現するMultiTurnSampleが定義
MultiTurnSampleはツール利用などが想定さえるAIエージェントシステムで適用可能

SingleTurnSampleのコードサンプル

from ragas import SingleTurnSample

# User's question
user_input = "What is the capital of France?"

# Retrieved contexts (e.g., from a knowledge base or search engine)
retrieved_contexts = ["Paris is the capital and most populous city of France."]

# Al's response
response = "The capital of France is Paris."

# Reference answer (ground truth)
reference = "Paris"

# Evaluation rubric
rubric = {
    "accuracy": "Correct",
    "completeness": "High",
    "fluency": "Excellent"
}

# Create the SingleTurnSample instance
sample = SingleTurnSample(
    user_input=user_input,
    retrieved_contexts=retrieved_contexts,
    response = response,
    reference=reference,
    rubric=rubric
)

MultiTurnSampleのコードサンプル

from ragas.messages import HumanMessage, AlMessage, ToolMessage, ToolCall

# User asks about the weather in New York City
user_message = HumanMessage(content="What's the weather like in New York City today?")

# Al decides to use a weather API tool to fetch the information
ai_initial_response = AlMessage(
    content="Let me check the current weather in New York City for you.",
    tool_calls=[ToolCall(name="WeatherAPI", args={"location": "New York City"})]
)

# Tool provides the weather information
tool_response = ToolMessage(content="It's sunny with a temperature of 75°F in New York City.")

# Al delivers the final response to the user
ai_final_response = AIMessage(content="It's sunny and 75 degrees Fahrenheit in New York City today.")

# Combine all messages into a list to represent the conversation
conversation = [
    user_message,
    ai_initial_response,
    tool_response,
    ai_final_response
]

引用元：https://docs.ragas.io/en/latest/concepts/components/eval_sample/

RagasのデータセットEvaluationDatasetの準備手順は2種類ある！
1. xxxTurnSampleを用意して、EvaluationDatasetを初期化
2. HuggingFace Datasetsクラスから読み込み

1. xxxTurnSampleからEvaluationDatasetを構築

from ragas import SingleTurnSample, EvaluationDataset

# Sample 1
sample1 = SingleTurnSample(
    ...
)
# Sample 2
sample2 = SingleTurnSample(
    ...
)
# Sample 3
sample3 = SingleTurnSample(
    ...
)

dataset = EvaluationDataset(samples=[sample1, sample2, sample3])

2. HuggingFace DatasetsからEvaluationDatasetを構築

from datasets import load_dataset
dataset = load_dataset("explodinggradients/amnesty_qa","english_v3")

from ragas import EvaluationDataset

eval_dataset = EvaluationDataset.from_hf_dataset(dataset["eval"])

5 実行サンプル

要約分析のサンプル実行例
- BedrockのClaude 3.0 Haikuを利用

import os
from os.path import join, dirname
from dotenv import load_dotenv
dotenv_path = "<path_to_dotfile>"
load_dotenv(verbose=True, dotenv_path=dotenv_path)

from boto3 import Session
from langchain_aws import ChatBedrock
import asyncio
from ragas.dataset_schema import SingleTurnSample
from ragas.metrics import SummarizationScore
from ragas.llms import LangchainLLMWrapper

boto3_session = Session()
bedrock_runtime = boto3_session.client("bedrock-runtime")

llm = ChatBedrock(
    client=bedrock_runtime,
    model_id=os.getenv("AWS_MODEL_ID_CALUDE30_HAIKU"),
)

async def main():
    sample = SingleTurnSample(
        response="A company is launching a fitness tracking app
that helps users set exercise goals, log meals, and track water
intake, with personalized workout suggestions and motivational
reminders.",
        reference_contexts=[
            "A company is launching a new product, a smartphone app
designed to help users track their fitness goals. The app allows
users to set daily exercise targets, log their meals, and track
their water intake. It also provides personalized workout
recommendations and sends motivational reminders throughout the
day."
        ]
    )
    scorer = SummarizationScore()
    scorer.llm = LangchainLLMWrapper(llm)
    response = await scorer.single_turn_ascore(sample)
    print(response)

if __name__ == "__main__":
    asyncio.run(main())

実行結果：
- 0.7048387096775146

6 まとめ

大幅アップデータされたRagas v0.2の概要を調査
対象ユースケースが大きく増え、RAG以外も適用可能に！
テストデータセットもRagas独自定義に変更
他のLLM関連ツールもRagasとの連携サンプルが増えており、様々なユースケースでの活用やツール連携が可能となっている
一方で、ドキュメントやサンプルコードが古いまま、または誤りを含んでいるケースもあるので、使いこなしにはGitHubのソースコードを直接参照しないといけないケースも…
- 気になった点はContributeしましょう！

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up