More than 1 year has passed since last update.

ビジネスエンジニアリング株式会社（B-EN-G）Advent Calendar 2023

@nakano-min

ビジネスエンジニアリング株式会社 (B-EN-G)

【RAGの定量評価】RagasとSelfCheckGPTを試してみた

Last updated at 2023-12-19Posted at 2023-12-19

はじめに

LLMを利用したアプリケーションの性能は、自然言語を取り扱っているため、定量的に評価することが一般的に難しいとされており、評価手法は確立されていません。

本記事では、LLMを利用したアプリケーションで代表的なRAG（Retrieval-Augmented Generation）を対象として、RAGパイプラインの評価フレームワークである「Ragas」と幻覚検出技術である「SelfCheckGPT」による評価を試してみます。

前提知識

RAG

RAG（Retrieval-Augmented Generation, 検索拡張生成）とは、ユーザーの質問に回答するための情報を検索し、その情報をプロンプトに含めることでモデルが学習していないデータに対しても質の高い回答を生成させる手法です。
RAGは、追加でモデルを学習させる手法である「Fine Tuning」と比べて、柔軟性や汎用性が高く、最新のデータに迅速に対応できるため、現在注目されている手法です。

Ragas

Ragas（Retrieval-Augmented Generation Assessment）とは、LLMなどを用いてRAGパイプラインを定量的に評価するフレームワークです。

このフレームワークでは、いくつかの指標が考案されており、例えば以下のようなものがあります。

Faithfulness（忠実性、信頼性）
回答がどの程度コンテキストに基づいているかを評価します。
コンテキストから得られない情報が回答に含まれていると減点されます。
Answer Relevancy（回答の関連性）
回答がどの程度質問に関連しているかを評価します。
質問に対して冗長な回答や不完全な回答は減点されます。
Context Precision（回答の精度）
コンテキストが質問にどの程度関連しているかを評価します。
回答に不要な情報が含まれていると減点されます。

【出典：Component-Wise Evaluation - Ragas Docs】

SelfCheckGPT

SelfCheckGPTとは、幻覚検出技術のひとつです。
この技術は、「幻覚（ハルシネーション）が含まれる回答は一貫性に欠け、矛盾をはらんでいる」というシンプルなアイデアをもとに開発されました。
なお、SelfCheckGPTはRAGを評価するために開発されたわけではありません。

以下のようなさまざまな手法によってスコアが算出されます。

SelfCheck-BERTScore
SlefCheck-QA（MQAG：Multiple-choice Question Answering and Generation）
SelfCheck-Unigram（$n$-gramモデルを利用する）
SelfCheck-NLI（NLIモデルを利用する）
SelfCheck-Prompt（LLMを利用する）

論文によれば、「SelfCheck-Prompt」が最もパフォーマンスの高い手法であると述べられています。

調査の方針

Ragasは、いくつかの評価指標のうち、LLMが生成する回答を評価対象とする以下のものを試します。

Faithfulness（忠実性、信頼性）
Answer Relevancy（回答の関連性）

SelfCheckGPTは、最もパフォーマンスが高いとされる「SelfCheck-Prompt」を試します。
Ragasの有効性を検証する際にも利用された「WikiEval」を対象に各評価を実行し、評価結果を眺めてみます。
可能であれば、評価結果を活用することの実践としてRAGの改善点を考えてみます。

環境

Python
- rye 0.15.2
- python 3.11.6
- ragas 0.0.21
- en_core_web_sm 3.6.0
言語モデル／チャットモデル
- モデル：gpt-3.5-turbo-16k
- バージョン：0613
Embeddingsモデル
- モデル：text-embeddings-ada-002
- バージョン：

Azure OpenAI Service上でデプロイしたモデルを利用しました。

その他

en_core_web_smのインストール
「en_core_web_sm」はPyPIでパッケージが公開されていないため、githubから直接インストールする必要があります。
以下のコマンドで追加しました。
```
rye add en_core_web_sm --url=https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.6.0/en_core_web_sm-3.6.0.tar.gz
```
コンテンツフィルターの構成
データセットには、Azure OpenAI Serviceのコンテンツフィルターに引っかかってしまうものが含まれています。
今回の調査では、フィルターをすべて「高」レベルに変更しました（※デフォルトは「中」レベル）。

実装

ディレクトリ構成は以下の通りです。

<pj-root>
├─.venv
├─src
│ ├─__init__.py
│ ├─_dataset.py
│ ├─_config.py
│ ├─metrics.py
│ ├─selfcheck.py
│ └─main.py
│
├─pyproject.toml
├─requirements.lock
└─requirements-dev.lock

データセット

「WikiEval」データセットを以下のようにRagasの入力形式に整形します。

_dataset.py

from datasets import load_dataset, Dataset

_dataset = load_dataset("explodinggradients/WikiEval")
# DatasetDict({
#     train: Dataset({
#         features: ['answer', 'question', 'context_v1', 'context_v2', 'ungrounded_answer', 'source', 'poor_answer'],
#         num_rows: 50
#     })
# })

_train = _dataset["train"]
_mapping = {
    "answer": _train["answer"],
    "question": _train["question"],
    "contexts": [c for c in _train["context_v1"]],
}

WIKI_EVAL_DATASET = Dataset.from_dict(_mapping)
# Dataset({
#     features: ['answer', 'question', 'contexts'],
#     num_rows: 50
# })

APIの接続設定

「Ragas」および「SelfCheckGPT」で利用するモデルは、Azure OpenAI Service上にデプロイしたものを利用します。
Azure OpenAI Service上でデプロイしたモデルをLangChainを介して利用する場合の設定は次の通りです。

_config.py

COMMON_CONFIG = {
    "azure_endpoint": "<Your API Endpoint>",
    "api_key": "<Your API Key>",
    "api_version": "2023-07-01-preview",
}

CHATMODEL_CONFIG = COMMON_CONFIG | {
    "azure_deployment": "<Your Deployment>",
    "model": "gpt-3.5-turbo-16k-0613",
}

EMBEDDINGS_CONFIG = COMMON_CONFIG | {
    "azure_deployment": "<Your Deployment>",
    "model": "text-embedding-ada-002",
}

Ragasの評価指標

Azure OpenAI Service上にデプロイしたモデルを利用しているため、ドキュメントを参考に以下のように実装しました。

metrics.py

import os
from langchain.chat_models import AzureChatOpenAI
from langchain.embeddings import AzureOpenAIEmbeddings
from ragas.llms import LangchainLLM
from ragas.metrics import (
    answer_relevancy,
    faithfulness,
)
from ragas.metrics.base import MetricWithLLM

from src._config import CHATMODEL_CONFIG, EMBEDDINGS_CONFIG

# 使用状況の追跡
os.environ["RAGAS_DO_NOT_TRACK"] = "true"

chat_model = AzureChatOpenAI(**CHATMODEL_CONFIG)
embeddings = AzureOpenAIEmbeddings(**EMBEDDINGS_CONFIG)

llm = LangchainLLM(chat_model)

METRICS: list[MetricWithLLM] = [
    faithfulness,
    answer_relevancy,
]

for m in METRICS:
    m.__setattr__("llm", llm)
    m.__setattr__("embeddings", embeddings)

Ragasは、ユーザーの使用状況を追跡しています。
追跡する使用状況は、非常に基本的な情報であり、個人や企業を特定できる情報ではありません。
利用状況の追跡を無効にする場合は、RAGAS_DO_NOT_TRACKを"true"に設定してください。

詳細はこちらを参照してください。

SelfCheckGPT／SelfCheck-Prompt

SelfCheckGPTの各手法は、Pythonパッケージとして提供されているものもありますが、今回調査対象とする「SelfCheck-Prompt」は提供されていないため、論文を参考に自前で実装しました。

selfcheck.py

import spacy
import pandas as pd
from dataclasses import dataclass
from datasets import Dataset
from langchain.schema import HumanMessage
from langchain.chat_models import AzureChatOpenAI

_SMAPLE_GEN = """\
Context: {}
{}
Answer:\
"""

_PROMPT = """\
Context: {}
Sentence: {}
Is the sentence supported by the context above?
Answer Yes or No:\
"""

@dataclass
class Result:
    question: str
    answer: str
    contexts: list[str]
    avg_score: float | None = None
    min_score: float | None = None

class SelfCheckGPT:

    def __init__(self, config: dict[str, str]) -> None:
        self.llm = AzureChatOpenAI(**config)
        self.nlp = spacy.load("en_core_web_sm")

    def _generate(self, content: str) -> str:
        message = HumanMessage(content=content)
        answer = self.llm.invoke([message])
        return answer.content

    def _create_samples(
        self,
        question: str,
        contexts: list[str],
        n: int
    ) -> list[str]:
        content = _SMAPLE_GEN.format("".join(contexts), question)
        return [self._generate(content) for _ in range(n)]

    def _split_passage(self, passage: str) -> list[str]:
        return [sent.text.strip() for sent in self.nlp(passage).sents]

    def _answer2score(self, answer: str) -> float:
        if answer == "Yes":
            return 1
        elif answer == "No":
            return 0
        else:
            return 0.5

    def _predict(
        self,
        sentences: list[str],
        samples: list[str],
    ) -> tuple[float, float]:
        sent_level_score: list[float] = []
        for sentence in sentences:
            score = 0
            for sample in samples:
                content = _PROMPT.format(sample, sentence)
                answer = self._generate(content)
                score += self._answer2score(answer)
            score /= len(samples)
            sent_level_score.append(score)
        avg_score = sum(sent_level_score) / len(sent_level_score)
        min_score = min(sent_level_score)
        return avg_score, min_score

    def evaluate(self, dataset: Dataset, num_samples: int) -> pd.DataFrame:
        results: list[Result] = []
        for i, d in enumerate(dataset):
            result = Result(d["question"], d["answer"], [c for c in d["contexts"]])
            print(f"{i:03}: {result.question}")
            sentences = self._split_passage(result.answer)
            samples = self._create_samples(result.question, result.contexts, num_samples)

            avg_score, min_score = self._predict(sentences, samples)

            result.avg_score = avg_score
            result.min_score = min_score
            results.append(result)

        return pd.DataFrame(results, columns=Result.__annotations__)

SelfCheckGPTは、”悪い”もの（幻覚を含むもの）ほどスコアが高くなるようにしています。
一方、Ragasは”良い”ものほどスコアが高くなるようにしています。
今回の調査では、Ragasに合わせてSelfCheckGPTは”良い”ものほどスコアが高くなるように実装しました。

評価の実行と出力

各評価を実行して、結果をCSVに保存します。
今回は、「SelfCheck-Prompt」で生成するサンプル数は10としました。

main.py

import pandas as pd
import ragas
from src._config import CHATMODEL_CONFIG
from src._dataset import WIKI_EVAL_DATASET
from src.metrics import METRICS
from src.selfcheck import SelfCheckGPT

# Ragas
rslts_ragas = ragas.evaluate(WIKI_EVAL_DATASET, METRICS).to_pandas()

# SelfCheckGPT／SelfCheck-Prompt
selfcheck = SelfCheckGPT(CHATMODEL_CONFIG)
rslts_selfcheck = selfcheck.evaluate(WIKI_EVAL_DATASET, 10)

rslts = pd.merge(rslts_ragas, rslts_selfcheck, on=["question", "answer"])
rslts.to_csv("./results.csv")

評価結果を眺めてみる

結果は次の通りです。

※ ”悪い”ものがわかりやすいように、1から引いたものを縦軸にとっています。
※ 表記を簡略化する都合上、「WikiEval」データセットを一行ずつナンバリングしました。

データセットのナンバリング

番号と「question」の値の関係は次の通りです。

000: Question: When is the scheduled launch date and time for the PSLV-C56 mission, and where will it be launched from?
001: Question: What is the objective of the Uzbekistan-Afghanistan-Pakistan Railway Project and how is it expected to enhance trade and logistics efficiency?
002: Question: When was PharmaCann founded and what is its headquarters location?
003: Question: Who directed the film Oppenheimer and who stars as J. Robert Oppenheimer in the film?
004: Question: What is theranostics and how does it combine diagnostic and therapeutic approaches in precision medicine?
005: Question: What is the human climate niche and how is it estimated?
006: Question: What is the taxonomy of Dasypoda radchenkoi, and what is its relationship to Dasypoda morotei?
007: Question: What is the main product of Fremantle Octopus and where is it based?
008: Question: Who is the Managing Director of FoodFutureCo and what are some of her accomplishments in the food industry?
009: Question: What was the purpose of designing and building the Fiat Ecobasic concept car?
010: Question: What is the purpose of the Rainbow Plaque programme in the UK, and how does it compare to other plaque programmes?
011: Question: What is the Zubaydah Trail and when was it constructed?
012: Question: When was the Chimnabai Clock Tower completed, and who was it named after?
013: Question: When did Trolleybus Route 20 in Shanghai start its operations, and what landmarks does it pass by?
014: Question: When did the Inter Expo Center in Sofia, Bulgaria open, and how has it been expanded over the years?
015: Question: When did Pope Benedict XVI become the head of the Catholic Church and sovereign of the Vatican City State, and when did he resign?
016: Question: What caused the crash of Yeti Airlines Flight 691 in Pokhara, Nepal?
017: Question: How does the height and thrust of the Starship rocket compare to other rockets?
018: Question: What is the Kyzylkum Desert known for in terms of its natural resources, and what are some of the major industrial enterprises in the region?
019: Question: When will the 80th annual Venice International Film Festival take place, and who will serve as the festival's opening film?
020: Question: Where is Myosotis angustata endemic to, and what is its conservation status?
021: Question: Where is Mount Brown located and what is its elevation?
022: Question: Where is the type locality of the Laoshan tree frog and what is its size?
023: Question: When was the Roanoke and Tar River Railroad fully merged into the Seaboard Air Line Railway network, and what happened to the line after that?
024: Question: What organizations has Moud Goba been involved with and what is her current role at UK Black Pride?
025: Question: What is the purpose of the Modernizing Opioid Treatment Access Act, and how does it aim to expand access to methadone for patients with opioid use disorder?
026: Question: When and where did Gaucho Americano have its world premiere, and when was it commercially released in Chilean theaters?
027: Question: Where is the type locality of the Blakistonia plata spider and what is the significance of its specific epithet?
028: Question: How many teams participate in the Turkish Women's Football Super League, and what is the format of the league?
029: Question: Who is buried in the Tomb of Alexander Stewart, and what is the condition of the tomb?
030: Question: When was the 5th Separate Guards Tatsin Red Banner Order of Suvorov Tank Brigade formed, and what is its military unit number?
031: Question: How long did the Siege of Mariupol last, and what was the outcome?
032: Question: Which countries and international organizations have imposed sanctions against Russia and Crimea, and what were the reasons for these sanctions?
033: Question: What factors contributed to the Sri Lankan economic crisis?
034: Question: How many people were killed and injured in the 2022 Hormozgan earthquakes, and what was the maximum intensity of the earthquakes?
035: Question: How many people were killed and injured in the mass shooting during the Independence Day parade in Highland Park, Illinois?
036: Question: What are some of the controversies surrounding Uber?
037: Question: What was the estimated timeline for fully restoring power in Moore County after the shooting attack on the electrical distribution substations?
038: Question: What caused the ethnic violence in Manipur in 2023, and what were the consequences of the violence?
039: Question: What was the size and payload of the Chinese balloon that was spotted in North American airspace?
040: Question: What were the temperatures and snowfall amounts during the cold snap in Afghanistan in January 2023, and how many people and livestock were affected?
041: Question: When was GPT-4 released and what are some of its capabilities?
042: Question: What is the current status of the Myanmar civil war, and how many people have been internally displaced since the coup?
043: Question: When and where will the Miss Grand Dominican Republic 2023 pageant be held, and what is the purpose of the pageant?
044: Question: What was the cause of the rebellion staged by the Wagner Group in 2023, and how did it end?
045: Question: What caused the gas supply outage in Sheffield, England in December 2022, and how long did the outage last?
046: Question: What sparked the civil unrest and protests in Iran in September 2022, and what were the main demands of the protesters?
047: Question: What types of volcanoes are found on Venus, and how do they differ from those on Earth?
048: Question: What are some measures for pandemic prevention?
049: Question: What are the main science objectives of the JUICE orbiter and what moons will it study?

評価結果にギャップがあるものをいくつかピックアップして見てみましょう。
※手法によってアプローチが異なるので差が生まれるのは当たり前で、あくまで各評価の特性を具体的な例をもとに理解することを目的としています。

Answer Relevancyのみ低いもの

034

質問

Question: How many people were killed and injured in the 2022 Hormozgan earthquakes, and what was the maximum intensity of the earthquakes?

（訳）質問: 2022 年のホルモズガン地震では何人が死亡、負傷しましたか?また、地震の最大震度はどれくらいでしたか?

コンテキスト

The 2022 Hormozgan earthquakes were a pair of doublet earthquakes that struck southern Iran on 1 July, 2022. The earthquakes, which occurred around two hours apart, killed seven people and injured dozens more.\n\nTectonic setting\nHormozgan province lies at the southern margin of the collision zone between the Eurasian Plate and the Arabian Plate. This collision lead to the creation of the Zagros Mountains and the Iranian Plateau. The main fault system that runs through the Zagros Range is the Zagros fold and thrust belt, which has been responsible for causing many earthquakes in Iran over the years.\n\nEarthquake\nThis earthquake is part of a sequence of earthquakes on 1 July, 2022, in southern Iran that began with a magnitude 6.0, followed by a magnitude 5.7 aftershock two hours later, and a magnitude 6.0 earthquake one minute after the magnitude 5.7 earthquake. The first two events were initially reported as 6.1 magnitude, while the third was an initial 6.2. They were revised down several hours later. Earthquake sequences similar to this one have previously occurred in the Zagros Mountains, with a similar sequence occurring in November 2021.\n\nAftershocks\nBy 3 July, there were twelve aftershocks, the strongest of which was 5.7 Mw. On July 23, two more aftershocks occurred, measuring 5.4 and 5.6 Mw\u202f. The two quakes caused further damage to houses, and caused one indirect injury.\n\nIntensity\nThe earthquakes had a maximum intensity of VII (Very strong). The strongest shaking was reported in the provinces of Hormozgan and Fars. The earthquake was felt throughout the Middle East in countries such as the United Arab Emirates, Oman, Saudi Arabia, Bahrain and Qatar, as well as parts of Pakistan and Afghanistan, which were severely affected by a more deadly earthquake 10 days earlier.\n\nOther events\nTwo offshore earthquakes occurred in the same province near the town of Kish a month before the July events. The earthquakes measured 5.5 and 5.6 on the moment magnitude scale, and occurred at a depth of 10.0 km. The first earthquake injured four and damaged 20 buildings. The latter caused one death and 37 injuries.Another earthquake, measuring magnitude 5.9 struck the same area on March 16 of that same year. It caused two injuries and minor damage in several villages.\n\nImpact\nTwelve towns and over 300 villages, with a combined population of around 900,000, were impacted by the earthquakes. The village of Sayeh Khvosh, home to around 1,100 people, was completely destroyed. The governor of Hormozgan, Mahdi Dousti, said that it would take several months to rebuild the village. In Bandar Khamir, at least 45 houses were affected, and 35 others were damaged in the town of Kong. In total, at least 392 houses were damaged or destroyed. There were also reports of power outages. A road between Bandar Khamir and Bandar Lengeh was blocked by a landslide. Seven people were killed and 111 others were injured. At least 22 of the injuries were serious enough to require hospitalisation.\n\nSee also\nList of earthquakes in 2022\nList of earthquakes in Iran\n\n\n== References ==

（訳）2022 年のホルモズガーン地震は、2022 年 7 月 1 日にイラン南部を襲った 2 つの二重地震でした。これらの地震は約 2 時間の間隔で発生し、7 人が死亡、数十人が負傷しました。\n\n地殻変動\nホルモズガーン州はイラン南部にあります。ユーラシアプレートとアラビアプレートの間の衝突帯の縁。この衝突により、ザグロス山脈とイラン高原が形成されました。ザグロス山脈を貫く主な断層系はザグロス褶曲帯と衝上帯で、これが長年にわたってイランで多くの地震を引き起こす原因となってきました。\n\n地震\nこの地震は、7 月 1 日に発生した一連の地震の一部です。 2022年、イラン南部でマグニチュード6.0の地震が発生し、2時間後にマグニチュード5.7の余震が発生し、マグニチュード5.7の地震の1分後にマグニチュード6.0の地震が発生した。最初の 2 つのイベントは当初マグニチュード 6.1 として報告され、3 番目のイベントは当初のマグニチュード 6.2 でした。数時間後に下方修正された。これと同様の一連の地震は以前にもザグロス山脈で発生しており、2021 年 11 月にも同様の地震が発生しました。\n\n余震\n7 月 3 日までに 12 回の余震があり、そのうち最大の地震は 5.7 Mw でした。 7月23日、さらに2回の余震が発生し、その規模は5.4Mwと5.6Mw\u202fでした。 2 つの地震により家屋にさらなる被害が発生し、間接的に 1 人が負傷しました。\n\n震度\n地震の最大震度は VII (非常に強い) でした。最も強い揺れはホルモズガン州とファルス州で報告された。この地震は、アラブ首長国連邦、オマーン、サウジアラビア、バーレーン、カタールなどの中東全域に加え、10日前に発生したより致命的な地震で深刻な被害を受けたパキスタンやアフガニスタンの一部でも感じられました。\n \nその他の出来事\n7 月の出来事の 1 か月前に、キシュの町近くの同じ州で 2 つの沖合地震が発生しました。地震のモーメントマグニチュードは5.5と5.6で、深さ10.0kmで発生した。最初の地震では4人が負傷し、20の建物が被害を受けた。後者では1名が死亡、37名が負傷した。同年3月16日にはマグニチュード5.9の別の地震が同じ地域を襲った。この地震により、負傷者 2 人が発生し、いくつかの村で軽度の被害が発生しました。\n\n影響\n合計人口約 90 万人、12 の町と 300 以上の村が地震の影響を受けました。約1,100人が暮らすサイエ・クヴォシュ村は完全に破壊された。ホルモズガン知事のマフディ・ドゥスティ氏は、村の再建には数カ月かかるだろうと述べた。バンダル・カミールでは少なくとも45軒の家が被害を受け、コンの町ではさらに35軒が被害を受けた。合計で少なくとも 392 戸の家屋が損傷または破壊されました。停電の報告もあった。バンダル・ハミールとバンダル・レンゲの間の道路が地滑りにより通行止めとなった。 7人が死亡、111人が負傷した。少なくとも 22 件の負傷は入院が必要な重傷でした。\n\nこちらもご覧ください\n2022 年の地震のリスト\nイランの地震のリスト\n\n\n== 参考文献 ==

回答

Answer: The 2022 Hormozgan earthquakes were a pair of doublet earthquakes that struck southern Iran on 1 July, 2022. The earthquakes occurred around two hours apart and resulted in the death of seven people. Dozens more were injured. The earthquakes had a maximum intensity of VII (Very strong), with the strongest shaking reported in the provinces of Hormozgan and Fars.

（訳）回答: 2022 年のホルモズガン地震は、2022 年 7 月 1 日にイラン南部を襲った 2 つの二重地震でした。これらの地震は約 2 時間間隔で発生し、7 人が死亡しました。さらに数十人が負傷した。地震の最大震度はVII（非常に強い）で、最も強い揺れはホルモズガン州とファルス州で報告された。

指標	値
Answer Relevancy	0.92
Faithfulness	1.00
SelfCheck-Prompt-AVG.	1.00

回答を見たところ、事実と異なる記載はありません。
しかし、回答に含まれる以下の記述が曖昧であり、完全な回答であるとは言えません。

さらに数十人が負傷した。

Answer Relevancyの特性上、不完全な回答は低く評価されるので、適切に評価されているものと考えられます。

改善点を考えてみる
コンテキストには、「数十人が負傷した」という曖昧な情報と「111人が負傷した」という具体的な情報が混在しており、どちらも嘘ではないが、質問に対して具体的な「111人が負傷した」と回答するのが適切といえます。
この場合、コンテキストから曖昧な情報である「数十人が負傷した」という記述を削除、または、「111人が負傷した」と具体的な記述に変更することで回答が改善すると思われます。

Faithfulnessのみ低いもの

026

質問

Question: When and where did Gaucho Americano have its world premiere, and when was it commercially released in Chilean theaters?

（訳）質問: ガウチョ・アメリカーノはいつ、どこでワールドプレミアされ、いつチリの劇場で商業公開されたのですか？

コンテキスト

Gaucho Americano (lit.\u2009'American Gaucho') is a 2021 Chilean documentary film directed by Nicolás Molina and written by Molina, Valentina Arango and Paula López. It presents the life of Joaquín and Víctor, 2 gauchos from Chilean Patagonia who find themselves alone in an American ranch to do a job.The film was named on the shortlist for Chilean's entry for the Academy Award for Best International Feature Film at the 95th Academy Awards, but it was not selected.\n\nSynopsis\nJoaquín and Victor, both gauchos from Chilean Patagonia, are hired as muleteers on a ranch in the United States. Accompanied only by their dogs and horses, they will have to protect their flocks of thousands of sheep from predators, in a foreign land that they believe they can dominate.\n\nCast\nJoaquín Agüil\nVictor Jara\n\nRelease\nGaucho Americano had its world premiere on April 29, 2021 at the Hot Docs Canadian International Documentary Festival. It was screened in mid-August 2021 for the first time in Chilean territory at the Santiago International Film Festival. It was commercially released on September 8, 2022 in Chilean theaters.\n\nReception\nCritical reception\nNikki Baughan from ScreenDaily describes Guacho Americano as a passive and hypnotic documentary that has a lot to say about immigration, rural life, and generational differences. In addition, to highlight the photography work of the director.\n\nAccolades\nReferences\nExternal links\nGaucho Americano at IMDb\nOfficial Page

（訳）ガウチョ・アメリカーノは、ニコラス・モリーナ監督、モリーナ、ヴァレンティーナ・アランゴ、パウラ・ロペス脚本による2021年のチリのドキュメンタリー映画。
チリのパタゴニア地方に住む2人のガウチョ、ホアキンとビクトルの人生を描く。
第95回アカデミー賞国際長編映画賞のチリ代表候補に選ばれたが、落選。

あらすじ
チリのパタゴニア地方出身のガウチョ、ホアキンとビクターは、アメリカの牧場でラバ使いとして雇われる。犬と馬だけを連れた彼らは、自分たちが支配できると信じている異国の地で、何千頭もの羊の群れを外敵から守らなければならない。

キャスト
ホアキン・アグイル
ビクトル・ハラ
公開
ガウチョ・アメリカーノは2021年4月29日、ホット・ドックス・カナダ国際ドキュメンタリー映画祭でワールドプレミア上映された。2021年8月中旬、サンティアゴ国際映画祭でチリ領内初上映。2022年9月8日、チリの劇場で商業公開された。

レセプション批評家の評価
ScreenDaily のNikki Baughanは、『Guacho Americano』を、移民、田舎暮らし、世代間の違いについて多くを語る、受動的で催眠的なドキュメンタリーだと評している。また、監督の撮影作品にも注目している。

称賛
参考文献
外部リンク
IMDbでのGaucho Americano
公式ページ

回答

Answer: Gaucho Americano had its world premiere on April 29, 2021, at the Hot Docs Canadian International Documentary Festival. It was commercially released on September 8, 2022, in Chilean theaters.

（訳）回答: ガウチョ・アメリカーノは2021年4月29日、ホット・ドックス・カナダ国際ドキュメンタリー映画祭でワールドプレミア上映された。2022年9月8日にチリの劇場で商業公開された。

指標	値
Answer Relevancy	0.97
Faithfulness	0.50
SelfCheck-Prompt-AVG.	1.00

見たところ、回答に事実と異なる記載はありません。
Faithfulnessは、通常コンテキストから得られない情報が回答に含まれると低く算出されます。
そのような状況ではないため、原因がわかりませんでした。
考えあぐねた挙句、もう一度評価を実行したところ、Faithfulnessは「1.00」となりました．．．

？？？

この原因を考察してみます。
Faithfulnessは、質問と回答から主張をいくつか挙げ、コンテキストに含まれる主張の数をもとにスコアを算出します。
この主張が少ない場合に、LLMの確率的な問題で、スコアが著しく低く出てしまうことがあると考えられます。

実際にRagasの実装で利用しているプロンプトのテンプレートを用いてLLMに直接聞いてみたところ、以下のように2つしか主張を回答しませんでした。

// 質問と回答から主張を生成させるプロンプト
Given a question and answer, create one or more statements from each sentence in the given answer.
question: Who was  Albert Einstein and what is he best known for?
answer: He was a German-born theoretical physicist, widely acknowledged to be one of the greatest and most influential physicists of all time. He was best known for developing the theory of relativity, he also made important contributions to the development of the theory of quantum mechanics.
statements:\nAlbert Einstein was born in Germany.\nAlbert Einstein was best known for his theory of relativity.
question: Cadmium Chloride is slightly soluble in this chemical, it is also called what?
answer: alcohol
statements:\nCadmium Chloride is slightly soluble in alcohol.
question: Were Shahul and Jithin of the same nationality?
answer: They were from different countries.
statements:\nShahul and Jithin were from different countries.
question: Question: When and where did Gaucho Americano have its world premiere, and when was it commercially released in Chilean theaters?
answer: Answer: Gaucho Americano had its world premiere on April 29, 2021, at the Hot Docs Canadian International Documentary Festival. It was commercially released on September 8, 2022, in Chilean theaters.
statements:

// AIの回答
Gaucho Americano had its world premiere on April 29, 2021, at the Hot Docs Canadian International Documentary Festival.
Gaucho Americano was commercially released on September 8, 2022, in Chilean theaters.

あまり現実的ではありませんが、Faithfulnessで精度の高い評価を得るには何度か評価を実行する必要がありそうです。

SelfCheck-Promptのみ低いもの

016

質問

Question: What caused the crash of Yeti Airlines Flight 691 in Pokhara, Nepal?

（訳）質問: ネパールのポカラでイエティ航空691便が墜落した原因は？

コンテキスト

Yeti Airlines Flight 691 was a scheduled domestic passenger flight from Kathmandu to Pokhara in Nepal. On 15 January 2023, the aircraft being operated on the route, an ATR 72 flown by Yeti Airlines, crashed while landing at Pokhara, killing all 72 occupants on board. It is the deadliest accident involving an ATR 72.\n\nAccident\nThe flight took off from Kathmandu's Tribhuvan International Airport at 10:33 am NST. It crashed on the bank of the Seti Gandaki River while on final approach to landing at Pokhara International Airport. A video filmed from the ground showed the aircraft banking steeply to the left before crashing 65 metres (213 ft) away. Another video was streamed live on Facebook by Sonu Jaiswal, a passenger on the plane, before and during the crash. The video shows passengers unaware of the situation until seconds before impact.The crash occurred in Gandaki Province between the old Pokhara Airport and the new Pokhara International Airport, which was opened two weeks earlier and also where the aircraft was intending to land. The accident resulted in the deaths of all 72 people on board, and was Nepal's worst aviation accident since the crash of Pakistan International Airlines Flight 268 in 1992, the deadliest aviation accident in Nepalese domestic aviation, and the deadliest accident involving an ATR 72.According to an official at the Pokhara International Airport, air traffic control cleared the flight to land on runway 30 heading from east to west, but the captain requested the opposing runway 12 heading from west to east, minutes before the crash. A Civil Aviation Authority of Nepal spokesperson said: "The weather was clear; according to preliminary information the cause of the crash is the technical issue of the plane."Flight-tracking organisation Flightradar24 noted that during the flight the aircraft had been transmitting inaccurate speed and altitude data.\n\nAftermath\nThe airport was closed as authorities launched a rescue operation. The Government of Nepal summoned an emergency cabinet meeting following the crash. Prime Minister Pushpa Kamal Dahal said he was deeply saddened by the tragic accident. The Office of the Prime Minister declared 16 January to be a national day of mourning, and the flag of Nepal was flown at half-staff. Yeti Airlines cancelled all regular flights scheduled for the day.\n\nInvestigation\nExperts noted that the video from the ground taken moments before the crash showed the aircraft's nose noticeably high before the left wing suddenly dropped, probably indicating a stall. Hours after the crash, a five-member committee headed by Nagendra Ghimire was set up to investigate the accident in conjunction with the French Bureau of Enquiry and Analysis for Civil Aviation Safety.\nOn 16 January, the flight data and cockpit voice recorders were found; the recorders were examined in Singapore and with assistance from Transportation Safety Board of Canada, Bureau of Enquiry and Analysis for Civil Aviation Safety, and Transport Safety Investigation Bureau of Singapore. About a month later, on 13 February, a preliminary report was released, which largely reproduced all relevant logs:At 10:56:27, the PF disengaged the Autopilot System (AP) at an altitude of 721 feet Above Ground Level (AGL). The PF then called for "FLAPS 30" at 10:56:32, and the PM replied, "Flaps 30 and descending". The flight data recorder (FDR) data did not record any flap surface movement at that time. Instead, the propeller rotation speed (Np) of both engines decreased simultaneously to less than 25% and the torque (Tq) started decreasing to 0%, which is consistent with both propellers going into the feathered condition...\nThe flight crew then carried out the "Before Landing Checklist" before starting the left turn onto the base leg. During that time, the power lever angle increased from 41% to 44%. At the point, Np of both propellers were recorded as Non-Computed Data (NCD) in the FDR and the torque (Tq) of both engines were at 0%. When propellers are in feather, they are not producing thrust...\nAt 10:56:54, another click was heard, followed by the flaps surface movement to the 30 degrees position.\n\nWhen ATC gave the clearance for landing at 10:57:07, the PF mentioned twice that there was no power coming from the engines. At 10:57:11, the power levers were advanced first to 62 degrees then to the maximum power position. At 10:57:20, the PM (who was previously the PF) repeated again that there was no power from the engines...\nThe aircraft's propellers had been feathered for a minute prior to the crash, causing the engine to produce no thrust and lead the vehicle into a stall; the condition levers, which control the propellers, were discovered from the wreckage set to the feathered position. Seconds preceding the crash, the pilots discussed a total lack of power and even moved the power lever to the extreme but failed to recognize that the condition levers were wrongly set. That the condition levers and the flap lever are next to each other in an ATR 72, experts speculate that the monitoring pilot (Kamal KC) had inadvertently moved the former in place of the latter, when asked of by the flying pilot; while, about twenty seconds later, he would set the flap lever properly on his own, he failed to account for his previous mistake, implying that the landing checklist was not properly followed.

（訳）イエティ航空691便は、ネパールのカトマンズ発ポカラ行きの国内線定期旅客便であった。2023年1月15日、この路線で運航されていたイエティ航空のATR 72型機がポカラに着陸中に墜落し、乗員乗客72人全員が死亡した。この事故は、ATR 72型機が関与した事故で最も死者数の多いものである。ポカラ国際空港に着陸する最終進入中にセティ・ガンダキ川岸に墜落した。地上から撮影されたビデオには、65メートル（213フィート）離れた場所に墜落する前に、機体が左に急バンクする様子が映っていた。墜落前と墜落中の別の映像は、同機の乗客であったソヌ・ジャイスワルによってフェイスブックでライブ配信された。墜落事故は、ガンダキ県の旧ポカラ空港と2週間前に開港した新ポカラ国際空港の間で発生し、飛行機が着陸しようとしていた場所でもあった。この事故により、搭乗していた72人全員が死亡し、ネパールでは1992年のパキスタン国際航空268便墜落事故以来最悪の航空事故となり、ネパール国内航空事故では最も死者数の多い事故となり、ATR 72型機の事故では最も死者数の多い事故となった。
ポカラ国際空港の職員によると、管制官は墜落の数分前、東から西に向かう滑走路30に着陸するよう許可したが、機長は西から東に向かう反対側の滑走路12を要求したという。ネパール民間航空局の広報担当者は次のように述べた：「天候は晴れていた。予備的な情報によると、墜落の原因は飛行機の技術的な問題である」飛行追跡組織Flightradar24は、飛行中、飛行機が不正確な速度と高度のデータを送信していたと指摘した。ネパール政府は墜落後、緊急閣議を開いた。プシュパ・カマル・ダハル首相は、悲劇的な事故に深い悲しみを覚えると述べた。首相府は1月16日を国家哀悼日とし、ネパール国旗を半旗に掲揚した。イエティ航空は、その日に予定されていたすべての定期便をキャンセルした。専門家は、墜落の直前に撮影された地上からのビデオには、左翼が突然落下する前に機首が明らかに高くなっており、おそらく失速したことを示していると指摘した。墜落の数時間後、フランスの民間航空安全調査分析局と共同で、ナジェンドラ・ギミレを委員長とする5人の委員からなる事故調査委員会が設置された。
1月16日、フライトデータとコックピット・ボイス・レコーダーが発見され、カナダ運輸安全委員会、民間航空安全調査分析局、シンガポール運輸安全調査局の協力を得て、シンガポールでレコーダーが調査された。約1ヵ月後の2月13日、予備報告書が発表され、関連するすべてのログがほぼ再現された：10時56分27秒、PFは地上高度721フィート（AGL）でオートパイロット・システム（AP）を解除した。PFは10時56分32秒に "FLAPS 30 "を要請し、機長は "フラップ30で降下中 "と返答した。フライト・データ・レコーダー（FDR）のデータには、このときのフラップ表面の動きは記録されていなかった。その代わり、両エンジンのプロペラ回転数（Np）が同時に25％以下に低下し、トルク（Tq）が0％に低下し始めた。この間、パワーレバーの角度は41%から44%に増加した。この時、FDRには両プロペラのNpがNCD(Non-Computed Data)として記録され、両エンジンのトルク(Tq)は0%であった。プロペラがフェザーの状態では推力は発生していない。10時56分54秒、再びクリック音が聞こえ、フラップ面が30度の位置まで移動した。
10時57分07秒にATCが着陸許可を出したとき、PFはエンジンからパワーが出ていないと2度言及した。10:57:11、パワーレバーはまず62度まで進められ、次に最大出力の位置まで進められた。10時57分20秒に、（以前はPFであった）PMが、エンジンからパワーが出ていないと再び繰り返した。墜落の1分前に、航空機のプロペラがフェザー状態になっていたため、エンジンは推力を発生せず、車両は失速した。墜落の数秒前、パイロットは完全にパワーが不足していることを話し合い、パワーレバーを極限まで動かしたが、コンディションレバーの設定が間違っていることに気づかなかった。ATR72型機では、コンディションレバーとフラップレバーは隣り合わせにあることから、専門家は、監視パイロット（カマルKC）が飛行中のパイロットに頼まれ、うっかり後者の代わりに前者を動かしてしまったと推測している。約20秒後、彼は自分でフラップレバーを正しくセットしたが、前回のミスを説明しなかったことから、着陸時のチェックリストが適切に守られていなかったと考えられる。

回答

Answer: The crash of Yeti Airlines Flight 691 in Pokhara, Nepal was caused by a technical issue with the plane. According to a spokesperson from the Civil Aviation Authority of Nepal, the weather was clear at the time of the crash, and preliminary information suggests that the crash was due to a technical problem. The flight data and cockpit voice recorders were recovered and examined, and a preliminary report indicated that the propellers of the aircraft had been feathered, causing the engines to produce no thrust and leading the plane into a stall. It was also noted that the power levers and condition levers were incorrectly set, possibly due to a mistake during the landing checklist. The investigation into the crash was conducted by a committee in Nepal, with assistance from the French Bureau of Enquiry and Analysis for Civil Aviation Safety, the Transportation Safety Board of Canada, and the Transport Safety Investigation Bureau of Singapore.

（訳）回答: ネパールのポカラで発生したイエティ航空691便の墜落事故は、飛行機の技術的な問題が原因であった。ネパール民間航空局の報道官によると、墜落時の天候は晴れであり、予備的な情報によると、墜落は技術的な問題によるものであった。フライトデータとコックピット・ボイスレコーダーが回収され、調査された結果、予備報告によると、航空機のプロペラが羽毛化し、エンジンの推力が出なくなり、失速に至った。また、着陸時のチェックリストの間違いと思われるが、パワーレバーとコンディションレバーの設定が誤っていたことも指摘された。墜落事故の調査は、フランスの民間航空安全調査分析局、カナダの運輸安全委員会、シンガポールの運輸安全調査局の協力を得て、ネパールの委員会が実施した。

指標	値
Answer Relevancy	0.98
Faithfulness	1.00
SelfCheck-Prompt-AVG.	0.76

回答を見たところ、以下の文章に事実と異なる情報がありました。

また、着陸時のチェックリストの間違いと思われるが、パワーレバーとコンディションレバーの設定が誤っていたことも指摘された。
⇒　チェックリストの間違いではなく、チェックリストが適切に守られなかった
⇒　設定が誤っていたのは、コンディションレバーのみである

Ragasでは、「真の回答」である教師データをもとに評価する指標は用意されていますが、
このような巧妙な”ウソ”を「真の回答」を必要とせずに見抜けるSelfCheck-Promptには驚きました。

改善点を考えてみる
まずは、「Chain of Thought」などのプロンプトエンジニアリングの手法を導入して幻覚を抑えられるか検証することが考えられます。
いくつかの手法を比較検証し、適したプロンプトを設計することが必要になります。
それでも改善しない場合は、最終手段としてLLMをより性能の良いものに変更してみるのがよいかもしれません。

まとめ

LLMを利用したアプリケーションで代表的なRAG（Retrieval-Augmented Generation）を対象として、以下の評価手法による評価を試してみました。

Ragas
- Faithfulness
- Answer Relevancy
SelfCheckGPT
- SelfCheck-Prompt

加えて、評価結果をいくつか見てみて、評価をもとに改善点を考えてみました。

（以下感想）
評価にLLMを用いる都合上、確率的な問題は排除しきれず、精度の高い評価を得るには繰り返し評価を実行する必要があると感じました。
これには時間やコスト（利用料）がかかってしまうため、いまだLLMを利用したアプリケーションの評価は発展途上にあることを再認識できました。

参考

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up