LLM as a Judgeの実装まとめ

Posted at 2024-12-22

はじめに

大規模言語モデル（LLM）の評価をLLMで行うLLM as a Judgeを今回は実装したので、まとめてます。

LLM as a Judgeを実際に実行するベースとして、参考にしていただければと思います。

実行環境

GPU
- RTX3090
Windows Subsystem for Linux 2（WSL2）

LLM as a Judgeについての記事

すでにLLM as a JudgeやLLMの評価についてはまとめられている記事がありますので、そちらを紹介いたします。

2. LLM-as-a-Judge の実装

2.1 スコアベースの評価

実装する機能
- 質問と回答のペアを入力とし、LLMに評価をさせる
- 評価観点（例：回答の正確さ、有用性）をプロンプトに含める
- 評価結果を数値（例：1～5）で出力する

Pythonコード例

import openai

def evaluate_with_llm(question, answer, evaluation_prompt, model="gpt-3.5-turbo"):
    """LLMを使用して質問と回答のペアを評価する"""
    openai.api_key = "YOUR_API_KEY"  # APIキーを設定

    prompt = f"""
    質問文: {question}
    回答文: {answer}
    評価観点: {evaluation_prompt}
    評価結果を1から5の数値で出力してください。
    """

    response = openai.chat.completions.create(
        model=model,
        messages=[{"role": "user", "content": prompt}]
    )
    try:
      score = int(response.choices[0].message.content.strip())
    except ValueError:
      score = None
    return score

# 例
question = "日本の首都はどこですか？"
answer = "東京です。"
evaluation_prompt = "回答の正確さを評価してください"
score = evaluate_with_llm(question, answer, evaluation_prompt)
print(f"評価スコア: {score}")

question = "今日は何日ですか？"
answer = "良い天気ですね"
evaluation_prompt = "質問に正しく答えているか？"
score = evaluate_with_llm(question, answer, evaluation_prompt)
print(f"評価スコア: {score}")

openaiライブラリを用いた実装例
YOUR_API_KEYは各自のAPIキーに置き換える必要あり
評価観点を調整することで多様な評価に対応可能
APIキーの設定について注意を促す

2.2 ペアワイズ評価

実装する機能
- 同じ質問に対する2つの異なる回答を比較し、より良い回答をLLMに選択させる
- 選択された回答と、その理由を出力する

Pythonコード例

import openai

def pairwise_evaluate_with_llm(question, answer1, answer2, model="gpt-3.5-turbo"):
    """LLMを使用して2つの回答を比較評価する"""
    openai.api_key = "YOUR_API_KEY"

    prompt = f"""
    質問文: {question}
    回答1: {answer1}
    回答2: {answer2}
    上記2つの回答を比較し、より良い回答を選んでください。また、理由も説明してください。
    """
    response = openai.chat.completions.create(
        model=model,
        messages=[{"role": "user", "content": prompt}]
    )
    return response.choices[0].message.content.strip()

# 例
question = "おすすめの観光地は？"
answer1 = "東京タワーです。"
answer2 = "上野公園です。"
result = pairwise_evaluate_with_llm(question, answer1, answer2)
print(f"評価結果: {result}")

2つの回答を比較するためのプロンプトを作成
LLMの出力をパースして、結果を取り出す処理を追加すると良い

2.3 LLM juries (PoLL)

実装する機能
- 複数のLLMに評価を行わせ、その結果を統合する
- 多数決や平均値を算出する

Pythonコード例

import openai

def evaluate_with_multiple_llms(question, answer, evaluation_prompt, models=["gpt-3.5-turbo", "gpt-4"]):
    """複数のLLMを使用して質問と回答のペアを評価する"""
    scores = []
    for model in models:
        score = evaluate_with_llm(question, answer, evaluation_prompt, model)
        scores.append(score)
    return scores

def aggregate_scores(scores):
  """複数の評価結果を統合する（例: 平均）"""
  valid_scores = [score for score in scores if score is not None]
  if not valid_scores:
      return None
  return sum(valid_scores) / len(valid_scores)

# 例
question = "円周率の値は？"
answer = "3.14159"
evaluation_prompt = "回答の正確さを評価してください"
scores = evaluate_with_multiple_llms(question, answer, evaluation_prompt)
average_score = aggregate_scores(scores)
print(f"各モデルのスコア: {scores}")
print(f"平均スコア: {average_score}")

question = "おすすめの食べ物は？"
answer = "それは難しい質問です。"
evaluation_prompt = "質問に答えようとしているか？"
scores = evaluate_with_multiple_llms(question, answer, evaluation_prompt)
average_score = aggregate_scores(scores)
print(f"各モデルのスコア: {scores}")
print(f"平均スコア: {average_score}")

複数のLLMを利用するための実装
aggregate_scores 関数で結果を統合する
多数決などの他の集計方法も実装できる

3. LLM as a Judgeの評価

3.1 評価者間のagreementの計算

実装する機能
- 複数の評価者（例：LLMと人間）の評価結果を比較し、agreementを算出する
- 単純な一致率や Cohen's Kappa を計算する

Pythonコード例

from sklearn.metrics import cohen_kappa_score

def calculate_agreement(evaluations1, evaluations2):
    """2つの評価者間のagreementを計算する"""
    # valid evaluationsを抽出する
    valid_evals1 = [eval for eval in evaluations1 if eval is not None]
    valid_evals2 = [eval for eval in evaluations2 if eval is not None]
    # Noneがある場合はスキップ
    if not valid_evals1 or not valid_evals2 or len(valid_evals1) != len(valid_evals2):
       return None
    return sum([1 for x,y in zip(valid_evals1, valid_evals2) if x == y]) / len(valid_evals1)

def calculate_cohens_kappa(evaluations1, evaluations2):
    """2つの評価者間のCohen's Kappaを計算する"""
    # valid evaluationsを抽出する
    valid_evals1 = [eval for eval in evaluations1 if eval is not None]
    valid_evals2 = [eval for eval in evaluations2 if eval is not None]
    # Noneがある場合はスキップ
    if not valid_evals1 or not valid_evals2 or len(valid_evals1) != len(valid_evals2):
       return None
    return cohen_kappa_score(valid_evals1, valid_evals2)

# 例
human_evaluations = [5, 4, 3, 5, 2]
llm_evaluations = [4, 4, 2, 5, 3]
agreement = calculate_agreement(human_evaluations, llm_evaluations)
kappa = calculate_cohens_kappa(human_evaluations, llm_evaluations)
print(f"Agreement: {agreement}")
print(f"Cohen's Kappa: {kappa}")

sklearn ライブラリを使って Cohen's Kappa を計算
単純な一致率の計算も実装
Noneがある場合のエラー処理を実装

3.2 相関係数の計算

実装する機能
- 複数の評価者間の評価結果の相関を計算する
- ピアソン相関係数、スピアマン相関係数などを算出する

Pythonコード例

import numpy as np
from scipy.stats import pearsonr, spearmanr

def calculate_correlations(evaluations1, evaluations2):
    """2つの評価者間の相関係数を計算する"""
     # valid evaluationsを抽出する
    valid_evals1 = [eval for eval in evaluations1 if eval is not None]
    valid_evals2 = [eval for eval in evaluations2 if eval is not None]
    # Noneがある場合はスキップ
    if not valid_evals1 or not valid_evals2 or len(valid_evals1) != len(valid_evals2):
       return None, None
    pearson, _ = pearsonr(valid_evals1, valid_evals2)
    spearman, _ = spearmanr(valid_evals1, valid_evals2)
    return pearson, spearman

# 例
human_evaluations = [5, 4, 3, 5, 2]
llm_evaluations = [4, 4, 2, 5, 3]
pearson, spearman = calculate_correlations(human_evaluations, llm_evaluations)
print(f"ピアソン相関係数: {pearson}")
print(f"スピアマン相関係数: {spearman}")

scipy ライブラリを用いてピアソン相関係数とスピアマン相関係数を計算

4. EvalGen の考え方を取り入れた評価基準の改善

実装する機能
- LLMに評価基準の候補を提案させる
- ユーザが評価基準を修正する
- 修正された評価基準を元に、評価を再実行する
- このプロセスを繰り返すことで、より良い評価基準を目指す

Pythonコード例（擬似コード）

def refine_evaluation_criteria(initial_criteria, questions, answers, model="gpt-3.5-turbo"):
    """LLMとユーザーで評価基準を改善する"""
    criteria = initial_criteria
    while True:
        # LLMに評価基準の改善案を提案させる
        prompt = f"""
        現在の評価基準: {criteria}
        質問と回答の例: {list(zip(questions, answers))}
        この評価基準を改善するための提案をしてください。
        """
        response = openai.chat.completions.create(
            model=model,
            messages=[{"role": "user", "content": prompt}]
        )
        llm_criteria_proposal = response.choices[0].message.content.strip()
        print(f"LLMによる評価基準の提案: {llm_criteria_proposal}")

        # ユーザに評価基準の修正を促す
        user_criteria = input("修正後の評価基準を入力してください (変更なしの場合はEnter): ")
        if user_criteria:
            criteria = user_criteria
        else:
          break
        
        # 新しい評価基準で評価を再実行する（ここでは単純化のため省略）
        print("評価基準を更新しました")
    return criteria

# 例
initial_criteria = "回答の正確さ"
questions = ["日本の首都は？", "今日の天気は？"]
answers = ["東京", "晴れ"]
final_criteria = refine_evaluation_criteria(initial_criteria, questions, answers)
print(f"最終的な評価基準: {final_criteria}")

LLMに評価基準の提案をさせるプロンプトを定義
ユーザからの入力を受け付ける
このプロセスを繰り返して、評価基準を改善していく
実際に評価を再実行する部分は、前述の評価関数を組み合わせて実装できる

5. 最後に

今後もLLM as a Judgeで使われている手法を更新していく予定です
WeaveのEvaluation機能を使って、評価結果をまとめる機能などをつけると比較しやすくなります

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up