[翻訳] MLflow LLM Evaluation

Posted at 2025-02-18

本書は著者が手動で翻訳したものであり内容の正確性を保証するものではありません。正確な内容に関しては原文を参照ください。

ChatGPTの出現によって、LLMはQ&A、翻訳、テキスト要約のように様々な領域でのテキスト生成のパワーを示しています。LLMのパフォーマンスの評価は、比較対象となる単一の正解データが存在しないため、従来のMLモデルとは若干異なります。MLflowは、あなたのLLMの評価の助けとなるAPIであるmlflow.evaluate()を提供します。

MLflowのLLM評価機能は、3つの主要なコンポーネントから構成されます:

評価するモデル: MLflowのpyfuncモデル、登録されているMLflowモデルをポイントするURI、HuggingFaceのテキスト要約パイプラインのように、あなたのモデルを表現する呼び出し可能なPythonであれば大丈夫です。
メトリクス: 計算するメトリクス、LLM評価ではLLMメトリクスを使用します。
評価データ: モデルの評価に用いるデータ、pandasデータフレーム、Pythonのリスト、numpyのarray、mlflow.data.dataset.Dataset()のインスタンスを使うことができます。

完全なノートブックガイドと例

LLMに対するMLflowの評価機能のパワーとシンプルさを説明する、包括的なユースケース指向のガイドに興味があるのであれば、以下のノートブックコレクションをご覧ください:

View the Notebook Guides

クイックスタート

以下は、MLflowのLLM評価がどのように動作するのかに対するクイックな概要を示すシンプルな例となります。この例では、カスタムプロンプトを用いて「openai/gpt-4」をラッピングすることで、シンプルなQ&Aモデルを構築しています。お使いのIPythonやローカルのエディタにこちらを貼り付けて実行し、求められたら必要な依存関係をインストールします。コードの実行にはOpenAIのAPIキーが必要となりますので、OpenAIキーをお持ちでない場合には、OpenAIのガイドに沿ってセットアップを行うことができます。

export OPENAI_API_KEY='your-api-key-here'

import mlflow
import openai
import os
import pandas as pd
from getpass import getpass

eval_data = pd.DataFrame(
    {
        "inputs": [
            "What is MLflow?",
            "What is Spark?",
        ],
        "ground_truth": [
            "MLflow is an open-source platform for managing the end-to-end machine learning (ML) "
            "lifecycle. It was developed by Databricks, a company that specializes in big data and "
            "machine learning solutions. MLflow is designed to address the challenges that data "
            "scientists and machine learning engineers face when developing, training, and deploying "
            "machine learning models.",
            "Apache Spark is an open-source, distributed computing system designed for big data "
            "processing and analytics. It was developed in response to limitations of the Hadoop "
            "MapReduce computing model, offering improvements in speed and ease of use. Spark "
            "provides libraries for various tasks such as data ingestion, processing, and analysis "
            "through its components like Spark SQL for structured data, Spark Streaming for "
            "real-time data processing, and MLlib for machine learning tasks",
        ],
    }
)

with mlflow.start_run() as run:
    system_prompt = "Answer the following question in two sentences"
    # Wrap "gpt-4" as an MLflow model.
    logged_model_info = mlflow.openai.log_model(
        model="gpt-4",
        task=openai.chat.completions,
        artifact_path="model",
        messages=[
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": "{question}"},
        ],
    )

    # Use predefined question-answering metrics to evaluate our model.
    results = mlflow.evaluate(
        logged_model_info.model_uri,
        eval_data,
        targets="ground_truth",
        model_type="question-answering",
    )
    print(f"See aggregated evaluation results below: \n{results.metrics}")

    # Evaluation result for each data record is available in `results.tables`.
    eval_table = results.tables["eval_results_table"]
    print(f"See evaluation table below: \n{eval_table}")

LLM Evaluationのメトリクス

MLflowのLLM評価メトリクスには2つのタイプがあります:

ヒューリスティック(経験則)ベースのメトリクス: これらのメトリクスは、Rouge (rougeL()), Flesch Kincaid (flesch_kincaid_grade_level())やBilingual Evaluation Understudy (BLEU) (bleu())のような特定の関数を用いてそれぞれのデータレコードのスコアを計算します。これらのメトリクスは、従来の連続値のメトリクスと似ています。ビルトインのヒューリスティックメトリクスのリストや、あなたの関数定義を用いたカスタムのメトリクスの定義方法に関しては、ヒューリスティックベースのメトリクスセクションをご覧ください。
LLM-as-a-Judgeメトリクス: LLM-as-a-Judgeは、モデル出力の品質をスコアリングするためにLLMを活用する新たなタイプのメトリックです。多くの場合で文脈や意味論的な精度のようなニュアンスを見逃してしまう、ヒューリスティックベースのメトリクスの限界を打破します。LLM-as-a-Judgeメトリクスは、人間による評価よりもスケーラブルでコスト効率が高いことに加え、より複雑な言語タスクに対してより人間のような評価を提供します。MLflowでは、さまざまなビルトインのLLM-as-a-Judgeメトリクスを提供し、自身のプロンプト、評点基準、リファレンスの例を用いたカスタムメトリクスの作成をサポートしています。詳細はLLM-as-a-Judgeメトリクスをご覧ください。

ヒューリスティックベースのメトリクス

ビルトインのヒューリスティックメトリクス

ビルトインのヒューリスティックメトリクスの完全なリストに関しては、こちらのページをご覧ください。

定義済みモデルタイプによるデフォルトメトリクス

MLflow LLM Evaluationには、「質問-回答」のように事前に選択されたタスクに対するメトリクスのデフォルトコレクションが含まれています。あなたが評価しようとするLLMユースケースに応じて、これらの事前定義のコレクションは、評価実行のプロセスを劇的にシンプルなものにします。選択済みのタスクに対するデフォルトメトリクスを使うには、以下の例に示しているようにmlflow.evaluate()のmodel_type引数を指定します。

results = mlflow.evaluate(
    model,
    eval_data,
    targets="ground_truth",
    model_type="question-answering",
)

サポートされるLLMモデルタイプと関連メトリクスを以下に一覧します:

question-answering: model_type="question-answering":
- exact-match
- toxicity ¹
- ari_grade_level ²
- flesch_kincaid_grade_level ²
text-summarization: model_type="text-summarization":
- ROUGE ³
- toxicity ¹
- ari_grade_level ²
- flesch_kincaid_grade_level ²
text models: model_type="text":
- toxicity ¹
- ari_grade_level ²
- flesch_kincaid_grade_level ²
retrievers: model_type="retriever":
- precision_at_k ⁴
- recall_at_k ⁴
- ndcg_at_k ⁴

メトリクスのカスタムリストの活用

特定のモデルタイプに紐づけられた定義済みのメトリクスを用いることが、MLflowにおけるLLM評価のスコアリングを生成する唯一の手段ではありません。mlflow.evaluateのextra_metrics引数にメトリクスのカスタムリストを指定することができます:

事前定義のモデルタイプのデフォルトメトリクスリストにメトリクスを追加するには、model_typeはそのままで、extra_metricsにあなたのメトリクスを追加します:
```
results = mlflow.evaluate(
    model,
    eval_data,
    targets="ground_truth",
    model_type="question-answering",
    extra_metrics=[mlflow.metrics.latency()],
)
```
上のコードは、「question-answering」モデルのすべてのメトリクスとmlflow.metrics.latency()を用いて、あなたのモデルを評価します。
デフォルトメトリクスの計算を無効化し、あなたが選択したメトリクスのみを計算するには、model_type引数を削除し、必要なメトリクスを定義します。
```
results = mlflow.evaluate(
    model,
    eval_data,
    targets="ground_truth",
    extra_metrics=[mlflow.metrics.toxicity(), mlflow.metrics.latency()],
)
```

サポートされる評価メトリクスの完全なリファレンスはこちらで確認できます。

カスタムのヒューリティックベースのLLM Evaluationメトリクスの作成

これは従来のカスタムメトリクスの作成と非常に似ていますが、mlflow.metrics.MetricValue()のインスタンスを返却する点が異なります。

あなたのスコアリングロジックを定義するeval_fnを実装します。この関数にはpredictionsとtargetsの2つの引数が必要です。eval_fnはmlflow.metrics.MetricValue()を返却しなくてはなりません。
メトリックを作成するために、mlflow.metrics.make_metric APIにeval_fnとその他の引数を渡します。

以下のコードでは"over_10_chars"という行単位のダミーメトリックを作成します。モデルの出力が10より多い場合、このスコアはyesとなり、そうでなければnoとなります。

def eval_fn(predictions, targets):
    scores = ["yes" if len(pred) > 10 else "no" for pred in predictions]
    return MetricValue(
        scores=scores,
        aggregate_results=standard_aggregations(scores),
    )


# Create an EvaluationMetric object.
passing_code_metric = make_metric(
    eval_fn=eval_fn, greater_is_better=False, name="over_10_chars"
)

他のメトリクスに依存するカスタムメトリックを作成するには、predictionsやtargetsの後の引数としてそれらの他のメトリクス名を含めます。これは、ビルトインメトリックの名前や他のカスタムメトリックにすることができます。メトリクスに間違って循環参照を含めないようにしてください。その場合には評価が失敗します。

以下のコードは、"toxic_or_over_10_chars"という行単位のダミーメトリックを作成します: モデルの出力が10以上あるいはtoxityのスコアが0.5より大きい場合にスコアはyes、そうでない場合にはnoとなります。

def eval_fn(predictions, targets, toxicity, over_10_chars):
    scores = [
        "yes" if toxicity.scores[i] > 0.5 or over_10_chars.scores[i] else "no"
        for i in len(toxicity.scores)
    ]
    return MetricValue(scores=scores)


# Create an EvaluationMetric object.
toxic_and_over_10_chars_metric = make_metric(
    eval_fn=eval_fn, greater_is_better=False, name="toxic_or_over_10_chars"
)

LLM-as-a-Judgeメトリクス

LLM-as-a-Judgeは、モデル出力の品質をスコアリングするためにLLMを活用する新たなタイプのメトリックであり、人間による評価よりもスケーラブルでコスト効率が高いことに加え、より複雑な言語タスクに対してより人間のような評価を提供します。

MLflowでは、ビルトインのLLM-as-a-Judgeメトリクスをサポートすることに加え、カスタムの設定やプロンプトを用いて自身のLLM-as-a-Judgeメトリクスを作成することができます。

ビルトインのLLM-as-a-Judgeメトリクス

MLflowでビルトインのLLM-as-a-Judgeメトリクスを使うには、mlflow.evaluate()関数のextra_metrics引数にメトリクス定義のリストを指定します。

以下の例では、評価のためのビルトインの回答の正確性メトリックとレイテンシーメトリック(ヒューリスティック)を用いています:

from mlflow.metrics import latency
from mlflow.metrics.genai import answer_correctness

results = mlflow.evaluate(
    eval_data,
    targets="ground_truth",
    extra_metrics=[
        answer_correctness(),
        latency(),
    ],
)

こちらがビルトインのLLM-as-a-Judgeメトリクスのリストです。それぞれのメトリックの完全なドキュメントに関してはリンクをクリックしてください:

answer_similarity(): モデルが生成した出力と正解データの情報と比較してどれだけ類似しているかを評価。
answer_correctness(): モデルが生成した出力が、正解データの情報に対してどれだけ事実つに基づいて正しいのかを評価。
answer_relevance(): 入力に対してモデルが生成した出力がどれだけ適切かを評価(コンテキストは無視)
relevance(): モデルが生成した出力が、入力とコンテキスト両方の観点でどれだけ適切かを評価。
faithfulness(): モデルが生成した出力は、提供されたコンテキストにどれだけ忠実かを評価。

Judgeモデルの選択

デフォルトでは、MLflowはメトリクスのスコアリングを行うジャッジモデルとしてOpenAIのGPT-4モデルを使います。メトリック定義のmodel引数にジャッジモデルを指定することで変更することができます。

1. SaaSのLLMプロバイダー

OpenAIやAnthropicのようなSaaSのLLMプロバイダーを使うには、<provider>:/<model-name>の形式で、メトリクス定義にmodelパラメータを設定します。現時点では、MLflowはジャッジモデルのための重要なLLMプロバイダーとして["openai", "anthropic", "bedrock", "mistral", "togetherai"]をサポートしています。

OpenAI / Azure OpenAI

OpenAIモデルはopenai:/<model-name> URIを通じてアクセスすることができます。

import mlflow
import os

os.environ["OPENAI_API_KEY"] = "<your-openai-api-key>"

answer_correctness = mlflow.metrics.genai.answer_correctness(model="openai:/gpt-4o")

# Test the metric definition
answer_correctness(
    inputs="What is MLflow?",
    predictions="MLflow is an innovative full self-driving airship.",
    targets="MLflow is an open-source platform for managing the end-to-end ML lifecycle.",
)

Azure OpenAIエンドポイントは、OPENAI_API_BASEやOPENAI_API_TYPEなどの環境変数を設定することで、同じopenai:/<model-name> URIを通じてアクセスすることができます。

os.environ["OPENAI_API_TYPE"] = "azure"
os.environ["OPENAI_API_BASE"] = "https:/my-azure-openai-endpoint.azure.com/"
os.environ["OPENAI_DEPLOYMENT_NAME"] = "gpt-4o-mini"
os.environ["OPENAI_API_VERSION"] = "2024-08-01-preview"
os.environ["OPENAI_API_KEY"] = "<your-api-key-for-azure-openai-endpoint>"

他のプロバイダーに関しては原文をご覧ください。

注意
評価でサードパーティのLLMサービス(OpenAIなど)を使うことで、当該LLMサービスの利用条項の対象となり、管理の対象となる場合があります。

2. 自身でホストするプロキシエンドポイント

(セキュリティコンプライアンスなどの理由で)プロキシエンドポイント経由でSaaSのLLMプロバイダにアクセスする場合、メトリクス定義のproxy_urlを設定することができます。さらに、エンドポイントの認証のための追加ヘッダーを指定するためにextra_headersを使います。

answer_similarity = mlflow.metrics.genai.answer_similarity(
    model="openai:/gpt-4o",
    proxy_url="https://my-proxy-endpoint/chat",
    extra_headers={"Group-ID": "my-group-id"},
)

3. MLflow AI Gatewayエンドポイント

MLflow AI Gatewayは、統合されたインタフェースでさまざまなLLMプロバイダーへのクエリーを可能とする自己ホストのソリューションです。MLflow AI Gatewayによってホストされているエンドポイントを使うには:

こちらのステップに沿って、ご自身のLLM設定を用いて、MLflow AI Gatewayサーバーを起動する。
set_deployments_target()を用いて、サーバーのアドレスを指すように、MLflowデプロイメントクライアントを設定。
メトリクス定義のmodelパラメータにendpoints:/<endpoint-name>を設定。

from mlflow.deployments import set_deployments_target

# When the MLflow AI Gateway server is running at http://localhost:5000
set_deployments_target("http://localhost:5000")
my_answer_similarity = mlflow.metrics.genai.answer_similarity(
    model="endpoints:/my-endpoint"
)

4. Databricksモデルサービング

Databricksにモデルホスティングする場合には、メトリクス定義のmodelパラメータにendpoints:/<endpoint-name>を設定することで、ジャッジモデルとして使用できます。以下のコードでは、基盤モデルAPIを通じて利用できるLlama 3.1 405Bモデルを使っています。

from mlflow.deployments import set_deployments_target

set_deployments_target("databricks")
llama3_answer_similarity = mlflow.metrics.genai.answer_similarity(
    model="endpoints:/databricks-llama-3-1-405b-instruct"
)

デフォルトJudgeパラメータの上書き

デフォルトでは、MLflowは以下のパラメータでジャッジLLMモデルにクエリーします:

temperature: 0.0
max_tokens: 200
top_p: 1.0

しかし、これがすべてのLLMプロバイダーに適しているわけではありません。例えば、Amazon Bedrock上のAnthropicのClaudeにアクセスするには、リクエストペイロードでanthropic_versionを指定する必要があります。メトリクス定義のparameters引数を指定することで、デフォルトパラメータを上書きすることができます。

my_answer_similarity = mlflow.metrics.genai.answer_similarity(
    model="bedrock:/anthropic.claude-3-5-sonnet-20241022-v2:0",
    parameters={
        "temperature": 0,
        "max_tokens": 256,
        "anthropic_version": "bedrock-2023-05-31",
    },
)

parameters引数で指定するパラメータディクショナリーは、マージされるのではなくすべてのデフォルトパラメータを置き換えることに注意してください。例えば、上のコードの例では、top_pはモデルに通知されません。

カスタムLLM-as-a-Judgeメトリクスの作成

また、以下の情報を必要とするmlflow.metrics.genai.make_genai_metric()で、ご自身のLLM-as-a-judge評価メトリクスを作成することができます。

name: あなたのカスタムメトリックの名前。
definition: メトリックが何をするのかを説明。
grading_prompt: スコアリング基準を説明。
examples (オプション): スコアとその入力/出力のいくつかのサンプル。LLMジャッジの推論に使用される。

設定の完全なリストはAPI documentationをご覧ください。

内部では、definition、grading_prompt、examplesと評価データ、モデルの出力は長いプロンプトにまとめられ、LLMに送信されます。あなたがプロンプトエンジニアリングのコンセプトに慣れているのであれば、SaaSのLLMの評価メトリックは基本的に指示、データ、モデル出力を含む「適切な」プロンプトを構成しようとし、GPT4のようなLLMは必要とする情報を出力することができます。

それでは、我々のモデル出力がどれだけプロフェッショナルかを計測する「professionalism(プロ意識)」と呼ばれるカスタムの生成AIメトリクスを作成しましょう。

はじめに、スコアを伴う幾つかのサンプルを作成しましょう。これらはLLMジャッジが使用する参照サンプルとなります。このようなサンプルを作成するために、4つのフィールドを持つmlflow.metrics.genai.EvaluationExample()クラスを使います:

input: 入力テキスト。
output: 出力テキスト。
score: 入力のコンテキストにおける出力のスコア。
justification: なぜそのデータにそのスコアを与えるのか。

professionalism_example_score_2 = mlflow.metrics.genai.EvaluationExample(
    input="What is MLflow?",
    output=(
        "MLflow is like your friendly neighborhood toolkit for managing your machine learning projects. It helps "
        "you track experiments, package your code and models, and collaborate with your team, making the whole ML "
        "workflow smoother. It's like your Swiss Army knife for machine learning!"
    ),
    score=2,
    justification=(
        "The response is written in a casual tone. It uses contractions, filler words such as 'like', and "
        "exclamation points, which make it sound less professional. "
    ),
)
professionalism_example_score_4 = mlflow.metrics.genai.EvaluationExample(
    input="What is MLflow?",
    output=(
        "MLflow is an open-source platform for managing the end-to-end machine learning (ML) lifecycle. It was "
        "developed by Databricks, a company that specializes in big data and machine learning solutions. MLflow is "
        "designed to address the challenges that data scientists and machine learning engineers face when "
        "developing, training, and deploying machine learning models.",
    ),
    score=4,
    justification=("The response is written in a formal language and a neutral tone. "),
)

それでは、professionalismメトリックを定義しましょう。それぞれのフィールドが、どのようにセットアップされるのかを確認することができます。

professionalism = mlflow.metrics.genai.make_genai_metric(
    name="professionalism",
    definition=(
        "Professionalism refers to the use of a formal, respectful, and appropriate style of communication that is "
        "tailored to the context and audience. It often involves avoiding overly casual language, slang, or "
        "colloquialisms, and instead using clear, concise, and respectful language."
    ),
    grading_prompt=(
        "Professionalism: If the answer is written using a professional tone, below are the details for different scores: "
        "- Score 0: Language is extremely casual, informal, and may include slang or colloquialisms. Not suitable for "
        "professional contexts."
        "- Score 1: Language is casual but generally respectful and avoids strong informality or slang. Acceptable in "
        "some informal professional settings."
        "- Score 2: Language is overall formal but still have casual words/phrases. Borderline for professional contexts."
        "- Score 3: Language is balanced and avoids extreme informality or formality. Suitable for most professional contexts. "
        "- Score 4: Language is noticeably formal, respectful, and avoids casual elements. Appropriate for formal "
        "business or academic settings. "
    ),
    examples=[professionalism_example_score_2, professionalism_example_score_4],
    model="openai:/gpt-4o-mini",
    parameters={"temperature": 0.0},
    aggregations=["mean", "variance"],
    greater_is_better=True,
)

ターゲットモデルの準備

mlflow.evaluate()であなたのモデルを評価するには、あなたのモデルは以下のタイプのいずれかである必要があります:

mlflow.pyfunc.PyFuncModel()のインスタンス、あるいは記録されているmlflow.pyfunc.PyFuncModelモデルをポイントするURI。通常、我々はこれをMLflowモデルと呼びます。
文字列の入力を受け取り、単一の文字列を出力するPython関数。呼び出す関数は、mlflow.pyfunc.PyFuncModel.predict()のシグネチャ(params引数を除く)にマッチする必要があります。簡単にいうと:
- pandas.Dataframe、numpy.ndarray、Pythonのリスト、ディクショナリー、scipyのmatrixのいずれかである単一の引数としてdataを持つこと。
- pandas.DataFrame、pandas.Series、numpy.ndarray、リストのいずれかを返却すること。
ローカルのMLflow AI Gateway、Databricks Foundation Models API、External Models in Databricks Model ServingをポイントするMLflowデプロイメントエンドポイントのURI。
model=Noneを設定し、dataにモデルの出力を配置します。データがPandasデータフレームの場合にのみ適用できます。

MLflowモデルによる評価

あなたのモデルをmlflow.pyfunc.PyFuncModelインスタンスに変換する方法に関する詳細な手順については、こちらのドキュメントをご覧ください。しかし、簡単に言えば、MLflowモデルとしてあなたのモデルを評価するには、以下のステップに従うことをお勧めします:

log_modelであなたのモデルをMLflowサーバーに記録します。それぞれのフレーバーには、mlflow.openai.log_model()のように自身のlog_model APIがあります:

with mlflow.start_run():
    system_prompt = "Answer the following question in two sentences"
    # Wrap "gpt-4o-mini" as an MLflow model.
    logged_model_info = mlflow.openai.log_model(
        model="gpt-4o-mini",
        task=openai.chat.completions,
        artifact_path="model",
        messages=[
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": "{question}"},
        ],
    )

mlflow.evaluate()のモデルインスタンスとして、記録されたモデルのURIを使います:

results = mlflow.evaluate(
    logged_model_info.model_uri,
    eval_data,
    targets="ground_truth",
    model_type="question-answering",
)

カスタム関数による評価

MLflow 2.8.0時点でmlflow.evaluate()は、MLflowにモデルを記録することなしに、Python関数の評価をサポートしています。これは、モデルを記録したくなくて評価のみを行いたい場合に有用です。以下の例では、関数の評価にmlflow.evaluate()を使用しています。以下のコードを実行するには、OpenAIの認証をセットアップする必要もあります。

import mlflow
import openai
import pandas as pd
from typing import List

eval_data = pd.DataFrame(
    {
        "inputs": [
            "What is MLflow?",
            "What is Spark?",
        ],
        "ground_truth": [
            "MLflow is an open-source platform for managing the end-to-end machine learning (ML) lifecycle. It was developed by Databricks, a company that specializes in big data and machine learning solutions. MLflow is designed to address the challenges that data scientists and machine learning engineers face when developing, training, and deploying machine learning models.",
            "Apache Spark is an open-source, distributed computing system designed for big data processing and analytics. It was developed in response to limitations of the Hadoop MapReduce computing model, offering improvements in speed and ease of use. Spark provides libraries for various tasks such as data ingestion, processing, and analysis through its components like Spark SQL for structured data, Spark Streaming for real-time data processing, and MLlib for machine learning tasks",
        ],
    }
)


def openai_qa(inputs: pd.DataFrame) -> List[str]:
    predictions = []
    system_prompt = "Please answer the following question in formal language."

    for _, row in inputs.iterrows():
        completion = openai.chat.completions.create(
            model="gpt-4o-mini",
            messages=[
                {"role": "system", "content": system_prompt},
                {"role": "user", "content": row["inputs"]},
            ],
        )
        predictions.append(completion.choices[0].message.content)

    return predictions


with mlflow.start_run():
    results = mlflow.evaluate(
        model=openai_qa,
        data=eval_data,
        targets="ground_truth",
        model_type="question-answering",
    )

print(results.metrics)

出力

{
    "flesch_kincaid_grade_level/v1/mean": 14.75,
    "flesch_kincaid_grade_level/v1/variance": 0.5625,
    "flesch_kincaid_grade_level/v1/p90": 15.35,
    "ari_grade_level/v1/mean": 18.15,
    "ari_grade_level/v1/variance": 0.5625,
    "ari_grade_level/v1/p90": 18.75,
    "exact_match/v1": 0.0,
}

MLflowデプロイメントエンドポイントによる評価

MLflow >= 2.11.0では、mlflow.evaluate()では、model引数にMLflowデプロイメントエンドポイントURIを直接指定することで、モデルエンドポイントを評価できるようになりました。これは、MLflowモデルやPython関数としてラッピングするための実装を行うことなしに、ローカルのMLflow AI Gateway、Databricks Foundation Models API、External Models in Databricks Model Servingによってホストされているモデルを評価したい場合に有用です。

以下の例で示しているように、エンドポイントURIでmlflow.evaluate()を呼び出す前に、mlflow.deployments.set_deployments_target()を用いて、ターゲットのデプロイメントクライアントを設定することを忘れないようにしてください。さもないと、MlflowException: No deployments target has been set...のようなエラーメッセージに遭遇することでしょう。

ヒント
MLflow AI GatewayやDatabricksでホストされていないエンドポイントを使いたい場合、Evaluating with a Custom Functionガイドに従ってカスタムのPython関数を作成し、model引数で使うことができます。

サポートされる入力データフォーマット

モデルとしてMLflowデプロイメントエンドポイントのURIを用いる際、入力データは以下のフォーマットのいずれかを用いることができます:

原文をご覧ください。

推論パラメータの指定

mlflow.evaluate()のinference_params引数を設定することで、モデルエンドポイントにmax_tokens、temperature、nのような追加の推論パラメータを設定することができます。inference_params引数は、モデルエンドポイントに渡すパラメータを含むディクショナリーです。指定されたパラメータは評価データセットの入力レコードの全てで使用されます。

注意
入力がリクエストペイロードを表現するディクショナリー形式の場合、max_tokensのようなパラメータを含めることもできます。inference_paramsと入力データの両方でパラメータが重複する場合、inference_paramsの値が優先されます。

例

ローカルのMLflow AI Gatewayによってホストされたチャットエンドポイント

import mlflow
from mlflow.deployments import set_deployments_target
import pandas as pd

# Point the client to the local MLflow AI Gateway
set_deployments_target("http://localhost:5000")

eval_data = pd.DataFrame(
    {
        # Input data must be a string column and named "inputs".
        "inputs": [
            "What is MLflow?",
            "What is Spark?",
        ],
        # Additional ground truth data for evaluating the answer
        "ground_truth": [
            "MLflow is an open-source platform ....",
            "Apache Spark is an open-source, ...",
        ],
    }
)


with mlflow.start_run() as run:
    results = mlflow.evaluate(
        model="endpoints:/my-chat-endpoint",
        data=eval_data,
        targets="ground_truth",
        inference_params={"max_tokens": 100, "temperature": 0.0},
        model_type="question-answering",
    )

Databricks Foundation Models APIでホストされているコンプリーションエンドポイント

import mlflow
from mlflow.deployments import set_deployments_target
import pandas as pd

# Point the client to Databricks Foundation Models API
set_deployments_target("databricks")

eval_data = pd.DataFrame(
    {
        # Input data must be a string column and named "inputs".
        "inputs": [
            "Write 3 reasons why you should use MLflow?",
            "Can you explain the difference between classification and regression?",
        ],
    }
)


with mlflow.start_run() as run:
    results = mlflow.evaluate(
        model="endpoints:/databricks-mpt-7b-instruct",
        data=eval_data,
        inference_params={"max_tokens": 100, "temperature": 0.0},
        model_type="text",
    )

External Models in Databricks Model Servingの評価は同じ方法で行うことができ、"endpoints:/your-chat-endpoint"のように異なるURIを指定するだけです。

静的データセットによる評価

MLflow >= 2.8.0では、mlflow.evaluate()はモデルを指定することなしに静的なデータセットの評価を行うことができます。モデルの出力をPandasデータフレームやMLflow PandasDatasetに保存しており、モデルを際実行することなしに静的なデータセットを評価したい場合に有用です。

Pandasデータフレームを使っている場合、mlflow.evaluate()のトップレベルのpredictionsパラメータを用いて、モデル出力を含むカラム名を指定する必要があります:

import mlflow
import pandas as pd

eval_data = pd.DataFrame(
    {
        "inputs": [
            "What is MLflow?",
            "What is Spark?",
        ],
        "ground_truth": [
            "MLflow is an open-source platform for managing the end-to-end machine learning (ML) lifecycle. "
            "It was developed by Databricks, a company that specializes in big data and machine learning solutions. "
            "MLflow is designed to address the challenges that data scientists and machine learning engineers "
            "face when developing, training, and deploying machine learning models.",
            "Apache Spark is an open-source, distributed computing system designed for big data processing and "
            "analytics. It was developed in response to limitations of the Hadoop MapReduce computing model, "
            "offering improvements in speed and ease of use. Spark provides libraries for various tasks such as "
            "data ingestion, processing, and analysis through its components like Spark SQL for structured data, "
            "Spark Streaming for real-time data processing, and MLlib for machine learning tasks",
        ],
        "predictions": [
            "MLflow is an open-source platform that provides handy tools to manage Machine Learning workflow "
            "lifecycle in a simple way",
            "Spark is a popular open-source distributed computing system designed for big data processing and analytics.",
        ],
    }
)

with mlflow.start_run() as run:
    results = mlflow.evaluate(
        data=eval_data,
        targets="ground_truth",
        predictions="predictions",
        extra_metrics=[mlflow.metrics.genai.answer_similarity()],
        evaluators="default",
    )
    print(f"See aggregated evaluation results below: \n{results.metrics}")

    eval_table = results.tables["eval_results_table"]
    print(f"See evaluation table below: \n{eval_table}")

評価結果の参照

コードによる評価結果の参照

mlflow.evaluate()はmlflow.models.EvaluationResult()インスタンスとして評価結果を返却します。選択したメトリクスのスコアを確認するには以下をチェックします:

metrics: 評価でたセットにおけるaverage/varianceのような集計結果を格納。上のコードさぷるの2番目のパスを取り、集計結果の出力にフォーカスしましょう。

with mlflow.start_run() as run:
    results = mlflow.evaluate(
        data=eval_data,
        targets="ground_truth",
        predictions="predictions",
        extra_metrics=[mlflow.metrics.genai.answer_similarity()],
        evaluators="default",
    )
    print(f"See aggregated evaluation results below: \n{results.metrics}")

tables["eval_results_table"]: 行ごとの評価結果を格納。

with mlflow.start_run() as run:
    results = mlflow.evaluate(
        data=eval_data,
        targets="ground_truth",
        predictions="predictions",
        extra_metrics=[mlflow.metrics.genai.answer_similarity()],
        evaluators="default",
    )
    print(
        f"See per-data evaluation results below: \n{results.tables['eval_results_table']}"
    )

MLflow UIによる評価結果の参照

あなたの評価結果はMLflowサーバーに自動で記録されるので、MLflow UIから直接評価結果を確認することができます。MLflow UIで評価結果を参照するには、以下のステップに従ってください:

MLflowエクスペリメントのエクスペリメントビューに移動します。
「Evaluation」タブを選択します。
評価結果をチェックしたいランを選択します。
右側のドロップダウンメニューからメトリクスを選択します。

わかりやすくするために、以下のスクリーンショットをご覧ください:

evaluate、torch、transformersのパッケージが必要。 ↩ ↩² ↩³
textstatのパッケージが必要。 ↩ ↩² ↩³ ↩⁴ ↩⁵ ↩⁶
evaluate、nltk、rouge-scoreのパッケージが必要。 ↩
すべてのリトリーバメトリクスはデフォルトではretriever_kが3だが、evaluator_configのretriever_kを指定することで上書き可能。 ↩ ↩² ↩³

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up