MLflowによるGEPA プロンプト最適化をDatabricks Free Editionで試す

Posted at 2025-10-25

はじめに

以前、MLflow&Databricksによるプロンプト最適化機能の記事を書きました。

この時は最適化アルゴリズムとしてMIPROv2しか利用できなかったのですが、先日MLflowのバージョン3.5がリリースされ、GEPAがサポートされました。

GEPAとはDatabricksとカリフォルニア大学バークレー校の研究から生まれた新しいプロンプト最適化手法であり、言語ベースのリフレクションと進化的探索を組み合わせてAIシステムを改善するもののようです。

詳しくは以下の記事を参照ください。（改善効果が高く、個人的には非常にインパクトのある内容でした）

dspyは少し前からGEPAのプロンプト最適化をサポートしていたのですが、MLflowのプロンプト最適化でもGEPAがネイティブに対応されました。（内部的にはdspyを使っていると思いますが）

現時点ではまだ実験機能のようですが、以下のドキュメントを基にDatabricks Free Editionで本機能を試してみます。

検証は上記の通りDatabricks Free Editionを利用します。

実践

まず、ノートブックを作成し必要なパッケージをインストール。
(バージョンを指定していませんが、mlflowは3.5以上を利用ください)

%pip install mlflow[databricks] openai dspy

%restart_python

次に、プロンプトをプロンプトレジストリに登録します。
プロンプト内容はAnswer this question: {{question}}です。

import mlflow
import openai

# 初期プロンプトを登録
prompt_name = "workspace.default.qa_for_opt"
prompt = mlflow.genai.register_prompt(
    name=prompt_name,
    template="Answer this question: {{question}}",
)

# targetというエイリアスを付与しておく
mlflow.genai.set_prompt_alias(name=prompt_name, alias="target", version=prompt.version)

プロンプトレジストリのUIを見ると、ちゃんと登録されていました。
（そして、見慣れないOptimizeというボタンができている・・・！これを押すと、プロンプト最適化のサンプルコードが表示されます）

次に、最適化に利用する予測関数とトレーニング用データセットを準備します。
今回はllama-3-1-8b-instructの回答能力を上げるためのプロンプト最適化を行います。
また、トレーニングデータセットは、日本語で日本に関する質問を用意しました。

import openai
import mlflow

creds = mlflow.utils.databricks_utils.get_databricks_host_creds()
client = openai.OpenAI(
    base_url=creds.host + "/serving-endpoints",
    api_key=creds.token,
)

# 予測関数を定義
def predict_fn(question: str) -> str:
    prompt = mlflow.genai.load_prompt(f"prompts:/{prompt_name}@target")
    completion = client.chat.completions.create(
        model="databricks-meta-llama-3-1-8b-instruct",
        messages=[{"role": "user", "content": prompt.format(question=question)}],
    )
    return completion.choices[0].message.content

# 入力と期待される出力を含む学習データ
dataset = [
    {
        # 入力スキーマは予測関数の引数と一致させる必要があります
        "inputs": {"question": "日本で2番目に高い山は？"},
        "expectations": {"expected_response": "北岳"},
    },
    {
        "inputs": {"question": "京都で最も有名な観光名所は？"},
        "expectations": {"expected_response": "清水寺"},
    },
    {
        "inputs": {"question": "日本の政令指定都市はいくつある？"},
        "expectations": {"expected_response": "20"},
    },
]

では、GEPAを用いた最適化を実行します。

from mlflow.genai.optimize import GepaPromptOptimizer
from mlflow.genai.scorers import Correctness

# プロンプトをGEPAで最適化
# 最適化用のLLMとしてはllama-4-maverick、スコアラーとしてはmeta-llama-3-3-70bを選択(Free Editionで利用できるモデルを使用)
result = mlflow.genai.optimize_prompts(
    predict_fn=predict_fn,
    train_data=dataset,
    prompt_uris=[prompt.uri],
    optimizer=GepaPromptOptimizer(
        reflection_model="databricks:/databricks-llama-4-maverick"
    ),
    scorers=[Correctness(model="databricks:/databricks-meta-llama-3-3-70b-instruct")],
)

# 最適化プロンプトの確認
optimized_prompt = result.optimized_prompts[0]
print(f"Optimized template: {optimized_prompt.template}")

print(f"The new prompt URI: {result.optimized_prompts[0].uri}")

実行すると学習の反復が進み、自分の場合は4分ぐらいでプロンプトの最適化が完了しました。

作成されたプロンプトは以下のようになります。

Answer the given question directly and concisely. When answering, consider the context of Japan and its geography, administrative divisions, and famous landmarks. For questions that may be subjective or open to interpretation, provide an answer based on generally accepted facts or common knowledge about Japan.

Input: {{question}}
Output: A direct and concise answer to the question.

Examples:
- For "日本の政令指定都市はいくつある？", the answer should be "20".
- For "日本で2番目に高い山は？", the answer should be "北岳".
- For "京都で最も有名な観光名所は？", while subjective, a common answer could be "清水寺" or "金閣寺", but ensure to provide a response that aligns with common knowledge or facts.

When in doubt about the specificity or subjectivity of a question, rely on verifiable information or widely recognized facts about Japan.
```

### New Instruction within ``` blocks

```
Answer the given question directly and concisely about facts related to Japan. Provide the answer in a straightforward manner without elaboration unless necessary for clarity. For counts or specific names, ensure the response is accurate and directly addresses the question.

Input format: Answer this question: {{question}}

Strategy:
1. Identify the type of question (e.g., count, name of a place, etc.).
2. Provide a direct answer based on known facts or common knowledge about Japan.
3. For subjective questions, rely on generally accepted information.

Example inputs and outputs:
- Input: 日本の政令指定都市はいくつある？
  Output: 20
- Input: 日本で2番目に高い山は？
  Output: 北岳
- Input: 京都で最も有名な観光名所は？
  Output: 清水寺 (or another widely recognized landmark)

Ensure responses are concise and directly answer the question.

最初のプロンプトがAnswer this question: {{question}}だったことを考えると、かなり変わっています。

実際に最適化されたプロンプトを使った場合は、最適化前のプロンプトに比べて回答傾向が変わりました（とはいえ、今回のトレーニング内容だと最適化前後で極端に精度が変わる感じではありませんでしたが）

まとめ

MLflowのプロンプト最適化機能を使い、GEPAによるプロンプト最適化を試してみました。
データセットを基にプロンプトの最適化を割と手軽にできそうで、改善効果も大きいのであれば非常によさそうです。（LLMのトークン消費量が気になりますが）

また、MLflowのドキュメントには以下のようにLLM変更時の自動最適化についてもまとめられています。こちらのやり方も試してみたいですね。

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up