DatabricksでMLflowとLLMを用いたRAGシステムの評価

Last updated at 2023-12-14Posted at 2023-12-14

こちらのノートブックをウォークスルーします。Retrieval Augumented Generation(RAG)システムの評価にLLMを使うような時代になったとは。

このノートブックでは、MLflowを用いてどのように様々なRAGシステムを評価するのかをデモンストレーションします。LLM-as-a-judgeの手法を用いています。

chromadbをインストールします。あとでエラーに遭遇するので明示的にバージョンを指定します。

%pip install chromadb==0.4.15
dbutils.library.restartPython()

OpenAI (あるいは Azure OpenAI) の環境変数を設定します。

import os
os.environ["OPENAI_API_KEY"] = dbutils.secrets.get("demo-token-takaaki.yayoi", "openai")

# If using Azure OpenAI
# os.environ["OPENAI_API_TYPE"] = "azure"
# os.environ["OPENAI_API_VERSION"] = "2023-05-15"
# os.environ["OPENAI_API_KEY"] = "https://<>.<>.<>.com"
# os.environ["OPENAI_DEPLOYMENT_NAME"] = "deployment-name"

import pandas as pd

import mlflow

RAGシステムの作成

MLflowのドキュメントをベースとして質疑応答を行うRAGシステムを作成するためにLangchainとChromaを使います。

from langchain.chains import RetrievalQA
from langchain.document_loaders import WebBaseLoader
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.llms import OpenAI
from langchain.text_splitter import CharacterTextSplitter
from langchain.vectorstores import Chroma

loader = WebBaseLoader("https://mlflow.org/docs/latest/index.html")

documents = loader.load()
text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
texts = text_splitter.split_documents(documents)

embeddings = OpenAIEmbeddings()
docsearch = Chroma.from_documents(texts, embeddings)

qa = RetrievalQA.from_chain_type(
    llm=OpenAI(temperature=0),
    chain_type="stuff",
    retriever=docsearch.as_retriever(),
    return_source_documents=True,
)

`mlflow.evaluate()`を用いたRAGシステムの評価

RAGチェーンを通じてそれぞれの入力を処理するシンプルな関数を作成します。

def model(input_df):
    answer = []
    for index, row in input_df.iterrows():
        answer.append(qa(row["questions"]))

    return answer

評価データセットを作成します。

eval_df = pd.DataFrame(
    {
        "questions": [
            "What is MLflow?",
            "How to run mlflow.evaluate()?",
            "How to log_table()?",
            "How to load_table()?",
        ],
    }
)

faithfulnessメトリックを作成します。インプットinputに対するアウトプットoutputの例、その際の評点scoreと理由づけjustificationを定義します。

from mlflow.metrics.genai import faithfulness, EvaluationExample

# この問題の文脈における faithfulness の良い例と悪い例を作成します
faithfulness_examples = [
    EvaluationExample(
        input="How do I disable MLflow autologging?",
        output="mlflow.autolog(disable=True) will disable autologging for all functions. In Databricks, autologging is enabled by default. ",
        score=2,
        justification="The output provides a working solution, using the mlflow.autolog() function that is provided in the context.",
        grading_context={
            "context": "mlflow.autolog(log_input_examples: bool = False, log_model_signatures: bool = True, log_models: bool = True, log_datasets: bool = True, disable: bool = False, exclusive: bool = False, disable_for_unsupported_versions: bool = False, silent: bool = False, extra_tags: Optional[Dict[str, str]] = None) → None[source] Enables (or disables) and configures autologging for all supported integrations. The parameters are passed to any autologging integrations that support them. See the tracking docs for a list of supported autologging integrations. Note that framework-specific configurations set at any point will take precedence over any configurations set by this function."
        },
    ),
    EvaluationExample(
        input="How do I disable MLflow autologging?",
        output="mlflow.autolog(disable=True) will disable autologging for all functions.",
        score=5,
        justification="The output provides a solution that is using the mlflow.autolog() function that is provided in the context.",
        grading_context={
            "context": "mlflow.autolog(log_input_examples: bool = False, log_model_signatures: bool = True, log_models: bool = True, log_datasets: bool = True, disable: bool = False, exclusive: bool = False, disable_for_unsupported_versions: bool = False, silent: bool = False, extra_tags: Optional[Dict[str, str]] = None) → None[source] Enables (or disables) and configures autologging for all supported integrations. The parameters are passed to any autologging integrations that support them. See the tracking docs for a list of supported autologging integrations. Note that framework-specific configurations set at any point will take precedence over any configurations set by this function."
        },
    ),
]

faithfulness_metric = faithfulness(model="openai:/gpt-4", examples=faithfulness_examples)
print(faithfulness_metric)

LLMがJudge(審判)として動作するようにタスクが定義されます。

EvaluationMetric(name=faithfulness, greater_is_better=True, long_name=faithfulness, version=v1, metric_details=
Task:
You are an impartial judge. You will be given an input that was sent to a machine
learning model, and you will be given an output that the model produced. You
may also be given additional information that was used by the model to generate the output.

Your task is to determine a numerical score called faithfulness based on the input and output.
A definition of faithfulness and a grading rubric are provided below.
You must use the grading rubric to determine your score. You must also justify your score.

Examples could be included below for reference. Make sure to use them as references and to
understand them before completing the task.

Input:
{input}

Output:
{output}

{grading_context_columns}

relevanceメトリックを作成します。メトリックの metric_details にアクセスする、あるいはメトリックを print することで完全な評点のプロンプトを確認することができます。

from mlflow.metrics.genai import relevance, EvaluationExample


relevance_metric = relevance(model="openai:/gpt-4")
print(relevance_metric)

EvaluationMetric(name=relevance, greater_is_better=True, long_name=relevance, version=v1, metric_details=
Task:
You are an impartial judge. You will be given an input that was sent to a machine
learning model, and you will be given an output that the model produced. You
may also be given additional information that was used by the model to generate the output.

Your task is to determine a numerical score called relevance based on the input and output.
A definition of relevance and a grading rubric are provided below.
You must use the grading rubric to determine your score. You must also justify your score.

Examples could be included below for reference. Make sure to use them as references and to
understand them before completing the task.

Input:
{input}

Output:
{output}

{grading_context_columns}

Metric definition:
Relevance encompasses the appropriateness, significance, and applicability of the output with respect to both the input and context. Scores should reflect the extent to which the output directly addresses the question provided in the input, given the provided context.

Grading rubric:
Relevance: Below are the details for different scores:- Score 1: The output doesn't mention anything about the question or is completely irrelevant to the provided context.
- Score 2: The output provides some relevance to the question and is somehow related to the provided context.
- Score 3: The output mostly answers the question and is largely consistent with the provided context.
- Score 4: The output answers the question and is consistent with the provided context.
- Score 5: The output answers the question comprehensively using the provided context.

Examples:

Input:
How is MLflow related to Databricks?

Output:
Databricks is a data engineering and analytics platform designed to help organizations process and analyze large amounts of data. Databricks is a company specializing in big data and machine learning solutions.

Additional information used by the model:
key: context
value:
MLflow is an open-source platform for managing the end-to-end machine learning (ML) lifecycle. It was developed by Databricks, a company that specializes in big data and machine learning solutions. MLflow is designed to address the challenges that data scientists and machine learning engineers face when developing, training, and deploying machine learning models.

score: 2
justification: The output provides relevant information about Databricks, mentioning it as a company specializing in big data and machine learning solutions. However, it doesn't directly address how MLflow is related to Databricks, which is the specific question asked in the input. Therefore, the output is only somewhat related to the provided context.
        

Input:
How is MLflow related to Databricks?

Output:
MLflow is a product created by Databricks to enhance the efficiency of machine learning processes.

Additional information used by the model:
key: context
value:
MLflow is an open-source platform for managing the end-to-end machine learning (ML) lifecycle. It was developed by Databricks, a company that specializes in big data and machine learning solutions. MLflow is designed to address the challenges that data scientists and machine learning engineers face when developing, training, and deploying machine learning models.

score: 4
justification: The output provides a relevant and accurate statement about the relationship between MLflow and Databricks. While it doesn't provide extensive detail, it still offers a substantial and meaningful response. To achieve a score of 5, the response could be further improved by providing additional context or details about how MLflow specifically functions within the Databricks ecosystem.
        

You must return the following fields in your response one below the other:
score: Your numerical score for the model's relevance based on the rubric
justification: Your step-by-step reasoning about the model's relevance score
    )

results = mlflow.evaluate(
    model,
    eval_df,
    model_type="question-answering",
    evaluators="default",
    predictions="result",
    extra_metrics=[faithfulness_metric, relevance_metric, mlflow.metrics.latency()],
    evaluator_config={
        "col_mapping": {
            "inputs": "questions",
            "context": "source_documents",
        }
    },
)
print(results.metrics)

{'toxicity/v1/mean': 0.00020886990023427643, 'toxicity/v1/variance': 4.508897658386363e-09, 'toxicity/v1/p90': 0.0002846180694177747, 'toxicity/v1/ratio': 0.0, 'faithfulness/v1/mean': 3.0, 'faithfulness/v1/variance': 4.0, 'faithfulness/v1/p90': 5.0, 'relevance/v1/mean': 3.5, 'relevance/v1/variance': 2.25, 'relevance/v1/p90': 4.7}

MLflowにも記録されます。

display(results.tables["eval_results_table"])

評価の根拠もjustificationとして表示されます。

これも評価ビューで比較します。

今度日本語でもトライしてみます。
↓
トライしてみました。

Databricksクイックスタートガイド

Databricks無料トライアル

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up

DatabricksでMLflowとLLMを用いたRAGシステムの評価

RAGシステムの作成

mlflow.evaluate()を用いたRAGシステムの評価

Databricksクイックスタートガイド

Databricks無料トライアル

`mlflow.evaluate()`を用いたRAGシステムの評価