GenAI Agentを作ろう on Databricks: Essay Grading Agent

Posted at 2024-10-28

こちらの続きです。

導入

次はこちらをMLflowを使う形に魔改造してウォークスルーしてみます。

Essay Grading Agent概要

上記のノートブック上段部を邦訳して抜粋。

LangGraphを使用したエッセイ採点システム

概要

このノートブックは、LangGraphとLLMモデルを使用して実装された自動エッセイ採点システムを紹介します。このシステムは、関連性、文法、構造、分析の深さという4つの主要な基準に基づいてエッセイを評価します。

動機

自動エッセイ採点システムは、教育現場での評価プロセスを大幅に効率化し、一貫性と客観性のある評価を提供できます。この実装は、大規模な言語モデルとグラフベースのワークフローを組み合わせて、洗練された採点システムを作成する方法を示すことを目的としています。

主要コンポーネント

状態グラフ: 採点プロセスのワークフローを定義

LLMモデル: 基本的な言語理解と分析を提供

採点関数: 各評価基準に対する個別の関数

条件付きロジック: 中間スコアに基づいて採点プロセスの流れを決定

方法

システムはエッセイを次のステップで採点します：

内容の関連性: エッセイが与えられたトピックにどれだけ適しているかを評価

文法チェック: エッセイの言語使用と文法の正確さを評価

構造分析: エッセイの組織とアイデアの流れを検査

分析の深さ: 提示された批判的思考と洞察のレベルを評価

各ステップは前のステップのスコアに基づいて条件付きで実行され、低品質のエッセイは早期に終了することができます。最終スコアは、すべての個別コンポーネントスコアの加重平均です。

結論

このノートブックは、自動エッセイ採点に対する柔軟で拡張可能なアプローチを示しています。大規模な言語モデルとグラフベースのワークフローの力を活用することで、複数の執筆品質の側面を考慮した微妙な評価を提供します。このシステムは、さまざまな教育環境に合わせてさらに洗練され、適応される可能性があり、エッセイ評価の効率と一貫性を向上させることができます。

以下のようなワークフローになります。

単純に言えば、エッセイを採点するエージェントです。
応用すれば技術文書の採点など実務的なところもカバーできそう。

それでは、mlflowにロギングしてDatabricks上で利用可能な形に実装してみます。

実装と実行

Databricks上でノートブックを作成し、LangChain/LangGraph関連とMlflow最新版をインストールします。なお、クラスタはサーバレスを利用しています。

%pip install -q -U langchain-core==0.3.13 langchain-databricks==0.1.1 langchain_community==0.3.3 langgraph==0.2.39
%pip install -q -U typing-extensions
%pip install -q -U "mlflow-skinny[databricks]==2.17.1"

dbutils.library.restartPython()

前回同様、MLflowのカスタムチャットモデルとして、エージェントを実装します。

%%writefile "./essay_grading_agent.py"

import uuid
import re
from typing import List, Optional, Dict, TypedDict

import mlflow
from mlflow.pyfunc import ChatModel
from mlflow.models import set_model
from mlflow.entities import SpanType
from mlflow.types.llm import (
    ChatResponse,
    ChatMessage,
    ChatParams,
    ChatChoice,
)

from langgraph.graph import StateGraph, END

from langchain_databricks import ChatDatabricks
from langchain_core.prompts import ChatPromptTemplate


class State(TypedDict):
    """エッセイ採点プロセスの状態を表します。"""

    endpoint: str
    params: dict
    essay: str
    relevance_score: float
    grammar_score: float
    structure_score: float
    depth_score: float
    final_score: float


class EssayGradingAgent(ChatModel):
    def __init__(self):
        """
        デフォルト値でEssayGradingAgentを初期化します。
        """
        self.models = {}
        self.models_config = {}
        self.app = None

    def load_context(self, context):
        """
        エージェントのコンテキストをロードし、モデル設定を行います。

        Args:
            context: モデル設定を含むコンテキスト。
        """
        self.models = context.model_config.get("models", {})
        self.models_config = context.model_config
        self.app = self.build_graph()

    def predict(
        self, context, messages: List[ChatMessage], params: Optional[ChatParams] = None
    ) -> ChatResponse:
        """
        指定されたチャットメッセージとパラメータに対する応答を予測します。

        Args:
            context: 予測のためのコンテキスト。
            messages (List[ChatMessage]): チャットメッセージのリスト。
            params (Optional[ChatParams]): チャットのためのオプションパラメータ。

        Returns:
            ChatResponse: チャットモデルからの応答。
        """
        with mlflow.start_span(name="Agent", span_type=SpanType.AGENT) as root_span:
            root_span.set_inputs(messages)

            attributes = {**params.to_dict(), **self.models_config, **self.models}
            root_span.set_attributes(attributes)

            endpoint = self._get_model_endpoint("agent")
            query = messages[-1].content
            response = self.grade_essay(query, endpoint, params.to_dict())
            output = ChatResponse(
                choices=[
                    ChatChoice(
                        index=0,
                        message=ChatMessage(
                            role="assistant",
                            content="Please check scores in metadata.",
                        ),
                    )
                ],
                usage={},
                metadata={
                    "final_score": response.get("final_score"),
                    "relevance_score": response.get("relevance_score"),
                    "grammar_score": response.get("grammar_score"),
                    "structure_score": response.get("structure_score"),
                    "depth_score": response.get("depth_score"),
                },
                model=endpoint,
            )
            root_span.set_outputs(output)
        return output

    def build_graph(self) -> StateGraph:
        """
        エッセイ採点プロセスの状態グラフを構築します。

        Returns:
            StateGraph: 構築された状態グラフ。
        """
        workflow = StateGraph(State)

        workflow.add_node("check_relevance", self.check_relevance)
        workflow.add_node("check_grammar", self.check_grammar)
        workflow.add_node("analyze_structure", self.analyze_structure)
        workflow.add_node("evaluate_depth", self.evaluate_depth)
        workflow.add_node("calculate_final_score", self.calculate_final_score)

        workflow.add_conditional_edges(
            "check_relevance",
            lambda x: "check_grammar"
            if x["relevance_score"] > 0.5
            else "calculate_final_score",
        )
        workflow.add_conditional_edges(
            "check_grammar",
            lambda x: "analyze_structure"
            if x["grammar_score"] > 0.6
            else "calculate_final_score",
        )
        workflow.add_conditional_edges(
            "analyze_structure",
            lambda x: "evaluate_depth"
            if x["structure_score"] > 0.7
            else "calculate_final_score",
        )
        workflow.add_conditional_edges(
            "evaluate_depth", lambda x: "calculate_final_score"
        )

        workflow.set_entry_point("check_relevance")
        workflow.add_edge("calculate_final_score", END)

        return workflow.compile()

    def extract_score(self, content: str) -> float:
        """
        指定されたコンテンツ文字列からスコアを抽出します。

        Args:
            content (str): スコアを含むコンテンツ文字列。

        Returns:
            float: 抽出されたスコア。

        Raises:
            ValueError: スコアを抽出できない場合。
        """
        match = re.search(r"Score:\s*(\d+(\.\d+)?)", content)
        if match:
            return float(match.group(1))
        raise ValueError(f"Could not extract score from: {content}")

    @mlflow.trace(span_type=SpanType.CHAIN)
    def check_relevance(self, state: State) -> State:
        """
        エッセイのトピックに対する関連性をチェックします。

        Args:
            state (State): エッセイ採点プロセスの現在の状態。

        Returns:
            State: 関連性スコアを含む更新された状態。
        """
        prompt = ChatPromptTemplate.from_template(
            "Analyze the relevance of the following essay to the given topic. "
            "Provide a relevance score between 0 and 1. "
            "Your response should start with 'Score: ' followed by the numeric score, "
            "then provide your explanation.\n\nEssay: {essay}"
        )
        llm = ChatDatabricks(endpoint=state.get("endpoint"), **state.get("params"))
        result = llm.invoke(prompt.format(essay=state["essay"]))
        try:
            state["relevance_score"] = self.extract_score(result.content)
        except ValueError as e:
            print(f"Error in check_relevance: {e}")
            state["relevance_score"] = 0.0
        return state

    @mlflow.trace(span_type=SpanType.CHAIN)
    def check_grammar(self, state: State) -> State:
        """
        エッセイの文法と言語使用をチェックします。

        Args:
            state (State): エッセイ採点プロセスの現在の状態。

        Returns:
            State: 文法スコアを含む更新された状態。
        """
        prompt = ChatPromptTemplate.from_template(
            "Analyze the grammar and language usage in the following essay. "
            "Provide a grammar score between 0 and 1. "
            "Your response should start with 'Score: ' followed by the numeric score, "
            "then provide your explanation.\n\nEssay: {essay}"
        )
        llm = ChatDatabricks(endpoint=state.get("endpoint"), **state.get("params"))
        result = llm.invoke(prompt.format(essay=state["essay"]))
        try:
            state["grammar_score"] = self.extract_score(result.content)
        except ValueError as e:
            print(f"Error in check_grammar: {e}")
            state["grammar_score"] = 0.0
        return state

    @mlflow.trace(span_type=SpanType.CHAIN)
    def analyze_structure(self, state: State) -> State:
        """
        エッセイの構造を分析します。

        Args:
            state (State): エッセイ採点プロセスの現在の状態。

        Returns:
            State: 構造スコアを含む更新された状態。
        """
        prompt = ChatPromptTemplate.from_template(
            "Analyze the structure of the following essay. "
            "Provide a structure score between 0 and 1. "
            "Your response should start with 'Score: ' followed by the numeric score, "
            "then provide your explanation.\n\nEssay: {essay}"
        )
        llm = ChatDatabricks(endpoint=state.get("endpoint"), **state.get("params"))
        result = llm.invoke(prompt.format(essay=state["essay"]))
        try:
            state["structure_score"] = self.extract_score(result.content)
        except ValueError as e:
            print(f"Error in analyze_structure: {e}")
            state["structure_score"] = 0.0
        return state

    @mlflow.trace(span_type=SpanType.CHAIN)
    def evaluate_depth(self, state: State) -> State:
        """
        エッセイの分析の深さを評価します。

        Args:
            state (State): エッセイ採点プロセスの現在の状態。

        Returns:
            State: 深さスコアを含む更新された状態。
        """
        prompt = ChatPromptTemplate.from_template(
            "Evaluate the depth of analysis in the following essay. "
            "Provide a depth score between 0 and 1. "
            "Your response should start with 'Score: ' followed by the numeric score, "
            "then provide your explanation.\n\nEssay: {essay}"
        )
        llm = ChatDatabricks(endpoint=state.get("endpoint"), **state.get("params"))
        result = llm.invoke(prompt.format(essay=state["essay"]))
        try:
            state["depth_score"] = self.extract_score(result.content)
        except ValueError as e:
            print(f"Error in evaluate_depth: {e}")
            state["depth_score"] = 0.0
        return state

    @mlflow.trace(span_type=SpanType.TOOL)
    def calculate_final_score(self, state: State) -> State:
        """
        個々のスコアに基づいてエッセイの最終スコアを計算します。

        Args:
            state (State): エッセイ採点プロセスの現在の状態。

        Returns:
            State: 最終スコアを含む更新された状態。
        """
        state["final_score"] = (
            state["relevance_score"] * 0.3
            + state["grammar_score"] * 0.2
            + state["structure_score"] * 0.2
            + state["depth_score"] * 0.3
        )
        return state

    @mlflow.trace(span_type=SpanType.AGENT)
    def grade_essay(self, essay: str, endpoint: str, params: dict) -> dict:
        """
        状態グラフを呼び出して指定されたエッセイを採点します。

        Args:
            essay (str): 採点するエッセイ。
            endpoint (str): モデルエンドポイント。
            params (dict): モデルのパラメータ。

        Returns:
            dict: すべてのスコアを含む最終状態。
        """
        initial_state = State(
            endpoint=endpoint,
            params=params,
            essay=essay,
            relevance_score=0.0,
            grammar_score=0.0,
            structure_score=0.0,
            depth_score=0.0,
            final_score=0.0,
        )
        result = self.app.invoke(initial_state)
        return result

    def _get_model_endpoint(self, role: str) -> str:
        """
        指定された役割のモデルエンドポイントを取得します。

        Args:
            role (str): モデルエンドポイントを取得する役割。

        Returns:
            str: モデルエンドポイント。
        """
        role_config = self.models.get(role, {})
        return role_config.get("endpoint")


set_model(EssayGradingAgent())

大部分の処理はLangGraphのノード用メソッド定義です。
build_graphメソッド内でこれらのメソッドを利用してグラフを定義しています。
このグラフは点数付けするノードを数段階経由して、最終的なスコアを計算しています。

次にカスタムチャットモデルをMLflowに保管・登録します。
モデル設定にLLMのエンドポイント名を指定していますが、前回同様Llama 3.2 3Bモデルを用いたMosaic AI Model Servingエンドポイントを指定しました。

import mlflow
# Databricks Unity Catalogを利用してモデル管理
mlflow.set_registry_uri("databricks-uc")

model_config = {
    "models": {
        "agent": {
            "endpoint": "llama_v3_2_3b_instruct_endpoint",
        },
    },
}

input_example = {
    "messages": [
        {
            "role": "user",
            "content": "The Impact of Artificial Intelligence on Modern Society",
        }
    ]
}

registered_model_name = "training.llm.essay_grading_agent"

with mlflow.start_run():
    model_info = mlflow.pyfunc.log_model(
        "model",
        python_model="essay_grading_agent.py",
        model_config=model_config,
        input_example=input_example,
        registered_model_name=registered_model_name,
    )

モデルが無事保管できたら、ロードして実際に使ってみます。

import mlflow
from mlflow import MlflowClient
from langchain_core.runnables.graph import MermaidDrawMethod
from IPython.display import display, Image

client = MlflowClient()
versions = [
    mv.version for mv in client.search_model_versions(f"name='{registered_model_name}'")
]
agent = mlflow.pyfunc.load_model(f"models:/{registered_model_name}/{versions[0]}")

def grade_essay(essay:str):
    result = agent.predict(
        {
            "messages": [{"role": "user", "content": essay}],
            "temerature": 0.0,
            "max_tokens": 1000,
        }
    )
    # Display the results
    print(f"Final Essay Score: {result['metadata']['final_score']:.2f}\n")
    print(f"Relevance Score: {result['metadata']['relevance_score']:.2f}")
    print(f"Grammar Score: {result['metadata']['grammar_score']:.2f}")
    print(f"Structure Score: {result['metadata']['structure_score']:.2f}")
    print(f"Depth Score: {result['metadata']['depth_score']:.2f}")    
    print("\n")

sample_essay = """
    The Impact of Artificial Intelligence on Modern Society

    Artificial Intelligence (AI) has become an integral part of our daily lives, 
    revolutionizing various sectors including healthcare, finance, and transportation. 
    This essay explores the profound effects of AI on modern society, discussing both 
    its benefits and potential challenges.

    One of the most significant impacts of AI is in the healthcare industry. 
    AI-powered diagnostic tools can analyze medical images with high accuracy, 
    often surpassing human capabilities. This leads to earlier detection of diseases 
    and more effective treatment plans. Moreover, AI algorithms can process vast 
    amounts of medical data to identify patterns and insights that might escape 
    human observation, potentially leading to breakthroughs in drug discovery and 
    personalized medicine.

    In the financial sector, AI has transformed the way transactions are processed 
    and monitored. Machine learning algorithms can detect fraudulent activities in 
    real-time, enhancing security for consumers and institutions alike. Robo-advisors 
    use AI to provide personalized investment advice, democratizing access to 
    financial planning services.

    The transportation industry is another area where AI is making significant strides. 
    Self-driving cars, powered by complex AI systems, promise to reduce accidents 
    caused by human error and provide mobility solutions for those unable to drive. 
    In logistics, AI optimizes routing and inventory management, leading to more 
    efficient supply chains and reduced environmental impact.

    However, the rapid advancement of AI also presents challenges. There are concerns 
    about job displacement as AI systems become capable of performing tasks 
    traditionally done by humans. This raises questions about the need for retraining 
    and reskilling the workforce to adapt to an AI-driven economy.

    Privacy and ethical concerns also arise with the increasing use of AI. The vast 
    amount of data required to train AI systems raises questions about data privacy 
    and consent. Additionally, there are ongoing debates about the potential biases 
    in AI algorithms and the need for transparent and accountable AI systems.

    In conclusion, while AI offers tremendous benefits and has the potential to solve 
    some of humanity's most pressing challenges, it also requires careful consideration 
    of its societal implications. As we continue to integrate AI into various aspects 
    of our lives, it is crucial to strike a balance between technological advancement 
    and ethical considerations, ensuring that the benefits of AI are distributed 
    equitably across society.
    """

grade_essay(sample_essay)

出力

Final Essay Score: 0.76

Relevance Score: 0.80
Grammar Score: 0.80
Structure Score: 0.80
Depth Score: 0.65

各スコアから最終スコアが計算できました。
（私にはどういった基準でスコア付けされているかわかりませんが。。。）

日本語文章でもやってみます。
青空文庫のこちらの一節を利用させていただきました。

sample_essay2 = """はじめに生れたのは歓びの霊である、この新しい年をよろこべ！
一月　　霊はまだ目がさめぬ
二月　　虹を織る
三月　　雨のなかに微笑する
四月　　白と緑の衣を着る
五月　　世界の青春
六月　　壮厳
七月　　二つの世界にゐる
八月　　色彩
九月　　美を夢みる
十月　　溜息する
十一月　おとろへる
十二月　眠る
　ケルトの古い言ひつたへかもしれない、或るふるぼけた本の最後の頁に何のつながりもなくこの暦が載つてゐるのを読んだのである。この暦によると世界は無限にふくざつな色に包まれてゐる。一月二月三月四月の意味はよくわかる。五月が青春であるのは、わが国に比べるとひと月遅いやうに思はれる、もつと北に寄つた国であるからだらう。したがつて、六月のすばらしさも一月おくれかもしれぬ。七月、霊が二つの世界にゐるといふのは、生長するものと衰へ初めるものとの二つの世界のことであらうか？　八月、色彩といふのは空の雲、飛ぶ鳥の羽根や、山々のみどり、木草の花の色、それが一時にまぶしいほど強烈で、ことに北の国は春から夏に一時にめざましい色を現はす。九月、美を夢みるといふのは八月の美しさがまだ続いて、やや静かになつてゆく季節。十月は溜息をする、さびしい風が吹く。十一月、すべての草木が疲れおとろへ、十二月、眠りに入る。この霊といふ字がすこし気どつた言葉のやうで、これを自然といふ字におき代へて読みなほしてみた。その方がはつきりする。"""
 
grade_essay(sample_essay2)

出力

Final Essay Score: 0.48

Relevance Score: 0.80
Grammar Score: 0.80
Structure Score: 0.40
Depth Score: 0.00

structureとdepthが低め。
これはLlamaの日本語処理能力のせいかな。

まとめ

GenAI Agentsのチュートリアル:6. Essay Grading Agent をMLflowのカスタムモデルを使うように魔改造して実行してみました。
これによってMLflowによるモデル（エージェント）管理やMosaic AI Model ServingによるDatabricks上でのデプロイを行うことができます。

点数付けを行うエージェントは応用次第でいろんなことができそうです。
いろんなユースケースを探してみると面白そう。

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up