N番煎じでQwQ-32BをMLflowも使って試す

Last updated at 2025-03-09Posted at 2025-03-07

導入

Aribaba社がQwenのReasoningモデルであるQwQ-32Bを公開しました。

Today, we release QwQ-32B, our new reasoning model with only 32 billion parameters that rivals cutting-edge reasoning model, e.g., DeepSeek-R1.

Blog: https://t.co/zCgACNdodj
HF: https://t.co/pfjZygOiyQ
ModelScope: https://t.co/hcfOD8wSLa
Demo: https://t.co/DxWPzAg6g8
Qwen Chat:… pic.twitter.com/kfvbNgNucW
— Qwen (@Alibaba_Qwen) March 5, 2025

パラメータ数がわずか(?)32BでDeepSeek R1に匹敵する性能という主張です。
しかもApache-2.0ライセンス。

実力がどんなものかとても気になったので試してみます。

試験環境はDatabricks on AWSを用いて、さらにMosaic AI Model Serving上にデプロイして使ってみます。

モデルはオリジナルのものではなく、以下の量子化(IQ4_XS)&GGUFフォーマット化したものを事前ダウンロードして利用しました。
そのため、性能はオリジナルと大きく異なっている可能性があります。

実行結果のみ気になる方はStep2から確認ください。

Step1. 準備

まずは、QwQ-32BをMLflowに登録して利用できるようにします。

ノートブックを作成し、必要なパッケージをインストール。
今回は(も)推論エンジンとしてSGLangを利用します。

%pip install --no-deps -i https://flashinfer.ai/whl/cu124/torch2.5 --trusted-host flashinfer.ai
%pip install transformers==4.48.3
%pip install sgl-kernel --force-reinstall --no-deps
%pip install "sglang[srt]==0.4.3.post4" "openai==1.65.4"
%pip install "mlflow-skinny[databricks]>=2.20.3" loguru

%restart_python

後ほど利用する各種パラメータを準備しておきます。
変数model_pathにダウンロードしておいたQwQ-32Bモデルのパスを指定しています。
変数registered_model_nameはUnity Catalog上のモデル登録場所です。

model_path = (
    "/Volumes/training/llm/model_snapshots/models--bartowski--Qwen_QwQ-32B-GGUF/Qwen_QwQ-32B-IQ4_XS.gguf"
)

model_config = {
    "server": {
        "model_name": "Qwen-QwQ-32B-IQ4_XS",        
        "port": 30000,
        "additional_args": (
            "--disable-cuda-graph --mem-fraction-static 0.9 "
            "--kv-cache-dtype fp8_e5m2 --context-length 81920 "            
            "--grammar-backend llguidance " # XGrammerだとModel Servingでそのまま動作しなかったため
            "--tool-call-parser qwen25 "
            "--max-running-requests 1 --chunked-prefill-size 4096 "
            "--reasoning-parser deepseek-r1 "
        ),
    }
}

registered_model_name = "training.llm.sglang_qwen_qwq_32b"

次にMLflowに登録するためのカスタムChatModelを定義します。

ChatModelとはなんぞや、という方は以下を参照ください。

コードは少し長いので折りたたんでいます。
また、MLflowのModels as Codeで利用するために、このセルの中身は%%writefileを使ってsglang_online_chat_model.pyという外部ファイルに出力しています。

補足として、このカスタムChatModelクラスはOpenAIのAPIで利用可能な入力パラメータresponse_formatや、Reasoningモデルのreasoning_contentの出力をcustom_inputs/custom_outputsに含めることでカバーしています。（現時点のChatModelインターフェースではカバーされていないため）

sglang_online_chat_model

%%writefile "./sglang_online_chat_model.py"

import os
from typing import Generator

import sglang as sgl
from sglang.utils import (
    launch_server_cmd,
    wait_for_server,
    print_highlight,
    terminate_process,
)

from mlflow.pyfunc import ChatModel
from mlflow.types.llm import (
    ChatMessage,
    ChatCompletionResponse,
    ChatChoice,
    ChatParams,
    ChatCompletionChunk,
)
from mlflow.models import set_model

from loguru import logger
import openai
import requests

SERVER_WAIT_TIMEOUT = 60 * 10
ENV_SGLANG_CLI_ARGS = "SGLANG_CLI_ARGS"


class SGLangOnlineChatModel(ChatModel):
    def __init__(self, server_process=None, port=None):
        self.server_process = server_process
        self.port = port
        self.is_managed = False
        self.model_path = None
        self.model_config = None
        self.client = None

    def load_context(self, context):
        """Load the model from the context."""

        self.model_path = context.artifacts["llm-model"]
        self.model_config = (
            context.model_config.get("server", {}) if context.model_config else {}
        )

        if not self.server_process:
            self.server_process, self.port = self._launch_sglang_server(
                self.model_path, self.model_config
            )
            self.is_managed = True

        self.client = openai.Client(
            base_url=f"http://localhost:{self.port}/v1", api_key="None"
        )

        logger.info(f"Server started at http://localhost:{self.port}")

    def predict(self, messages: list[ChatMessage], params: ChatParams = None):

        # SGLangのサーバプロセスが存在しない場合、ダミーメッセージを返す。
        # if not self.server_process:
        if not self._health_check(self.port):
            return ChatCompletionResponse(
                choices=[
                    {
                        "index": 0,
                        "message": {
                            "role": "asssitant",
                            "content": "no response from server.",
                        },
                    }
                ]
            )

        # contentのNone埋め(sglang利用における制約)
        for msg in messages:
            if msg.content is None:
                msg.content = ""

        # list[ChatAgentMessage]のメッセージ入力を辞書型に変換
        llm_messages = [msg.to_dict() for msg in messages]

        # 推論パラメータの構築
        dict_params = self._build_completion_parameter(params)

        # Chat Completionの実行
        response = self.client.chat.completions.create(
            model=self.model_config.get("model_name", "Unknown"),
            messages=llm_messages,
            **dict_params,
        )

        # Reasoning ContentをCustom Outputとして保持
        dict_resp = response.to_dict()
        message = dict_resp["choices"][0]["message"]
        if "reasoning_content" in message:
            dict_resp["custom_outputs"] = {
                "reasoning_content": message["reasoning_content"]
            }

        # 結果の返却
        return ChatCompletionResponse.from_dict(dict_resp)

    def predict_stream(
        self, messages: list[ChatMessage], params: ChatParams = None
    ) -> Generator[ChatCompletionChunk, None, None]:

        # SGLangのサーバプロセスが存在しない場合、ダミーメッセージを返す。
        # if not self.server_process:
        if not self._health_check(self.port):
            yield ChatCompletionChunk(
                choices=[
                    {
                        "index": 0,
                        "message": {
                            "role": "asssitant",
                            "content": "no response from server.",
                        },
                    }
                ]
            )

        # contentのNone埋め(sglang利用における制約)
        for msg in messages:
            if msg.content is None:
                msg.content = ""

        # list[ChatAgentMessage]のメッセージ入力を辞書型に変換
        llm_messages = [msg.to_dict() for msg in messages]

        # 推論パラメータの構築
        dict_params = self._build_completion_parameter(params, stream=True)

        # Chat Completionの実行
        response = self.client.chat.completions.create(
            model=self.model_config.get("model_name", "Unknown"),
            messages=llm_messages,
            **dict_params,
        )

        # ChunkのStream返却
        for chunk in response:
            dict_chunk = chunk.to_dict()
            delta = dict_chunk["choices"][0]["delta"]
            if "reasoning_content" in delta:
                dict_chunk["custom_outputs"] = {
                    "reasoning_content": delta["reasoning_content"]
                }

            yield ChatCompletionChunk.from_dict(dict_chunk)

    def _shutdown(self):
        """ Shutdown the server process. """
        if self.server_process and self.is_managed:
            logger.info("shutdown sglang server.")
            terminate_process(self.server_process)
            self.server_process = None
            self.is_managed = False

    def _launch_sglang_server(self, model_path, model_config):
        """ Start the server process. """

        if not model_path:
            raise ValueError("model_path is required")

        port = model_config.get("port")
        additional_args_from_env = os.environ.get(ENV_SGLANG_CLI_ARGS, "") # 環境変数から追加引数を取得
        additional_args = model_config.get("additional_args", "") + " " + additional_args_from_env

        cli_args = (
            f"python -m sglang.launch_server "
            f"--model-path {model_path} "
            f"--host 127.0.0.1 "
            f"{additional_args}"
        )

        logger.info(f"Launching server with args: {cli_args}")

        server_process, port = launch_server_cmd(cli_args, port=port)
        wait_for_server(f"http://localhost:{port}", timeout=SERVER_WAIT_TIMEOUT)

        return server_process, port

    def _health_check(self, port) -> bool:
        """ Check the health of the server process. """
        response = requests.get(f"http://localhost:{port}/health")
        return response.ok

    def _build_completion_parameter(
        self, params: ChatParams, stream: bool = False
    ) -> dict:
        """Build the completion parameter from the given params."""

        dict_params = params.to_dict() if params else {}
        dict_params["stream"] = stream
        if "custom_inputs" in dict_params:
            custom_inputs = dict_params.pop("custom_inputs")
            if "response_format" in custom_inputs:
                if isinstance(custom_inputs["response_format"], dict):
                    dict_params["response_format"] = custom_inputs["response_format"]
                else:
                    raise ValueError(
                        "custom_inputs['response_format'] must be a dictionary"
                    )
            if "extra_body" in custom_inputs:
                if isinstance(custom_inputs["extra_body"], dict):
                    dict_params["extra_body"] = custom_inputs["extra_body"]
                else:
                    raise ValueError("custom_inputs['extra_body'] must be a dictionary")

        return dict_params

    def __del__(self):
        self._shutdown()


model = SGLangOnlineChatModel()
set_model(model)

では定義したカスタムChatModelを用いて、MLflowにロギングします。

import mlflow
import os

mlflow.set_registry_uri("databricks-uc")
extra_pip_requirements = [
    "torch==2.5.1 --index-url https://download.pytorch.org/whl/cu124",
    "https://github.com/flashinfer-ai/flashinfer/releases/download/v0.2.2.post1/flashinfer_python-0.2.2.post1+cu124torch2.5-cp38-abi3-linux_x86_64.whl",
    "threadpoolctl==3.5.0",
    "transformers==4.48.3",
    "sgl-kernel",
    "sglang[srt]==0.4.3.post4",
    "openai==1.65.4",
    "loguru",
]
pip_requirements = mlflow.pyfunc.get_default_pip_requirements() + extra_pip_requirements

artifacts = {
    "llm-model": model_path,
}

input_example = {
    "messages": [
        {
            "role": "user",
            "content": "What is a good recipe for baking scones that doesn't require a lot of skill?",
        }
    ],
    "max_tokens":10,
    "temperature":0.6,
}

with mlflow.start_run() as run:
    _ = mlflow.pyfunc.log_model(
        artifact_path="model",
        python_model="sglang_online_chat_model.py",
        artifacts=artifacts,
        model_config=model_config,
        input_example=input_example,        
        pip_requirements=pip_requirements,
        metadata={"task": "llm/v1/chat"},
        registered_model_name=registered_model_name,  # 登録モデル名 in Unity Catalog
    )

モデルのロギング終わったら以下のコードを実行し、Databricks Mosaic AI Model Serving上にデプロイします。

import requests
import json
import uuid
import mlflow
from mlflow import MlflowClient

# 現在のノートブックコンテキストのAPIエンドポイントとトークンを取得
API_ROOT = dbutils.notebook.entry_point.getDbutils().notebook().getContext().apiUrl().get() 
API_TOKEN = dbutils.notebook.entry_point.getDbutils().notebook().getContext().apiToken().get()
endpoint_name = registered_model_name.split(".")[2] + "_endpoint"

mlflow.set_registry_uri("databricks-uc")
client = MlflowClient()
versions = [
    mv.version for mv in client.search_model_versions(f"name='{registered_model_name}'")
]

data = {
    "name": endpoint_name,
    "config": {
        "served_entities": [
            {
                "entity_name": registered_model_name,
                "entity_version": versions[0],
                "workload_type": "GPU_MEDIUM",
                "workload_size": "Small",
                "scale_to_zero_enabled": True,
            }
        ]
    },
}

headers = {"Context-Type": "text/json", "Authorization": f"Bearer {API_TOKEN}"}

response = requests.post(
    url=f"{API_ROOT}/api/2.0/serving-endpoints", json=data, headers=headers
)

print(json.dumps(response.json(), indent=4))

これで準備完了です。
デプロイ完了までは結構時間がかかるので、気長に待ちましょう。

Step2. 実行してみる

別のノートブックを作成し、LangChainを使って推論を実行してみます。
（クラスタはサーバレスを利用しました）

まずは必要なパッケージをインストール。

%pip install databricks-langchain mlflow-skinny[databricks]

%restart_python

MLflow Tracingを有効化します。

import mlflow

mlflow.langchain.autolog()

また、databricks-langchainのChatDatabricksクラスそのままだとcustom_inputsやcustom_outputs等のパラメータ入力/出力結果が扱いづらいため、そのあたりを取り扱えるようにしたクラスChatDatabricksExtを用意します。

from typing import List, Optional, Dict, Any, Mapping
from langchain_core.messages import BaseMessage
from langchain_core.outputs import ChatGeneration, ChatGenerationChunk, ChatResult
from databricks_langchain import ChatDatabricks

class ChatDatabricksExt(ChatDatabricks):
    def _prepare_inputs(
        self,
        messages: List[BaseMessage],
        stop: Optional[List[str]] = None,
        **kwargs: Any,
    ) -> Dict[str, Any]:
        """Prepare the inputs for the model."""

        data = super()._prepare_inputs(messages, stop, **kwargs)

        if "response_format" in data:
            response_format = data.pop("response_format")
            if response_format.get("type") == "json_schema":
                if "name" not in response_format["json_schema"]:
                    response_format["json_schema"]["name"] = "foo"

        return data

    def _convert_response_to_chat_result(
        self, response: Mapping[str, Any]
    ) -> ChatResult:
        """ Convert the response to a ChatResult. """
        resp = super()._convert_response_to_chat_result(response)

        if "custom_outputs" in response:
            resp.llm_output.update({"custom_outputs": response["custom_outputs"]})

        return resp

準備が終わりました。
では、いくつか試してみましょう。

単純推論

シンプルな質問をしてみます。
出力結果の前半がThinking内容、後半部が最終回答です。

endpoint = "sglang_qwen_qwq_32b_endpoint"
llm = ChatDatabricksExt(model=endpoint)

messages = [
    {"role": "system", "content": "あなたは日本語を話すAIアシスタントです。"},
    {"role": "user", "content": "Databricksとは何？"},
]

# 推論実行!
result = llm.invoke(messages, max_tokens=2000, temperature=0.6, top_p=0.95)

# 結果出力
print(" ==== Thinking ====")
print(result.response_metadata["custom_outputs"]["reasoning_content"])
print(" ==== Answer ====")
print(result.content)

出力結果

 ==== Thinking ====
Okay, the user is asking "What is Databricks?" Let me start by recalling what I know about Databricks. 

First, I remember that Databricks is a company founded by the creators of Apache Spark. That's important because Spark is a big data processing framework. So I should mention the founders and their relation to Spark.

Next, the company provides a cloud-based analytics platform. I need to explain what that platform does. It integrates various tools like Spark, Delta Lake, MLflow, and maybe others. The platform is used for data engineering, data science, and machine learning. 

I should also highlight the main features. Delta Lake is a storage layer for big data, which is part of the Databricks platform. MLflow is for managing machine learning workflows. The unified workspace is another key point—allowing teams to collaborate in one place.

Then, the cloud providers they support: AWS, Azure, Google Cloud. That's part of their infrastructure. 

Use cases are important too. Data lakes, data warehousing, real-time analytics, and MLOps. Maybe give examples like transforming raw data into insights or deploying ML models.

I should also mention the benefits: scalability, ease of use, integration with other tools, and collaboration features. 

Wait, I need to make sure all this info is accurate. Let me double-check. Yes, Databricks was founded by the Spark team. The platform is indeed cloud-based and includes Delta Lake and MLflow. The unified workspace is a core feature. 

Maybe the user is a data professional looking to understand the platform, or a student. Either way, the answer should be clear and cover the essentials without too much jargon. 

I should structure the answer in a way that first defines Databricks, then explains its components, features, use cases, and benefits. Keep it concise but informative. Avoid technical terms where possible, but since it's about a technical tool, some terms are necessary. 

Wait, should I mention the company's history briefly? Like founded in 2013 by Ion Stoica and others? Maybe that adds context. Also, their partnerships with major cloud providers. 

Okay, putting it all together now. Start with a clear definition, then break down the key points. Make sure to explain each component like Delta Lake and MLflow in simple terms. Highlight collaboration and integration aspects as they are unique selling points. 

Double-check for any inaccuracies. For example, confirming that Databricks does offer data warehousing capabilities through Delta Lake. Yes, they position Delta Lake as a data lakehouse, combining features of data lakes and warehouses. 

Alright, the response should flow logically from introduction to features to use cases and benefits. Keep paragraphs short for readability.

 ==== Answer ====
Databricksは、Apache Sparkの共同開発者らによって設立されたデータ分析プラットフォームの企業です。主にクラウド上で動作し、データエンジニアリング、データサイエンス、機械学習（ML）を統合した作業環境を提供します。

### 主な特徴：
1. **Apache Sparkベースの処理**：  
   Apache Sparkをコアに、大規模データの処理（バッチ処理・リアルタイム処理）を効率化します。

2. **Delta Lake**：  
   高速で信頼性の高いデータレイクハウスを実現するオープンソースのデータストレージ技術。トランザクションやスキーマ管理、データ品質機能を提供します。

3. **MLflow**：  
   機械学習パイプラインの管理ツールで、モデルのトレーニング、バージョン管理、デプロイを一元化します。

4. **統合ワークスペース**：  
   データエンジニア、データサイエンティスト、 Analystが同一の環境で協業できる「統合ワークスペース」を提供。ノートブックやダッシュボード、コードを共有できます。

5. **クラウドネイティブ**：  
   AWS、Azure、Google Cloudなど主要なクラウドプラットフォーム上で動作し、スケーラビリティを活用できます。

### 主な用途：
- **データレイクの構築**：多様なデータ形式を統合し分析
- **リアルタイム分析**：ストリーム処理による即時インサイトの生成
- **機械学習パイプライン**：データ前処理からモデルデプロイまでの一貫したワークフロー
- **データウェアハウス**：Delta Lakeを活用した高速クエリ処理

### 特徴的なメリット：
- **協業の促進**：チームが同一プラットフォームで作業し、プロジェクトを効率化
- **コスト効率**：クラウドリソースの自動スケーリングでリソースを最適化
- **ツール統合**：Python/R/SQLなど複数言語対応、外部ツールとの連携も可能

Databricksは、企業がデータ駆動型意思決定を加速するために広く採用されています。特に、複雑なデータパイプラインや機械学習プロジェクトを抱える組織でその価値が発揮されます。

MLflow Tracingの出力は以下のような感じ。

英語でThinkingしてから、システムプロンプトで指定したように日本語で回答してくれています。

別の内容も質問してみます。

endpoint = "sglang_qwen_qwq_32b_endpoint"
llm = ChatDatabricksExt(model=endpoint)

messages = [
    {"role": "system", "content": "あなたは日本語を話すAIアシスタントです。"},
    {"role": "user", "content": "日本の製造業における生成AI活用の重要性について論じてください。"},
]

# 推論実行!
result = llm.invoke(messages, max_tokens=2000, temperature=0.6, top_p=0.95)

# 結果出力
print(" ==== Thinking ====")
print(result.response_metadata["custom_outputs"]["reasoning_content"])
print(" ==== Answer ====")
print(result.content)

出力結果

 ==== Thinking ====
まず、日本の製造業が生成AIを活用する重要性について考えます。製造業は日本経済の基盤なので、その効率化やイノベーションが重要です。生成AIの活用例として、例えば、設計の自動化や品質管理の最適化、生産プロセスの最適化などが挙げられます。

次に、具体的なメリットを整理します。設計段階では、AIが複雑な形状の最適化やシミュレーションを自動化することで、開発期間の短縮やコスト削減が可能になります。また、生産工程では、AIによる予知保全や異常検知が製造ラインのダウンタイムを減らし、生産性を向上させます。さらに、カスタマイズされた製品の大量生産を可能にするオンデマンド製造も挙げられます。

また、人材不足の問題への対応として、生成AIが経験豊富な技術者のノウハウを学習し、新人教育や技術継承を支援する点も重要です。さらに、グローバル競争力を維持するために、AIを活用した製品の高度なカスタマイズや迅速な市場対応が求められます。

課題としては、データの質と量、既存システムとの統合、人材の育成、セキュリティや倫理的な問題が挙げられます。これらの課題を克服するための具体的な取り組みや、政府や企業の役割も論じる必要があります。

最後に、締めくくりとして、生成AIが日本の製造業の競争力維持と持続可能な成長に不可欠な技術であることを強調します。

 ==== Answer ====
生成AIの活用が日本の製造業に与える影響は多岐にわたる。まず、製造業が日本経済の基盤を支える産業であることを踏まえ、生産性向上やイノベーション創出が重要である点を指摘する。次に、生成AIが設計・生産・品質管理などの各工程でもたらす具体的なメリットを挙げる。例えば、設計ではAIによる最適化やシミュレーションで開発期間の短縮、生産では予知保全でメンテナンスコストの削減、品質管理では異常検知による製品品質の向上が挙げられる。さらに、人材不足の解消や技術継承への貢献、グローバル競争力の強化といった点にも触れ、総合的な重要性を強調する。最後に、導入に伴う課題（データの質、人材育成、セキュリティ）を提示しつつ、これらの克服が成長の鍵であることを結論付ける。

こちらは日本語で考えて日本語で回答されました。
回答は何か独特な口調ですね。。。

Tool Calling

QwQ-32Bの設定関連を見るに対応してそうでしたので、Tool Callingを実行してみます。
内容はLangChainのサンプルそのまま。

from langchain_core.tools import tool

@tool
def add(a: int, b: int) -> int:
    """Adds a and b."""
    return a + b

@tool
def multiply(a: int, b: int) -> int:
    """Multiplies a and b."""
    return a * b

tools = [add, multiply]

llm_with_tools = llm.bind_tools(tools)

query = "What is 3 * 12? Also, what is 11 + 49?"
result = llm_with_tools.invoke(query)

print("=== Tool Calls ===")
print(result.tool_calls)
print("=== Thinking ===")
print(result.response_metadata["custom_outputs"]["reasoning_content"])

出力結果

=== Tool Calls ===
[{'name': 'multiply', 'args': {'a': 3, 'b': 12}, 'id': '1', 'type': 'tool_call'}, {'name': 'add', 'args': {'a': 11, 'b': 49}, 'id': '0', 'type': 'tool_call'}]
=== Thinking ===
Okay, the user is asking two separate questions here. First, they want to know what 3 multiplied by 12 is. Then, they're asking for the sum of 11 and 49. Let me check the tools available. There's the 'multiply' function for the first part and the 'add' function for the second. 

For the first question, 3 * 12, I should use the multiply tool with a=3 and b=12. That should give 36. 

The second part is 11 + 49. Using the add function with a=11 and b=49 would be correct, which adds up to 60. 

I need to make sure I call each function separately. Let me structure the tool calls properly. First the multiply, then the add. Each in their own tool_call tags.

MLflow Tracingの出力は以下のようになります。

ちゃんとTool Callingしてくれます。
その上、推論過程がわかるのはいいですね。
Toolの実行をユーザ承認するような場合に役に立ちそう。

Structured Output

構造化出力も試してみます。
まずは、単純なwith_structured_outputを使ったやり方。

from pydantic import BaseModel, Field

class ResponseFormatter(BaseModel):
    """Always use this tool to structure your response to the user."""

    answer: str = Field(description="The answer to the user's question")
    followup_question: str = Field(description="A followup question the user could ask")

model_with_structure = llm.with_structured_output(ResponseFormatter)
structured_output = model_with_structure.invoke("What is the powerhouse of the cell?")

structured_output

出力結果

ResponseFormatter(answer="The powerhouse of the cell is the mitochondria, which generates most of the cell's supply of adenosine triphosphate (ATP), used as a source of chemical energy.", followup_question='How do mitochondria produce energy for the cell?')

MLflow Tracingの出力は以下のようになります。

デフォルトだとwith_structured_outputの構造化出力はFunction Callingを使った方法になります。
前のセクションでFunction(Tool) Callingができていたので納得の結果。

違う方法で、JSONスキーマを指定した構造化出力もさせてみます。

import json

model_with_structure = llm.with_structured_output(
    ResponseFormatter, method="json_schema"
)
structured_output = model_with_structure.invoke(
    f"What is the powerhouse of the cell? Please answer with the following JSON format:{json.dumps(ResponseFormatter.model_json_schema())}",
    temperature=0.6,
    top_p=0.95,
    max_tokens=1000,
)

structured_output

出力結果

ResponseFormatter(answer='The powerhouse of the cell is the mitochondria.', followup_question='How do mitochondria generate energy for the cell?')

こちらでも問題なく構造化出力できますね。

どちらのやり方も、Thinkingするため通常の出力に比べると時間がかかるのですが、構造化出力の確実性は高そうな気がしています。

まとめ

QwQ-32BをDatabricks Mosaic AI Model Servingにデプロイして試してみました。
Mosaic AI Model Servingを必ずしも使う必要はないのですが、LangChainと組み合わせて使う際は便利ですね。

DeepSeek-R1に匹敵する性能かどうかはわかりませんが（情報を調べている限りではそこまででもなさそう）、32BというパラメータサイズのReasoningモデルとしてはかなり良いものだと思います。
特にTool Callingで利用できるのは（レイテンシは気になりますが）使うケースが多そうです。以前作成したDeep Researchとか。
また、日本語で事前学習したものの登場も期待したいですね。

このモデル/エンドポイントを活用したエージェントなどもいろいろ作ったりしてみたいと思います。

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up