N番煎じでQwen2.5をDatabricks Mosaic AI Model Serving上で試す

Last updated at 2024-09-22Posted at 2024-09-22

やりたいことが溜まってきているのですが、しばらく余裕が無さそう。。。

導入

アリババ社が開発・公開しているLLM Qwenのバージョン2.5が公開されました。

LLM の Qwen2.5 と、コーディングの専用モデルである Qwen2.5-Coder、数学のモデルである Qwen2.5-Math が提供されています。すべてのオープンウェイトモデルは、密度の高いデコーダーのみの言語モデルであり、次のようなさまざまなサイズで利用できます。

Qwen2.5: 0.5B、1.5B、3B、7B、14B、32B、および 72B
Qwen2.5-Coder: 1.5B、7B、32Bが進行中
Qwen2.5-Math: 1.5B、7B、および72B。

3B と 72B のバリアントを除くすべてのオープンソースモデルは、Apache 2.0 でライセンスされています。ライセンスファイルは、それぞれの Hugging Face リポジトリにあります。これらのモデルに加えて、当社の主力言語モデルである Qwen-Plus と Qwen-Turbo の API を Model Studio を通じて提供していますので、ぜひご検討ください。さらに、Qwen2-VL-72Bもオープンソース化しており、先月のリリースと比較してパフォーマンスが向上しています。

多言語対応のモデルであり、ベンチマークでも高い性能を示しています。
通常のLLMは日本語性能も高く、SNS上の評価結果を見るに32Bの中級パラメータLLMであってもgpt4o miniに匹敵する性能のようです。しかも一部を除いてApache 2.0でライセンスされています。

性能をそこまで求めない用途ならもうこれでいいんじゃないかな。。。

というわけで軽く試用してみます。

今回はDatabricks Mosaic AI Model Servingにデプロイした上でいくつかのプロンプトを実行します。
環境はDatabricks on AWS、DBRは15.4ML、インスタンスタイプはg5.xlargeです。
推論エンジンはExLlamaV2を利用し、EXL2形式で量子化されたモデルを利用します。

公式にAWQやGGUFなどのモデルも公開されているため、vLLMやLlama.cppといった推論エンジンを利用する方が便利ですが、VRAM使用量を抑えたかったので今回はExLlamaV2を使いました。

今後はvLLMを使うコードも出していこうと思います。

Step1. パッケージインストール

ノートブックを作成し、必要なパッケージをインストールします。

# Flash Attensino
# 特定のビルドを選択
%pip install -U https://github.com/Dao-AILab/flash-attention/releases/download/v2.6.3/flash_attn-2.6.3+cu123torch2.3cxx11abiFALSE-cp311-cp311-linux_x86_64.whl
# ExllamaV2
%pip install https://github.com/turboderp/exllamav2/releases/download/v0.2.2/exllamav2-0.2.2+cu121.torch2.3.1-cp311-cp311-linux_x86_64.whl

# lm-format-enforcer
%pip install "lm-format-enforcer==0.10.7"

%pip install "mlflow-skinny[databricks]>=2.16.1"

dbutils.library.restartPython()

Step2. モデル読み込み時の設定ファイルの作成

モデルをロードする際の設定を以下のようなjsonファイルとしてノートブックと同じフォルダ内に保存しておきます。

mlflow_exllamav2_model_config.json

{
  "model_name": "training.llm.Qwen2_5-32B-Instruct-exl2",
  "use_draft": false,
  "use_tp": false,
  "prompt_template": {
    "system": "<|im_start|>system\\n{}<|im_end|>",
    "user": "<|im_start|>user\\n{}<|im_end|>",
    "assistant": "<|im_start|>assistant\\n{}<end_of_turn>",
    "last": "<|im_start|>assistant\\n"
  },
  "max_seq_len": 32768,
  "cache_type": "Q4",
  "no_graphs": false,
  "additional_stop_tokens": [],
  "seed": 123
}

Step3. カスタムモデルクラスの作成

ExLlamaV2を使って推論するためのMLflow用カスタムChatモデルを定義します。
コードが長いので折り畳み。

PyfuncカスタムCHatモデルのセル

%%writefile "./exllamav2_chat_model.py"

from typing import List, Union, Iterator
import uuid
import logging

import mlflow
from mlflow.types.llm import ChatResponse, ChatMessage, ChatParams
from mlflow.models import set_model
from exllamav2 import (
    ExLlamaV2,
    ExLlamaV2Config,
    ExLlamaV2Cache,
    ExLlamaV2Cache_Q4,
    ExLlamaV2Cache_Q6,
    ExLlamaV2Cache_Q8,
    ExLlamaV2Cache_TP,
    ExLlamaV2Tokenizer,
)
from exllamav2.generator import ExLlamaV2DynamicGenerator, ExLlamaV2Sampler
from exllamav2.generator.filters import ExLlamaV2PrefixFilter, ExLlamaV2Filter

# from lmformatenforcer.integrations.exllamav2 import ExLlamaV2TokenEnforcerFilter
from lmformatenforcer import JsonSchemaParser
from functools import lru_cache
from lmformatenforcer.integrations.exllamav2 import build_token_enforcer_tokenizer_data
from lmformatenforcer import TokenEnforcer, CharacterLevelParser
from typing import List
import json

logger = logging.getLogger(__name__)

@lru_cache(10)
def _get_lmfe_tokenizer_data(tokenizer: ExLlamaV2Tokenizer):
    return build_token_enforcer_tokenizer_data(tokenizer)

class ExLlamaV2TokenEnforcerFilter(ExLlamaV2Filter):
    token_sequence: List[int]

    def __init__(
        self,
        model: ExLlamaV2,
        tokenizer: ExLlamaV2Tokenizer,
        character_level_parser: CharacterLevelParser,
    ):
        super().__init__(model, tokenizer)
        tokenizer_data = _get_lmfe_tokenizer_data(tokenizer)
        self.token_enforcer = TokenEnforcer(tokenizer_data, character_level_parser)
        self.token_sequence = []

    def begin(self, prefix_str: str) -> None:
        self.token_sequence = []

    def feed(self, token) -> None:
        self.token_sequence.append(int(token[0][0]))

    def next(self):
        allowed_tokens = self.token_enforcer.get_allowed_tokens(self.token_sequence)
        return sorted(allowed_tokens), []

    def use_background_worker(self):
        return True


# Define a custom PythonModel
class ExLlamaV2CustomChatModel(mlflow.pyfunc.ChatModel):
    """ A custom PythonModel for the ExLlamaV2 model. """

    def load_context(self, context):
        """Load the model from the contexts."""

        model_path = context.artifacts["llm-model"]
        config_file = context.artifacts["config_file"]

        logger.info(f"Loading model from {model_path}")
        logger.info(f"Loading config from {config_file}")

        # Config fileから設定を読み込む
        config = json.load(open(config_file))
        model_name = config.get("model_name", "No Name")
        use_draft = config.get("use_draft", False)
        prompt_template = config.get("prompt_template", {})
        max_seq_len = config.get("max_seq_len", -1)
        cache_class = self._select_cache_class(config.get("cache_type", ""))
        no_graphs = config.get("no_graphs", False)
        use_tp = config.get("use_tp", None)
        additional_stop_tokens = config.get("additional_stop_tokens", [])
        seed = config.get("seed", 123)

        # Speculative Decodingの設定
        draft_model_path = None
        draft_model = None
        draft_cache = None
        if use_draft:
            draft_model_path = context.artifacts["llm-model-draft"]

            draft_config = ExLlamaV2Config(draft_model_path)
            draft_config.arch_compat_overrides()
            draft_model = ExLlamaV2(draft_config)
            draft_cache = cache_class(
                draft_model, max_seq_len=self.max_seq_len, lazy=True
            )
            draft_model.load_autosplit(draft_cache, progress=True)

        # 主LLMの設定
        config = ExLlamaV2Config(model_path)
        config.arch_compat_overrides()
        config.no_graphs = no_graphs

        model = ExLlamaV2(config)
        if use_tp:
            model.load_tp(progress=True, expect_cache_tokens=max_seq_len)
            cache = ExLlamaV2Cache_TP(model, max_seq_len=max_seq_len, base=cache_class)
        else:
            cache = cache_class(
                model,
                max_seq_len=max_seq_len,
                lazy=True,
            )
            model.load_autosplit(cache, progress=True)

        tokenizer = ExLlamaV2Tokenizer(config)

        generator = ExLlamaV2DynamicGenerator(
            model=model,
            cache=cache,
            draft_model=draft_model,
            draft_cache=draft_cache,
            tokenizer=tokenizer,
            paged=True,
        )
        generator.warmup()

        self._model = model
        self._cache = cache
        self._tokenizer = tokenizer
        self._generator = generator
        self._draft_model = draft_model
        self._draft_cache = draft_cache
        self.model_name = model_name
        self.prompt_template = prompt_template
        self.additional_stop_tokens = additional_stop_tokens
        self.seed = seed

    def reset_generator(self):
        """ Reset the generator to the initial state. """
        self._generator = ExLlamaV2DynamicGenerator(
            model=self._model,
            cache=self._cache,
            tokenizer=self._tokenizer,
            paged=True,
        )
        self._generator.warmup()

    def predict(
        self,
        context,
        messages: List[mlflow.types.llm.ChatMessage],
        params: mlflow.types.llm.ChatParams,
    ):
        """ Predict the response to the given messages. """

        # プロンプト構築
        prompt = self._format_messages(messages, self.prompt_template)

        # フィルターの取得
        filters = self._build_filters(self._model, self._tokenizer, messages)

        # 推論設定の構築
        settings = self._build_settings(
            params, self._tokenizer, self.additional_stop_tokens
        )

        output, last_result = self._generator.generate(
            prompt=prompt,
            filters=filters,
            max_new_tokens=settings["max_tokens"],
            gen_settings=settings["gen_settings"],
            completion_only=True,
            seed=self.seed,
            stop_conditions=settings["stop_conditions"],
            encode_special_tokens=True,
            add_bos=True,
            add_eos=False,
            return_last_results=True,
        )

        id = str(uuid.uuid4())
        return self._build_response(
            id, self.model_name, prompt, output, settings, last_result
        )

    @staticmethod
    def _select_cache_class(cache_type: str = "Q4"):
        """ Select the cache class based on the given cache type. """
        selector = {
            "Q4": ExLlamaV2Cache_Q4,
            "Q6": ExLlamaV2Cache_Q6,
            "Q8": ExLlamaV2Cache_Q8,
        }
        return selector.get(cache_type, ExLlamaV2Cache)

    @staticmethod
    def _format_messages(messages, prompt_template):
        """ Format the given messages into a prompt. """

        # 最初末尾に生成用プロンプト追加
        _messages = (
            [ChatMessage(role="first", content="")]
            + messages
            + [ChatMessage(role="last", content="")]
        )

        prompts = []
        for mes in _messages:
            template = prompt_template.get(mes.role)
            if template:
                prompts.append(template.format(mes.content))

        return "".join(prompts)

    @staticmethod
    def _build_filters(model, tokenizer, messages):
        """ Build the filters for the given messages. """

        json_schema_messages = [m for m in messages if m.role == "json_schema"]
        if json_schema_messages:
            schema_parser = JsonSchemaParser(
                json.loads(json_schema_messages[-1].content)
            )
            return [
                # ExLlamaV2TokenEnforcerFilter(model, tokenizer, schema_parser),
                ExLlamaV2PrefixFilter(model, tokenizer, ["{", " {"]),
            ]

        return []

    @staticmethod
    def _build_settings(params, tokenizer, additional_stop_tokens):
        """ Build the settings for the given parameters. """

        # サンプリングの設定
        settings = ExLlamaV2Sampler.Settings()
        settings.temperature = params.temperature or 1.0
        settings.top_k = params.top_k or 50
        settings.top_p = params.top_p or 0.8
        settings.token_frequency_penalty = params.frequency_penalty or 0.0
        settings.token_presence_penalty = params.presence_penalty or 0.0

        # トークン最大出力数
        max_tokens = params.max_tokens or 100

        # Stop Condition設定
        stop_conditions = [tokenizer.eos_token_id] + [
            tokenizer.single_id(s) for s in additional_stop_tokens
        ]

        return {
            "gen_settings": settings,
            "max_tokens": max_tokens,
            "stop_conditions": stop_conditions,
        }

    @staticmethod
    def _build_response(id, model_name, prompt, output, settings, last_result):
        """ Build the response for the given parameters. """

        usage = {
            "prompt_tokens": last_result["prompt_tokens"],
            "completion_tokens": last_result["new_tokens"],
            "total_tokens": last_result["prompt_tokens"] + last_result["new_tokens"],
        }
        finish_reason = last_result["eos_reason"]

        response = {
            "id": id,
            "model": model_name,
            "choices": [
                {
                    "index": 0,
                    "message": {"role": "assistant", "content": output},
                    "finish_reason": finish_reason,
                }
            ],
            "usage": usage,
        }

        return ChatResponse(**response)

set_model(ExLlamaV2CustomChatModel())

Step4. MLflowにモデルをロギング

Step3のカスタムモデルを用いて、MLflowにモデルを登録します。
今回は、Qwen2.5 32Bを量子化した以下のモデルを利用させてもらいました。

model_path変数には、上記モデルをダウンロードしたパス文字列が入っています。

import mlflow
import os

mlflow.set_registry_uri("databricks-uc")
extra_pip_requirements = [
    "astunparse==1.6.3",
    "cython==0.29.32",
    "dill==0.3.6",
    "opt-einsum==3.3.0",
    "tokenizers==0.19.0",
    "ninja==1.11.1.1",
    "rich==13.7.1",
    "torch==2.3.1 --index-url https://download.pytorch.org/whl/cu121",
    "https://github.com/Dao-AILab/flash-attention/releases/download/v2.6.3/flash_attn-2.6.3+cu123torch2.3cxx11abiFALSE-cp311-cp311-linux_x86_64.whl", # flash-attn
    "https://github.com/turboderp/exllamav2/releases/download/v0.2.2/exllamav2-0.2.2+cu121.torch2.3.1-cp311-cp311-linux_x86_64.whl", #ExLlamaV2
    "lm-format-enforcer>=0.10.7",
]
pip_requirements = mlflow.pyfunc.get_default_pip_requirements() + extra_pip_requirements

artifacts = {
    "llm-model": model_path,
    "config_file": "mlflow_exllamav2_model_config.json",
}

with mlflow.start_run() as run:
    _ = mlflow.pyfunc.log_model(
        artifact_path="model",
        python_model="exllamav2_chat_model.py",
        artifacts=artifacts,
        pip_requirements=pip_requirements,
        example_no_conversion=True,
        await_registration_for=3600,  # モデルサイズが大きいので長めの待ち時間にします
        registered_model_name=registered_model_name,  # 登録モデル名 in Unity Catalog
    )

Step5. Mosaic AI Model Servingにエンドポイントを作成

MLflowに登録したモデルをMosaic AI Model Servingにデプロイします。

import requests
import json
from mlflow import MlflowClient

endpoint_workload_type='GPU_MEDIUM'
endpoint_workload_size='Small'
endpoint_scale_to_zero_enabled='true'

# Get the API endpoint and token for the current notebook context
API_ROOT = dbutils.notebook.entry_point.getDbutils().notebook().getContext().apiUrl().get() 
API_TOKEN = dbutils.notebook.entry_point.getDbutils().notebook().getContext().apiToken().get()

client=MlflowClient()
versions = [mv.version for mv in client.search_model_versions(f"name='{registered_model_name}'")]

data = {
    "name": endpoint_name,
    "config":{
        "served_entities": [
        {
            "entity_name": registered_model_name,
            "entity_version": versions[0],
            "workload_type": endpoint_workload_type,
            "workload_size": endpoint_workload_size,
            "scale_to_zero_enabled": endpoint_scale_to_zero_enabled,
        }]
    },
}

headers = {"Context-Type": "text/json", "Authorization": f"Bearer {API_TOKEN}"}

response = requests.post(url=f"{API_ROOT}/api/2.0/serving-endpoints", json=data, headers=headers)

print(json.dumps(response.json(), indent=4))

ここまで準備でした。問題なければ30～40分程度でコンテナの作成とデプロイが完了します。

Step6. 推論する

別のノートブックを作成し、作成したエンドポイントを使って推論してみます。
クラスタはサーバレスを使用しました。

まずはパッケージをインストール。今回はLangChainを使って推論を補助してもらいます。

%pip install -U -qq langchain_core langchain-databricks
%pip install -U "mlflow-skinny[databricks]>=2.16.2"

dbutils.library.restartPython()

推論用のプロンプトを準備。

prompts = [
    "Hello, what is your name?",
    "Databricksとは何ですか？詳細に教えてください。",
    "まどか☆マギカでは誰が一番かわいい?",
    "ランダムな10個の要素からなるリストを作成してソートするコードをPythonで書いてください。",
    "現在の日本の首相は誰？",
    "あなたはマラソンをしています。今3位の人を抜きました。あなたの今の順位は何位ですか?ステップバイステップで考えてください。",
]

system_prompt = """あなたは親切なAIアシスタントです。日本語で回答してください。"""

messages = [[("system", system_prompt), ("user", p)] for p in prompts]

LangChainのChatモデルインスタンスを作成。
endpointにはStep5.で作成したエンドポイントの名前を指定してください。

from langchain_databricks import ChatDatabricks

chat_model = ChatDatabricks(endpoint=endpoint_name)

では、推論を実行してみましょう。

import time

start_time = time.time()
outputs = chat_model.batch(messages, temperature=0.0, max_tokens=512)
end_time = time.time()

processing_time = end_time - start_time
total_tokens = 0
for p in zip(prompts, outputs):

    total_tokens += p[1].response_metadata["total_tokens"]

    print("Q:", p[0])
    print("A:", p[1].content)
    print("-"*100)

print("processing_time:", processing_time)
print("total_tokens:", total_tokens)
print("avg. tokens per sec.:", total_tokens / processing_time)

出力

Q: Hello, what is your name?
A: こんにちは！私はAIアシスタントですので、名前はありませんが、あなたの質問にお答えします。どのようにお手伝いできるか教えてください。
----------------------------------------------------------------------------------------------------
Q: Databricksとは何ですか？詳細に教えてください。
A: Databricksは、Apache Sparkを活用したクラウドベースのデータ処理・分析プラットフォームです。このプラットフォームは、2013年にアマノ・カレル・チアンらにより設立されました。彼らは、Sparkの開発者でもあります。

Databricks は、データエンジニア、データサイエンティスト、データアナリストがデータを収集、保存、プロセス、分析するための一つの環境を提供します。以下にDatabricksの特徴を詳しく説明します：

1. **Collaborative Notebooks**：データ分析と共有のためのJupyter Notebookのような環境を提供します。これによりデータエンジニアやデータサイエンティストは共同作業を行うことができます。

2. **Unified Data Analytics Platform**：データエンジニアリングからデータサイエンスまでのワークフローを包括的にサポートします。これにより、大量のデータに対して高速な分析が可能になります。

3. **Auto-scaling**：ユーザーのワークフローに応じて自動的にクラスターのサイズをスケーリングします。これにより、リソースの最適化とコストの削減が可能です。

4. **Integration with MLflow for Machine Learning**：機械学習のライフサイクル管理（モデルのトレーニング、評価、デプロイ）を容易にするためのMLflowと統合されています。

5. **Support for Multiple Languages**：Python、Scala、Java、R、SQLなどのプログラミング言語をサポートしています。

6. **Cloud Agnostic**：AWS、Azure、Google Cloud Platformなどの主要なクラウドプラットフォーム上で動作します。

以上がDatabricksの主な特徴です。企業はDatabricksを使用して、データ分析と機械学習のワークフローを強化し、データドリブンな意思決定を可能にしています。
----------------------------------------------------------------------------------------------------
Q: まどか☆マギカでは誰が一番かわいい?
A: まどか☆マギカのキャラクターは皆魅力的で個性的なので、「一番かわいい」とは一概に言えないかもしれません。キャラクターの好みは人それぞれ違いますね。

ただし、一般的には以下のキャラクターが特に人気があります：

1. 妙子：物語のキickerキャラクターとして重要な役割を果たしています。

2. 魔術師の制服を着た各キャラクター（まどか、ひまり、ほむら、みくる、さやか）：それぞれ異なる魅力と背景を持っています。

最終的にはあなたの好みによって変わると思います。どのキャラクターにも深い意味と魅力があるので、全員を楽しむのが一番かもしれませんね。
----------------------------------------------------------------------------------------------------
Q: ランダムな10個の要素からなるリストを作成してソートするコードをPythonで書いてください。
A: もちろん、以下にPythonを使用してランダムな10個の要素からなるリストを作成し、それをソートするコードを示します。

```python
import random

# ランダムな10個の要素からなるリストを作成
random_list = [random.randint(1, 100) for _ in range(10)]
print("元のリスト:", random_list)

# リストをソート
sorted_list = sorted(random_list)
print("ソート後のリスト:", sorted_list)
```

このコードでは、`random.randint(1, 100)` を使用して1から100までの範囲でランダムな整数を生成しています。生成された10個の要素からなるリストを `sorted()` 関数でソートしています。ソートの結果を確認するために、元のリストとソート後のリストをそれぞれ表示しています。
----------------------------------------------------------------------------------------------------
Q: 現在の日本の首相は誰？
A: 現在の日本の首相は岸田文雄（きしだ ふみお）さんです。2021年10月に第100代の内閣総理大臣に就任されました。
----------------------------------------------------------------------------------------------------
Q: あなたはマラソンをしています。今3位の人を抜きました。あなたの今の順位は何位ですか?ステップバイステップで考えてください。
A: あなたがマラソンで3位の人を抜いた場合、あなたの現在の順位は2位になります。順位の変化を考えていきましょう。

1. 初め、あなたは4位で、3位の人がいました。
2. あなたが3位の人を抜いたということは、あなたがその人と位置取りを交換したということです。
3. つまり、あなたが3位になり、前の3位の人は4位になりました。

したがって、あなたはマラソンの2位にいます。
----------------------------------------------------------------------------------------------------
processing_time: 40.83073449134827
total_tokens: 1251
avg. tokens per sec.: 30.63868469633035

量子化されていますが、しっかりした日本語内容が出力されています。
マラソン問題は間違っていますが、temperatureを変更して何度か実行すると正解するケースもありました。

まとめ

Mosaic AI Model Servingを使ってQwen2.5のデプロイ＆モデルの試験をしてみました。

Mosaic AI Model Servingも徐々に使い勝手が上がってきており、最近ではエンドポイントの停止ができるようになっていました。（なんで今までできなかったんだろう。。。)

Qwen2.5は32Bでも日本語性能では従来の70Bクラスの性能があるように思います。
これがApache 2.0で利用できるのか。。。

最近はプロプライエタリのLLM APIが価格も安くなり、OpenAI o1が出たりと面白いニュースが多い状態ですが、ローカル(オープン)LLMも用途によって非常に重要な役割を果たしていくと思います。
ますますの発展が楽しみですね。

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up