More than 1 year has passed since last update.

DatabricksでAWQフォーマットのモデルを使って推論する②

Last updated at 2023-10-06Posted at 2023-10-06

以下の記事の続きです。

導入

前回はAutoAWQを使って、AWQフォーマットのモデルから推論しました。
今回はvLLMを使ってAWQフォーマットのモデルを使ってみます。
検証環境は前回同様です。

vLLMとは

以前、GIGAZINEさんの記事でも取り上げられていた、推論を高効率で実施したりサービング機能をもったライブラリです。

バージョン0.2.0でAWQフォーマットのサポートがなされました。

Initial support for AWQ (performance not optimized)

まだ最適化前ですが、パフォーマンスを中心に見ていきたいと思います。

推論する

前回ダウンロードしたモデル＋αで推論を試してみます。

必要なモジュールをインストール。(Transformersはgitから読んでいますが、通常のTransformersで動作すると思います）

%pip install -U vllm git+https://github.com/huggingface/transformers.git accelerate

dbutils.library.restartPython()

モデルの読み込みと実行。
model_pathに、読み込むモデルのパスを事前に入れて実行しましょう。

from vllm import LLM, SamplingParams

prompts = ["What is AI? AI is"]
sampling_params = SamplingParams(temperature=0.8, top_p=0.95)

llm = LLM(model=model_path, quantization="awq", dtype="half")

outputs = llm.generate(prompts, sampling_params)

# Print the outputs.
for output in outputs:
    prompt = output.prompt
    generated_text = output.outputs[0].text
    print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")

TheBloke/Mistral-7B-v0.1-AWQの場合

実行結果

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
ValueError: Quantization is not supported for <class 'vllm.model_executor.models.mistral.MistralForCausalLM'>.

通常のMistral-7BはvLLMでサポートされているはずですが、量子化モデルはまだ未対応の模様。

TheBloke/vicuna-13B-v1.5-16K-AWQの場合

実行結果

RuntimeError: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

むむむ、AutoAWQもそうでしたが、コンテキストを拡張したモデルはまだダメそう？

TheBloke/vicuna-13B-v1.5-AWQの場合

というわけで、通常のコンテキスト長のVicuna v1.5でも試してみます。

実行結果

Processed prompts: 100%|██████████| 1/1 [00:00<00:00,  2.15it/s]
Prompt: 'What is AI? AI is', Generated text: ' a field of study that focuses on creating intelligent machines that can perform tasks'

今度は動きました。
モデルのロードで2分ぐらいかかり、AutoAWQより読込速度は時間かかった感じ。

推論速度を前回と同様のやり方で簡易的に測ってみます。

def generate_text(prompt:str, max_tokens:int) -> str:
    sampling_params = SamplingParams(temperature=0.8, top_p=0.95, max_tokens=max_tokens)

    outputs = llm.generate([prompt], sampling_params)
    return outputs[0].outputs[0].text

import time

prompt = "What is Databricks? Databricks is"

max_new_tokens = 128

time_begin = time.time()

output = generate_text(prompt, max_new_tokens)

time_end = time.time()
time_total = time_end - time_begin

print(output)
print()
print(f"Response generated in {time_total:.2f} seconds, {max_new_tokens} tokens, {max_new_tokens / time_total:.2f} tokens/second")

実行結果

Processed prompts: 100%|██████████| 1/1 [00:02<00:00,  2.91s/it]
 a cloud-based data lake platform that enables businesses to store, manage, and analyze large amounts of data. It is built on top of Apache Spark, an open-source big data processing framework, and provides a unified platform for data engineering, data science, and data analytics.

What is Data Engineering? Data Engineering is a set of practices and technologies for creating, maintaining, and evolving the systems and architectures that power data-driven applications. Data engineering involves designing, building, and maintaining data pipelines, data stores, and data processing systems to enable businesses to collect, store

Response generated in 2.91 seconds, 128 tokens, 43.94 tokens/second

何度か繰り返して測った結果、44 tokens/sec前後という推論速度でした。
VRAM使用量はモデルロード後で7.3GB、推論後で18.3GB。

まとめ

まだ最適化前ということで、推論速度はAutoAWQの方が現状速いですね。
また、量子化対応モデルも限定的なのかもしれません。

性能面などの検証がまだ全然できていませんが、AWQはメモリ効率・推論速度共に優れた量子化フォーマットなのかなと思います。
まだ発展途上感はありますが、使えるように準備していこうかなと思いました。

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up