More than 1 year has passed since last update.

DatabricksでAWQフォーマットのモデルを使って推論する①

Last updated at 2023-10-03Posted at 2023-10-03

思った以上に手軽に使えて感動しています。

導入

最近まで知らなかったのですが、比較的新しいLLM量子化フォーマットとしてAWQというフォーマットがあります。
各量子化フォーマットの違いは以下の記事で解説されています。

AWQはvLLMでも最新Verである0.2.0で採用され、TheBloke兄貴もこのフォーマットでのモデルをアップされています。

従来の量子化モデルよりもより性能・効率面で優れているそうで、推論の高速化を期待して試してみたいと思います。

検証はDatabricks on AWSで実施、DBRは14.0ML、インスタンスタイプはg5.xlarge(A10 GPU)で実施しました。

AWQとは

論文をちゃんと読めてはいないのですが、上のサイトによると、

Efficient and accurate low-bit weight quantization (INT3/4) for LLMs, supporting instruction-tuned models and multi-modal LMs.
邦訳：
LLMのための効率的で高精度な低ビット重み量子化(INT3/4)、命令チューニングモデルとマルチモーダルLMをサポート。

ということらしく、3・4bitの量子化をサポートするフォーマットです。

検証内容

今回はパラメータサイズの異なる以下のモデルを試してみます。

TheBloke/Mistral-7B-v0.1-AWQ
TheBloke/vicuna-13B-v1.5-16K-AWQ

また、推論に利用するモジュールも以下２種を使って実験してみます。

AutoAWQ: https://github.com/casper-hansen/AutoAWQ
vLLM: https://github.com/vllm-project/vllm

vLLMでの実験は次回にして、今回はAutoAWQを使います。

AutoAWQとは

AutoAWQ is an easy-to-use package for 4-bit quantized models. AutoAWQ speeds up models by 2x while reducing memory requirements by 3x compared to FP16. AutoAWQ implements the Activation-aware Weight Quantization (AWQ) algorithm for quantizing LLMs. AutoAWQ was created and improved upon from the original work from MIT.
邦訳：
AutoAWQは、4ビット量子化モデル用の使いやすいパッケージです。AutoAWQはFP16と比較して、モデルを2倍高速化し、必要なメモリを3倍削減します。AutoAWQは、LLMを量子化するためのActivation-aware Weight Quantization (AWQ)アルゴリズムを実装しています。AutoAWQは、マサチューセッツ工科大学(MIT)のオリジナルの研究を基に作成され、改良されました。

というわけで、4bitのAWQ量子化モデルを高速かつ高効率に実行することができるようです。
割とTransformersに近い使い勝手のモジュールで、Transformersに慣れていたら簡単に利用できそうな感じです。

注意点として、Requirementsが

Compute Capability 8.0 (sm80). Ampere and later architectures are supported.
CUDA Toolkit 11.8 and later.

となっていますので、比較的新しいアーキテクチャのGPUが必要となります。

では、やってみましょう。

Step1. モデルのダウンロード

いつものやつです。

import os
from huggingface_hub import snapshot_download

UC_VOLUME = "/Volumes/モデルを保存するためのUnity Catalogボリューム"

model = "TheBloke/Mistral-7B-v0.1-AWQ"
local_dir = f"/tmp/{model}"
uc_dir = "/models--TheBloke--Mistral-7B-v0.1-AWQ"

snapshot_location = snapshot_download(
    repo_id=model,
    local_dir=local_dir,
    local_dir_use_symlinks=False,
)

dbutils.fs.cp(f"file:{local_dir}", f"{UC_VOLUME}{uc_dir}", recurse=True)

こんな感じで、モデルを3種ダウンロードします。

Step2. 推論する

必要なモジュールをインストール。

%pip install -U autoawq git+https://github.com/huggingface/transformers.git accelerate

dbutils.library.restartPython()

transformersをgithubからインストールしているのは、Mistral-7Bを動かすためには必要なためです。
ネットワーク環境にもよると思いますが、1分強でインストールできると思います。

では、サンプルコードを参考に推論してみます。
まずはMistral-7B-v0.1から。

UC_VOLUME = "/Volumes/モデルを保存するためのUnity Catalogボリューム"

model_dir = "/models--TheBloke--Mistral-7B-v0.1-AWQ"
model_path = f"{UC_VOLUME}{model_dir}"

from awq import AutoAWQForCausalLM
from transformers import AutoTokenizer

# Load model
model = AutoAWQForCausalLM.from_quantized(
    model_path,
    fuse_layers=True,
    trust_remote_code=False,
    safetensors=True,
)
tokenizer = AutoTokenizer.from_pretrained(
    model_path,
    trust_remote_code=False,
)

prompt = "What is AI? AI is"

tokens = tokenizer(prompt, return_tensors="pt").input_ids.cuda()

# Generate output
generation_output = model.generate(
    tokens, do_sample=True, temperature=0.7, top_p=0.95, top_k=40, max_new_tokens=32
)

print("Output: ", tokenizer.decode(generation_output[0]))

Output:  <s> What is AI? AI is a branch of computer science that has been around for a long time. It is a field of study that is concerned with the development of intelligent machines that can perform

というわけで動きました。

4bit量子化というわけで精度劣化が気になるところですが、上のような単純な生成は問題なくできているように見えます。
今回は主に推論速度・メモリ効率に焦点を当てて確認してみます。

Step3. 速度を計る

厳密ではないですが、簡単に推論速度を測ってみました。

まず、推論処理を関数でラップします。

def generate_text(prompt:str, max_new_tokens:int=512) -> str:

    tokens = tokenizer(
        prompt,
        return_tensors='pt'
    ).input_ids.cuda()

    # Generate output
    generation_output = model.generate(
        tokens,
        do_sample=True,
        temperature=0.7,
        top_p=0.95,
        top_k=40,
        max_new_tokens=max_new_tokens,
    )

    return tokenizer.decode(generation_output[0])

import time

prompt = "What is Databricks? Databricks is"

max_new_tokens = 128

time_begin = time.time()

output = generate_text(prompt, max_new_tokens)

time_end = time.time()
time_total = time_end - time_begin

print(output)
print()
print(f"Response generated in {time_total:.2f} seconds, {max_new_tokens} tokens, {max_new_tokens / time_total:.2f} tokens/second")

結果

<s> What is Databricks? Databricks is a platform that is used to build and run large-scale data applications. It is a platform that can help you to store, manage, and analyze your data.

## What is Robotics?

Robotics is the study of robots. It is the study of robots that can perform tasks that are usually done by humans.

## What are the Benefits of AI?

The benefits of AI are many. They include the following:

- AI can help you to make better decisions.
- AI can help you to solve problems.
- AI can help you to save time

Response generated in 2.03 seconds, 128 tokens, 63.03 tokens/second

何度か実行しましたが、概ね63-65 tokens/secの推論速度でした。
max_tokensをもう少し増やせばより安定した計測ができると思います。

以前の記事でExLlama V2を試しましたが、同じg5.xlarge上だと60 tokens/sec程度の推論速度がでたので、同程度の性能かな。
7Bモデルとはいえ、60 tokens/sec以上って十分な速度ですね。

VRAM使用量は、モデルロード後は3.6GB程度、推論を実行すると10.8GBまで増加しました。
推論実行するとキャッシュを内部に貯める構造なのかな？

Step4. 他のモデルでも試す

次にTheBloke/vicuna-13B-v1.5-16K-AWQを使ってみます。
コードはほぼ同じなので、速度計測部分以外は省略。

import time

prompt = "What is Databricks? Databricks is"

max_new_tokens = 128

time_begin = time.time()

output = generate_text(prompt, max_new_tokens)

time_end = time.time()
time_total = time_end - time_begin

print(output)
print()
print(f"Response generated in {time_total:.2f} seconds, {max_new_tokens} tokens, {max_new_tokens / time_total:.2f} tokens/second")

結果

<s> What is Databricks? Databricks is a cloud-based data lake platform that allows organizations to store, process, and analyze large amounts of data in real-time.
What is Apache Spark? Apache Spark is an open-source data processing engine that is designed to process large amounts of data in real-time.
What is Apache Spark? Apache Spark is an open-source data processing engine that is designed to process large amounts of data in real-time.
What is Apache Spark? Apache Spark is an open-source data processing engine that is designed to process large amounts of data in real-time.
What is Apache Spark? Apache Spark is an open-

Response generated in 2.83 seconds, 128 tokens, 45.18 tokens/second

出力がおかしいので、RoPE scaling?が未対応なためな気がします。（モデルロード時にNo such comm: LSP_COMM_IDというメッセージが出ていたので、少なくともモデルがうまく読み込めてない）

推論速度は45 tokens/sec 前後でした。
VRAM使用量はモデルロード後に6.2GB、推論後には13.2GB程度まで上がりました。

TheBloke/CodeLlama-34B-Instruct-AWQも試してみようとしたのですが、CUDAでOOMが出ました。微妙にVRAM足りず。

まとめ

AutoAWQだとうまくいくケースとそうでないケースがありそうで、もうちょっといろいろ調べてみる必要がありそうです。また、日本語モデルを利用したときの性能面なども気にはなります。

とはいえ、使いやすいモジュールですし、推論速度は確かにかなり高速です。
(CTranslate2で8bit量子化したvicuna-13B-v1.5-16Kモデルが10 tokens/secぐらいだったので、相当速い）
また、モデルサイズが小さいこともあってか、モデルのロード速度もかなり速いです。
まだ過渡期のモジュールではありますが、こちらの本格利用を検討していってもよさそう。

次回はvLLMでAWQフォーマットのモデルを試してみたいと思います。

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up