More than 1 year has passed since last update.

internlm-chat-7b をGPUで動かしてみる

Last updated at 2023-11-12Posted at 2023-11-12

internlm-chat-7b-v1_1をGPTQを利用して圧縮すれば
GPUにて動作可能になるのではと思い実験してみた

まとめ

GPTQモデルへの変換は問題なくでき
VRAM使用率を8.4GB程度で実行することが可能になった

ただ少し日本語の出力がおかしいので英語で利用する方が良いかも

GPTQ変換済みモデル
利用の際はinternlm-chat-7b-v1_1のライセンスに従ってください
URL:https://huggingface.co/wizcat/internlm-chat-7b-v1_1-gptq

GPTQモデルへの変換

変換にはRTX3090を利用して１時間程度かかります

下準備

git clone https://huggingface.co/internlm/internlm-chat-7b-v1_1
pip install -U auto-gptq
pip install git+https://github.com/huggingface/optimum.git
pip install git+https://github.com/huggingface/transformers.git
pip install -U accelerate
pip install -U datasets

GPTQ変換コード

import torch
import time
from transformers import AutoTokenizer, AutoModelForCausalLM,GPTQConfig
from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig

model_id = r".\internlm-chat-7b-v1_1"
tokenizer = AutoTokenizer.from_pretrained(model_id , trust_remote_code=True)
gptq_config = GPTQConfig(bits=4, dataset = "c4", tokenizer=tokenizer)

model = AutoModelForCausalLM.from_pretrained(model_id, device_map="auto", quantization_config=gptq_config,trust_remote_code=True)

model.save_pretrained("internlm-chat-7b-v1_1gptq")
tokenizer.save_pretrained("internlm-chat-7b-v1_1-gptq")

推論速度テスト

推論速度テストコード

import torch
import time
from transformers import AutoTokenizer, AutoModelForCausalLM,GPTQConfig

model_path = r".\internlm-chat-7b-v1_1-gptq"

tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)

gptq_config = GPTQConfig(bits= 4 , disable_exllama= True )
model = AutoModelForCausalLM.from_pretrained( model_path , device_map= "auto" , quantization_config = gptq_config,trust_remote_code=True)

model = model.eval()

history = []
testdata = ["おはよう","hello","你好",
            "pythonで1から10まで足し合わせるコードを書いて",
            "Write a code that adds up from 1 to 10 in python"]

for txt in testdata:
    start_time = time.perf_counter()
    response, history = model.chat(tokenizer, txt, history=[])
    print(f"in:{txt}\nout:{response}")
    end_time = time.perf_counter()
    elapsed_time = end_time - start_time
    print(f"worktime:{elapsed_time}")

推論結果

in:おはよう
out:おはようございます。何をお願いしますか？
worktime:5.678416800000001

in:hello
out:Hello! How can I assist you today?
worktime:2.8455946

in:你好
out:你好，我是人工智能助手，有什么可以帮助你的吗？
worktime:3.3212843000000003

in:pythonで1から10まで足し合わせるコードを書いて
out:以下のコードは、Pythonで1から10までの数値を足し合わせるものです。

for i in range(1, 11):
    print(i)

このコードは、range(1, 11)という语法を使用して、1から10までの数値を迭代します。然後、print()関数を使用して、それぞれの数値を表示します。
worktime:24.9393924

in:Write a code that adds up from 1 to 10 in python
out:Here's the code to add up from 1 to 10 in Python:

sum = 0
for i in range(1, 11):
    sum += i
print("The sum of numbers from 1 to 10 is:", sum)

This code initializes a variable sum to 0 and then uses a for loop to iterate through the numbers 1 to 10. For each number, it adds it to the sum variable. Finally, it prints out the sum of the numbers from 1 to 10.
worktime:28.824399199999995

テスト環境
CPU:i9-12900F
GPU:RTX3090
RAM:128GB

GPTQ変換を行ったモデルをCPUで動かすことができなかったため
下記表のCPUはGPTQ変換前の物を利用した場合の推論時間です

input	CPU	GPU
おはよう	10.39s	6.05s
hello	11.08s	2.79s
你好	11.12s	2.344s
pythonで1から10まで足し合わせるコードを書いて	71.40s	29.44s
Write a code that adds up from 1 to 10 in python	114.46s	29.44s

会話の継続性テスト

日本語向けのモデルでないため少し文字がおかしなところがあるが
問題なく動くレベル

msg:pythonで1から10まで足し合わせるコードを書いて
以下は、Pythonで1から10までの数値の总合を計算するコードです。

```
sum = 0
for i in range(1, 11):
    sum += i
print("1から10までの数値の总合は、", sum)
```

このコードは、`sum`変数を0に初期し、`for`文で、1から10（含め）の数値を逐次加算します。最後に、計算結果を表示します。
worktime:29.4449707
msg:そのコードの初期値を5に修正したコードをかいて
以下は、Pythonで1から10までの数値の总合を計算するコードを修正してみました。

```
sum = 5
for i in range(1, 11):
    sum += i
print("1から10までの数値の总合は、", sum)
```

このコードは、`sum`変数を5に初期し、`for`文で、1から10（含め）の数値を逐次加算します。最後に、計算結果を表示します。
worktime:30.212431100000003

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up