LoginSignup
3
1

llama-cpp-pythonを用いたllama2制御に必要なVRAMの調査

Posted at

はじめに

llama2をローカルで使うために、llama.cppについて勉強中です。
今回はlama.cppライブラリのPythonバインディングを提供するパッケージであるllama-cpp-pythonを用いて、各モデルのGPU使用量を調査しようと思います。

また、私の持っているGPUがRTX3060tiのメモリ容量が8GBなので、使用量が8GBに収まるGPUオフロード設定値を見つけたいと思います。

環境

Google Colabratory (GPU: T4)

実際に試してみる

基本はLangChainのチュートリアルをコピペして使いました。

1. GGML-llamaモデルのダウンロード

# 7b ggml llama2
!wget -q -P ./models https://huggingface.co/TheBloke/Llama-2-7B-Chat-GGML/resolve/main/llama-2-7b-chat.ggmlv3.q4_K_M.bin

# 13b ggml llama2
!wget -q -P ./models https://huggingface.co/TheBloke/Llama-2-13B-chat-GGML/resolve/main/llama-2-13b-chat.ggmlv3.q4_K_M.bin

llama2_7b_path = "/content/models/llama-2-7b-chat.ggmlv3.q4_K_M.bin"
llama2_13b_path = "/content/models/llama-2-13b-chat.ggmlv3.q4_K_M.bin"

🖊️GGMLとは?

大規模言語モデルを量子化し個人レベルのハードウェアで大規模言語モデル使えるようにするための技術を提供するライブラリ。

2. ライブラリのインストール

!CMAKE_ARGS="-DLLAMA_CUBLAS=on" FORCE_CMAKE=1 pip install -q llama-cpp-python
!pip install -q langchain
from langchain.llms import LlamaCpp
from langchain import PromptTemplate, LLMChain
from langchain.callbacks.manager import CallbackManager
from langchain.callbacks.streaming_stdout import StreamingStdOutCallbackHandler
from langchain.agents import load_tools
from langchain.agents import initialize_agent

3. LLMの実行

template = """Question: {question}

Answer: Let's work this out in a step by step way to be sure we have the right answer."""

prompt = PromptTemplate(template=template, input_variables=["question"])

callback_manager = CallbackManager([StreamingStdOutCallbackHandler()])
n_gpu_layers = 40 # ここを変える
n_batch = 512

llm = LlamaCpp(
    model_path=llama2_7b_path, # ここを変える
    n_gpu_layers=n_gpu_layers,
    n_batch=n_batch,
    callback_manager=callback_manager,
    verbose=True,
)
question = "What NFL team won the Super Bowl in the year Justin Bieber was born?"
# ジャスティン・ビーバーが生まれた年にスーパーボウルで優勝したNFLチームはどこですか?

llm_chain = LLMChain(prompt=prompt, llm=llm)
llm_chain.run(question)

3-1. model=7b, n_gpu_layer=40の場合

 Here are the Super Bowl winners for each year since 1967, when the NFL-AFL merger took place:

* 1967 - Green Bay Packers
* 1968 - Kansas City Chiefs
* 1969 - New York Jets
* 1970 - Baltimore Colts
* 1971 - Dallas Cowboys
* 1972 - Miami Dolphins
* 1973 - Oakland Raiders
* 1974 - Pittsburgh Steelers
* 1975 - Minnesota Vikings
* 1976 - Pittsburgh Steelers
* 1977 - Dallas Cowboys
* 1978 - Los Angeles Rams
* 1979 - San Francisco 49ers
* 1980 - Pittsburgh Steelers
* 1981 - Miami Dolphins
* 1982 - Washington Redskins
* 1983 - Los Angeles Raiders
* 1984 - San Francisco 49ers
* 198

出力時間=9秒
VRAM=5.6GB

3-2. model=13b, n_gpu_layer=40の場合

Step 1: When is Justin Bieber born?

Step 2: What NFL season did he get born during?

Step 3: Who won the Super Bowl during that season?

Please help with each of these steps so I can give you the correct answer!

出力時間=9秒
VRAM=9.1GB

3-3. model=13b, n_gpu_layer=30の場合

 Justin Bieber was born on March 1, 1994. The Super Bowl is played at the end of each NFL season, which typically falls in February.

So, if we assume that Justin Bieber was born in the early morning hours of March 1st (ignoring any possible pregnancy complications or premature birth), and the Super Bowl was held on the first Sunday in February (which is when the game is usually held), then we can calculate the year of the Super Bowl as follows:

* If the Super Bowl was held on the first Sunday in February 1994, then it would have been Super Bowl XXVIII.
* However, if the Super Bowl was held on the first Sunday in February 1995 (i.e., one year after Justin Bieber's birth), then it would have been Super Bowl XXX.

Therefore, the NFL team that won the Super Bowl in the year Justin Bieber was born is:

The Dallas Cowboys, who won Super Bowl XXX.

出力時間=60秒
VRAM=7.3GB

3-4. model=13b, n_gpu_layer=20の場合

Justin Bieber was born on March 1, 1994. The Super Bowl is played at the end of the NFL season which runs from September to December. Therefore, we can deduce that Justin Bieber was not alive during any Super Bowl game. Therefore, no team won the Super Bowl in the year Justin Bieber was born.

出力時間=60秒
VRAM=5.5GB

まとめ

ここまでやりましたが実行時間に関しては1秒あたりのtoken生成数がわからないとあまり意味がないかなと思います。

また、そもそも与えるプロンプトによってVRAMの使用量が変わる可能性があるので、確認できたVRAM使用量も参考程度にしかならないかもです。

上記を考慮して、ローカルで環境構築する際はmodel=13b, n_gpu_layer=20model=7b, n_gpu_layer=40を使用することにします。

出力値はどのモデルも微妙かなと思いましたが、ここはプロンプト次第でもう少し制御できるのかなと思うので工夫していきたいと思います。

3
1
0

Register as a new user and use Qiita more conveniently

  1. You get articles that match your needs
  2. You can efficiently read back useful information
  3. You can use dark theme
What you can do with signing up
3
1