FlexGenを動かしてみた

Posted at 2023-08-07

Docker上でFlexGenを動かしてみました。

実行環境

OS：Ubuntu 22.04.2 LTS
GPU：NVIDIA GeForce RTX 3060
cuda：12.1
Docker：24.0.2

環境準備

Dockfileを準備します。

FROM pytorch/pytorch:2.0.1-cuda11.7-cudnn8-devel

RUN apt update && apt upgrade -y
RUN apt install -y git python3 python3-pip

RUN git clone https://github.com/FMInference/FlexGen
RUN cd FlexGen && pip install -e .

docker imageをbuildします。
```
$ sudo docker build -t flexgen-test .
```

コンテナを立ち上げます。

$ sudo docker run -it --rm --gpus 0 --name flexgen flexgen-test /bin/bash

実行

ベンチマーク

モデルを指定して実行します。

初回はモデルのダウンロード等で少し時間かかります。

# python3 -m flexgen.flex_opt --model facebook/opt-1.3b

実行結果

<run_flexgen>: args.model: facebook/opt-1.3b
Downloading (…)okenizer_config.json: 100%|███████████████████████████████████████████████████████████████████████████████████████| 685/685 [00:00<00:00, 1.54MB/s]
Downloading (…)lve/main/config.json: 100%|███████████████████████████████████████████████████████████████████████████████████████| 651/651 [00:00<00:00, 1.60MB/s]
Downloading (…)olve/main/vocab.json: 100%|█████████████████████████████████████████████████████████████████████████████████████| 899k/899k [00:00<00:00, 1.48MB/s]
Downloading (…)olve/main/merges.txt: 100%|█████████████████████████████████████████████████████████████████████████████████████| 456k/456k [00:00<00:00, 28.7MB/s]
Downloading (…)cial_tokens_map.json: 100%|████████████████████████████████████████████████████████████████████████████████████████| 221/221 [00:00<00:00, 470kB/s]
model size: 2.443 GB, cache size: 0.398 GB, hidden size (prefill): 0.008 GB
init weight...
Load the pre-trained pytorch weights of opt-1.3b from huggingface. The downloading and cpu loading can take dozens of minutes. If it seems to get stuck, you can monitor the progress by checking the memory usage of this process.
Downloading pytorch_model.bin: 100%|█████████████████████████████████████████████████████████████████████████████████████████| 2.63G/2.63G [02:18<00:00, 19.0MB/s]
Fetching 1 files: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [02:20<00:00, 140.88s/it]
Convert format: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:03<00:00,  3.62s/it]
warmup - generate                                                                                                                                                 
benchmark - generate
/opt/conda/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py:293: UserWarning: torch.distributed.reduce_op is deprecated, please use torch.distributed.ReduceOp instead
  warnings.warn(
Outputs:
----------------------------------------------------------------------
0: Paris is the capital city of France. It is the most populous city in France, with an estimated population of 6,848,000 in 2016. It is the second most populous city
----------------------------------------------------------------------
3: Paris is the capital city of France. It is the most populous city in France, with an estimated population of 6,848,000 in 2016. It is the second most populous city
----------------------------------------------------------------------

TorchDevice: cuda:0
  cur_mem: 2.6505 GB,  peak_mem: 3.2478 GB
TorchDevice: cpu
  cur_mem: 0.0000 GB,  peak_mem: 0.0000 GB
model size: 2.443 GB    cache size: 0.398 GB    hidden size (p): 0.008 GB
peak gpu mem: 3.248 GB  projected: False
prefill latency: 0.292 s        prefill throughput: 7018.240 token/s
decode latency: 0.501 s decode throughput: 247.357 token/s
total latency: 0.793 s  total throughput: 161.390 token/s

モデルをfacebook/opt-6.7bに変更して実行します。

# python3 -m flexgen.flex_opt --model facebook/opt-6.7b

実行結果

<run_flexgen>: args.model: facebook/opt-6.7b
model size: 12.386 GB, cache size: 1.062 GB, hidden size (prefill): 0.017 GB
init weight...
Load the pre-trained pytorch weights of opt-6.7b from huggingface. The downloading and cpu loading can take dozens of minutes. If it seems to get stuck, you can monitor the progress by checking the memory usage of this process.
Downloading (…)l-00002-of-00002.bin: 100%|██████████████████████████████████████████████████████████████████████| 3.36G/3.36G [01:36<00:00, 34.7MB/s]
Downloading (…)l-00001-of-00002.bin: 100%|██████████████████████████████████████████████████████████████████████| 9.96G/9.96G [03:24<00:00, 48.7MB/s]
Fetching 2 files: 100%|██████████████████████████████████████████████████████████████████████████████████| 2/2 [03:25<00:00, 102.84s/it]
Convert format: 100%|██████████████████████████████████████████████████████████████████████████████████| 2/2 [00:38<00:00, 19.22s/it]
Traceback (most recent call last):                                                                                                                                                                                                           
  File "/opt/conda/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/opt/conda/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/workspace/FlexGen/flexgen/flex_opt.py", line 1327, in <module>
    run_flexgen(args)
  File "/workspace/FlexGen/flexgen/flex_opt.py", line 1219, in run_flexgen
    model = OptLM(opt_config, env, args.path, policy)
  File "/workspace/FlexGen/flexgen/flex_opt.py", line 637, in __init__
    self.init_all_weights()
  File "/workspace/FlexGen/flexgen/flex_opt.py", line 800, in init_all_weights
    self.init_weight(j)
  File "/workspace/FlexGen/flexgen/flex_opt.py", line 651, in init_weight
    self.layers[j].init_weight(self.weight_home[j], expanded_path)
  File "/workspace/FlexGen/flexgen/flex_opt.py", line 494, in init_weight
    weights = init_weight_list(weight_specs, self.policy, self.env)
  File "/workspace/FlexGen/flexgen/flex_opt.py", line 112, in init_weight_list
    weight = home.allocate(shape, dtype, pin_memory=pin_memory)
  File "/workspace/FlexGen/flexgen/pytorch_backend.py", line 190, in allocate
    data = torch.empty(shape, dtype=dtype, pin_memory=pin_memory, device=self.dev)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 128.00 MiB (GPU 0; 11.76 GiB total capacity; 11.53 GiB already allocated; 29.69 MiB free; 11.53 GiB reserved in total by PyTorch) 
If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

メモリ不足で落ちてしまいました。

percentオプションで少し調整して実行してみます。

# python3 -m flexgen.flex_opt --model facebook/opt-6.7b --percent 75 25 100 0 100 0

実行結果

<run_flexgen>: args.model: facebook/opt-6.7b
model size: 12.386 GB, cache size: 1.062 GB, hidden size (prefill): 0.017 GB
init weight...
warmup - generate
benchmark - generate
/opt/conda/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py:293: UserWarning: torch.distributed.reduce_op is deprecated, please use torch.distributed.ReduceOp instead
  warnings.warn(
/workspace/FlexGen/flexgen/utils.py:132: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. 
This should only matter to you if you are using storages directly.  To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
  data_ptr = tensor.storage().data_ptr()
Outputs:
----------------------------------------------------------------------
0: Paris is the capital city of France and the most visited city in the world. It is the most visited city in the world, with more than 30 million visitors each year. Paris is the
----------------------------------------------------------------------
3: Paris is the capital city of France and the most visited city in the world. It is the most visited city in the world, with more than 30 million visitors each year. Paris is the
----------------------------------------------------------------------

TorchDevice: cuda:0
  cur_mem: 7.7936 GB,  peak_mem: 9.2148 GB
TorchDevice: cpu
  cur_mem: 5.0003 GB,  peak_mem: 0.0000 GB
model size: 12.386 GB   cache size: 1.062 GB    hidden size (p): 0.017 GB
peak gpu mem: 9.215 GB  projected: False
prefill latency: 1.195 s        prefill throughput: 1714.413 token/s
decode latency: 10.083 s        decode throughput: 12.297 token/s
total latency: 11.278 s total throughput: 11.349 token/s

動きました！

FlexGen APIのサンプル

ついでにAPIのサンプルも動かしてみました。

モデルは小さめのを・・・

# python3 -m flexgen.apps.completion --model facebook/opt-1.3b

実行結果

Initialize...
Generate...
Outputs:
----------------------------------------------------------------------
0: Question: Where were the 2004 Olympics held?
Answer: Athens, Greece
Question: What is the longest river on the earth?
Answer: The Nile
Question: What is the name of the tallest mountain?
Answer: Mount Kilimanjaro
Question: Are there no higher mountains than Everest
----------------------------------------------------------------------
1: Extract the airport codes from this text.
Text: "I want a flight from New York to San Francisco."
Airport codes: JFK, SFO.
Text: "I want you to book a flight from Phoenix to Las Vegas."
Airport codes: PHX, LVG.
Text: "I want to book a flight from New York to San Francisco."
Airport codes: JFK, SFO
----------------------------------------------------------------------
Shutdown...

しっかり答えてくれてます。

まとめ

今回使用したGPUがハイスペックではなかったので、大きめのモデルを使おうとするとメモリ不足で落ちてしまいました。
percentオプションを使うことで多少は動かせましたが、CPUに動いてもらっているので処理時間は遅くなります。

難しいのかと思って身構えていましたが、案外簡単に動かすことができました。
GPU搭載のマシンをお持ちの方は軽い気持ちでやってみてください。

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up