Docker上でFlexGenを動かしてみました。
実行環境
- OS:Ubuntu 22.04.2 LTS
- GPU:NVIDIA GeForce RTX 3060
- cuda:12.1
- Docker:24.0.2
環境準備
-
Dockfileを準備します。
FROM pytorch/pytorch:2.0.1-cuda11.7-cudnn8-devel RUN apt update && apt upgrade -y RUN apt install -y git python3 python3-pip RUN git clone https://github.com/FMInference/FlexGen RUN cd FlexGen && pip install -e .
-
docker imageをbuildします。
$ sudo docker build -t flexgen-test .
-
コンテナを立ち上げます。
$ sudo docker run -it --rm --gpus 0 --name flexgen flexgen-test /bin/bash
実行
ベンチマーク
-
モデルを指定して実行します。
初回はモデルのダウンロード等で少し時間かかります。
# python3 -m flexgen.flex_opt --model facebook/opt-1.3b
実行結果
<run_flexgen>: args.model: facebook/opt-1.3b Downloading (…)okenizer_config.json: 100%|███████████████████████████████████████████████████████████████████████████████████████| 685/685 [00:00<00:00, 1.54MB/s] Downloading (…)lve/main/config.json: 100%|███████████████████████████████████████████████████████████████████████████████████████| 651/651 [00:00<00:00, 1.60MB/s] Downloading (…)olve/main/vocab.json: 100%|█████████████████████████████████████████████████████████████████████████████████████| 899k/899k [00:00<00:00, 1.48MB/s] Downloading (…)olve/main/merges.txt: 100%|█████████████████████████████████████████████████████████████████████████████████████| 456k/456k [00:00<00:00, 28.7MB/s] Downloading (…)cial_tokens_map.json: 100%|████████████████████████████████████████████████████████████████████████████████████████| 221/221 [00:00<00:00, 470kB/s] model size: 2.443 GB, cache size: 0.398 GB, hidden size (prefill): 0.008 GB init weight... Load the pre-trained pytorch weights of opt-1.3b from huggingface. The downloading and cpu loading can take dozens of minutes. If it seems to get stuck, you can monitor the progress by checking the memory usage of this process. Downloading pytorch_model.bin: 100%|█████████████████████████████████████████████████████████████████████████████████████████| 2.63G/2.63G [02:18<00:00, 19.0MB/s] Fetching 1 files: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [02:20<00:00, 140.88s/it] Convert format: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:03<00:00, 3.62s/it] warmup - generate benchmark - generate /opt/conda/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py:293: UserWarning: torch.distributed.reduce_op is deprecated, please use torch.distributed.ReduceOp instead warnings.warn( Outputs: ---------------------------------------------------------------------- 0: Paris is the capital city of France. It is the most populous city in France, with an estimated population of 6,848,000 in 2016. It is the second most populous city ---------------------------------------------------------------------- 3: Paris is the capital city of France. It is the most populous city in France, with an estimated population of 6,848,000 in 2016. It is the second most populous city ---------------------------------------------------------------------- TorchDevice: cuda:0 cur_mem: 2.6505 GB, peak_mem: 3.2478 GB TorchDevice: cpu cur_mem: 0.0000 GB, peak_mem: 0.0000 GB model size: 2.443 GB cache size: 0.398 GB hidden size (p): 0.008 GB peak gpu mem: 3.248 GB projected: False prefill latency: 0.292 s prefill throughput: 7018.240 token/s decode latency: 0.501 s decode throughput: 247.357 token/s total latency: 0.793 s total throughput: 161.390 token/s
-
モデルを
facebook/opt-6.7b
に変更して実行します。# python3 -m flexgen.flex_opt --model facebook/opt-6.7b
実行結果
<run_flexgen>: args.model: facebook/opt-6.7b model size: 12.386 GB, cache size: 1.062 GB, hidden size (prefill): 0.017 GB init weight... Load the pre-trained pytorch weights of opt-6.7b from huggingface. The downloading and cpu loading can take dozens of minutes. If it seems to get stuck, you can monitor the progress by checking the memory usage of this process. Downloading (…)l-00002-of-00002.bin: 100%|██████████████████████████████████████████████████████████████████████| 3.36G/3.36G [01:36<00:00, 34.7MB/s] Downloading (…)l-00001-of-00002.bin: 100%|██████████████████████████████████████████████████████████████████████| 9.96G/9.96G [03:24<00:00, 48.7MB/s] Fetching 2 files: 100%|██████████████████████████████████████████████████████████████████████████████████| 2/2 [03:25<00:00, 102.84s/it] Convert format: 100%|██████████████████████████████████████████████████████████████████████████████████| 2/2 [00:38<00:00, 19.22s/it] Traceback (most recent call last): File "/opt/conda/lib/python3.10/runpy.py", line 196, in _run_module_as_main return _run_code(code, main_globals, None, File "/opt/conda/lib/python3.10/runpy.py", line 86, in _run_code exec(code, run_globals) File "/workspace/FlexGen/flexgen/flex_opt.py", line 1327, in <module> run_flexgen(args) File "/workspace/FlexGen/flexgen/flex_opt.py", line 1219, in run_flexgen model = OptLM(opt_config, env, args.path, policy) File "/workspace/FlexGen/flexgen/flex_opt.py", line 637, in __init__ self.init_all_weights() File "/workspace/FlexGen/flexgen/flex_opt.py", line 800, in init_all_weights self.init_weight(j) File "/workspace/FlexGen/flexgen/flex_opt.py", line 651, in init_weight self.layers[j].init_weight(self.weight_home[j], expanded_path) File "/workspace/FlexGen/flexgen/flex_opt.py", line 494, in init_weight weights = init_weight_list(weight_specs, self.policy, self.env) File "/workspace/FlexGen/flexgen/flex_opt.py", line 112, in init_weight_list weight = home.allocate(shape, dtype, pin_memory=pin_memory) File "/workspace/FlexGen/flexgen/pytorch_backend.py", line 190, in allocate data = torch.empty(shape, dtype=dtype, pin_memory=pin_memory, device=self.dev) torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 128.00 MiB (GPU 0; 11.76 GiB total capacity; 11.53 GiB already allocated; 29.69 MiB free; 11.53 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
メモリ不足で落ちてしまいました。
-
percent
オプションで少し調整して実行してみます。# python3 -m flexgen.flex_opt --model facebook/opt-6.7b --percent 75 25 100 0 100 0
実行結果
<run_flexgen>: args.model: facebook/opt-6.7b model size: 12.386 GB, cache size: 1.062 GB, hidden size (prefill): 0.017 GB init weight... warmup - generate benchmark - generate /opt/conda/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py:293: UserWarning: torch.distributed.reduce_op is deprecated, please use torch.distributed.ReduceOp instead warnings.warn( /workspace/FlexGen/flexgen/utils.py:132: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly. To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage() data_ptr = tensor.storage().data_ptr() Outputs: ---------------------------------------------------------------------- 0: Paris is the capital city of France and the most visited city in the world. It is the most visited city in the world, with more than 30 million visitors each year. Paris is the ---------------------------------------------------------------------- 3: Paris is the capital city of France and the most visited city in the world. It is the most visited city in the world, with more than 30 million visitors each year. Paris is the ---------------------------------------------------------------------- TorchDevice: cuda:0 cur_mem: 7.7936 GB, peak_mem: 9.2148 GB TorchDevice: cpu cur_mem: 5.0003 GB, peak_mem: 0.0000 GB model size: 12.386 GB cache size: 1.062 GB hidden size (p): 0.017 GB peak gpu mem: 9.215 GB projected: False prefill latency: 1.195 s prefill throughput: 1714.413 token/s decode latency: 10.083 s decode throughput: 12.297 token/s total latency: 11.278 s total throughput: 11.349 token/s
動きました!
FlexGen APIのサンプル
ついでにAPIのサンプルも動かしてみました。
-
モデルは小さめのを・・・
# python3 -m flexgen.apps.completion --model facebook/opt-1.3b
実行結果
Initialize... Generate... Outputs: ---------------------------------------------------------------------- 0: Question: Where were the 2004 Olympics held? Answer: Athens, Greece Question: What is the longest river on the earth? Answer: The Nile Question: What is the name of the tallest mountain? Answer: Mount Kilimanjaro Question: Are there no higher mountains than Everest ---------------------------------------------------------------------- 1: Extract the airport codes from this text. Text: "I want a flight from New York to San Francisco." Airport codes: JFK, SFO. Text: "I want you to book a flight from Phoenix to Las Vegas." Airport codes: PHX, LVG. Text: "I want to book a flight from New York to San Francisco." Airport codes: JFK, SFO ---------------------------------------------------------------------- Shutdown...
しっかり答えてくれてます。
まとめ
今回使用したGPUがハイスペックではなかったので、大きめのモデルを使おうとするとメモリ不足で落ちてしまいました。
percent
オプションを使うことで多少は動かせましたが、CPUに動いてもらっているので処理時間は遅くなります。
難しいのかと思って身構えていましたが、案外簡単に動かすことができました。
GPU搭載のマシンをお持ちの方は軽い気持ちでやってみてください。