4
1

Delete article

Deleted articles cannot be recovered.

Draft of this article would be also deleted.

Are you sure you want to delete this article?

More than 1 year has passed since last update.

FlexGenを動かしてみた

Posted at

Docker上でFlexGenを動かしてみました。

実行環境

  • OS:Ubuntu 22.04.2 LTS
  • GPU:NVIDIA GeForce RTX 3060
  • cuda:12.1
  • Docker:24.0.2

環境準備

  • Dockfileを準備します。

    FROM pytorch/pytorch:2.0.1-cuda11.7-cudnn8-devel
    
    RUN apt update && apt upgrade -y
    RUN apt install -y git python3 python3-pip
    
    RUN git clone https://github.com/FMInference/FlexGen
    RUN cd FlexGen && pip install -e .
    
  • docker imageをbuildします。

    $ sudo docker build -t flexgen-test .
    
  • コンテナを立ち上げます。

    $ sudo docker run -it --rm --gpus 0 --name flexgen flexgen-test /bin/bash
    

実行

ベンチマーク

  • モデルを指定して実行します。

    初回はモデルのダウンロード等で少し時間かかります。

    # python3 -m flexgen.flex_opt --model facebook/opt-1.3b
    

    実行結果

    <run_flexgen>: args.model: facebook/opt-1.3b
    Downloading ()okenizer_config.json: 100%|███████████████████████████████████████████████████████████████████████████████████████| 685/685 [00:00<00:00, 1.54MB/s]
    Downloading ()lve/main/config.json: 100%|███████████████████████████████████████████████████████████████████████████████████████| 651/651 [00:00<00:00, 1.60MB/s]
    Downloading ()olve/main/vocab.json: 100%|█████████████████████████████████████████████████████████████████████████████████████| 899k/899k [00:00<00:00, 1.48MB/s]
    Downloading ()olve/main/merges.txt: 100%|█████████████████████████████████████████████████████████████████████████████████████| 456k/456k [00:00<00:00, 28.7MB/s]
    Downloading ()cial_tokens_map.json: 100%|████████████████████████████████████████████████████████████████████████████████████████| 221/221 [00:00<00:00, 470kB/s]
    model size: 2.443 GB, cache size: 0.398 GB, hidden size (prefill): 0.008 GB
    init weight...
    Load the pre-trained pytorch weights of opt-1.3b from huggingface. The downloading and cpu loading can take dozens of minutes. If it seems to get stuck, you can monitor the progress by checking the memory usage of this process.
    Downloading pytorch_model.bin: 100%|█████████████████████████████████████████████████████████████████████████████████████████| 2.63G/2.63G [02:18<00:00, 19.0MB/s]
    Fetching 1 files: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [02:20<00:00, 140.88s/it]
    Convert format: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:03<00:00,  3.62s/it]
    warmup - generate                                                                                                                                                 
    benchmark - generate
    /opt/conda/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py:293: UserWarning: torch.distributed.reduce_op is deprecated, please use torch.distributed.ReduceOp instead
      warnings.warn(
    Outputs:
    ----------------------------------------------------------------------
    0: Paris is the capital city of France. It is the most populous city in France, with an estimated population of 6,848,000 in 2016. It is the second most populous city
    ----------------------------------------------------------------------
    3: Paris is the capital city of France. It is the most populous city in France, with an estimated population of 6,848,000 in 2016. It is the second most populous city
    ----------------------------------------------------------------------
    
    TorchDevice: cuda:0
      cur_mem: 2.6505 GB,  peak_mem: 3.2478 GB
    TorchDevice: cpu
      cur_mem: 0.0000 GB,  peak_mem: 0.0000 GB
    model size: 2.443 GB    cache size: 0.398 GB    hidden size (p): 0.008 GB
    peak gpu mem: 3.248 GB  projected: False
    prefill latency: 0.292 s        prefill throughput: 7018.240 token/s
    decode latency: 0.501 s decode throughput: 247.357 token/s
    total latency: 0.793 s  total throughput: 161.390 token/s
    
  • モデルをfacebook/opt-6.7bに変更して実行します。

    # python3 -m flexgen.flex_opt --model facebook/opt-6.7b
    

    実行結果

    <run_flexgen>: args.model: facebook/opt-6.7b
    model size: 12.386 GB, cache size: 1.062 GB, hidden size (prefill): 0.017 GB
    init weight...
    Load the pre-trained pytorch weights of opt-6.7b from huggingface. The downloading and cpu loading can take dozens of minutes. If it seems to get stuck, you can monitor the progress by checking the memory usage of this process.
    Downloading ()l-00002-of-00002.bin: 100%|██████████████████████████████████████████████████████████████████████| 3.36G/3.36G [01:36<00:00, 34.7MB/s]
    Downloading ()l-00001-of-00002.bin: 100%|██████████████████████████████████████████████████████████████████████| 9.96G/9.96G [03:24<00:00, 48.7MB/s]
    Fetching 2 files: 100%|██████████████████████████████████████████████████████████████████████████████████| 2/2 [03:25<00:00, 102.84s/it]
    Convert format: 100%|██████████████████████████████████████████████████████████████████████████████████| 2/2 [00:38<00:00, 19.22s/it]
    Traceback (most recent call last):                                                                                                                                                                                                           
      File "/opt/conda/lib/python3.10/runpy.py", line 196, in _run_module_as_main
        return _run_code(code, main_globals, None,
      File "/opt/conda/lib/python3.10/runpy.py", line 86, in _run_code
        exec(code, run_globals)
      File "/workspace/FlexGen/flexgen/flex_opt.py", line 1327, in <module>
        run_flexgen(args)
      File "/workspace/FlexGen/flexgen/flex_opt.py", line 1219, in run_flexgen
        model = OptLM(opt_config, env, args.path, policy)
      File "/workspace/FlexGen/flexgen/flex_opt.py", line 637, in __init__
        self.init_all_weights()
      File "/workspace/FlexGen/flexgen/flex_opt.py", line 800, in init_all_weights
        self.init_weight(j)
      File "/workspace/FlexGen/flexgen/flex_opt.py", line 651, in init_weight
        self.layers[j].init_weight(self.weight_home[j], expanded_path)
      File "/workspace/FlexGen/flexgen/flex_opt.py", line 494, in init_weight
        weights = init_weight_list(weight_specs, self.policy, self.env)
      File "/workspace/FlexGen/flexgen/flex_opt.py", line 112, in init_weight_list
        weight = home.allocate(shape, dtype, pin_memory=pin_memory)
      File "/workspace/FlexGen/flexgen/pytorch_backend.py", line 190, in allocate
        data = torch.empty(shape, dtype=dtype, pin_memory=pin_memory, device=self.dev)
    torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 128.00 MiB (GPU 0; 11.76 GiB total capacity; 11.53 GiB already allocated; 29.69 MiB free; 11.53 GiB reserved in total by PyTorch) 
    If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
    

    メモリ不足で落ちてしまいました。

  • percentオプションで少し調整して実行してみます。

    # python3 -m flexgen.flex_opt --model facebook/opt-6.7b --percent 75 25 100 0 100 0
    

    実行結果

    <run_flexgen>: args.model: facebook/opt-6.7b
    model size: 12.386 GB, cache size: 1.062 GB, hidden size (prefill): 0.017 GB
    init weight...
    warmup - generate
    benchmark - generate
    /opt/conda/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py:293: UserWarning: torch.distributed.reduce_op is deprecated, please use torch.distributed.ReduceOp instead
      warnings.warn(
    /workspace/FlexGen/flexgen/utils.py:132: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. 
    This should only matter to you if you are using storages directly.  To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
      data_ptr = tensor.storage().data_ptr()
    Outputs:
    ----------------------------------------------------------------------
    0: Paris is the capital city of France and the most visited city in the world. It is the most visited city in the world, with more than 30 million visitors each year. Paris is the
    ----------------------------------------------------------------------
    3: Paris is the capital city of France and the most visited city in the world. It is the most visited city in the world, with more than 30 million visitors each year. Paris is the
    ----------------------------------------------------------------------
    
    TorchDevice: cuda:0
      cur_mem: 7.7936 GB,  peak_mem: 9.2148 GB
    TorchDevice: cpu
      cur_mem: 5.0003 GB,  peak_mem: 0.0000 GB
    model size: 12.386 GB   cache size: 1.062 GB    hidden size (p): 0.017 GB
    peak gpu mem: 9.215 GB  projected: False
    prefill latency: 1.195 s        prefill throughput: 1714.413 token/s
    decode latency: 10.083 s        decode throughput: 12.297 token/s
    total latency: 11.278 s total throughput: 11.349 token/s
    

    動きました!

FlexGen APIのサンプル

ついでにAPIのサンプルも動かしてみました。

  • モデルは小さめのを・・・

    # python3 -m flexgen.apps.completion --model facebook/opt-1.3b
    

    実行結果

    Initialize...
    Generate...
    Outputs:
    ----------------------------------------------------------------------
    0: Question: Where were the 2004 Olympics held?
    Answer: Athens, Greece
    Question: What is the longest river on the earth?
    Answer: The Nile
    Question: What is the name of the tallest mountain?
    Answer: Mount Kilimanjaro
    Question: Are there no higher mountains than Everest
    ----------------------------------------------------------------------
    1: Extract the airport codes from this text.
    Text: "I want a flight from New York to San Francisco."
    Airport codes: JFK, SFO.
    Text: "I want you to book a flight from Phoenix to Las Vegas."
    Airport codes: PHX, LVG.
    Text: "I want to book a flight from New York to San Francisco."
    Airport codes: JFK, SFO
    ----------------------------------------------------------------------
    Shutdown...
    

    しっかり答えてくれてます。

まとめ

今回使用したGPUがハイスペックではなかったので、大きめのモデルを使おうとするとメモリ不足で落ちてしまいました。
percentオプションを使うことで多少は動かせましたが、CPUに動いてもらっているので処理時間は遅くなります。

難しいのかと思って身構えていましたが、案外簡単に動かすことができました。
GPU搭載のマシンをお持ちの方は軽い気持ちでやってみてください。

4
1
0

Register as a new user and use Qiita more conveniently

  1. You get articles that match your needs
  2. You can efficiently read back useful information
  3. You can use dark theme
What you can do with signing up
4
1

Delete article

Deleted articles cannot be recovered.

Draft of this article would be also deleted.

Are you sure you want to delete this article?