More than 1 year has passed since last update.

Google Colab で GPT-neox 4bit 量子化を試す

Last updated at 2023-07-02Posted at 2023-07-02

「Google Colab」で「GPT-neox 4bit 量子化」を試したので、まとめました。

GPT-NeoX

GPT-NeoXは「EleutherAI」が開発した、200億パラメーターのオープンソースです。Pileというデータセットで訓練された200億パラメーターの自己回帰型言語モデルです。そのアーキテクチャはGPT-3に似ており、GPT-J-6Bとほぼ同じです。その訓練データセットには、多様な英語のテキストが含まれており、このモデルの汎用性を反映しています。これは、最大の公開されている事前訓練済みの汎用自己回帰型言語モデルであると主張されています。

GPT-NeoX-20Bの特徴や利点の一部は以下の通りです：

トークンの位置エンコーディングに学習された埋め込みではなく、ロータリー位置埋め込みを使用しています。
注意層と前方結合層を直列ではなく並列で計算することで、スループットが15%向上しています。
特に強力な少数ショット推論器であり、同じサイズのGPT-3やFairSeqモデルよりも、5ショットで評価したときに性能が大幅に向上しています。

モデル一覧

「GPT-NeoX」は、次の28つのモデルが提供されています。
（2023年7月2日現在）

Model Name	Task	Last Updated	Stars	Forks
EleutherAI/gpt-neox-20b	Text Generation	Updated 11 days ago	48.3k	398
EleutherAI/neox-ckpt-pythia-1.4b		Updated Mar 14
EleutherAI/neox-ckpt-pythia-1.4b-deduped		Updated Feb 3
EleutherAI/neox-ckpt-pythia-1.4b-deduped-v1		Updated Apr 21
EleutherAI/neox-ckpt-pythia-1.4b-v1		Updated Apr 21	1
EleutherAI/neox-ckpt-pythia-12b		Updated Mar 14		3
EleutherAI/neox-ckpt-pythia-12b-deduped		Updated Jan 17		3
EleutherAI/neox-ckpt-pythia-12b-deduped-v1		Updated Apr 21
EleutherAI/neox-ckpt-pythia-12b-v1		Updated Apr 21
EleutherAI/neox-ckpt-pythia-160m		Updated Mar 14
EleutherAI/neox-ckpt-pythia-160m-deduped-v0		Updated Apr 6
EleutherAI/neox-ckpt-pythia-160m-deduped-v1		Updated Apr 21	1
EleutherAI/neox-ckpt-pythia-160m-v1		Updated Apr 21
EleutherAI/neox-ckpt-pythia-1b		Updated Mar 14
EleutherAI/neox-ckpt-pythia-1b-deduped-v0		Updated Apr 6
EleutherAI/neox-ckpt-pythia-1b-deduped-v1		Updated about 1 month ago
EleutherAI/neox-ckpt-pythia-1b-v1		Updated Apr 21
EleutherAI/neox-ckpt-pythia-2.8b-deduped		Updated Mar 15
EleutherAI/neox-ckpt-pythia-2.8b-deduped-v1		Updated Apr 21
EleutherAI/neox-ckpt-pythia-2.8b-v1		Updated Apr 21
EleutherAI/neox-ckpt-pythia-410m		Updated Mar 14
EleutherAI/neox-ckpt-pythia-410m-deduped		Updated Feb 3
EleutherAI/neox-ckpt-pythia-410m-deduped-v1		Updated Apr 21	1
EleutherAI/neox-ckpt-pythia-410m-v1		Updated Apr 21
EleutherAI/neox-ckpt-pythia-6.9b		Updated Mar 14		2
EleutherAI/neox-ckpt-pythia-6.9b-deduped		Updated Mar 2		1
EleutherAI/neox-ckpt-pythia-6.9b-deduped-v1		Updated Apr 21
EleutherAI/neox-ckpt-pythia-6.9b-v1		Updated Apr 21
EleutherAI/neox-ckpt-pythia-70m		Updated Mar 14
EleutherAI/neox-ckpt-pythia-70m-deduped-v0		Updated Apr 6
EleutherAI/neox-ckpt-pythia-70m-deduped-v1		Updated about 1 month ago		1
EleutherAI/neox-ckpt-pythia-70m-v1		Updated Apr 21
EleutherAI/neox-ckpts-pythia-2.8b		Updated Mar 14

引用

Hugging Face モデル：https://huggingface.co/EleutherAI/gpt-neox-20b
Github: https://github.com/EleutherAI/gpt-neox
論文：https://arxiv.org/abs/2204.06745

Colabでの実行

Google Colab　GPU | T4 | Standard Memoryでの13/15のVRAMを利用しました。実行手順は、次のとおりです。

パッケージのインストール。

!pip install -q -U bitsandbytes
!pip install -q -U git+https://github.com/huggingface/transformers.git
!pip install -q -U git+https://github.com/huggingface/peft.git
!pip install -q -U git+https://github.com/huggingface/accelerate.git

トークナイザーとモデルの準備。

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig

model_id = "EleutherAI/gpt-neox-20b"
pretrain_cache_dir = "/content/model/v00"

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16,
    cache_dir=pretrain_cache_dir,
)
tokenizer = AutoTokenizer.from_pretrained(model_id)

model = AutoModelForCausalLM.from_pretrained(model_id, quantization_config=bnb_config, device_map={"":0})

モデルの確認

print(model)

結果

GPTNeoXForCausalLM(
  (gpt_neox): GPTNeoXModel(
    (embed_in): Embedding(50432, 6144)
    (layers): ModuleList(
      (0-43): 44 x GPTNeoXLayer(
        (input_layernorm): LayerNorm((6144,), eps=1e-05, elementwise_affine=True)
        (post_attention_layernorm): LayerNorm((6144,), eps=1e-05, elementwise_affine=True)
        (attention): GPTNeoXAttention(
          (rotary_emb): RotaryEmbedding()
          (query_key_value): Linear4bit(in_features=6144, out_features=18432, bias=True)
          (dense): Linear4bit(in_features=6144, out_features=6144, bias=True)
        )
        (mlp): GPTNeoXMLP(
          (dense_h_to_4h): Linear4bit(in_features=6144, out_features=24576, bias=True)
          (dense_4h_to_h): Linear4bit(in_features=24576, out_features=6144, bias=True)
          (act): FastGELUActivation()
        )
      )
    )
    (final_layer_norm): LayerNorm((6144,), eps=1e-05, elementwise_affine=True)
  )
  (embed_out): Linear(in_features=6144, out_features=50432, bias=False)
)

トークナイザーの確認

# English

text = "I am a programmer."

tokens = tokenizer.tokenize(text)
ids = tokenizer.encode(text)

# Print the tokens
print(len(tokens))
print(tokens)
print(ids)

結果

5
['I', 'Ġam', 'Ġa', 'Ġprogrammer', '.']
[42, 717, 247, 34513, 15]

# Japanese

text = "私はプログラマーです。"

tokens = tokenizer.tokenize(text)
ids = tokenizer.encode(text)

# Print the tokens
print(len(tokens))
print(tokens)
print(ids)

結果

10
['ç§ģ', 'ãģ¯', 'ãĥĹ', 'ãĥŃ', 'ãĤ°', 'ãĥ©', 'ãĥŀ', 'ãĥ¼', 'ãģ§ãģĻ', 'ãĢĤ']
[45804, 6418, 22655, 24404, 29287, 17694, 31229, 6996, 20776, 4340]

推論の確認

# English

text = "I am"
device = "cuda:0"

inputs = tokenizer(text, return_tensors="pt").to(device)
outputs = model.generate(**inputs, max_new_tokens=20)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

結果

Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.
I am not sure if this is the right place to ask this question, but I am not sure where else

# Japanese

text = "私は"
device = "cuda:0"

inputs = tokenizer(text, return_tensors="pt").to(device)
outputs = model.generate(**inputs, max_new_tokens=20)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

結果

Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.
私は、
私は、私は、私は、私は、私は、私は、

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up