Llama3.1 70B を AWS P4d インスタンスで微調整

Last updated at 2024-08-09Posted at 2024-08-09

Meta-Llama-3.1-70B-Instruct を LoRA や量子化を用いずに Amazon EC2 P4d インスタンスで教師あり微調整しました。手順を紹介します。

P4d インスタンスの p4d.24xlarge は、GPU あたり 40GB メモリの NVIDIA A100 Tensor Core GPU を 8 基搭載し、GPU メモリの合計は 320 GB です。また、1152 GiB のインスタンスメモリと 8 TB NVMe ベースの SSD インスタンスストレージを備えています。

微調整には推定 500GB 超の GPU メモリが必要なので Transformers の DeepSpeed 統合で GPU メモリの不足を補います。

環境

Amazon EC2 インスタンス
インスタンスタイプ: p4d.24xlarge
リージョン: ap-northeast-1
AMI: Deep Learning OSS Nvidia Driver AMI GPU PyTorch 2.3.0 (Ubuntu 20.04) 20240611

$ python --version
Python 3.11.9
$ python
>>> import transformers, deepspeed
>>> transformers.__version__
'4.43.3'
>>> deepspeed.__version__
'0.14.4'

データセット

ござるデータセットを使わせていただきました。

準備

Llama3.1 は利用申請が必要です。Hugging Face Hub の　Llama-3.1 モデルカードから申請を行いユーザーアクセストークンを発行します。

EC2 インスタンスを開始しユーザ ubuntu にログインします。

DeepSpeed に必要な libaio-dev をインストール

sudo apt-get update
sudo apt-get install libaio-dev

Python 仮想環境を開始

source activate pytorch

Pythonパッケージをインストール

pip install huggingface_hub[cli]
pip install transformers[deepspeed] trl accelerate datasets

HuggingFace CLI にログイン

HF_TOKEN="hf_XXXXXXXXXXXXXXXXXXXXXXXXXXXXX"
huggingface-cli login --token $HF_TOKEN

Huggin Face Hub からモデルをダウンロード

MODEL_ID=meta-llama/Meta-Llama-3.1-70B-Instruct
huggingface-cli download $MODEL_ID --local-dir /opt/dlami/nvme/models/$MODEL_ID

DeepSpeed 構成

DeepSpeed の ZeRO は、CPU や NVMe へオフロードすることで GPU メモリの不足を補うことができます。

メモリ要件を見積りすると CPU オフロードにはおよそ 1600 GB のインスタンスメモリが必要とのこと。p4d.24xlarge のインスタンスメモリ 1152 GiB はこの要件を満たさないため今回は、ZeRO-Infinity で NVMe へオフロードします。オフロードの対象は optimizer のみとし parameter はオフロードしません。

DeepSpeed 構成ファイルを作成

zero3_train.json

{
    "bf16": {
        "enabled": "auto"
    },
    "zero_optimization": {
        "stage": 3,
        "offload_optimizer": {
            "device": "nvme",
            "nvme_path": "/opt/dlami/nvme/zero_param",
            "pin_memory": true,
            "buffer_count": 4,
            "fast_init": false
        },
        "overlap_comm": true,
        "contiguous_gradients": true,
        "sub_group_size": 1e9,
        "reduce_bucket_size": "auto",
        "stage3_prefetch_bucket_size": "auto",
        "stage3_param_persistence_threshold": "auto",
        "stage3_max_live_parameters": 1e9,
        "stage3_max_reuse_distance": 1e9,
        "stage3_gather_16bit_weights_on_model_save": true
    },
    "gradient_accumulation_steps": "auto",
    "gradient_clipping": "auto",
    "steps_per_print": 2000,
    "train_batch_size": "auto",
    "train_micro_batch_size_per_gpu": "auto",
    "wall_clock_breakdown": false
}

微調整

Python スクリプトファイルを作成

train.py

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from datasets import load_dataset
from trl import SFTConfig, SFTTrainer, DataCollatorForCompletionOnlyLM

args = SFTConfig(
    output_dir="/opt/dlami/nvme/output",
    overwrite_output_dir=True,
    num_train_epochs=3,
    per_device_train_batch_size=1,
    gradient_accumulation_steps=4,
    gradient_checkpointing=True,
    gradient_checkpointing_kwargs={'use_reentrant':False},
    learning_rate=8e-6,
    weight_decay=0.0,
    bf16=True,
    fp16=False,
    deepspeed='./zero3_train.json',  # DeepSpeed 構成ファイル
    max_seq_length=8192,
    packing=False,
    save_strategy='no',
)


dataset = load_dataset("bbz662bbz/databricks-dolly-15k-ja-gozarinnemon")

model_id = "/opt/dlami/nvme/models/meta-llama/Meta-Llama-3.1-70B-Instruct"

tokenizer = AutoTokenizer.from_pretrained(model_id)

chat_template = """<|start_header_id|>user<|end_header_id|>

{0}<|eot_id|><|start_header_id|>assistant<|end_header_id|>

{1}<|eot_id|>"""

response_template = """<|start_header_id|>assistant<|end_header_id|>

"""

# プロンプトテンプレート適用
def formatting_func(rows):
    output_texts = []
    for i, instruction in enumerate(rows['instruction']):
        user_content = rows['input'][i] + instruction
        text = chat_template.format(user_content, rows['output'][i])
        output_texts.append(text)
    return output_texts

tokenizer.pad_token = "<|finetune_right_pad_id|>"

collator = DataCollatorForCompletionOnlyLM(response_template, tokenizer=tokenizer)

model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16,
)
model.config.use_cache = False

trainer = SFTTrainer(
    model,
    tokenizer=tokenizer,
    train_dataset=dataset["train"].select(range(64)),
    formatting_func=formatting_func,
    data_collator=collator,
    args=args,
)

trainer.train()

save_dir = "/opt/dlami/nvme/models/llama-3-gozaru"
trainer.save_model(save_dir)
tokenizer.save_pretrained(save_dir)

事前トレーニングされたモデルを使用する場合、AutoModelForCausalLM.from_pretrained() を呼び出す前に SFTConfig オブジェクトを作成する必要があります（参照 Constructing Massive Models）。

deepspeed コマンドの引数に train.py を指定し実行

LIBRARY_PATH=$LD_LIBRARY_PATH deepspeed train.py

DeepSpeed 実行時の静的リンクでリンクエラー /usr/bin/ld: cannot find -lcurand が発生したためここでは LIBRARY_PATH を指定しています。

微調整に要した時間はおよそ 35 分でした。

time コマンドによる計測結果

real    35m6.142s
user    388m46.755s
sys     101m36.467s

推論

微調整の効果を確認します。

test.py

import transformers
import torch

pipeline = transformers.pipeline(
    "text-generation",
    model="/opt/dlami/nvme/models/llama-3-gozaru",
    model_kwargs={"torch_dtype": torch.bfloat16},
    device_map="auto",
)

messages = [
    {"role": "user", "content": "日本の首都は？"},
]

outputs = pipeline(
    messages,
    max_new_tokens=256,
)

print(outputs[0]["generated_text"][-1])

実行

python test.py

結果

{'role': 'assistant', 'content': '我、りんえもんは思う。 日本の首都は東京でござる。知らんけど。'}

「我、りんえもんは思う」から始まり「知らんけど」で終わっています。期待通りです。

以上です。最後までお読みいただきありがとうございます。

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up