singularityによるms-swift-megatronのSFTの学習方法

Last updated at 2025-11-12Posted at 2025-11-12

1.概要

松尾研LLMコンペ2025にPhase2としてRAMENチームで開発に取り組んできました。

以前、記載したsingularityやms-swiftによるmegatronからHuggingfaceの変換等を記載しました。

本記事は、ms-swiftのmegatron形式でSFTで学習する方法を記載します。

singularityについては、以下の記事を参照して下さい。

2.SFTとは

SFTとは、Supervised Fine-Tuning の略です。

ベースモデルに対して実施する教師あり学習の方法です。

詳細は、以下を参照して下さい。

3.ms-swiftの学習方法

3.1シングルノードでの学習方法

※singularityの実行ファイルを作成する。ファイル名をsft_singlenode.shとする。

#!/bin/bash
#SBATCH --job-name=sft_megatron_singlenode
#SBATCH -p P05
#SBATCH --nodelist=xxx-xxx58
#SBATCH --nodes=1
#SBATCH --gpus-per-node=8
#SBATCH --cpus-per-task=128
#SBATCH --time=40:00:00
#SBATCH --output=logs/%x-%j.out
#SBATCH --error=logs/%x-%j.err

echo "start job"
# Multinodeの場合nnodesを増やす（nodelistの番号変更とnodesの値変更）
export NNODES=1
export NPROC_PER_NODE=8
export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
export MASTER_ADDR=$(scontrol show hostnames $SLURM_JOB_NODELIST | head -n 1)
export MASTER_PORT=29500

MODEL=${1}
echo "start srun"
srun --jobid $SLURM_JOBID --gpus-per-node=${NPROC_PER_NODE}  singularity run -w --nv -B /home {singularity環境フルパス} \
bash sft_singlenode_exec.sh ${MODEL} ${MASTER_PORT}

※STFのスクリプトを作成する。ファイル名をsft_singlenode_exec.sh とする。

# For more information on multi-node training launch methods, refer to:
# https://github.com/modelscope/ms-swift/tree/main/examples/train/multi-node

set -eu

# Debug information
echo "=== Debug Info ==="
echo "MEGATRON_LM_PATH: $MEGATRON_LM_PATH"
echo "MODELSCOPE_CACHE: $MODELSCOPE_CACHE"
ls -la /workspace/Megatron-LM/ 2>/dev/null || echo "Megatron-LM directory not found"
python -c "import torch; print(f'CUDA available: {torch.cuda.is_available()}'); print(f'Device count: {torch.cuda.device_count()}')"
echo "=================="

# Set CUDA environment variables explicitly
export CUDA_HOME=/usr/local/cuda
export LD_LIBRARY_PATH=/usr/local/cuda/lib64:/usr/local/cuda/compat/lib:/usr/local/nvidia/lib:/usr/local/nvidia/lib64:$LD_LIBRARY_PATH
export NVIDIA_DRIVER_CAPABILITIES=compute,utility

export PYTORCH_CUDA_ALLOC_CONF='expandable_segments:True'
export CUDA_VISIBLE_DEVICES=${CUDA_VISIBLE_DEVICES}
export NNODES=${NNODES}
export NNODES=$SLURM_NNODES
export NODE_RANK=$SLURM_PROCID
export MASTER_ADDR=${MASTER_ADDR}
export MASTER_PORT=${MASTER_PORT}
export NPROC_PER_NODE=${NPROC_PER_NODE}
export NCCL_DEBUG=INFO

echo "=== Debug Info ==="
echo "NNODES: $NNODES"
echo "NODE_RANK: $NODE_RANK"
echo "MASTER_ADDR: $MASTER_ADDR"
echo "MASTER_PORT: $MASTER_PORT"
echo "NPROC_PER_NODE: $NPROC_PER_NODE"
echo "=================="

ulimit -s unlimited
ulimit -v unlimited
ulimit -n 65536
ulimit -u 32768

export WANDB_ENTITY=ken05-matuo-llm-88_llm_2025_suzuki

MODEL=${1}
MASTER_PORT=${2}

USE_HF=1  MASTER_PORT=${MASTER_PORT} megatron sft \
    --load ${MODEL} \
    --dataset /home/xxxx/xxxx/xxxx/train.parquet \
    --val_dataset /home/xxxx/xxxx/xxxx/validation.parquet \
    --split_dataset_ratio 0.01 \
    --train_type lora \
    --lazy_tokenize true \
    --lora_rank 8 \
    --lora_alpha 32 \
    --target_modules all-linear \
    --expert_model_parallel_size 8 \
    --sequence_parallel true \
    --micro_batch_size 1 \
    --global_batch_size 16 \
    --recompute_granularity full \
    --recompute_method uniform \
    --recompute_num_layers 1 \
    --finetune true \
    --cross_entropy_loss_fusion true \
    --lr 1e-4 \
    --lr_warmup_fraction 0.05 \
    --min_lr 1e-5 \
    --max_epochs 1 \
    --save megatron_output/singlenode/${MODEL} \
    --eval_interval 200 \
    --save_interval 200 \
    --max_length 16384 \
    --num_workers 8 \
    --dataset_num_proc 8 \
    --no_save_optim true \
    --no_save_rng true \
    --attention_backend flash \
    --wandb_exp_name msswift_megatron \
    --wandb_project sft_singlenode \
    --wandb_save_dir wandb_logs

singularityで実行する方法

# nodeに入る
cd (ms-swiftのSFTパス)
sbatch sft_singlenode.sh

singularity環境については、以下を参照ください。

3.2マルチノードでの学習方法

※singularityの実行ファイルを作成する。ファイル名をsft_multinode.shとする。

#!/bin/bash
#SBATCH --job-name=sft_megatron_multinode
#SBATCH -p P05
#SBATCH --nodelist=xxx-xxx[58-61]
#SBATCH --nodes=4
#SBATCH --gpus-per-node=8
#SBATCH --cpus-per-task=128
#SBATCH --time=40:00:00
#SBATCH --output=logs/%x-%j.out
#SBATCH --error=logs/%x-%j.err

echo "start job"
# Multinodeの場合nnodesを増やす（nodelistの番号変更とnodesの値変更）
export NNODES=4
export NPROC_PER_NODE=8
export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
export MASTER_ADDR=$(scontrol show hostnames $SLURM_JOB_NODELIST | head -n 1)
export MASTER_PORT=29500

MODEL=${1}
echo "start srun"
srun --jobid $SLURM_JOBID --gpus-per-node=${NPROC_PER_NODE}  singularity run -w --nv -B /home {singularity環境フルパス} \
bash sft_multinode_exec.sh ${MODEL} ${MASTER_PORT}

※STFのスクリプトを作成する。ファイル名をsft_multinode_exec.sh とする。

# For more information on multi-node training launch methods, refer to:
# https://github.com/modelscope/ms-swift/tree/main/examples/train/multi-node

set -eu

# Debug information
echo "=== Debug Info ==="
echo "MEGATRON_LM_PATH: $MEGATRON_LM_PATH"
echo "MODELSCOPE_CACHE: $MODELSCOPE_CACHE"
ls -la /workspace/Megatron-LM/ 2>/dev/null || echo "Megatron-LM directory not found"
python -c "import torch; print(f'CUDA available: {torch.cuda.is_available()}'); print(f'Device count: {torch.cuda.device_count()}')"
echo "=================="

# Set CUDA environment variables explicitly
export CUDA_HOME=/usr/local/cuda
export LD_LIBRARY_PATH=/usr/local/cuda/lib64:/usr/local/cuda/compat/lib:/usr/local/nvidia/lib:/usr/local/nvidia/lib64:$LD_LIBRARY_PATH
export NVIDIA_DRIVER_CAPABILITIES=compute,utility

export PYTORCH_CUDA_ALLOC_CONF='expandable_segments:True'
export CUDA_VISIBLE_DEVICES=${CUDA_VISIBLE_DEVICES}
export NNODES=${NNODES}
export NNODES=$SLURM_NNODES
export NODE_RANK=$SLURM_PROCID
export MASTER_ADDR=${MASTER_ADDR}
export MASTER_PORT=${MASTER_PORT}
export NPROC_PER_NODE=${NPROC_PER_NODE}
export NCCL_DEBUG=INFO

echo "=== Debug Info ==="
echo "NNODES: $NNODES"
echo "NODE_RANK: $NODE_RANK"
echo "MASTER_ADDR: $MASTER_ADDR"
echo "MASTER_PORT: $MASTER_PORT"
echo "NPROC_PER_NODE: $NPROC_PER_NODE"
echo "=================="

ulimit -s unlimited
ulimit -v unlimited
ulimit -n 65536
ulimit -u 32768

export WANDB_ENTITY=ken05-matuo-llm-88_llm_2025_suzuki

MODEL=${1}
MASTER_PORT=${2}

USE_HF=1  MASTER_PORT=${MASTER_PORT} megatron sft \
    --load ${MODEL} \
    --dataset /home/xxxx/xxxx/xxxx//train.parquet \
    --val_dataset /home/xxxx/xxxx/xxxx/validation.parquet \
    --split_dataset_ratio 0.01 \
    --train_type lora \
    --lazy_tokenize true \
    --lora_rank 8 \
    --lora_alpha 32 \
    --target_modules all-linear \
    --tensor_model_parallel_size 2 \
    --pipeline_model_parallel_size 2 \
    --expert_model_parallel_size 4 \
    --context_parallel_size 2 \
    --moe_permute_fusion true \
    --moe_grouped_gemm true \
    --moe_shared_expert_overlap true \
    --moe_aux_loss_coeff 1e-3 \
    --moe_expert_capacity_factor 1.0 \
    --moe_token_dispatcher_type alltoall \
    --sequence_parallel true \
    --micro_batch_size 4 \
    --global_batch_size 32 \
    --recompute_granularity full \
    --recompute_method uniform \
    --recompute_num_layers 1 \
    --finetune true \
    --cross_entropy_loss_fusion true \
    --lr 1e-4 \
    --lr_warmup_fraction 0.05 \
    --min_lr 1e-5 \
    --max_epochs 1 \
    --save megatron_output/multinode/${MODEL} \
    --eval_interval 200 \
    --save_interval 400 \
    --max_length 16384 \
    --num_workers 8 \
    --dataset_num_proc 8 \
    --no_save_optim true \
    --no_save_rng true \
    --attention_backend flash \
    --wandb_exp_name msswift_megatron \
    --wandb_project sft_multinode \
    --wandb_save_dir wandb_logs

singularityで実行する方法

# nodeに入る
cd (ms-swiftのSFTパス)
sbatch sft_singlenode.sh

4.まとめ

singularityを使って、ms-swiftのフレームワークでSFTの学習が出来ました。

プロジェクトのクレジット

本プロジェクトは、国立研究開発法人新エネルギー・産業技術開発機構（NEDO）の

「日本語版医療特化型LLMの社会実装に向けた安全性検証・実証」における

基盤モデルの開発プロジェクトの一環として行われました。

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up