singularityによるms-swift-megatronのDFTの学習方法

Last updated at 2025-11-12Posted at 2025-11-12

1.概要

松尾研LLMコンペ2025にPhase2としてRAMENチームで開発に取り組んできました。

以前、記載したsingularityやms-swiftによるmegatronからHuggingfaceの変換等を記載しました。

本記事は、ms-swiftのmegatron形式でDFTで学習する方法を記載します。

singularityについては、以下の記事を参照して下さい。

2.DFTとは

DFTとは、Discriminative Fine-Tuning の略です。

人間の好みラベルや報酬モデルを活用せずに事後学習する方法です。

詳細は、以下を参照して下さい。

3.ms-swiftの学習方法

3.1シングルノードでの学習方法

※singularityの実行ファイルを作成する。ファイル名をdft_singlenode.shとする。

#!/bin/bash
#SBATCH --job-name=sft_megatron_singlenode
#SBATCH -p P05
#SBATCH --nodelist=xxx-xxxx
#SBATCH --nodes=1
#SBATCH --gpus-per-node=8
#SBATCH --cpus-per-task=128
#SBATCH --time=40:00:00
#SBATCH --output=logs/%x-%j.out
#SBATCH --error=logs/%x-%j.err

echo "start job"
# Multinodeの場合nnodesを増やす（nodelistの番号変更とnodesの値変更）
export NNODES=1
export NPROC_PER_NODE=8
export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
export MASTER_ADDR=$(scontrol show hostnames $SLURM_JOB_NODELIST | head -n 1)
export MASTER_PORT=29500

MODEL=${1}
echo "start srun"
srun --jobid $SLURM_JOBID --gpus-per-node=${NPROC_PER_NODE}  singularity run -w --nv -B /home {singularity環境フルパス} \
bash dft_singlenode_exec.sh ${MODEL} ${MASTER_PORT}

※DTFのスクリプトを作成する。ファイル名をdft_singlenode_exec.sh とする。

　このパラメータの「--enable_dft_loss true 」を追加したらDFTになる

# For more information on multi-node training launch methods, refer to:
# https://github.com/modelscope/ms-swift/tree/main/examples/train/multi-node

set -eu

# Debug information
echo "=== Debug Info ==="
echo "MEGATRON_LM_PATH: $MEGATRON_LM_PATH"
echo "MODELSCOPE_CACHE: $MODELSCOPE_CACHE"
ls -la /workspace/Megatron-LM/ 2>/dev/null || echo "Megatron-LM directory not found"
python -c "import torch; print(f'CUDA available: {torch.cuda.is_available()}'); print(f'Device count: {torch.cuda.device_count()}')"
echo "=================="

# Set CUDA environment variables explicitly
export CUDA_HOME=/usr/local/cuda
export LD_LIBRARY_PATH=/usr/local/cuda/lib64:/usr/local/cuda/compat/lib:/usr/local/nvidia/lib:/usr/local/nvidia/lib64:$LD_LIBRARY_PATH
export NVIDIA_DRIVER_CAPABILITIES=compute,utility

export PYTORCH_CUDA_ALLOC_CONF='expandable_segments:True'
export CUDA_VISIBLE_DEVICES=${CUDA_VISIBLE_DEVICES}
export NNODES=${NNODES}
export NNODES=$SLURM_NNODES
export NODE_RANK=$SLURM_PROCID
export MASTER_ADDR=${MASTER_ADDR}
export MASTER_PORT=${MASTER_PORT}
export NPROC_PER_NODE=${NPROC_PER_NODE}
export NCCL_DEBUG=INFO

echo "=== Debug Info ==="
echo "NNODES: $NNODES"
echo "NODE_RANK: $NODE_RANK"
echo "MASTER_ADDR: $MASTER_ADDR"
echo "MASTER_PORT: $MASTER_PORT"
echo "NPROC_PER_NODE: $NPROC_PER_NODE"
echo "=================="

ulimit -s unlimited
ulimit -v unlimited
ulimit -n 65536
ulimit -u 32768

export WANDB_ENTITY=ken05-matuo-llm-88_llm_2025_suzuki

MODEL=${1}
MASTER_PORT=${2}

USE_HF=1  MASTER_PORT=${MASTER_PORT} megatron sft \
    --load ${MODEL} \
    --dataset /home/xxxxx/xxxxx/xxxxx/train.parquet \
    --val_dataset /home/xxxxx/xxxxx/xxxxx/validation.parquet \
    --split_dataset_ratio 0.01 \
    --enable_dft_loss true \
    --train_type lora \
    --lazy_tokenize true \
    --lora_rank 8 \
    --lora_alpha 32 \
    --target_modules all-linear \
    --expert_model_parallel_size 8 \
    --sequence_parallel true \
    --micro_batch_size 1 \
    --global_batch_size 16 \
    --recompute_granularity full \
    --recompute_method uniform \
    --recompute_num_layers 1 \
    --finetune true \
    --cross_entropy_loss_fusion true \
    --lr 1e-4 \
    --lr_warmup_fraction 0.05 \
    --min_lr 1e-5 \
    --max_epochs 1 \
    --save megatron_output/singlenode/${MODEL} \
    --eval_interval 200 \
    --save_interval 200 \
    --max_length 2048 \
    --num_workers 8 \
    --dataset_num_proc 8 \
    --no_save_optim true \
    --no_save_rng true \
    --attention_backend flash \
    --wandb_exp_name msswift_megatron \
    --wandb_project dft_singlenode \
    --wandb_save_dir wandb_logs

singularityで実行する方法

# nodeに入る
cd (ms-swiftのDFTパス)
sbatch dft_singlenode.sh

singularity環境については、以下を参照ください。

3.2マルチノードでの学習方法

※singularityの実行ファイルを作成する。ファイル名をdft_multinode.shとする。

#!/bin/bash
#SBATCH --job-name=dft_megatron_multinode
#SBATCH -p P05
#SBATCH --nodelist=xxxx-xxxx[60,62-64]
#SBATCH --nodes=4
#SBATCH --gpus-per-node=8
#SBATCH --cpus-per-task=128
#SBATCH --time=40:00:00
#SBATCH --output=logs/%x-%j.out
#SBATCH --error=logs/%x-%j.err

echo "start job"
export NNODES=4
export NPROC_PER_NODE=8
export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
export MASTER_ADDR=$(scontrol show hostnames $SLURM_JOB_NODELIST | head -n 1)
export MASTER_PORT=9901

MODEL=${1}
echo "start srun"
srun --jobid $SLURM_JOBID --gpus-per-node=${NPROC_PER_NODE}  singularity run -w --nv -B /home {singularity環境フルパス} \
bash dft_multinode_exec.sh ${MODEL}

※DTFのスクリプトを作成する。ファイル名をdft_multinode_exec.sh とする。

　このパラメータの「--enable_dft_loss true 」を追加したらDFTになる

# For more information on multi-node training launch methods, refer to:
# https://github.com/modelscope/ms-swift/tree/main/examples/train/multi-node

set -eu

# Debug information
echo "=== Debug Info ==="
echo "MEGATRON_LM_PATH: $MEGATRON_LM_PATH"
echo "MODELSCOPE_CACHE: $MODELSCOPE_CACHE"
ls -la /workspace/Megatron-LM/ 2>/dev/null || echo "Megatron-LM directory not found"
python -c "import torch; print(f'CUDA available: {torch.cuda.is_available()}'); print(f'Device count: {torch.cuda.device_count()}')"
echo "=================="

export CUDA_HOME=/usr/local/cuda
export LD_LIBRARY_PATH=/usr/local/cuda/lib64:/usr/local/cuda/compat/lib:/usr/local/nvidia/lib:/usr/local/nvidia/lib64:$LD_LIBRARY_PATH
export NVIDIA_DRIVER_CAPABILITIES=compute,utility

export PYTORCH_CUDA_ALLOC_CONF='expandable_segments:True'
export CUDA_VISIBLE_DEVICES=${CUDA_VISIBLE_DEVICES}
export NNODES=${NNODES}
export NNODES=$SLURM_NNODES
export NODE_RANK=$SLURM_PROCID
export MASTER_ADDR=${MASTER_ADDR}
export MASTER_PORT=${MASTER_PORT}
export NPROC_PER_NODE=${NPROC_PER_NODE}
export NCCL_DEBUG=INFO

echo "=== Debug Info ==="
echo "NNODES: $NNODES"
echo "NODE_RANK: $NODE_RANK"
echo "MASTER_ADDR: $MASTER_ADDR"
echo "MASTER_PORT: $MASTER_PORT"
echo "NPROC_PER_NODE: $NPROC_PER_NODE"
echo "=================="

ulimit -s unlimited
ulimit -v unlimited
ulimit -n 65536
ulimit -u 32768

export WANDB_ENTITY=ken05-matuo-llm-88_llm_2025_suzuki

MODEL=${1}

USE_HF=1  MASTER_PORT=${MASTER_PORT} megatron sft \
    --load ${MODEL} \
    --dataset /home/xxxx/xxxx/xxxx/train.parquet \
    --val_dataset /home/xxxx/xxxx/xxxx/validation.parquet \
    --split_dataset_ratio 0.01 \
    --train_type lora \
    --lazy_tokenize true \
    --lora_rank 8 \
    --lora_alpha 32 \
    --target_modules all-linear \
    --tensor_model_parallel_size 2 \
    --pipeline_model_parallel_size 2 \
    --expert_model_parallel_size 4 \
    --context_parallel_size 2 \
    --moe_permute_fusion true \
    --moe_grouped_gemm true \
    --moe_shared_expert_overlap true \
    --moe_aux_loss_coeff 1e-3 \
    --moe_expert_capacity_factor 1.0 \
    --moe_token_dispatcher_type alltoall \
    --sequence_parallel true \
    --micro_batch_size 1 \
    --global_batch_size 16 \
    --recompute_granularity full \
    --recompute_method uniform \
    --recompute_num_layers 1 \
    --finetune true \
    --cross_entropy_loss_fusion true \
    --lr 2e-5 \
    --lr_warmup_fraction 0.05 \
    --min_lr 1e-5 \
    --max_epochs 1 \
    --save megatron_output/multinode/${MODEL} \
    --eval_interval 200 \
    --save_interval 200 \
    --max_length 16384 \
    --num_workers 8 \
    --dataset_num_proc 8 \
    --no_save_optim true \
    --no_save_rng true \
    --enable_dft_loss true \
    --attention_backend flash \
    --wandb_exp_name msswift_megatron \
    --wandb_project dft_multinode \
    --wandb_save_dir wandb_logs

singularityで実行する方法

# nodeに入る
cd (ms-swiftのDFTパス)
sbatch dft_singlenode.sh

4.まとめ

singularityを使って、ms-swiftのフレームワークでDFTの学習が出来ました。

プロジェクトのクレジット

本プロジェクトは、国立研究開発法人新エネルギー・産業技術開発機構（NEDO）の

「日本語版医療特化型LLMの社会実装に向けた安全性検証・実証」における

基盤モデルの開発プロジェクトの一環として行われました。

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up