RAMENチームのGSPOの検証について

Last updated at 2025-11-06Posted at 2025-11-06

1.概要

松尾研LLMコンペ2025にPhase2としてRAMENチームで開発に取り組んできました。

RAMENチームとして事後学習の実験する際、GSPOも候補に上がりました。

時間の都合上、GitHubの参考したのみですが、エラーが無く実行出来ました。

本記事については、その実験について少し記載します。

2.GSPOとは

GSPOとは、Group Sequence Policy Optimizationの略です。

MoE(Mixture of Experts)を効率よく強化学習の1つ手法です。2025年に発表された手法です。

詳細は、以下を確認下さい。

参考URL：https://arxiv.org/abs/2507.18071

3.GRPO実験

3.1対応ライブラリ

今回のVerlのライブラリを活用して、GSPOを実験しました。

singularityも公開されているのでdockerhubで実行も可能です。今回は、コンペ用のcondaで構築しました。

3.2参考ソース

以下のGitHubを参考にしてコンペ用の環境に書き換えしました。

参考URL：https://github.com/volcengine/verl/blob/main/recipe/gspo/test_gspo_qwen30b_a3b_ep.sh

3.3コンペ用ソースコード書き換え

以下のように書き換えを実施しました。

ほぼ3.2を踏襲した形となっています。報酬関数もそのまま使っています。

#!/bin/bash
#SBATCH --job-name=gspo-qwen
#SBATCH -p P05
#SBATCH --nodelist=osk-gpu62
#SBATCH --gres=gpu:8
#SBATCH --cpus-per-task=64
#SBATCH --time=12:00:00
#SBATCH --mem=0
#SBATCH --output=/home/Competition2025/P05/shareP05/nishimae/traning_log/gspo/slurm_logs/%j.out
#SBATCH --error=/home/Competition2025/P05/shareP05/nishimae/traning_log/gspo/slurm_logs/%j.err

# 現在のモジュール環境をリセットする（読み込まれている全てのモジュールをアンロード）
module reset

# NCCL（NVIDIA Collective Communications Library）バージョン2.22.3を読み込む
module load nccl/2.22.3

# HPC-X（高性能通信ライブラリ）バージョン2.18.1をCUDA 12およびGCCに対応する構成で読み込む
module load hpcx/2.18.1-gcc-cuda12/hpcx-mt

module load miniconda/24.7.1-py311

source /home/appli/miniconda3/24.7.1-py311/etc/profile.d/conda.sh

# condaコマンドが使えることを確認。
which conda && echo "====" && conda --version

#step0 でインストールした conda のディレクトリ
export CONDA_PATH="/home/Competition2025/P05/shareP05/train/envs/train_env_hara_gspo/conda_env"

source ~/.bashrc

conda init

conda config --set auto_activate_base false

# 念のため既に有効化されているPython仮想環境がある場合に備えてリセットのために無効化する。
conda deactivate
conda deactivate

# 作成したPython仮想環境を有効化。
conda activate $CONDA_PATH
python -m pip install lark==1.2.2
python -m pip install groq

cd /home/Competition2025/P05/shareP05/nishimae/traning_log/gspo

export NCCL_SOCKET_IFNAME=enp25s0np0
export NVTE_FUSED_ATTN=0
export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7

unset ROCR_VISIBLE_DEVICES

# ulimit設定
ulimit -v unlimited
ulimit -m unlimited
ulimit -c unlimited
ulimit -n 65535

export WANDB_ENTITY='ken05-matuo-llm-88_llm_2025_suzuki'
export WANDB_PROJECT="gspo-qwen3"
export WANDB_RUN_NAME="gspo-30BA3B-${SLURM_JOB_ID}"

# Set how many GPUs we actually have on this node.
export GPUS_PER_NODE=8
NNODES=${SLURM_JOB_NUM_NODES}
export NNODES

export RAY_LOGGING_LEVEL=DEBUG
export HYDRA_FULL_ERROR=1

# サーバーダウン回避のためのおまじない
export TEMP="/nvme34/P03U011/tmp"
export TMPDIR="/nvme34/P03U011/tmp"
export TMP="/nvme34/P03U011/tmp"

export RAY_DISABLE_DASHBOARD=1

ulimit -m unlimited
ulimit -v unlimited

rm -rf /nvme34/P03U011/tmp/ray
mkdir -p /nvme34/P03U011/tmp/ray

echo "Using $NNODES nodes for training..."

# ------------------------------------- Setup xp params ---------------------------------------
project_name='RL-GSPO'

adv_estimator=grpo
loss_mode=gspo
loss_agg_mode="seq-mean-token-mean"
MODEL_PATH="/home/Competition2025/P05/shareP05/models/Qwen3-30B-A3B-Thinking-2507"
offload=false # it's a small model, offloading will just slow-down training
rollout_engine=vllm
rollout_mode=sync # can be async to speedup large scale xps
gpu_memory_utilization=0.9
reward_manager=dapo
adv_estimator=grpo
shuffle_dataset=true
first_time_dataset_prep=true # prepare dataset

test_freq=10
save_freq=10
total_epochs=10
total_training_steps=500
val_before_train=false

use_kl_in_reward=false
kl_coef=0.0
use_kl_loss=false
kl_loss_coef=0.0

clip_ratio_low=0.0003 # as recommended by the paper, see Sec. 5.1
clip_ratio_high=0.0004 # as recommended by the paper, see Sec. 5.1
train_batch_size=512
ppo_mini_batch_size=128 # maintain 4 mini-batches as recommended by the paper, see Sec. 5.1
ppo_micro_batch_size_per_gpu=8 # setup depending on your GPU memory
n_resp_per_prompt=16

max_prompt_length=$((1024 * 2))
max_response_length=$((1024 * 4))
# dapo reward manager params
enable_overlong_buffer=false # true
overlong_buffer_len=$((1024 * 2))
overlong_penalty_factor=1.0

# Paths and namings
SFT_MODEL=$(basename $MODEL_PATH)
exp_name="${loss_mode}-epslow-${clip_ratio_low}-epshigh-${clip_ratio_high}-${SFT_MODEL}-RL"
CKPTS_DIR=./checkpoints/${loss_mode}/${exp_name}

# Sampling params at rollouts
temperature=1.0
top_p=1.0
top_k=-1 # 0 for HF rollout, -1 for vLLM rollout
val_top_p=0.7

# Performance Related Parameter
sp_size=1
use_dynamic_bsz=true
actor_ppo_max_token_len=$(((max_prompt_length + max_response_length) * 2))
infer_ppo_max_token_len=$(((max_prompt_length + max_response_length) * 3))
offload=true
gen_tp=1
entropy_checkpointing=true # This enables entropy recomputation specifically for the entropy calculation, lowering memory usage during training.

# ------------------------------------- train/val data preparation ---------------------------------------
if [ "$first_time_dataset_prep" = true ]; then
    echo "Preprocessing GSM8K dataset..."
    python $CONDA_PATH/../deps/verl/examples/data_preprocess/gsm8k.py --local_save_dir $HOME/data/gsm8k/
fi

gsm8k_train_path=~/data/gsm8k/train.parquet
gsm8k_test_path=~/data/gsm8k/test.parquet

# set the paths
train_files="['$gsm8k_train_path']"
test_files="['$gsm8k_test_path']"

python3 -m verl.trainer.main_ppo \
    algorithm.adv_estimator=${adv_estimator} \
    actor_rollout_ref.actor.policy_loss.loss_mode=${loss_mode} \
    data.train_files="${train_files}" \
    data.val_files="${test_files}" \
    data.shuffle=$shuffle_dataset \
    data.prompt_key=prompt \
    data.truncation='error' \
    data.filter_overlong_prompts=true \
    data.train_batch_size=${train_batch_size} \
    data.max_prompt_length=${max_prompt_length} \
    data.max_response_length=${max_response_length} \
    actor_rollout_ref.rollout.n=${n_resp_per_prompt} \
    algorithm.use_kl_in_reward=${use_kl_in_reward} \
    algorithm.kl_ctrl.kl_coef=${kl_coef} \
    actor_rollout_ref.actor.use_kl_loss=${use_kl_loss} \
    actor_rollout_ref.actor.kl_loss_coef=${kl_loss_coef} \
    actor_rollout_ref.actor.clip_ratio_low=${clip_ratio_low} \
    actor_rollout_ref.actor.clip_ratio_high=${clip_ratio_high} \
    actor_rollout_ref.model.use_remove_padding=true \
    actor_rollout_ref.actor.use_dynamic_bsz=${use_dynamic_bsz} \
    actor_rollout_ref.ref.log_prob_use_dynamic_bsz=${use_dynamic_bsz} \
    actor_rollout_ref.rollout.log_prob_use_dynamic_bsz=${use_dynamic_bsz} \
    actor_rollout_ref.actor.ppo_max_token_len_per_gpu=${actor_ppo_max_token_len} \
    actor_rollout_ref.ref.log_prob_max_token_len_per_gpu=${infer_ppo_max_token_len} \
    actor_rollout_ref.rollout.log_prob_max_token_len_per_gpu=${infer_ppo_max_token_len} \
    actor_rollout_ref.rollout.name=vllm \
    actor_rollout_ref.rollout.name=${rollout_engine} \
    actor_rollout_ref.rollout.mode=${rollout_mode} \
    actor_rollout_ref.model.path="${MODEL_PATH}" \
    actor_rollout_ref.model.enable_gradient_checkpointing=true \
    actor_rollout_ref.ref.fsdp_config.model_dtype=bf16 \
    actor_rollout_ref.actor.fsdp_config.model_dtype=bf16\
    actor_rollout_ref.actor.optim.lr=1e-6 \
    actor_rollout_ref.actor.optim.lr_warmup_steps_ratio=0.05 \
    actor_rollout_ref.actor.optim.weight_decay=0.1 \
    actor_rollout_ref.actor.ppo_mini_batch_size=${ppo_mini_batch_size} \
    actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu=${ppo_micro_batch_size_per_gpu} \
    actor_rollout_ref.actor.fsdp_config.param_offload=${offload} \
    actor_rollout_ref.actor.fsdp_config.optimizer_offload=${offload} \
    actor_rollout_ref.actor.entropy_coeff=0 \
    actor_rollout_ref.actor.grad_clip=1.0 \
    actor_rollout_ref.actor.loss_agg_mode=${loss_agg_mode} \
    actor_rollout_ref.actor.ulysses_sequence_parallel_size=${sp_size} \
    actor_rollout_ref.rollout.gpu_memory_utilization=${gpu_memory_utilization} \
    actor_rollout_ref.rollout.tensor_model_parallel_size=${gen_tp} \
    actor_rollout_ref.rollout.enable_chunked_prefill=true \
    actor_rollout_ref.rollout.max_num_batched_tokens=$((max_prompt_length + max_response_length)) \
    actor_rollout_ref.rollout.temperature=${temperature} \
    actor_rollout_ref.rollout.top_p=${top_p} \
    actor_rollout_ref.rollout.top_k=${top_k} \
    actor_rollout_ref.rollout.val_kwargs.temperature=${temperature} \
    actor_rollout_ref.rollout.val_kwargs.top_p=${val_top_p} \
    actor_rollout_ref.rollout.val_kwargs.top_k=${top_k} \
    actor_rollout_ref.rollout.val_kwargs.do_sample=true \
    actor_rollout_ref.rollout.val_kwargs.n=1 \
    actor_rollout_ref.ref.fsdp_config.param_offload=${offload} \
    actor_rollout_ref.ref.ulysses_sequence_parallel_size=${sp_size} \
    actor_rollout_ref.actor.entropy_checkpointing=${entropy_checkpointing} \
    reward_model.reward_manager=${reward_manager} \
    +reward_model.reward_kwargs.overlong_buffer_cfg.enable=${enable_overlong_buffer} \
    +reward_model.reward_kwargs.overlong_buffer_cfg.len=${overlong_buffer_len} \
    +reward_model.reward_kwargs.overlong_buffer_cfg.penalty_factor=${overlong_penalty_factor} \
    +reward_model.reward_kwargs.overlong_buffer_cfg.log=false \
    +reward_model.reward_kwargs.max_resp_len=${max_response_length} \
    trainer.logger='["console","wandb"]' \
    trainer.project_name="${project_name}" \
    trainer.experiment_name="${exp_name}" \
    trainer.n_gpus_per_node="${GPUS_PER_NODE}" \
    trainer.nnodes="${NNODES}" \
    trainer.val_before_train=${val_before_train} \
    trainer.test_freq=${test_freq} \
    trainer.save_freq=${save_freq} \
    trainer.total_epochs=${total_epochs} \
    trainer.total_training_steps=${total_training_steps} \
    trainer.default_local_dir="${CKPTS_DIR}" \
    trainer.resume_mode=auto \
    trainer.log_val_generations=2 \
    $@

3.4実行結果

実行時間が330時間程度かかることが分かりました。

報酬関数やvllm以外のAPI等も検討しましたが、時間の制約上、これで終了となりました。

ただ、GSPOを動かすことが出来たので良かったです。

Training Progress:   1%|          | 3/500 [2:00:06<331:28:23, 2401.01s/it]
step:1 - global_seqlen/min:1048431 - global_seqlen/max:1305505 - global_seqlen/minmax_diff:257074 - global_seqlen/balanced_min:1166445 - global_seqlen/balanced_max:1166446 - global_seqlen/mean:1166445.875 - actor/entropy:0.17531120777130127 - actor/pg_loss:-0.00023988876564014824 - actor/pg_clipfrac:0.3401229108132205 - actor/ppo_kl:-6.084418607571528e-05 - actor/pg_clipfrac_lower:0.0 - actor/grad_norm:0.0941162109375 - perf/mfu/actor:0.015284848191420949 - perf/max_memory_allocated_gb:112.15361452102661 - perf/max_memory_reserved_gb:119.193359375 - perf/cpu_memory_used_gb:98.97510528564453 - actor/lr:0.0 - training/global_step:1 - training/epoch:0 - critic/score/mean:0.7080078125 - critic/score/max:1.0 - critic/score/min:0.0 - critic/rewards/mean:0.7080078125 - critic/rewards/max:1.0 - critic/rewards/min:0.0 - critic/advantages/mean:-0.03459210693836212 - critic/advantages/max:3.7499847412109375 - critic/advantages/min:-3.7499847412109375 - critic/returns/mean:-0.03459210693836212 - critic/returns/max:3.7499847412109375 - critic/returns/min:-3.7499847412109375 - response_length/mean:1053.1951904296875 - response_length/max:4096.0 - response_length/min:225.0 - response_length/clip_ratio:0.019775390625 - response_length_non_aborted/mean:1053.1951904296875 - response_length_non_aborted/max:4096.0 - response_length_non_aborted/min:225.0 - response_length_non_aborted/clip_ratio:0.019775390625 - response/aborted_ratio:0.0 - prompt_length/mean:85.912109375 - prompt_length/max:196.0 - prompt_length/min:46.0 - prompt_length/clip_ratio:0.0 - timing_s/start_profile:0.0001772550167515874 - timing_s/generate_sequences:300.4525146484375 - timing_s/generation_timing/max:367.5038757324219 - timing_s/generation_timing/min:258.537109375 - timing_s/generation_timing/topk_ratio:0.125 - timing_s/gen:389.5034919740865 - timing_s/reward:2.791593419970013 - timing_s/old_log_prob:211.36980652296916 - timing_s/adv:0.2772369439480826 - timing_s/update_actor:1866.206732786959 - timing_s/step:2471.1838159790495 - timing_s/stop_profile:5.347910337150097e-05 - timing_per_token_ms/gen:0.045145300146803374 - timing_per_token_ms/update_actor:0.1999885692067537 - timing_per_token_ms/adv:2.9709580818321576e-05 - perf/total_num_tokens:9331567 - perf/time_per_step:2471.1838159790495 - perf/throughput:472.0190652988191

4.まとめ

強化学習の手法の1つであるGSPOについて実験することが出来ました。

まだ未完成でありますが、この記事が少しでも参考になれば幸いです。

プロジェクトのクレジット

本プロジェクトは、国立研究開発法人新エネルギー・産業技術開発機構（NEDO）の

「日本語版医療特化型LLMの社会実装に向けた安全性検証・実証」における

基盤モデルの開発プロジェクトの一環として行われました。

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up