Azure Machine Learning で Phi をFine-Tuning

Posted at 2025-08-07

Azure Machine Learning で Phi-3-mini-4k-instructをFine-Tuningしたのでその時の記録です。
正直、少し古い記事を参考にしながら、あまり変えずにやっているので、もっと良い方法あると思います。

前提

Azure Machine Learning のリソース作成済
コンピューティングクラスターで使うVMサイズのクォータ要求済

注意点

やれなかったこと/やらなかったこと

Docker の最適化: 参照元と変えずにやっているので、古いバージョンが多いです。本番化を変えるのであれば、最新化と安定を考えるべき
Phi4 シリーズ不使用: Phi-3はさすがに2025年では古いので・・・
Fine-Tuneによる精度確認: 精度良くなったかなどは未確認
プログラムの細かい確認: 動けばいい、くらいの確認しかしていないので、よくない書き方をしている点も多いです

Steps

1. コンピューティングクラスター作成

Azure Machine Learning Studio のメニュー管理 -> Standard_NC40ads_H100_v5 のVMサイズを使ったコンピューティングクラスターを作りました。

2. 環境作成

Azure Machine Learning Studio のメニューアセット -> 環境から環境を作成。Dockerfileから作成しています。
正直、ここが一番苦労しました。flash-attnのインストールでいろいろとエラーが起き、AIに解析してもらいながらこの形にしました。ただ、UbuntuにしてもPythonにしてもいろいろとバージョン古いので、本番で使う場合には最適化が必要です。この場では、ひとまずエラーなしに動かした、レベルで終わらしています。

Dockerfile

FROM mcr.microsoft.com/aifx/acpt/stable-ubuntu2004-cu124-py310-torch241:biweekly.202410.2

USER root

# support Deepspeed launcher requirement of passwordless ssh login
RUN apt-get update && \
    apt-get -y upgrade && \
    apt-get install -y --no-install-recommends \
      openssh-server \
      openssh-client && \
    rm -rf /var/lib/apt/lists/*

RUN pip install --upgrade pip

# Install pip dependencies
RUN pip install --no-cache-dir \
    azureml-acft-accelerator==0.0.63 \
    azureml_acft_common_components==0.0.63 \
    azureml-acft-contrib-hf-nlp==0.0.63 \
    azureml-evaluate-mlflow==0.0.63 \
    azureml-metrics[text]==0.0.63 \
    mltable==1.6.1 \
    mpi4py==4.0.1 \
    sentencepiece==0.2.0 \
    transformers==4.46.1 \
    datasets==3.1.0 \
    accelerate==1.1.0 \
    diffusers==0.31.0 \
    onnxruntime==1.20.0 \
    rouge-score==0.1.2 \
    sacrebleu==2.4.3 \
    bitsandbytes==0.44.1 \
    einops==0.8.0 \
    aiohttp==3.10.10 \
    peft==0.13.2 \
    deepspeed==0.15.3 \
    trl==0.12.0 \
    tiktoken==0.8.0 \
    packaging==24.1 \
    timm==1.0.11 \
    azure-identity

# flash-attn はビルド時に環境変数を指定
RUN MAX_JOBS=4 pip install flash-attn==2.6.3 --no-build-isolation

3. Python Script 実行

ローカルでのPython環境です。ローカルPCから Azure MLへデータアップロードやジョブ作成、実行をしています。
無駄なもの入っているかもしれませんが、project.tomlです。Ubuntu22.04.5 でPoetryを使っています。

project.toml(tool.poetry.dependencies部分抜粋)

[tool.poetry.dependencies]
python = "^3.11"
jupyterlab = "^4.4.5"
datasets = "^4.0.0"
azure-ai-ml = "^1.28.1"
azure-identity = "^1.23.1"
transformers = "^4.54.0"
peft = "^0.16.0"
trl = "^0.19.1"
bitsandbytes = "^0.46.1"
torch = "^2.7.1"
accelerate = "^1.9.0"

3.1. データ作成

訓練データをダウンロードしてローカルPC上に保存。

from datasets import load_dataset
from random import randrange

# 例：UltraChat データセットを一部取得
dataset = load_dataset("HuggingFaceH4/ultrachat_200k", split="train_sft[:2%]")

# train/test split
splits = dataset.train_test_split(test_size=0.2)
splits["train"].to_json("data/train.jsonl", orient="records", lines=True)
splits["test"].to_json("data/eval.jsonl",  orient="records", lines=True)

print("Train size:", len(splits["train"]), "Eval size:", len(splits["test"]))
print(dataset[randrange(len(dataset))])

結果

/home/fukuhara/repositories/fine-tune-aml/.venv/lib/python3.11/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html
  from .autonotebook import tqdm as notebook_tqdm
Creating json from Arrow format: 100%|██████████| 4/4 [00:00<00:00,  7.50ba/s]
Creating json from Arrow format: 100%|██████████| 1/1 [00:00<00:00,  9.16ba/s]
Train size: 3325 Eval size: 832
{'prompt': 'What factors contributed to the rise of the Ottoman Empire, and how did it maintain its power in the early modern era?', 'prompt_id': 'f7c6c1885a6ec92f0ac9a0ad6193529aeebc15a4f66c622dcb64366d640bb89c', 'messages': [{'content': 'What factors contributed to the rise of the Ottoman Empire, and how did it maintain its power in the early modern era?', 'role': 'user'}, {'content': 'The rise of the Ottoman Empire was due to various factors, 後略

Azure Machine Learning Studio のメニューアセット -> データでアップロードします(詳細省略)。

3.2. Fine-Tuningプログラム

Azure ML上で動くFine-Tuningのプログラムです。train.pyとしてsrcディレクトリに置きます。
ほとんど参照元からコピペしただけなので、理解していない部分も多いです。
苦労したのもはモデル保存です。mlflow.transformers.log_modelで保存すると、うまくOutputの場所に保存される試行錯誤しました。今、ここのScriptはmlflow のバージョンが2.22.1ですが、3.XXになるとmlflow.transformers.log_modelのartifact_pathがdeprecateしているので注意してください。

src/train.py

import os
import mlflow
import argparse
import sys
import logging

import datasets
from datasets import load_dataset
from peft import LoraConfig
import torch
import transformers
from trl import SFTTrainer
from transformers import AutoModelForCausalLM, AutoTokenizer, TrainingArguments
from datasets import load_dataset

logger = logging.getLogger(__name__)


###################
# Hyper-parameters
###################
training_config = {
    "bf16": True,
    "do_eval": False,
    "learning_rate": 5.0e-06,
    "log_level": "info",
    "logging_steps": 20,
    "logging_strategy": "steps",
    "lr_scheduler_type": "cosine",
    "num_train_epochs": 1,
    "max_steps": -1,
    "output_dir": "./checkpoint_dir",
    "overwrite_output_dir": True,
    "per_device_eval_batch_size": 4,
    "per_device_train_batch_size": 4,
    "remove_unused_columns": True,
    "save_steps": 100,
    "save_total_limit": 1,
    "seed": 0,
    "gradient_checkpointing": True,
    "gradient_checkpointing_kwargs":{"use_reentrant": False},
    "gradient_accumulation_steps": 1,
    "warmup_ratio": 0.2,
    "report_to": [],   # ← ここで MLflow などへの自動ログを無効化
    }

peft_config = {
    "r": 16,
    "lora_alpha": 32,
    "lora_dropout": 0.05,
    "bias": "none",
    "task_type": "CAUSAL_LM",
    "target_modules": "all-linear",
    "modules_to_save": None,
}
train_conf = TrainingArguments(**training_config)
peft_conf = LoraConfig(**peft_config)

###############
# Setup logging
###############
logging.basicConfig(
    format="%(asctime)s - %(levelname)s - %(name)s - %(message)s",
    datefmt="%Y-%m-%d %H:%M:%S",
    handlers=[logging.StreamHandler(sys.stdout)],
)
log_level = train_conf.get_process_log_level()
logger.setLevel(log_level)
datasets.utils.logging.set_verbosity(log_level)
transformers.utils.logging.set_verbosity(log_level)
transformers.utils.logging.enable_default_handler()
transformers.utils.logging.enable_explicit_format()

# Log on each process a small summary
logger.warning(
    f"Process rank: {train_conf.local_rank}, device: {train_conf.device}, n_gpu: {train_conf.n_gpu}"
    + f" distributed training: {bool(train_conf.local_rank != -1)}, 16-bits training: {train_conf.fp16}"
)
logger.info(f"Training/evaluation parameters {train_conf}")
logger.info(f"PEFT parameters {peft_conf}")

################
# Modle Loading
################
checkpoint_path = "microsoft/Phi-3-mini-4k-instruct"
#checkpoint = "microsoft/Phi-3.5-mini-instruct"  # Phi3.5
# checkpoint_path = "microsoft/Phi-3-mini-128k-instruct"
model_kwargs = dict(
    use_cache=False,
    trust_remote_code=True,
    attn_implementation="flash_attention_2",  # loading the model with flash-attenstion support
    torch_dtype=torch.bfloat16,
    device_map=None
)
model = AutoModelForCausalLM.from_pretrained(checkpoint_path, **model_kwargs)
tokenizer = AutoTokenizer.from_pretrained(checkpoint_path)
tokenizer.model_max_length = 2048
tokenizer.pad_token = tokenizer.unk_token  # use unk rather than eos token to prevent endless generation
tokenizer.pad_token_id = tokenizer.convert_tokens_to_ids(tokenizer.pad_token)
tokenizer.padding_side = 'right'

##################
# Data Processing
##################
def apply_chat_template(
    example,
    tokenizer,
):
    messages = example["messages"]
    # Add an empty system message if there is none
    if messages[0]["role"] != "system":
        messages.insert(0, {"role": "system", "content": ""})
    example["text"] = tokenizer.apply_chat_template(
        messages, tokenize=False, add_generation_prompt=False)
    return example


def main(args):
    train_dataset = load_dataset('json', data_files=args.train_file, split='train')
    test_dataset = load_dataset('json', data_files=args.eval_file, split='train')
    column_names = list(train_dataset.features)

    processed_train_dataset = train_dataset.map(
        apply_chat_template,
        fn_kwargs={"tokenizer": tokenizer},
        num_proc=10,
        remove_columns=column_names,
        desc="Applying chat template to train_sft",
    )

    processed_test_dataset = test_dataset.map(
        apply_chat_template,
        fn_kwargs={"tokenizer": tokenizer},
        num_proc=10,
        remove_columns=column_names,
        desc="Applying chat template to test_sft",
    )

    ###########
    # Training
    ###########
    trainer = SFTTrainer(
        model=model,
        args=train_conf,
        peft_config=peft_conf,
        train_dataset=processed_train_dataset,
        eval_dataset=processed_test_dataset,
        max_seq_length=2048,
        dataset_text_field="text",
        tokenizer=tokenizer,
        packing=True
    )

    train_result = trainer.train()
    metrics = train_result.metrics
    trainer.log_metrics("train", metrics)
    trainer.save_metrics("train", metrics)
    trainer.save_state()

    #############
    # Evaluation
    #############
    tokenizer.padding_side = 'left'
    metrics = trainer.evaluate()
    metrics["eval_samples"] = len(processed_test_dataset)
    trainer.log_metrics("eval", metrics)
    trainer.save_metrics("eval", metrics)


    # ############
    # # Save model
    # ############

    with mlflow.start_run():
        # この方法がoutputsにModelとして登録されるが、名前が動的(mlflow_log_model_1992684870)
        mlflow.transformers.log_model(
            transformers_model={
                "model": trainer.model,
                "tokenizer": trainer.tokenizer,
            },
    #       artifact_path=args.model_dir では outputs/modelには出力されない
            artifact_path="model", #  'model' の部分
        )


def parse_args():
    # setup argparse
    parser = argparse.ArgumentParser()
    parser.add_argument("--train-file", type=str, required=True)
    parser.add_argument("--eval-file",  type=str, required=True)
    # parser.add_argument("--model-dir",  type=str, default="./outputs/model")
    parser.add_argument("--epochs", default=10, type=int, help="number of epochs")
    parser.add_argument(
        "--batch-size",
        default=16,
        type=int,
        help="mini batch size for each gpu/process",
    )
    parser.add_argument("--learning-rate", default=0.001, type=float, help="learning rate")
    parser.add_argument("--momentum", default=0.9, type=float, help="momentum")
    parser.add_argument(
        "--print-freq",
        default=200,
        type=int,
        help="frequency of printing training statistics",
    )

    # parse args
    args = parser.parse_args()

    # return args
    return args


# run script
if __name__ == "__main__":
    # parse args
    args = parse_args()
    # call main function
    main(args)

3.3. 実行プログラム

Fine-Tuningのジョブを登録、起動し、デプロイ、推論、デプロイ削除のプログラムです。
プログラム全体で、長いです。

import json

from azure.ai.ml import MLClient, Input
from azure.ai.ml import command
from azure.ai.ml.entities import Model, ManagedOnlineEndpoint, ManagedOnlineDeployment, ProbeSettings, OnlineRequestSettings
from azure.ai.ml.constants import AssetTypes
from azure.identity import DefaultAzureCredential
from urllib.error import HTTPError
from urllib.parse import urlparse
from urllib.request import Request, urlopen

credential = DefaultAzureCredential()
subscription_id= "サブスクリプションID"
resource_group = "リソースグループ名"
workspace= "ワークスペース名"
ml_client = MLClient(credential, subscription_id, resource_group, workspace)

job = command(
    inputs=dict(
        train_file=Input(
            type="uri_file",
            path="azureml:phi35-train-data:1",
        ),
        eval_file=Input(
            type="uri_file",
            path="azureml:phi35-eval-data:1",
        ),        
        epoch=1,
        batchsize=64,
        lr = 0.01,
        momentum = 0.9,
        prtfreq = 200,
    ),
    # mlflow.transformers.log_model でモデル保存するとoutput生成までやってくれた
    # outputs={
    #     "model": Output(type="uri_folder")#, path="azureml://datastores/workspaceblobstore/paths/custom_path/")
    # },
    code="./src",  # local path where the code is stored
    compute = 'gpu-cluster',
    # command="accelerate launch train.py --train-file ${{inputs.train_file}} --eval-file ${{inputs.eval_file}} --epochs ${{inputs.epoch}} --batch-size ${{inputs.batchsize}} --learning-rate ${{inputs.lr}} --momentum ${{inputs.momentum}} --print-freq ${{inputs.prtfreq}} --model-dir ${{outputs.model}}",
    command="accelerate launch train.py --train-file ${{inputs.train_file}} --eval-file ${{inputs.eval_file}} --epochs ${{inputs.epoch}} --batch-size ${{inputs.batchsize}} --learning-rate ${{inputs.lr}} --momentum ${{inputs.momentum}} --print-freq ${{inputs.prtfreq}}",
    environment="llm-training3:1",
    distribution={
        "type": "PyTorch",
        "process_count_per_instance": 1,
    },
)
returned_job  = ml_client.jobs.create_or_update(job)

ml_client.jobs.stream(returned_job.name)

print("ジョブID:", returned_job.name)
job = ml_client.jobs.get(name=returned_job.name)
print(job.status)

# モデル登録
phi_model = Model(
    # train.py のmlflow.transformers.log_modelで保存したモデルのパス
    path=f"azureml://jobs/{job.name}/outputs/artifacts/model",  
    type=AssetTypes.MLFLOW_MODEL,
    name="phi3-finetuned",
    description="Phi3 model fine-tuned on custom dataset",
)

registered_model = ml_client.models.create_or_update(phi_model)


# ライブエンドポイント作成
ENDPOINT_NAME = "test-endpoint-for-phi3"
endpoint = ManagedOnlineEndpoint(
    name=ENDPOINT_NAME,
    description="Online endpoint for test",
    auth_mode="key",
)
ml_client.begin_create_or_update(endpoint).wait()

DEPLOYMENT_NAME = "Deploy-test"

deployment = ManagedOnlineDeployment(
    name=DEPLOYMENT_NAME,
    endpoint_name=ENDPOINT_NAME,
    model=f"{registered_model.name}:{registered_model.version}",
    instance_type="Standard_NC40ads_H100_v5",
    instance_count=1,
    liveness_probe=ProbeSettings(initial_delay=600),
    request_settings=OnlineRequestSettings(request_timeout_ms=90000),
)
ml_client.online_deployments.begin_create_or_update(deployment).wait()

endpoint.traffic = {DEPLOYMENT_NAME: 100}
updated_online_endpoint = ml_client.begin_create_or_update(endpoint).result()

score_url = updated_online_endpoint.scoring_uri
parsed = urlparse(score_url)
uri = f"{str(parsed.scheme)}://{str(parsed.hostname)}"

data = {"input_data": [{"message": "Hello"}]}

body = str.encode(json.dumps(data))
auth_keys = ml_client.online_endpoints.get_keys(name=ENDPOINT_NAME)
headers = {'Content-Type':'application/json', 'Accept': 'application/json', 'Authorization':('Bearer '+ auth_keys.primary_key)}

req = Request(uri+"/score", body, headers)

try:
    response = urlopen(req)

    result = response.read()
    print(result)
except HTTPError as error:
    print("The request failed with status code: " + str(error.code))

    # Print the headers - they include the requert ID and the timestamp, which are useful for debugging the failure
    print(error.info())
    print(error.read().decode("utf8", 'ignore'))

ml_client.online_endpoints.begin_delete(name=ENDPOINT_NAME).wait()

↑のプログラム全体の中からポイント絞って説明します。

3.3.1. ファインチューニング実行

ファインチューニング実行部分のコードです。
次のモデル登録につなげるのに、どうモデル保存させるかが苦労しました。
Outputs 定義してそのパスにmlflow.transformers.log_modelで保存させてもうまくいかなかったです。mlflow.transformers.log_modelは自動でOutputsを作ってくれるみたいで、Model登録側のパスの指定を変えるとうまく読み込めました。

job = command(
    inputs=dict(
        train_file=Input(
            type="uri_file",
            path="azureml:phi35-train-data:1",
        ),
        eval_file=Input(
            type="uri_file",
            path="azureml:phi35-eval-data:1",
        ),        
        epoch=1,
        batchsize=64,
        lr = 0.01,
        momentum = 0.9,
        prtfreq = 200,
    ),
    # mlflow.transformers.log_model でモデル保存するとoutput生成までやってくれた
    # outputs={
    #     "model": Output(type="uri_folder")#, path="azureml://datastores/workspaceblobstore/paths/custom_path/")
    # },
    code="./src",  # local path where the code is stored
    compute = 'gpu-cluster',
    # command="accelerate launch train.py --train-file ${{inputs.train_file}} --eval-file ${{inputs.eval_file}} --epochs ${{inputs.epoch}} --batch-size ${{inputs.batchsize}} --learning-rate ${{inputs.lr}} --momentum ${{inputs.momentum}} --print-freq ${{inputs.prtfreq}} --model-dir ${{outputs.model}}",
    command="accelerate launch train.py --train-file ${{inputs.train_file}} --eval-file ${{inputs.eval_file}} --epochs ${{inputs.epoch}} --batch-size ${{inputs.batchsize}} --learning-rate ${{inputs.lr}} --momentum ${{inputs.momentum}} --print-freq ${{inputs.prtfreq}}",
    environment="llm-training3:1",
    distribution={
        "type": "PyTorch",
        "process_count_per_instance": 1,
    },
)
returned_job  = ml_client.jobs.create_or_update(job)

ジョブ完了したときの概要画面。
※ 「登録されたモデル」がこの時点でありますが、次のステップの後に出てきます。

"model"パスにモデルが保存されています。
user_logs/std_log_process_0.txt を見ます。

ログ(長い)

user_logs/std_log_process_0.txt

The following values were not passed to `accelerate launch` and had defaults used instead:
	`--num_processes` was set to a value of `1`
	`--num_machines` was set to a value of `1`
	`--mixed_precision` was set to a value of `'no'`
	`--dynamo_backend` was set to a value of `'no'`
To avoid this warning pass in values for each of the problematic parameters or run `accelerate config`.
2025-08-07 03:17:45 - WARNING - __main__ - Process rank: 0, device: cuda:0, n_gpu: 1 distributed training: True, 16-bits training: False
2025-08-07 03:17:45 - INFO - __main__ - Training/evaluation parameters TrainingArguments(
_n_gpu=1,
accelerator_config={'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None, 'use_configured_state': False},
adafactor=False,
adam_beta1=0.9,
adam_beta2=0.999,
adam_epsilon=1e-08,
auto_find_batch_size=False,
batch_eval_metrics=False,
bf16=True,
bf16_full_eval=False,
data_seed=None,
dataloader_drop_last=False,
dataloader_num_workers=0,
dataloader_persistent_workers=False,
dataloader_pin_memory=True,
dataloader_prefetch_factor=None,
ddp_backend=None,
ddp_broadcast_buffers=None,
ddp_bucket_cap_mb=None,
ddp_find_unused_parameters=None,
ddp_timeout=1800,
debug=[],
deepspeed=None,
disable_tqdm=False,
dispatch_batches=None,
do_eval=False,
do_predict=False,
do_train=False,
eval_accumulation_steps=None,
eval_delay=0,
eval_do_concat_batches=True,
eval_on_start=False,
eval_steps=None,
eval_strategy=no,
eval_use_gather_object=False,
evaluation_strategy=None,
fp16=False,
fp16_backend=auto,
fp16_full_eval=False,
fp16_opt_level=O1,
fsdp=[],
fsdp_config={'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False},
fsdp_min_num_params=0,
fsdp_transformer_layer_cls_to_wrap=None,
full_determinism=False,
gradient_accumulation_steps=1,
gradient_checkpointing=True,
gradient_checkpointing_kwargs={'use_reentrant': False},
greater_is_better=None,
group_by_length=False,
half_precision_backend=auto,
hub_always_push=False,
hub_model_id=None,
hub_private_repo=False,
hub_strategy=every_save,
hub_token=<HUB_TOKEN>,
ignore_data_skip=False,
include_for_metrics=[],
include_inputs_for_metrics=False,
include_num_input_tokens_seen=False,
include_tokens_per_second=False,
jit_mode_eval=False,
label_names=None,
label_smoothing_factor=0.0,
learning_rate=5e-06,
length_column_name=length,
load_best_model_at_end=False,
local_rank=0,
log_level=info,
log_level_replica=warning,
log_on_each_node=True,
logging_dir=./checkpoint_dir/runs/Aug07_03-17-44_e6cf3c2d7500455db8e799b5be01f472000001,
logging_first_step=False,
logging_nan_inf_filter=True,
logging_steps=20,
logging_strategy=steps,
lr_scheduler_kwargs={},
lr_scheduler_type=cosine,
max_grad_norm=1.0,
max_steps=-1,
metric_for_best_model=None,
mp_parameters=,
neftune_noise_alpha=None,
no_cuda=False,
num_train_epochs=1,
optim=adamw_torch,
optim_args=None,
optim_target_modules=None,
output_dir=./checkpoint_dir,
overwrite_output_dir=True,
past_index=-1,
per_device_eval_batch_size=4,
per_device_train_batch_size=4,
prediction_loss_only=False,
push_to_hub=False,
push_to_hub_model_id=None,
push_to_hub_organization=None,
push_to_hub_token=<PUSH_TO_HUB_TOKEN>,
ray_scope=last,
remove_unused_columns=True,
report_to=[],
restore_callback_states_from_checkpoint=False,
resume_from_checkpoint=None,
run_name=./checkpoint_dir,
save_on_each_node=False,
save_only_model=False,
save_safetensors=True,
save_steps=100,
save_strategy=steps,
save_total_limit=1,
seed=0,
skip_memory_metrics=True,
split_batches=None,
tf32=None,
torch_compile=False,
torch_compile_backend=None,
torch_compile_mode=None,
torch_empty_cache_steps=None,
torchdynamo=None,
tpu_metrics_debug=False,
tpu_num_cores=None,
use_cpu=False,
use_ipex=False,
use_legacy_prediction_loop=False,
use_liger_kernel=False,
use_mps_device=False,
warmup_ratio=0.2,
warmup_steps=0,
weight_decay=0.0,
)
2025-08-07 03:17:45 - INFO - __main__ - PEFT parameters LoraConfig(peft_type=<PeftType.LORA: 'LORA'>, auto_mapping=None, base_model_name_or_path=None, revision=None, task_type='CAUSAL_LM', inference_mode=False, r=16, target_modules='all-linear', lora_alpha=32, lora_dropout=0.05, fan_in_fan_out=False, bias='none', use_rslora=False, modules_to_save=None, init_lora_weights=True, layers_to_transform=None, layers_pattern=None, rank_pattern={}, alpha_pattern={}, megatron_config=None, megatron_core='megatron.core', loftq_config={}, use_dora=False, layer_replication=None, runtime_config=LoraRuntimeConfig(ephemeral_gpu_offload=False))
[INFO|configuration_utils.py:679] 2025-08-07 03:17:45,592 >> loading configuration file config.json from cache at /root/.cache/huggingface/hub/models--microsoft--Phi-3-mini-4k-instruct/snapshots/0a67737cc96d2554230f90338b163bc6380a2a85/config.json
[WARNING|dynamic_module_utils.py:421] 2025-08-07 03:17:45,773 >> A new version of the following files was downloaded from https://huggingface.co/microsoft/Phi-3-mini-4k-instruct:
- configuration_phi3.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.
[INFO|configuration_utils.py:679] 2025-08-07 03:17:45,774 >> loading configuration file config.json from cache at /root/.cache/huggingface/hub/models--microsoft--Phi-3-mini-4k-instruct/snapshots/0a67737cc96d2554230f90338b163bc6380a2a85/config.json
[INFO|configuration_utils.py:746] 2025-08-07 03:17:45,775 >> Model config Phi3Config {
  "_name_or_path": "microsoft/Phi-3-mini-4k-instruct",
  "architectures": [
    "Phi3ForCausalLM"
  ],
  "attention_bias": false,
  "attention_dropout": 0.0,
  "auto_map": {
    "AutoConfig": "microsoft/Phi-3-mini-4k-instruct--configuration_phi3.Phi3Config",
    "AutoModelForCausalLM": "microsoft/Phi-3-mini-4k-instruct--modeling_phi3.Phi3ForCausalLM"
  },
  "bos_token_id": 1,
  "embd_pdrop": 0.0,
  "eos_token_id": 32000,
  "hidden_act": "silu",
  "hidden_size": 3072,
  "initializer_range": 0.02,
  "intermediate_size": 8192,
  "max_position_embeddings": 4096,
  "model_type": "phi3",
  "num_attention_heads": 32,
  "num_hidden_layers": 32,
  "num_key_value_heads": 32,
  "original_max_position_embeddings": 4096,
  "pad_token_id": 32000,
  "resid_pdrop": 0.0,
  "rms_norm_eps": 1e-05,
  "rope_scaling": null,
  "rope_theta": 10000.0,
  "sliding_window": 2047,
  "tie_word_embeddings": false,
  "torch_dtype": "bfloat16",
  "transformers_version": "4.46.1",
  "use_cache": false,
  "vocab_size": 32064
}

[WARNING|dynamic_module_utils.py:421] 2025-08-07 03:17:46,030 >> A new version of the following files was downloaded from https://huggingface.co/microsoft/Phi-3-mini-4k-instruct:
- modeling_phi3.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.
[INFO|modeling_utils.py:3937] 2025-08-07 03:17:46,400 >> loading weights file model.safetensors from cache at /root/.cache/huggingface/hub/models--microsoft--Phi-3-mini-4k-instruct/snapshots/0a67737cc96d2554230f90338b163bc6380a2a85/model.safetensors.index.json

Downloading shards:   0%|          | 0/2 [00:00<?, ?it/s]
Downloading shards:  50%|█████     | 1/2 [00:04<00:04,  4.85s/it]
Downloading shards: 100%|██████████| 2/2 [00:09<00:00,  4.48s/it]
Downloading shards: 100%|██████████| 2/2 [00:09<00:00,  4.54s/it]
[INFO|modeling_utils.py:1670] 2025-08-07 03:17:55,474 >> Instantiating Phi3ForCausalLM model under default dtype torch.bfloat16.
[WARNING|logging.py:328] 2025-08-07 03:17:55,479 >> You are attempting to use Flash Attention 2.0 with a model not initialized on GPU. Make sure to move the model to GPU after initializing it on CPU with `model.to('cuda')`.
[INFO|configuration_utils.py:1096] 2025-08-07 03:17:55,480 >> Generate config GenerationConfig {
  "bos_token_id": 1,
  "eos_token_id": 32000,
  "pad_token_id": 32000,
  "use_cache": false
}


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]
Loading checkpoint shards:  50%|█████     | 1/2 [00:00<00:00,  6.19it/s]
Loading checkpoint shards: 100%|██████████| 2/2 [00:00<00:00,  6.64it/s]
Loading checkpoint shards: 100%|██████████| 2/2 [00:00<00:00,  6.56it/s]
[INFO|modeling_utils.py:4800] 2025-08-07 03:17:55,852 >> All model checkpoint weights were used when initializing Phi3ForCausalLM.

[INFO|modeling_utils.py:4808] 2025-08-07 03:17:55,852 >> All the weights of Phi3ForCausalLM were initialized from the model checkpoint at microsoft/Phi-3-mini-4k-instruct.
If your task is similar to the task the model of the checkpoint was trained on, you can already use Phi3ForCausalLM for predictions without further training.
[INFO|configuration_utils.py:1051] 2025-08-07 03:17:56,036 >> loading configuration file generation_config.json from cache at /root/.cache/huggingface/hub/models--microsoft--Phi-3-mini-4k-instruct/snapshots/0a67737cc96d2554230f90338b163bc6380a2a85/generation_config.json
[INFO|configuration_utils.py:1096] 2025-08-07 03:17:56,037 >> Generate config GenerationConfig {
  "bos_token_id": 1,
  "eos_token_id": [
    32000,
    32001,
    32007
  ],
  "pad_token_id": 32000
}

[INFO|tokenization_utils_base.py:2211] 2025-08-07 03:17:58,109 >> loading file tokenizer.model from cache at /root/.cache/huggingface/hub/models--microsoft--Phi-3-mini-4k-instruct/snapshots/0a67737cc96d2554230f90338b163bc6380a2a85/tokenizer.model
[INFO|tokenization_utils_base.py:2211] 2025-08-07 03:17:58,109 >> loading file tokenizer.json from cache at /root/.cache/huggingface/hub/models--microsoft--Phi-3-mini-4k-instruct/snapshots/0a67737cc96d2554230f90338b163bc6380a2a85/tokenizer.json
[INFO|tokenization_utils_base.py:2211] 2025-08-07 03:17:58,109 >> loading file added_tokens.json from cache at /root/.cache/huggingface/hub/models--microsoft--Phi-3-mini-4k-instruct/snapshots/0a67737cc96d2554230f90338b163bc6380a2a85/added_tokens.json
[INFO|tokenization_utils_base.py:2211] 2025-08-07 03:17:58,109 >> loading file special_tokens_map.json from cache at /root/.cache/huggingface/hub/models--microsoft--Phi-3-mini-4k-instruct/snapshots/0a67737cc96d2554230f90338b163bc6380a2a85/special_tokens_map.json
[INFO|tokenization_utils_base.py:2211] 2025-08-07 03:17:58,109 >> loading file tokenizer_config.json from cache at /root/.cache/huggingface/hub/models--microsoft--Phi-3-mini-4k-instruct/snapshots/0a67737cc96d2554230f90338b163bc6380a2a85/tokenizer_config.json
[INFO|tokenization_utils_base.py:2475] 2025-08-07 03:17:58,149 >> Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Using custom data configuration default-778cde16c49f6168
2025-08-07 03:17:58 - INFO - datasets.builder - Using custom data configuration default-778cde16c49f6168
Loading Dataset Infos from /opt/conda/envs/ptca/lib/python3.10/site-packages/datasets/packaged_modules/json
2025-08-07 03:17:58 - INFO - datasets.info - Loading Dataset Infos from /opt/conda/envs/ptca/lib/python3.10/site-packages/datasets/packaged_modules/json
Generating dataset json (/root/.cache/huggingface/datasets/json/default-778cde16c49f6168/0.0.0/f4e89e8750d5d5ffbef2c078bf0ddfedef29dc2faff52a6255cf513c05eb1092)
2025-08-07 03:17:58 - INFO - datasets.builder - Generating dataset json (/root/.cache/huggingface/datasets/json/default-778cde16c49f6168/0.0.0/f4e89e8750d5d5ffbef2c078bf0ddfedef29dc2faff52a6255cf513c05eb1092)
Downloading and preparing dataset json/default to /root/.cache/huggingface/datasets/json/default-778cde16c49f6168/0.0.0/f4e89e8750d5d5ffbef2c078bf0ddfedef29dc2faff52a6255cf513c05eb1092...
2025-08-07 03:17:58 - INFO - datasets.builder - Downloading and preparing dataset json/default to /root/.cache/huggingface/datasets/json/default-778cde16c49f6168/0.0.0/f4e89e8750d5d5ffbef2c078bf0ddfedef29dc2faff52a6255cf513c05eb1092...
Downloading took 0.0 min
2025-08-07 03:17:58 - INFO - datasets.download.download_manager - Downloading took 0.0 min
Checksum Computation took 0.0 min
2025-08-07 03:17:58 - INFO - datasets.download.download_manager - Checksum Computation took 0.0 min
Generating train split
2025-08-07 03:17:59 - INFO - datasets.builder - Generating train split

Generating train split: 0 examples [00:00, ? examples/s]
Generating train split: 3325 examples [00:00, 39425.17 examples/s]
Unable to verify splits sizes.
2025-08-07 03:17:59 - INFO - datasets.utils.info_utils - Unable to verify splits sizes.
Dataset json downloaded and prepared to /root/.cache/huggingface/datasets/json/default-778cde16c49f6168/0.0.0/f4e89e8750d5d5ffbef2c078bf0ddfedef29dc2faff52a6255cf513c05eb1092. Subsequent calls will reuse this data.
2025-08-07 03:17:59 - INFO - datasets.builder - Dataset json downloaded and prepared to /root/.cache/huggingface/datasets/json/default-778cde16c49f6168/0.0.0/f4e89e8750d5d5ffbef2c078bf0ddfedef29dc2faff52a6255cf513c05eb1092. Subsequent calls will reuse this data.
Using custom data configuration default-1be9985046771f19
2025-08-07 03:17:59 - INFO - datasets.builder - Using custom data configuration default-1be9985046771f19
Loading Dataset Infos from /opt/conda/envs/ptca/lib/python3.10/site-packages/datasets/packaged_modules/json
2025-08-07 03:17:59 - INFO - datasets.info - Loading Dataset Infos from /opt/conda/envs/ptca/lib/python3.10/site-packages/datasets/packaged_modules/json
Generating dataset json (/root/.cache/huggingface/datasets/json/default-1be9985046771f19/0.0.0/f4e89e8750d5d5ffbef2c078bf0ddfedef29dc2faff52a6255cf513c05eb1092)
2025-08-07 03:17:59 - INFO - datasets.builder - Generating dataset json (/root/.cache/huggingface/datasets/json/default-1be9985046771f19/0.0.0/f4e89e8750d5d5ffbef2c078bf0ddfedef29dc2faff52a6255cf513c05eb1092)
Downloading and preparing dataset json/default to /root/.cache/huggingface/datasets/json/default-1be9985046771f19/0.0.0/f4e89e8750d5d5ffbef2c078bf0ddfedef29dc2faff52a6255cf513c05eb1092...
2025-08-07 03:17:59 - INFO - datasets.builder - Downloading and preparing dataset json/default to /root/.cache/huggingface/datasets/json/default-1be9985046771f19/0.0.0/f4e89e8750d5d5ffbef2c078bf0ddfedef29dc2faff52a6255cf513c05eb1092...
Downloading took 0.0 min
2025-08-07 03:17:59 - INFO - datasets.download.download_manager - Downloading took 0.0 min
Checksum Computation took 0.0 min
2025-08-07 03:17:59 - INFO - datasets.download.download_manager - Checksum Computation took 0.0 min
Generating train split
2025-08-07 03:17:59 - INFO - datasets.builder - Generating train split

Generating train split: 0 examples [00:00, ? examples/s]
Generating train split: 832 examples [00:00, 43153.62 examples/s]
Unable to verify splits sizes.
2025-08-07 03:17:59 - INFO - datasets.utils.info_utils - Unable to verify splits sizes.
Dataset json downloaded and prepared to /root/.cache/huggingface/datasets/json/default-1be9985046771f19/0.0.0/f4e89e8750d5d5ffbef2c078bf0ddfedef29dc2faff52a6255cf513c05eb1092. Subsequent calls will reuse this data.
2025-08-07 03:17:59 - INFO - datasets.builder - Dataset json downloaded and prepared to /root/.cache/huggingface/datasets/json/default-1be9985046771f19/0.0.0/f4e89e8750d5d5ffbef2c078bf0ddfedef29dc2faff52a6255cf513c05eb1092. Subsequent calls will reuse this data.
Process #0 will write at /root/.cache/huggingface/datasets/json/default-778cde16c49f6168/0.0.0/f4e89e8750d5d5ffbef2c078bf0ddfedef29dc2faff52a6255cf513c05eb1092/cache-936df8b9bc388c25_00000_of_00010.arrow
2025-08-07 03:17:59 - INFO - datasets.arrow_dataset - Process #0 will write at /root/.cache/huggingface/datasets/json/default-778cde16c49f6168/0.0.0/f4e89e8750d5d5ffbef2c078bf0ddfedef29dc2faff52a6255cf513c05eb1092/cache-936df8b9bc388c25_00000_of_00010.arrow
Process #1 will write at /root/.cache/huggingface/datasets/json/default-778cde16c49f6168/0.0.0/f4e89e8750d5d5ffbef2c078bf0ddfedef29dc2faff52a6255cf513c05eb1092/cache-936df8b9bc388c25_00001_of_00010.arrow
2025-08-07 03:17:59 - INFO - datasets.arrow_dataset - Process #1 will write at /root/.cache/huggingface/datasets/json/default-778cde16c49f6168/0.0.0/f4e89e8750d5d5ffbef2c078bf0ddfedef29dc2faff52a6255cf513c05eb1092/cache-936df8b9bc388c25_00001_of_00010.arrow
Process #2 will write at /root/.cache/huggingface/datasets/json/default-778cde16c49f6168/0.0.0/f4e89e8750d5d5ffbef2c078bf0ddfedef29dc2faff52a6255cf513c05eb1092/cache-936df8b9bc388c25_00002_of_00010.arrow
2025-08-07 03:17:59 - INFO - datasets.arrow_dataset - Process #2 will write at /root/.cache/huggingface/datasets/json/default-778cde16c49f6168/0.0.0/f4e89e8750d5d5ffbef2c078bf0ddfedef29dc2faff52a6255cf513c05eb1092/cache-936df8b9bc388c25_00002_of_00010.arrow
Process #3 will write at /root/.cache/huggingface/datasets/json/default-778cde16c49f6168/0.0.0/f4e89e8750d5d5ffbef2c078bf0ddfedef29dc2faff52a6255cf513c05eb1092/cache-936df8b9bc388c25_00003_of_00010.arrow
2025-08-07 03:17:59 - INFO - datasets.arrow_dataset - Process #3 will write at /root/.cache/huggingface/datasets/json/default-778cde16c49f6168/0.0.0/f4e89e8750d5d5ffbef2c078bf0ddfedef29dc2faff52a6255cf513c05eb1092/cache-936df8b9bc388c25_00003_of_00010.arrow
Process #4 will write at /root/.cache/huggingface/datasets/json/default-778cde16c49f6168/0.0.0/f4e89e8750d5d5ffbef2c078bf0ddfedef29dc2faff52a6255cf513c05eb1092/cache-936df8b9bc388c25_00004_of_00010.arrow
2025-08-07 03:17:59 - INFO - datasets.arrow_dataset - Process #4 will write at /root/.cache/huggingface/datasets/json/default-778cde16c49f6168/0.0.0/f4e89e8750d5d5ffbef2c078bf0ddfedef29dc2faff52a6255cf513c05eb1092/cache-936df8b9bc388c25_00004_of_00010.arrow
Process #5 will write at /root/.cache/huggingface/datasets/json/default-778cde16c49f6168/0.0.0/f4e89e8750d5d5ffbef2c078bf0ddfedef29dc2faff52a6255cf513c05eb1092/cache-936df8b9bc388c25_00005_of_00010.arrow
2025-08-07 03:17:59 - INFO - datasets.arrow_dataset - Process #5 will write at /root/.cache/huggingface/datasets/json/default-778cde16c49f6168/0.0.0/f4e89e8750d5d5ffbef2c078bf0ddfedef29dc2faff52a6255cf513c05eb1092/cache-936df8b9bc388c25_00005_of_00010.arrow
Process #6 will write at /root/.cache/huggingface/datasets/json/default-778cde16c49f6168/0.0.0/f4e89e8750d5d5ffbef2c078bf0ddfedef29dc2faff52a6255cf513c05eb1092/cache-936df8b9bc388c25_00006_of_00010.arrow
2025-08-07 03:17:59 - INFO - datasets.arrow_dataset - Process #6 will write at /root/.cache/huggingface/datasets/json/default-778cde16c49f6168/0.0.0/f4e89e8750d5d5ffbef2c078bf0ddfedef29dc2faff52a6255cf513c05eb1092/cache-936df8b9bc388c25_00006_of_00010.arrow
Process #7 will write at /root/.cache/huggingface/datasets/json/default-778cde16c49f6168/0.0.0/f4e89e8750d5d5ffbef2c078bf0ddfedef29dc2faff52a6255cf513c05eb1092/cache-936df8b9bc388c25_00007_of_00010.arrow
2025-08-07 03:17:59 - INFO - datasets.arrow_dataset - Process #7 will write at /root/.cache/huggingface/datasets/json/default-778cde16c49f6168/0.0.0/f4e89e8750d5d5ffbef2c078bf0ddfedef29dc2faff52a6255cf513c05eb1092/cache-936df8b9bc388c25_00007_of_00010.arrow
Process #8 will write at /root/.cache/huggingface/datasets/json/default-778cde16c49f6168/0.0.0/f4e89e8750d5d5ffbef2c078bf0ddfedef29dc2faff52a6255cf513c05eb1092/cache-936df8b9bc388c25_00008_of_00010.arrow
2025-08-07 03:17:59 - INFO - datasets.arrow_dataset - Process #8 will write at /root/.cache/huggingface/datasets/json/default-778cde16c49f6168/0.0.0/f4e89e8750d5d5ffbef2c078bf0ddfedef29dc2faff52a6255cf513c05eb1092/cache-936df8b9bc388c25_00008_of_00010.arrow
Process #9 will write at /root/.cache/huggingface/datasets/json/default-778cde16c49f6168/0.0.0/f4e89e8750d5d5ffbef2c078bf0ddfedef29dc2faff52a6255cf513c05eb1092/cache-936df8b9bc388c25_00009_of_00010.arrow
2025-08-07 03:17:59 - INFO - datasets.arrow_dataset - Process #9 will write at /root/.cache/huggingface/datasets/json/default-778cde16c49f6168/0.0.0/f4e89e8750d5d5ffbef2c078bf0ddfedef29dc2faff52a6255cf513c05eb1092/cache-936df8b9bc388c25_00009_of_00010.arrow
Spawning 10 processes
2025-08-07 03:17:59 - INFO - datasets.arrow_dataset - Spawning 10 processes

Applying chat template to train_sft (num_proc=10):   0%|          | 0/3325 [00:00<?, ? examples/s]Caching processed dataset at /root/.cache/huggingface/datasets/json/default-778cde16c49f6168/0.0.0/f4e89e8750d5d5ffbef2c078bf0ddfedef29dc2faff52a6255cf513c05eb1092/cache-936df8b9bc388c25_00000_of_00010.arrow
2025-08-07 03:17:59 - INFO - datasets.arrow_dataset - Caching processed dataset at /root/.cache/huggingface/datasets/json/default-778cde16c49f6168/0.0.0/f4e89e8750d5d5ffbef2c078bf0ddfedef29dc2faff52a6255cf513c05eb1092/cache-936df8b9bc388c25_00000_of_00010.arrow
Caching processed dataset at /root/.cache/huggingface/datasets/json/default-778cde16c49f6168/0.0.0/f4e89e8750d5d5ffbef2c078bf0ddfedef29dc2faff52a6255cf513c05eb1092/cache-936df8b9bc388c25_00001_of_00010.arrow
2025-08-07 03:17:59 - INFO - datasets.arrow_dataset - Caching processed dataset at /root/.cache/huggingface/datasets/json/default-778cde16c49f6168/0.0.0/f4e89e8750d5d5ffbef2c078bf0ddfedef29dc2faff52a6255cf513c05eb1092/cache-936df8b9bc388c25_00001_of_00010.arrow
Caching processed dataset at /root/.cache/huggingface/datasets/json/default-778cde16c49f6168/0.0.0/f4e89e8750d5d5ffbef2c078bf0ddfedef29dc2faff52a6255cf513c05eb1092/cache-936df8b9bc388c25_00002_of_00010.arrow
2025-08-07 03:17:59 - INFO - datasets.arrow_dataset - Caching processed dataset at /root/.cache/huggingface/datasets/json/default-778cde16c49f6168/0.0.0/f4e89e8750d5d5ffbef2c078bf0ddfedef29dc2faff52a6255cf513c05eb1092/cache-936df8b9bc388c25_00002_of_00010.arrow

Applying chat template to train_sft (num_proc=10):  10%|█         | 333/3325 [00:00<00:01, 1915.39 examples/s]Caching processed dataset at /root/.cache/huggingface/datasets/json/default-778cde16c49f6168/0.0.0/f4e89e8750d5d5ffbef2c078bf0ddfedef29dc2faff52a6255cf513c05eb1092/cache-936df8b9bc388c25_00003_of_00010.arrow
2025-08-07 03:17:59 - INFO - datasets.arrow_dataset - Caching processed dataset at /root/.cache/huggingface/datasets/json/default-778cde16c49f6168/0.0.0/f4e89e8750d5d5ffbef2c078bf0ddfedef29dc2faff52a6255cf513c05eb1092/cache-936df8b9bc388c25_00003_of_00010.arrow
Caching processed dataset at /root/.cache/huggingface/datasets/json/default-778cde16c49f6168/0.0.0/f4e89e8750d5d5ffbef2c078bf0ddfedef29dc2faff52a6255cf513c05eb1092/cache-936df8b9bc388c25_00004_of_00010.arrow
2025-08-07 03:17:59 - INFO - datasets.arrow_dataset - Caching processed dataset at /root/.cache/huggingface/datasets/json/default-778cde16c49f6168/0.0.0/f4e89e8750d5d5ffbef2c078bf0ddfedef29dc2faff52a6255cf513c05eb1092/cache-936df8b9bc388c25_00004_of_00010.arrow
Caching processed dataset at /root/.cache/huggingface/datasets/json/default-778cde16c49f6168/0.0.0/f4e89e8750d5d5ffbef2c078bf0ddfedef29dc2faff52a6255cf513c05eb1092/cache-936df8b9bc388c25_00005_of_00010.arrow
2025-08-07 03:17:59 - INFO - datasets.arrow_dataset - Caching processed dataset at /root/.cache/huggingface/datasets/json/default-778cde16c49f6168/0.0.0/f4e89e8750d5d5ffbef2c078bf0ddfedef29dc2faff52a6255cf513c05eb1092/cache-936df8b9bc388c25_00005_of_00010.arrow
Caching processed dataset at /root/.cache/huggingface/datasets/json/default-778cde16c49f6168/0.0.0/f4e89e8750d5d5ffbef2c078bf0ddfedef29dc2faff52a6255cf513c05eb1092/cache-936df8b9bc388c25_00006_of_00010.arrow
2025-08-07 03:17:59 - INFO - datasets.arrow_dataset - Caching processed dataset at /root/.cache/huggingface/datasets/json/default-778cde16c49f6168/0.0.0/f4e89e8750d5d5ffbef2c078bf0ddfedef29dc2faff52a6255cf513c05eb1092/cache-936df8b9bc388c25_00006_of_00010.arrow
Caching processed dataset at /root/.cache/huggingface/datasets/json/default-778cde16c49f6168/0.0.0/f4e89e8750d5d5ffbef2c078bf0ddfedef29dc2faff52a6255cf513c05eb1092/cache-936df8b9bc388c25_00007_of_00010.arrow
2025-08-07 03:17:59 - INFO - datasets.arrow_dataset - Caching processed dataset at /root/.cache/huggingface/datasets/json/default-778cde16c49f6168/0.0.0/f4e89e8750d5d5ffbef2c078bf0ddfedef29dc2faff52a6255cf513c05eb1092/cache-936df8b9bc388c25_00007_of_00010.arrow
Caching processed dataset at /root/.cache/huggingface/datasets/json/default-778cde16c49f6168/0.0.0/f4e89e8750d5d5ffbef2c078bf0ddfedef29dc2faff52a6255cf513c05eb1092/cache-936df8b9bc388c25_00008_of_00010.arrow
2025-08-07 03:17:59 - INFO - datasets.arrow_dataset - Caching processed dataset at /root/.cache/huggingface/datasets/json/default-778cde16c49f6168/0.0.0/f4e89e8750d5d5ffbef2c078bf0ddfedef29dc2faff52a6255cf513c05eb1092/cache-936df8b9bc388c25_00008_of_00010.arrow

Applying chat template to train_sft (num_proc=10):  80%|████████  | 2661/3325 [00:00<00:00, 11205.53 examples/s]Caching processed dataset at /root/.cache/huggingface/datasets/json/default-778cde16c49f6168/0.0.0/f4e89e8750d5d5ffbef2c078bf0ddfedef29dc2faff52a6255cf513c05eb1092/cache-936df8b9bc388c25_00009_of_00010.arrow
2025-08-07 03:17:59 - INFO - datasets.arrow_dataset - Caching processed dataset at /root/.cache/huggingface/datasets/json/default-778cde16c49f6168/0.0.0/f4e89e8750d5d5ffbef2c078bf0ddfedef29dc2faff52a6255cf513c05eb1092/cache-936df8b9bc388c25_00009_of_00010.arrow

Applying chat template to train_sft (num_proc=10): 100%|██████████| 3325/3325 [00:00<00:00, 8346.41 examples/s] 
Concatenating 10 shards
2025-08-07 03:18:00 - INFO - datasets.arrow_dataset - Concatenating 10 shards
Process #0 will write at /root/.cache/huggingface/datasets/json/default-1be9985046771f19/0.0.0/f4e89e8750d5d5ffbef2c078bf0ddfedef29dc2faff52a6255cf513c05eb1092/cache-b0b6c9097b85e021_00000_of_00010.arrow
2025-08-07 03:18:00 - INFO - datasets.arrow_dataset - Process #0 will write at /root/.cache/huggingface/datasets/json/default-1be9985046771f19/0.0.0/f4e89e8750d5d5ffbef2c078bf0ddfedef29dc2faff52a6255cf513c05eb1092/cache-b0b6c9097b85e021_00000_of_00010.arrow
Process #1 will write at /root/.cache/huggingface/datasets/json/default-1be9985046771f19/0.0.0/f4e89e8750d5d5ffbef2c078bf0ddfedef29dc2faff52a6255cf513c05eb1092/cache-b0b6c9097b85e021_00001_of_00010.arrow
2025-08-07 03:18:00 - INFO - datasets.arrow_dataset - Process #1 will write at /root/.cache/huggingface/datasets/json/default-1be9985046771f19/0.0.0/f4e89e8750d5d5ffbef2c078bf0ddfedef29dc2faff52a6255cf513c05eb1092/cache-b0b6c9097b85e021_00001_of_00010.arrow
Process #2 will write at /root/.cache/huggingface/datasets/json/default-1be9985046771f19/0.0.0/f4e89e8750d5d5ffbef2c078bf0ddfedef29dc2faff52a6255cf513c05eb1092/cache-b0b6c9097b85e021_00002_of_00010.arrow
2025-08-07 03:18:00 - INFO - datasets.arrow_dataset - Process #2 will write at /root/.cache/huggingface/datasets/json/default-1be9985046771f19/0.0.0/f4e89e8750d5d5ffbef2c078bf0ddfedef29dc2faff52a6255cf513c05eb1092/cache-b0b6c9097b85e021_00002_of_00010.arrow
Process #3 will write at /root/.cache/huggingface/datasets/json/default-1be9985046771f19/0.0.0/f4e89e8750d5d5ffbef2c078bf0ddfedef29dc2faff52a6255cf513c05eb1092/cache-b0b6c9097b85e021_00003_of_00010.arrow
2025-08-07 03:18:00 - INFO - datasets.arrow_dataset - Process #3 will write at /root/.cache/huggingface/datasets/json/default-1be9985046771f19/0.0.0/f4e89e8750d5d5ffbef2c078bf0ddfedef29dc2faff52a6255cf513c05eb1092/cache-b0b6c9097b85e021_00003_of_00010.arrow
Process #4 will write at /root/.cache/huggingface/datasets/json/default-1be9985046771f19/0.0.0/f4e89e8750d5d5ffbef2c078bf0ddfedef29dc2faff52a6255cf513c05eb1092/cache-b0b6c9097b85e021_00004_of_00010.arrow
2025-08-07 03:18:00 - INFO - datasets.arrow_dataset - Process #4 will write at /root/.cache/huggingface/datasets/json/default-1be9985046771f19/0.0.0/f4e89e8750d5d5ffbef2c078bf0ddfedef29dc2faff52a6255cf513c05eb1092/cache-b0b6c9097b85e021_00004_of_00010.arrow
Process #5 will write at /root/.cache/huggingface/datasets/json/default-1be9985046771f19/0.0.0/f4e89e8750d5d5ffbef2c078bf0ddfedef29dc2faff52a6255cf513c05eb1092/cache-b0b6c9097b85e021_00005_of_00010.arrow
2025-08-07 03:18:00 - INFO - datasets.arrow_dataset - Process #5 will write at /root/.cache/huggingface/datasets/json/default-1be9985046771f19/0.0.0/f4e89e8750d5d5ffbef2c078bf0ddfedef29dc2faff52a6255cf513c05eb1092/cache-b0b6c9097b85e021_00005_of_00010.arrow
Process #6 will write at /root/.cache/huggingface/datasets/json/default-1be9985046771f19/0.0.0/f4e89e8750d5d5ffbef2c078bf0ddfedef29dc2faff52a6255cf513c05eb1092/cache-b0b6c9097b85e021_00006_of_00010.arrow
2025-08-07 03:18:00 - INFO - datasets.arrow_dataset - Process #6 will write at /root/.cache/huggingface/datasets/json/default-1be9985046771f19/0.0.0/f4e89e8750d5d5ffbef2c078bf0ddfedef29dc2faff52a6255cf513c05eb1092/cache-b0b6c9097b85e021_00006_of_00010.arrow
Process #7 will write at /root/.cache/huggingface/datasets/json/default-1be9985046771f19/0.0.0/f4e89e8750d5d5ffbef2c078bf0ddfedef29dc2faff52a6255cf513c05eb1092/cache-b0b6c9097b85e021_00007_of_00010.arrow
2025-08-07 03:18:00 - INFO - datasets.arrow_dataset - Process #7 will write at /root/.cache/huggingface/datasets/json/default-1be9985046771f19/0.0.0/f4e89e8750d5d5ffbef2c078bf0ddfedef29dc2faff52a6255cf513c05eb1092/cache-b0b6c9097b85e021_00007_of_00010.arrow
Process #8 will write at /root/.cache/huggingface/datasets/json/default-1be9985046771f19/0.0.0/f4e89e8750d5d5ffbef2c078bf0ddfedef29dc2faff52a6255cf513c05eb1092/cache-b0b6c9097b85e021_00008_of_00010.arrow
2025-08-07 03:18:00 - INFO - datasets.arrow_dataset - Process #8 will write at /root/.cache/huggingface/datasets/json/default-1be9985046771f19/0.0.0/f4e89e8750d5d5ffbef2c078bf0ddfedef29dc2faff52a6255cf513c05eb1092/cache-b0b6c9097b85e021_00008_of_00010.arrow
Process #9 will write at /root/.cache/huggingface/datasets/json/default-1be9985046771f19/0.0.0/f4e89e8750d5d5ffbef2c078bf0ddfedef29dc2faff52a6255cf513c05eb1092/cache-b0b6c9097b85e021_00009_of_00010.arrow
2025-08-07 03:18:00 - INFO - datasets.arrow_dataset - Process #9 will write at /root/.cache/huggingface/datasets/json/default-1be9985046771f19/0.0.0/f4e89e8750d5d5ffbef2c078bf0ddfedef29dc2faff52a6255cf513c05eb1092/cache-b0b6c9097b85e021_00009_of_00010.arrow
Spawning 10 processes
2025-08-07 03:18:00 - INFO - datasets.arrow_dataset - Spawning 10 processes

Applying chat template to test_sft (num_proc=10):   0%|          | 0/832 [00:00<?, ? examples/s]Caching processed dataset at /root/.cache/huggingface/datasets/json/default-1be9985046771f19/0.0.0/f4e89e8750d5d5ffbef2c078bf0ddfedef29dc2faff52a6255cf513c05eb1092/cache-b0b6c9097b85e021_00000_of_00010.arrow
2025-08-07 03:18:00 - INFO - datasets.arrow_dataset - Caching processed dataset at /root/.cache/huggingface/datasets/json/default-1be9985046771f19/0.0.0/f4e89e8750d5d5ffbef2c078bf0ddfedef29dc2faff52a6255cf513c05eb1092/cache-b0b6c9097b85e021_00000_of_00010.arrow
Caching processed dataset at /root/.cache/huggingface/datasets/json/default-1be9985046771f19/0.0.0/f4e89e8750d5d5ffbef2c078bf0ddfedef29dc2faff52a6255cf513c05eb1092/cache-b0b6c9097b85e021_00001_of_00010.arrow
2025-08-07 03:18:00 - INFO - datasets.arrow_dataset - Caching processed dataset at /root/.cache/huggingface/datasets/json/default-1be9985046771f19/0.0.0/f4e89e8750d5d5ffbef2c078bf0ddfedef29dc2faff52a6255cf513c05eb1092/cache-b0b6c9097b85e021_00001_of_00010.arrow

Applying chat template to test_sft (num_proc=10):  10%|█         | 84/832 [00:00<00:01, 526.85 examples/s]Caching processed dataset at /root/.cache/huggingface/datasets/json/default-1be9985046771f19/0.0.0/f4e89e8750d5d5ffbef2c078bf0ddfedef29dc2faff52a6255cf513c05eb1092/cache-b0b6c9097b85e021_00002_of_00010.arrow
2025-08-07 03:18:00 - INFO - datasets.arrow_dataset - Caching processed dataset at /root/.cache/huggingface/datasets/json/default-1be9985046771f19/0.0.0/f4e89e8750d5d5ffbef2c078bf0ddfedef29dc2faff52a6255cf513c05eb1092/cache-b0b6c9097b85e021_00002_of_00010.arrow
Caching processed dataset at /root/.cache/huggingface/datasets/json/default-1be9985046771f19/0.0.0/f4e89e8750d5d5ffbef2c078bf0ddfedef29dc2faff52a6255cf513c05eb1092/cache-b0b6c9097b85e021_00003_of_00010.arrow
2025-08-07 03:18:00 - INFO - datasets.arrow_dataset - Caching processed dataset at /root/.cache/huggingface/datasets/json/default-1be9985046771f19/0.0.0/f4e89e8750d5d5ffbef2c078bf0ddfedef29dc2faff52a6255cf513c05eb1092/cache-b0b6c9097b85e021_00003_of_00010.arrow
Caching processed dataset at /root/.cache/huggingface/datasets/json/default-1be9985046771f19/0.0.0/f4e89e8750d5d5ffbef2c078bf0ddfedef29dc2faff52a6255cf513c05eb1092/cache-b0b6c9097b85e021_00004_of_00010.arrow
2025-08-07 03:18:00 - INFO - datasets.arrow_dataset - Caching processed dataset at /root/.cache/huggingface/datasets/json/default-1be9985046771f19/0.0.0/f4e89e8750d5d5ffbef2c078bf0ddfedef29dc2faff52a6255cf513c05eb1092/cache-b0b6c9097b85e021_00004_of_00010.arrow
Caching processed dataset at /root/.cache/huggingface/datasets/json/default-1be9985046771f19/0.0.0/f4e89e8750d5d5ffbef2c078bf0ddfedef29dc2faff52a6255cf513c05eb1092/cache-b0b6c9097b85e021_00005_of_00010.arrow
2025-08-07 03:18:00 - INFO - datasets.arrow_dataset - Caching processed dataset at /root/.cache/huggingface/datasets/json/default-1be9985046771f19/0.0.0/f4e89e8750d5d5ffbef2c078bf0ddfedef29dc2faff52a6255cf513c05eb1092/cache-b0b6c9097b85e021_00005_of_00010.arrow
Caching processed dataset at /root/.cache/huggingface/datasets/json/default-1be9985046771f19/0.0.0/f4e89e8750d5d5ffbef2c078bf0ddfedef29dc2faff52a6255cf513c05eb1092/cache-b0b6c9097b85e021_00006_of_00010.arrow
2025-08-07 03:18:00 - INFO - datasets.arrow_dataset - Caching processed dataset at /root/.cache/huggingface/datasets/json/default-1be9985046771f19/0.0.0/f4e89e8750d5d5ffbef2c078bf0ddfedef29dc2faff52a6255cf513c05eb1092/cache-b0b6c9097b85e021_00006_of_00010.arrow
Caching processed dataset at /root/.cache/huggingface/datasets/json/default-1be9985046771f19/0.0.0/f4e89e8750d5d5ffbef2c078bf0ddfedef29dc2faff52a6255cf513c05eb1092/cache-b0b6c9097b85e021_00007_of_00010.arrow
2025-08-07 03:18:00 - INFO - datasets.arrow_dataset - Caching processed dataset at /root/.cache/huggingface/datasets/json/default-1be9985046771f19/0.0.0/f4e89e8750d5d5ffbef2c078bf0ddfedef29dc2faff52a6255cf513c05eb1092/cache-b0b6c9097b85e021_00007_of_00010.arrow

Applying chat template to test_sft (num_proc=10):  80%|████████  | 666/832 [00:00<00:00, 3002.39 examples/s]Caching processed dataset at /root/.cache/huggingface/datasets/json/default-1be9985046771f19/0.0.0/f4e89e8750d5d5ffbef2c078bf0ddfedef29dc2faff52a6255cf513c05eb1092/cache-b0b6c9097b85e021_00008_of_00010.arrow
2025-08-07 03:18:00 - INFO - datasets.arrow_dataset - Caching processed dataset at /root/.cache/huggingface/datasets/json/default-1be9985046771f19/0.0.0/f4e89e8750d5d5ffbef2c078bf0ddfedef29dc2faff52a6255cf513c05eb1092/cache-b0b6c9097b85e021_00008_of_00010.arrow
Caching processed dataset at /root/.cache/huggingface/datasets/json/default-1be9985046771f19/0.0.0/f4e89e8750d5d5ffbef2c078bf0ddfedef29dc2faff52a6255cf513c05eb1092/cache-b0b6c9097b85e021_00009_of_00010.arrow
2025-08-07 03:18:00 - INFO - datasets.arrow_dataset - Caching processed dataset at /root/.cache/huggingface/datasets/json/default-1be9985046771f19/0.0.0/f4e89e8750d5d5ffbef2c078bf0ddfedef29dc2faff52a6255cf513c05eb1092/cache-b0b6c9097b85e021_00009_of_00010.arrow

Applying chat template to test_sft (num_proc=10): 100%|██████████| 832/832 [00:00<00:00, 2281.69 examples/s]
Concatenating 10 shards
2025-08-07 03:18:00 - INFO - datasets.arrow_dataset - Concatenating 10 shards
/opt/conda/envs/ptca/lib/python3.10/site-packages/huggingface_hub/utils/_deprecation.py:100: FutureWarning: Deprecated argument(s) used in '__init__': max_seq_length, dataset_text_field, packing. Will not be supported from version '0.13.0'.

Deprecated positional argument(s) used in SFTTrainer, please use the SFTConfig to set these arguments instead.
  warnings.warn(message, FutureWarning)
[INFO|training_args.py:2147] 2025-08-07 03:18:00,600 >> PyTorch: setting up devices
/opt/conda/envs/ptca/lib/python3.10/site-packages/trl/trainer/sft_trainer.py:212: UserWarning: You passed a `packing` argument to the SFTTrainer, the value you passed will override the one in the `SFTConfig`.
  warnings.warn(
/opt/conda/envs/ptca/lib/python3.10/site-packages/trl/trainer/sft_trainer.py:300: UserWarning: You passed a `max_seq_length` argument to the SFTTrainer, the value you passed will override the one in the `SFTConfig`.
  warnings.warn(
/opt/conda/envs/ptca/lib/python3.10/site-packages/trl/trainer/sft_trainer.py:328: UserWarning: You passed a `dataset_text_field` argument to the SFTTrainer, the value you passed will override the one in the `SFTConfig`.
  warnings.warn(
Using custom data configuration default-86a243d7579dcc57
2025-08-07 03:18:00 - INFO - datasets.builder - Using custom data configuration default-86a243d7579dcc57
Loading Dataset Infos from /opt/conda/envs/ptca/lib/python3.10/site-packages/datasets/packaged_modules/generator
2025-08-07 03:18:00 - INFO - datasets.info - Loading Dataset Infos from /opt/conda/envs/ptca/lib/python3.10/site-packages/datasets/packaged_modules/generator
Generating dataset generator (/root/.cache/huggingface/datasets/generator/default-86a243d7579dcc57/0.0.0)
2025-08-07 03:18:00 - INFO - datasets.builder - Generating dataset generator (/root/.cache/huggingface/datasets/generator/default-86a243d7579dcc57/0.0.0)
Downloading and preparing dataset generator/default to /root/.cache/huggingface/datasets/generator/default-86a243d7579dcc57/0.0.0...
2025-08-07 03:18:00 - INFO - datasets.builder - Downloading and preparing dataset generator/default to /root/.cache/huggingface/datasets/generator/default-86a243d7579dcc57/0.0.0...
Generating train split
2025-08-07 03:18:00 - INFO - datasets.builder - Generating train split

Generating train split: 0 examples [00:00, ? examples/s][WARNING|tokenization_utils_base.py:4089] 2025-08-07 03:18:03,251 >> Token indices sequence length is longer than the specified maximum sequence length for this model (2099 > 2048). Running this sequence through the model will result in indexing errors

Generating train split: 1 examples [00:02,  2.37s/ examples]
Generating train split: 595 examples [00:02, 337.85 examples/s]
Generating train split: 1000 examples [00:05, 216.07 examples/s]
Generating train split: 1618 examples [00:05, 432.08 examples/s]
Generating train split: 2000 examples [00:06, 358.84 examples/s]
Generating train split: 2234 examples [00:06, 335.33 examples/s]
Unable to verify splits sizes.
2025-08-07 03:18:07 - INFO - datasets.utils.info_utils - Unable to verify splits sizes.
Dataset generator downloaded and prepared to /root/.cache/huggingface/datasets/generator/default-86a243d7579dcc57/0.0.0. Subsequent calls will reuse this data.
2025-08-07 03:18:07 - INFO - datasets.builder - Dataset generator downloaded and prepared to /root/.cache/huggingface/datasets/generator/default-86a243d7579dcc57/0.0.0. Subsequent calls will reuse this data.
Using custom data configuration default-2adadf842f27ad3d
2025-08-07 03:18:07 - INFO - datasets.builder - Using custom data configuration default-2adadf842f27ad3d
Loading Dataset Infos from /opt/conda/envs/ptca/lib/python3.10/site-packages/datasets/packaged_modules/generator
2025-08-07 03:18:07 - INFO - datasets.info - Loading Dataset Infos from /opt/conda/envs/ptca/lib/python3.10/site-packages/datasets/packaged_modules/generator
Generating dataset generator (/root/.cache/huggingface/datasets/generator/default-2adadf842f27ad3d/0.0.0)
2025-08-07 03:18:07 - INFO - datasets.builder - Generating dataset generator (/root/.cache/huggingface/datasets/generator/default-2adadf842f27ad3d/0.0.0)
Downloading and preparing dataset generator/default to /root/.cache/huggingface/datasets/generator/default-2adadf842f27ad3d/0.0.0...
2025-08-07 03:18:07 - INFO - datasets.builder - Downloading and preparing dataset generator/default to /root/.cache/huggingface/datasets/generator/default-2adadf842f27ad3d/0.0.0...
Generating train split
2025-08-07 03:18:07 - INFO - datasets.builder - Generating train split

Generating train split: 0 examples [00:00, ? examples/s]
Generating train split: 1 examples [00:01,  1.57s/ examples]
Generating train split: 560 examples [00:01, 465.61 examples/s]
Generating train split: 566 examples [00:01, 330.02 examples/s]
Unable to verify splits sizes.
2025-08-07 03:18:09 - INFO - datasets.utils.info_utils - Unable to verify splits sizes.
Dataset generator downloaded and prepared to /root/.cache/huggingface/datasets/generator/default-2adadf842f27ad3d/0.0.0. Subsequent calls will reuse this data.
2025-08-07 03:18:09 - INFO - datasets.builder - Dataset generator downloaded and prepared to /root/.cache/huggingface/datasets/generator/default-2adadf842f27ad3d/0.0.0. Subsequent calls will reuse this data.
e6cf3c2d7500455db8e799b5be01f472000001:91:91 [0] NCCL INFO NCCL_SOCKET_IFNAME set by environment to eth0
e6cf3c2d7500455db8e799b5be01f472000001:91:91 [0] NCCL INFO Bootstrap : Using eth0:10.0.0.5<0>
e6cf3c2d7500455db8e799b5be01f472000001:91:91 [0] NCCL INFO NET/Plugin: Failed to find ncclNetPlugin_v7 symbol.
e6cf3c2d7500455db8e799b5be01f472000001:91:91 [0] NCCL INFO NET/Plugin: Loaded net plugin NCCL RDMA Plugin v6 (v6)
e6cf3c2d7500455db8e799b5be01f472000001:91:91 [0] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v7 symbol.
e6cf3c2d7500455db8e799b5be01f472000001:91:91 [0] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin symbol (>= v5). ncclCollNetPlugin symbols v4 and lower are not supported.
e6cf3c2d7500455db8e799b5be01f472000001:91:91 [0] NCCL INFO cudaDriverVersion 12060
NCCL version 2.19.4+cuda12.4
e6cf3c2d7500455db8e799b5be01f472000001:91:283 [0] NCCL INFO Plugin Path : /opt/nccl-rdma-sharp-plugins/lib/libnccl-net.so
e6cf3c2d7500455db8e799b5be01f472000001:91:283 [0] NCCL INFO P2P plugin IBext
e6cf3c2d7500455db8e799b5be01f472000001:91:283 [0] NCCL INFO NCCL_SOCKET_IFNAME set by environment to eth0
e6cf3c2d7500455db8e799b5be01f472000001:91:283 [0] NCCL INFO NET/IB : No device found.
e6cf3c2d7500455db8e799b5be01f472000001:91:283 [0] NCCL INFO NCCL_IB_DISABLE set by environment to 1.
e6cf3c2d7500455db8e799b5be01f472000001:91:283 [0] NCCL INFO NCCL_SOCKET_IFNAME set by environment to eth0
e6cf3c2d7500455db8e799b5be01f472000001:91:283 [0] NCCL INFO NET/Socket : Using [0]eth0:10.0.0.5<0>
e6cf3c2d7500455db8e799b5be01f472000001:91:283 [0] NCCL INFO Using non-device net plugin version 0
e6cf3c2d7500455db8e799b5be01f472000001:91:283 [0] NCCL INFO Using network Socket
e6cf3c2d7500455db8e799b5be01f472000001:91:283 [0] NCCL INFO comm 0x310774a0 rank 0 nranks 1 cudaDev 0 nvmlDev 0 busId 100000 commId 0x3707de77de238827 - Init START
e6cf3c2d7500455db8e799b5be01f472000001:91:283 [0] NCCL INFO NCCL_TOPO_FILE set by environment to /opt/microsoft/ndv5-topo.xml

e6cf3c2d7500455db8e799b5be01f472000001:91:283 [0] graph/xml.cc:315 NCCL WARN Could not open XML topology file /opt/microsoft/ndv5-topo.xml : No such file or directory
e6cf3c2d7500455db8e799b5be01f472000001:91:283 [0] NCCL INFO Topology detection : could not read /sys/devices/LNXSYSTM:00/LNXSYBUS:00/ACPI0004:00/VMBUS:00/00000041-0001-0000-3130-444532333231/pci0001:00/0001:00:00.0/../max_link_speed, ignoring
e6cf3c2d7500455db8e799b5be01f472000001:91:283 [0] NCCL INFO Topology detection : could not read /sys/devices/LNXSYSTM:00/LNXSYBUS:00/ACPI0004:00/VMBUS:00/00000041-0001-0000-3130-444532333231/pci0001:00/0001:00:00.0/../max_link_width, ignoring
e6cf3c2d7500455db8e799b5be01f472000001:91:283 [0] NCCL INFO Topology detection: network path /sys/devices/LNXSYSTM:00/LNXSYBUS:00/ACPI0004:00/VMBUS:00/6045bd62-5ebf-6045-bd62-5ebf6045bd62 is not a PCI device (vmbus). Attaching to first CPU
e6cf3c2d7500455db8e799b5be01f472000001:91:283 [0] NCCL INFO === System : maxBw 5000.0 totalBw 0.0 ===
e6cf3c2d7500455db8e799b5be01f472000001:91:283 [0] NCCL INFO CPU/0 (1/2/-1)
e6cf3c2d7500455db8e799b5be01f472000001:91:283 [0] NCCL INFO + PCI[5000.0] - NIC/0
e6cf3c2d7500455db8e799b5be01f472000001:91:283 [0] NCCL INFO + PCI[48.0] - GPU/100000 (0)
e6cf3c2d7500455db8e799b5be01f472000001:91:283 [0] NCCL INFO ==========================================
e6cf3c2d7500455db8e799b5be01f472000001:91:283 [0] NCCL INFO GPU/100000 :GPU/100000 (0/5000.000000/LOC) CPU/0 (1/48.000000/PHB) 
e6cf3c2d7500455db8e799b5be01f472000001:91:283 [0] NCCL INFO Setting affinity for GPU 0 to ff,ffffffff
e6cf3c2d7500455db8e799b5be01f472000001:91:283 [0] NCCL INFO Pattern 4, crossNic 0, nChannels 16, bw 60.000000/60.000000, type LOC/PIX, sameChannels 1
e6cf3c2d7500455db8e799b5be01f472000001:91:283 [0] NCCL INFO  0 : GPU/0
e6cf3c2d7500455db8e799b5be01f472000001:91:283 [0] NCCL INFO  1 : GPU/0
e6cf3c2d7500455db8e799b5be01f472000001:91:283 [0] NCCL INFO  2 : GPU/0
e6cf3c2d7500455db8e799b5be01f472000001:91:283 [0] NCCL INFO  3 : GPU/0
e6cf3c2d7500455db8e799b5be01f472000001:91:283 [0] NCCL INFO  4 : GPU/0
e6cf3c2d7500455db8e799b5be01f472000001:91:283 [0] NCCL INFO  5 : GPU/0
e6cf3c2d7500455db8e799b5be01f472000001:91:283 [0] NCCL INFO  6 : GPU/0
e6cf3c2d7500455db8e799b5be01f472000001:91:283 [0] NCCL INFO  7 : GPU/0
e6cf3c2d7500455db8e799b5be01f472000001:91:283 [0] NCCL INFO  8 : GPU/0
e6cf3c2d7500455db8e799b5be01f472000001:91:283 [0] NCCL INFO  9 : GPU/0
e6cf3c2d7500455db8e799b5be01f472000001:91:283 [0] NCCL INFO 10 : GPU/0
e6cf3c2d7500455db8e799b5be01f472000001:91:283 [0] NCCL INFO 11 : GPU/0
e6cf3c2d7500455db8e799b5be01f472000001:91:283 [0] NCCL INFO 12 : GPU/0
e6cf3c2d7500455db8e799b5be01f472000001:91:283 [0] NCCL INFO 13 : GPU/0
e6cf3c2d7500455db8e799b5be01f472000001:91:283 [0] NCCL INFO 14 : GPU/0
e6cf3c2d7500455db8e799b5be01f472000001:91:283 [0] NCCL INFO 15 : GPU/0
e6cf3c2d7500455db8e799b5be01f472000001:91:283 [0] NCCL INFO Pattern 3, crossNic 0, nChannels 16, bw 60.000000/60.000000, type LOC/PIX, sameChannels 1
e6cf3c2d7500455db8e799b5be01f472000001:91:283 [0] NCCL INFO  0 : GPU/0
e6cf3c2d7500455db8e799b5be01f472000001:91:283 [0] NCCL INFO  1 : GPU/0
e6cf3c2d7500455db8e799b5be01f472000001:91:283 [0] NCCL INFO  2 : GPU/0
e6cf3c2d7500455db8e799b5be01f472000001:91:283 [0] NCCL INFO  3 : GPU/0
e6cf3c2d7500455db8e799b5be01f472000001:91:283 [0] NCCL INFO  4 : GPU/0
e6cf3c2d7500455db8e799b5be01f472000001:91:283 [0] NCCL INFO  5 : GPU/0
e6cf3c2d7500455db8e799b5be01f472000001:91:283 [0] NCCL INFO  6 : GPU/0
e6cf3c2d7500455db8e799b5be01f472000001:91:283 [0] NCCL INFO  7 : GPU/0
e6cf3c2d7500455db8e799b5be01f472000001:91:283 [0] NCCL INFO  8 : GPU/0
e6cf3c2d7500455db8e799b5be01f472000001:91:283 [0] NCCL INFO  9 : GPU/0
e6cf3c2d7500455db8e799b5be01f472000001:91:283 [0] NCCL INFO 10 : GPU/0
e6cf3c2d7500455db8e799b5be01f472000001:91:283 [0] NCCL INFO 11 : GPU/0
e6cf3c2d7500455db8e799b5be01f472000001:91:283 [0] NCCL INFO 12 : GPU/0
e6cf3c2d7500455db8e799b5be01f472000001:91:283 [0] NCCL INFO 13 : GPU/0
e6cf3c2d7500455db8e799b5be01f472000001:91:283 [0] NCCL INFO 14 : GPU/0
e6cf3c2d7500455db8e799b5be01f472000001:91:283 [0] NCCL INFO 15 : GPU/0
e6cf3c2d7500455db8e799b5be01f472000001:91:283 [0] NCCL INFO Tree 0 : -1 -> 0 -> -1/-1/-1
e6cf3c2d7500455db8e799b5be01f472000001:91:283 [0] NCCL INFO Tree 16 : -1 -> 0 -> -1/-1/-1
e6cf3c2d7500455db8e799b5be01f472000001:91:283 [0] NCCL INFO Tree 1 : -1 -> 0 -> -1/-1/-1
e6cf3c2d7500455db8e799b5be01f472000001:91:283 [0] NCCL INFO Tree 17 : -1 -> 0 -> -1/-1/-1
e6cf3c2d7500455db8e799b5be01f472000001:91:283 [0] NCCL INFO Tree 2 : -1 -> 0 -> -1/-1/-1
e6cf3c2d7500455db8e799b5be01f472000001:91:283 [0] NCCL INFO Tree 18 : -1 -> 0 -> -1/-1/-1
e6cf3c2d7500455db8e799b5be01f472000001:91:283 [0] NCCL INFO Tree 3 : -1 -> 0 -> -1/-1/-1
e6cf3c2d7500455db8e799b5be01f472000001:91:283 [0] NCCL INFO Tree 19 : -1 -> 0 -> -1/-1/-1
e6cf3c2d7500455db8e799b5be01f472000001:91:283 [0] NCCL INFO Tree 4 : -1 -> 0 -> -1/-1/-1
e6cf3c2d7500455db8e799b5be01f472000001:91:283 [0] NCCL INFO Tree 20 : -1 -> 0 -> -1/-1/-1
e6cf3c2d7500455db8e799b5be01f472000001:91:283 [0] NCCL INFO Tree 5 : -1 -> 0 -> -1/-1/-1
e6cf3c2d7500455db8e799b5be01f472000001:91:283 [0] NCCL INFO Tree 21 : -1 -> 0 -> -1/-1/-1
e6cf3c2d7500455db8e799b5be01f472000001:91:283 [0] NCCL INFO Tree 6 : -1 -> 0 -> -1/-1/-1
e6cf3c2d7500455db8e799b5be01f472000001:91:283 [0] NCCL INFO Tree 22 : -1 -> 0 -> -1/-1/-1
e6cf3c2d7500455db8e799b5be01f472000001:91:283 [0] NCCL INFO Tree 7 : -1 -> 0 -> -1/-1/-1
e6cf3c2d7500455db8e799b5be01f472000001:91:283 [0] NCCL INFO Tree 23 : -1 -> 0 -> -1/-1/-1
e6cf3c2d7500455db8e799b5be01f472000001:91:283 [0] NCCL INFO Tree 8 : -1 -> 0 -> -1/-1/-1
e6cf3c2d7500455db8e799b5be01f472000001:91:283 [0] NCCL INFO Tree 24 : -1 -> 0 -> -1/-1/-1
e6cf3c2d7500455db8e799b5be01f472000001:91:283 [0] NCCL INFO Tree 9 : -1 -> 0 -> -1/-1/-1
e6cf3c2d7500455db8e799b5be01f472000001:91:283 [0] NCCL INFO Tree 25 : -1 -> 0 -> -1/-1/-1
e6cf3c2d7500455db8e799b5be01f472000001:91:283 [0] NCCL INFO Tree 10 : -1 -> 0 -> -1/-1/-1
e6cf3c2d7500455db8e799b5be01f472000001:91:283 [0] NCCL INFO Tree 26 : -1 -> 0 -> -1/-1/-1
e6cf3c2d7500455db8e799b5be01f472000001:91:283 [0] NCCL INFO Tree 11 : -1 -> 0 -> -1/-1/-1
e6cf3c2d7500455db8e799b5be01f472000001:91:283 [0] NCCL INFO Tree 27 : -1 -> 0 -> -1/-1/-1
e6cf3c2d7500455db8e799b5be01f472000001:91:283 [0] NCCL INFO Tree 12 : -1 -> 0 -> -1/-1/-1
e6cf3c2d7500455db8e799b5be01f472000001:91:283 [0] NCCL INFO Tree 28 : -1 -> 0 -> -1/-1/-1
e6cf3c2d7500455db8e799b5be01f472000001:91:283 [0] NCCL INFO Tree 13 : -1 -> 0 -> -1/-1/-1
e6cf3c2d7500455db8e799b5be01f472000001:91:283 [0] NCCL INFO Tree 29 : -1 -> 0 -> -1/-1/-1
e6cf3c2d7500455db8e799b5be01f472000001:91:283 [0] NCCL INFO Tree 14 : -1 -> 0 -> -1/-1/-1
e6cf3c2d7500455db8e799b5be01f472000001:91:283 [0] NCCL INFO Tree 30 : -1 -> 0 -> -1/-1/-1
e6cf3c2d7500455db8e799b5be01f472000001:91:283 [0] NCCL INFO Tree 15 : -1 -> 0 -> -1/-1/-1
e6cf3c2d7500455db8e799b5be01f472000001:91:283 [0] NCCL INFO Tree 31 : -1 -> 0 -> -1/-1/-1
e6cf3c2d7500455db8e799b5be01f472000001:91:283 [0] NCCL INFO Channel 00/32 :    0
e6cf3c2d7500455db8e799b5be01f472000001:91:283 [0] NCCL INFO Channel 01/32 :    0
e6cf3c2d7500455db8e799b5be01f472000001:91:283 [0] NCCL INFO Channel 02/32 :    0
e6cf3c2d7500455db8e799b5be01f472000001:91:283 [0] NCCL INFO Channel 03/32 :    0
e6cf3c2d7500455db8e799b5be01f472000001:91:283 [0] NCCL INFO Channel 04/32 :    0
e6cf3c2d7500455db8e799b5be01f472000001:91:283 [0] NCCL INFO Channel 05/32 :    0
e6cf3c2d7500455db8e799b5be01f472000001:91:283 [0] NCCL INFO Channel 06/32 :    0
e6cf3c2d7500455db8e799b5be01f472000001:91:283 [0] NCCL INFO Channel 07/32 :    0
e6cf3c2d7500455db8e799b5be01f472000001:91:283 [0] NCCL INFO Channel 08/32 :    0
e6cf3c2d7500455db8e799b5be01f472000001:91:283 [0] NCCL INFO Channel 09/32 :    0
e6cf3c2d7500455db8e799b5be01f472000001:91:283 [0] NCCL INFO Channel 10/32 :    0
e6cf3c2d7500455db8e799b5be01f472000001:91:283 [0] NCCL INFO Channel 11/32 :    0
e6cf3c2d7500455db8e799b5be01f472000001:91:283 [0] NCCL INFO Channel 12/32 :    0
e6cf3c2d7500455db8e799b5be01f472000001:91:283 [0] NCCL INFO Channel 13/32 :    0
e6cf3c2d7500455db8e799b5be01f472000001:91:283 [0] NCCL INFO Channel 14/32 :    0
e6cf3c2d7500455db8e799b5be01f472000001:91:283 [0] NCCL INFO Channel 15/32 :    0
e6cf3c2d7500455db8e799b5be01f472000001:91:283 [0] NCCL INFO Channel 16/32 :    0
e6cf3c2d7500455db8e799b5be01f472000001:91:283 [0] NCCL INFO Channel 17/32 :    0
e6cf3c2d7500455db8e799b5be01f472000001:91:283 [0] NCCL INFO Channel 18/32 :    0
e6cf3c2d7500455db8e799b5be01f472000001:91:283 [0] NCCL INFO Channel 19/32 :    0
e6cf3c2d7500455db8e799b5be01f472000001:91:283 [0] NCCL INFO Channel 20/32 :    0
e6cf3c2d7500455db8e799b5be01f472000001:91:283 [0] NCCL INFO Channel 21/32 :    0
e6cf3c2d7500455db8e799b5be01f472000001:91:283 [0] NCCL INFO Channel 22/32 :    0
e6cf3c2d7500455db8e799b5be01f472000001:91:283 [0] NCCL INFO Channel 23/32 :    0
e6cf3c2d7500455db8e799b5be01f472000001:91:283 [0] NCCL INFO Channel 24/32 :    0
e6cf3c2d7500455db8e799b5be01f472000001:91:283 [0] NCCL INFO Channel 25/32 :    0
e6cf3c2d7500455db8e799b5be01f472000001:91:283 [0] NCCL INFO Channel 26/32 :    0
e6cf3c2d7500455db8e799b5be01f472000001:91:283 [0] NCCL INFO Channel 27/32 :    0
e6cf3c2d7500455db8e799b5be01f472000001:91:283 [0] NCCL INFO Channel 28/32 :    0
e6cf3c2d7500455db8e799b5be01f472000001:91:283 [0] NCCL INFO Channel 29/32 :    0
e6cf3c2d7500455db8e799b5be01f472000001:91:283 [0] NCCL INFO Channel 30/32 :    0
e6cf3c2d7500455db8e799b5be01f472000001:91:283 [0] NCCL INFO Channel 31/32 :    0
e6cf3c2d7500455db8e799b5be01f472000001:91:283 [0] NCCL INFO Ring 00 : 0 -> 0 -> 0
e6cf3c2d7500455db8e799b5be01f472000001:91:283 [0] NCCL INFO Ring 01 : 0 -> 0 -> 0
e6cf3c2d7500455db8e799b5be01f472000001:91:283 [0] NCCL INFO Ring 02 : 0 -> 0 -> 0
e6cf3c2d7500455db8e799b5be01f472000001:91:283 [0] NCCL INFO Ring 03 : 0 -> 0 -> 0
e6cf3c2d7500455db8e799b5be01f472000001:91:283 [0] NCCL INFO Ring 04 : 0 -> 0 -> 0
e6cf3c2d7500455db8e799b5be01f472000001:91:283 [0] NCCL INFO Ring 05 : 0 -> 0 -> 0
e6cf3c2d7500455db8e799b5be01f472000001:91:283 [0] NCCL INFO Ring 06 : 0 -> 0 -> 0
e6cf3c2d7500455db8e799b5be01f472000001:91:283 [0] NCCL INFO Ring 07 : 0 -> 0 -> 0
e6cf3c2d7500455db8e799b5be01f472000001:91:283 [0] NCCL INFO Ring 08 : 0 -> 0 -> 0
e6cf3c2d7500455db8e799b5be01f472000001:91:283 [0] NCCL INFO Ring 09 : 0 -> 0 -> 0
e6cf3c2d7500455db8e799b5be01f472000001:91:283 [0] NCCL INFO Ring 10 : 0 -> 0 -> 0
e6cf3c2d7500455db8e799b5be01f472000001:91:283 [0] NCCL INFO Ring 11 : 0 -> 0 -> 0
e6cf3c2d7500455db8e799b5be01f472000001:91:283 [0] NCCL INFO Ring 12 : 0 -> 0 -> 0
e6cf3c2d7500455db8e799b5be01f472000001:91:283 [0] NCCL INFO Ring 13 : 0 -> 0 -> 0
e6cf3c2d7500455db8e799b5be01f472000001:91:283 [0] NCCL INFO Ring 14 : 0 -> 0 -> 0
e6cf3c2d7500455db8e799b5be01f472000001:91:283 [0] NCCL INFO Ring 15 : 0 -> 0 -> 0
e6cf3c2d7500455db8e799b5be01f472000001:91:283 [0] NCCL INFO Ring 16 : 0 -> 0 -> 0
e6cf3c2d7500455db8e799b5be01f472000001:91:283 [0] NCCL INFO Ring 17 : 0 -> 0 -> 0
e6cf3c2d7500455db8e799b5be01f47200000[2025-08-07 03:18:10,761] [INFO] [real_accelerator.py:219:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[INFO|trainer.py:698] 2025-08-07 03:18:11,827 >> Using auto half precision backend
[INFO|trainer.py:2313] 2025-08-07 03:18:12,092 >> ***** Running training *****
[INFO|trainer.py:2314] 2025-08-07 03:18:12,092 >>   Num examples = 2,234
[INFO|trainer.py:2315] 2025-08-07 03:18:12,092 >>   Num Epochs = 1
[INFO|trainer.py:2316] 2025-08-07 03:18:12,092 >>   Instantaneous batch size per device = 4
[INFO|trainer.py:2319] 2025-08-07 03:18:12,092 >>   Total train batch size (w. parallel, distributed & accumulation) = 4
[INFO|trainer.py:2320] 2025-08-07 03:18:12,092 >>   Gradient Accumulation steps = 1
[INFO|trainer.py:2321] 2025-08-07 03:18:12,092 >>   Total optimization steps = 559
[INFO|trainer.py:2322] 2025-08-07 03:18:12,094 >>   Number of trainable parameters = 25,165,824

  0%|          | 0/559 [00:00<?, ?it/s]1:91:283 [0] NCCL INFO Ring 18 : 0 -> 0 -> 0
e6cf3c2d7500455db8e799b5be01f472000001:91:283 [0] NCCL INFO Ring 19 : 0 -> 0 -> 0
e6cf3c2d7500455db8e799b5be01f472000001:91:283 [0] NCCL INFO Ring 20 : 0 -> 0 -> 0
e6cf3c2d7500455db8e799b5be01f472000001:91:283 [0] NCCL INFO Ring 21 : 0 -> 0 -> 0
e6cf3c2d7500455db8e799b5be01f472000001:91:283 [0] NCCL INFO Ring 22 : 0 -> 0 -> 0
e6cf3c2d7500455db8e799b5be01f472000001:91:283 [0] NCCL INFO Ring 23 : 0 -> 0 -> 0
e6cf3c2d7500455db8e799b5be01f472000001:91:283 [0] NCCL INFO Ring 24 : 0 -> 0 -> 0
e6cf3c2d7500455db8e799b5be01f472000001:91:283 [0] NCCL INFO Ring 25 : 0 -> 0 -> 0
e6cf3c2d7500455db8e799b5be01f472000001:91:283 [0] NCCL INFO Ring 26 : 0 -> 0 -> 0
e6cf3c2d7500455db8e799b5be01f472000001:91:283 [0] NCCL INFO Ring 27 : 0 -> 0 -> 0
e6cf3c2d7500455db8e799b5be01f472000001:91:283 [0] NCCL INFO Ring 28 : 0 -> 0 -> 0
e6cf3c2d7500455db8e799b5be01f472000001:91:283 [0] NCCL INFO Ring 29 : 0 -> 0 -> 0
e6cf3c2d7500455db8e799b5be01f472000001:91:283 [0] NCCL INFO Ring 30 : 0 -> 0 -> 0
e6cf3c2d7500455db8e799b5be01f472000001:91:283 [0] NCCL INFO Ring 31 : 0 -> 0 -> 0
e6cf3c2d7500455db8e799b5be01f472000001:91:283 [0] NCCL INFO Trees [0] -1/-1/-1->0->-1 [1] -1/-1/-1->0->-1 [2] -1/-1/-1->0->-1 [3] -1/-1/-1->0->-1 [4] -1/-1/-1->0->-1 [5] -1/-1/-1->0->-1 [6] -1/-1/-1->0->-1 [7] -1/-1/-1->0->-1 [8] -1/-1/-1->0->-1 [9] -1/-1/-1->0->-1 [10] -1/-1/-1->0->-1 [11] -1/-1/-1->0->-1 [12] -1/-1/-1->0->-1 [13] -1/-1/-1->0->-1 [14] -1/-1/-1->0->-1 [15] -1/-1/-1->0->-1 [16] -1/-1/-1->0->-1 [17] -1/-1/-1->0->-1 [18] -1/-1/-1->0->-1 [19] -1/-1/-1->0->-1 [20] -1/-1/-1->0->-1 [21] -1/-1/-1->0->-1 [22] -1/-1/-1->0->-1 [23] -1/-1/-1->0->-1 [24] -1/-1/-1->0->-1 [25] -1/-1/-1->0->-1 [26] -1/-1/-1->0->-1 [27] -1/-1/-1->0->-1 [28] -1/-1/-1->0->-1 [29] -1/-1/-1->0->-1 [30] -1/-1/-1->0->-1 [31] -1/-1/-1->0->-1
e6cf3c2d7500455db8e799b5be01f472000001:91:283 [0] NCCL INFO P2P Chunksize set to 131072
e6cf3c2d7500455db8e799b5be01f472000001:91:283 [0] NCCL INFO Connected all rings
e6cf3c2d7500455db8e799b5be01f472000001:91:283 [0] NCCL INFO Connected all trees
e6cf3c2d7500455db8e799b5be01f472000001:91:283 [0] NCCL INFO 32 coll channels, 0 nvls channels, 32 p2p channels, 32 p2p channels per peer
e6cf3c2d7500455db8e799b5be01f472000001:91:283 [0] NCCL INFO MSCCL: No external scheduler found, using internal implementation
e6cf3c2d7500455db8e799b5be01f472000001:91:283 [0] NCCL INFO MSCCL: Internal Scheduler will use /usr/lib/x86_64-linux-gnu/msccl-algorithms as algorithm directory and /usr/lib/x86_64-linux-gnu/../share/nccl/msccl-algorithms as share algorithm directory and /usr/share/nccl/msccl-algorithms as package installed share algorithm directory 
e6cf3c2d7500455db8e799b5be01f472000001:91:283 [0] NCCL INFO Using MSCCL Algo files from /usr/share/nccl/msccl-algorithms
e6cf3c2d7500455db8e799b5be01f472000001:91:283 [0] NCCL INFO MSCCL: Initialization finished
e6cf3c2d7500455db8e799b5be01f472000001:91:283 [0] NCCL INFO comm 0x310774a0 rank 0 nranks 1 cudaDev 0 nvmlDev 0 busId 100000 commId 0x3707de77de238827 - Init COMPLETE
[rank0]:[W807 03:18:12.932341718 reducer.cpp:1400] Warning: find_unused_parameters=True was specified in DDP constructor, but did not find any unused parameters in the forward pass. This flag results in an extra traversal of the autograd graph every iteration,  which can adversely affect performance. If your model indeed never has any unused parameters in the forward pass, consider turning this flag off. Note that this warning may be a false positive if your model has flow control causing later iterations to have unused parameters. (function operator())
/opt/conda/envs/ptca/lib/python3.10/site-packages/torch/utils/checkpoint.py:1399: FutureWarning: `torch.cpu.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cpu', args...)` instead.
  with device_autocast_ctx, torch.cpu.amp.autocast(**cpu_autocast_kwargs), recompute_context:  # type: ignore[attr-defined]

  0%|          | 1/559 [00:01<11:55,  1.28s/it]
  0%|          | 2/559 [00:02<10:34,  1.14s/it]
  1%|          | 3/559 [00:03<10:10,  1.10s/it]
  1%|          | 4/559 [00:04<09:59,  1.08s/it]
  1%|          | 5/559 [00:05<09:52,  1.07s/it]
  1%|          | 6/559 [00:06<09:48,  1.06s/it]
  1%|▏         | 7/559 [00:07<09:45,  1.06s/it]
  1%|▏         | 8/559 [00:08<09:43,  1.06s/it]
  2%|▏         | 9/559 [00:09<09:42,  1.06s/it]
  2%|▏         | 10/559 [00:10<09:40,  1.06s/it]
  2%|▏         | 11/559 [00:11<09:39,  1.06s/it]
  2%|▏         | 12/559 [00:12<09:38,  1.06s/it]
  2%|▏         | 13/559 [00:13<09:37,  1.06s/it]
  3%|▎         | 14/559 [00:14<09:37,  1.06s/it]
  3%|▎         | 15/559 [00:16<09:36,  1.06s/it]
  3%|▎         | 16/559 [00:17<09:35,  1.06s/it]
  3%|▎         | 17/559 [00:18<09:34,  1.06s/it]
  3%|▎         | 18/559 [00:19<09:33,  1.06s/it]
  3%|▎         | 19/559 [00:20<09:33,  1.06s/it]
  4%|▎         | 20/559 [00:21<09:32,  1.06s/it]
                                                

  4%|▎         | 20/559 [00:21<09:32,  1.06s/it]
  4%|▍         | 21/559 [00:22<09:31,  1.06s/it]
  4%|▍         | 22/559 [00:23<09:30,  1.06s/it]
  4%|▍         | 23/559 [00:24<09:30,  1.06s/it]
  4%|▍         | 24/559 [00:25<09:29,  1.06s/it]
  4%|▍         | 25/559 [00:26<09:29,  1.07s/it]
  5%|▍         | 26/559 [00:27<09:27,  1.07s/it]
  5%|▍         | 27/559 [00:28<09:26,  1.07s/it]
  5%|▌         | 28/559 [00:29<09:26,  1.07s/it]
  5%|▌         | 29/559 [00:30<09:25,  1.07s/it]
  5%|▌         | 30/559 [00:32<09:52,  1.12s/it]
  6%|▌         | 31/559 [00:33<09:43,  1.11s/it]
  6%|▌         | 32/559 [00:34<09:37,  1.10s/it]
  6%|▌         | 33/559 [00:35<09:32,  1.09s/it]
  6%|▌         | 34/559 [00:36<09:28,  1.08s/it]
  6%|▋         | 35/559 [00:37<09:25,  1.08s/it]
  6%|▋         | 36/559 [00:38<09:24,  1.08s/it]
  7%|▋         | 37/559 [00:39<09:21,  1.08s/it]
  7%|▋         | 38/559 [00:40<09:19,  1.07s/it]
  7%|▋         | 39/559 [00:41<09:17,  1.07s/it]
  7%|▋         | 40/559 [00:42<09:16,  1.07s/it]
                                                

  7%|▋         | 40/559 [00:42<09:16,  1.07s/it]
  7%|▋         | 41/559 [00:43<09:14,  1.07s/it]
  8%|▊         | 42/559 [00:45<09:13,  1.07s/it]
  8%|▊         | 43/559 [00:46<09:13,  1.07s/it]
  8%|▊         | 44/559 [00:47<09:12,  1.07s/it]
  8%|▊         | 45/559 [00:48<09:11,  1.07s/it]
  8%|▊         | 46/559 [00:49<09:11,  1.07s/it]
  8%|▊         | 47/559 [00:50<09:10,  1.07s/it]
  9%|▊         | 48/559 [00:51<09:08,  1.07s/it]
  9%|▉         | 49/559 [00:52<09:08,  1.08s/it]
  9%|▉         | 50/559 [00:53<09:07,  1.08s/it]
  9%|▉         | 51/559 [00:54<09:06,  1.08s/it]
  9%|▉         | 52/559 [00:55<09:05,  1.08s/it]
  9%|▉         | 53/559 [00:56<09:04,  1.08s/it]
 10%|▉         | 54/559 [00:57<09:03,  1.08s/it]
 10%|▉         | 55/559 [00:59<09:02,  1.08s/it]
 10%|█         | 56/559 [01:00<09:02,  1.08s/it]
 10%|█         | 57/559 [01:01<09:01,  1.08s/it]
 10%|█         | 58/559 [01:02<09:00,  1.08s/it]
 11%|█         | 59/559 [01:03<09:26,  1.13s/it]
 11%|█         | 60/559 [01:04<09:18,  1.12s/it]
                                                

 11%|█         | 60/559 [01:04<09:18,  1.12s/it]
 11%|█         | 61/559 [01:05<09:12,  1.11s/it]
 11%|█         | 62/559 [01:06<09:06,  1.10s/it]
 11%|█▏        | 63/559 [01:07<09:03,  1.10s/it]
 11%|█▏        | 64/559 [01:08<09:00,  1.09s/it]
 12%|█▏        | 65/559 [01:10<08:58,  1.09s/it]
 12%|█▏        | 66/559 [01:11<08:57,  1.09s/it]
 12%|█▏        | 67/559 [01:12<08:55,  1.09s/it]
 12%|█▏        | 68/559 [01:13<08:54,  1.09s/it]
 12%|█▏        | 69/559 [01:14<08:53,  1.09s/it]
 13%|█▎        | 70/559 [01:15<08:52,  1.09s/it]
 13%|█▎        | 71/559 [01:16<08:51,  1.09s/it]
 13%|█▎        | 72/559 [01:17<08:50,  1.09s/it]
 13%|█▎        | 73/559 [01:18<08:50,  1.09s/it]
 13%|█▎        | 74/559 [01:19<08:49,  1.09s/it]
 13%|█▎        | 75/559 [01:20<08:48,  1.09s/it]
 14%|█▎        | 76/559 [01:22<08:47,  1.09s/it]
 14%|█▍        | 77/559 [01:23<08:46,  1.09s/it]
 14%|█▍        | 78/559 [01:24<08:45,  1.09s/it]
 14%|█▍        | 79/559 [01:25<08:45,  1.09s/it]
 14%|█▍        | 80/559 [01:26<08:43,  1.09s/it]
                                                

 14%|█▍        | 80/559 [01:26<08:43,  1.09s/it]
 14%|█▍        | 81/559 [01:27<08:43,  1.09s/it]
 15%|█▍        | 82/559 [01:28<08:42,  1.09s/it]
 15%|█▍        | 83/559 [01:29<08:41,  1.10s/it]
 15%|█▌        | 84/559 [01:30<08:40,  1.10s/it]
 15%|█▌        | 85/559 [01:31<08:39,  1.10s/it]
 15%|█▌        | 86/559 [01:32<08:38,  1.10s/it]
 16%|█▌        | 87/559 [01:34<08:37,  1.10s/it]
 16%|█▌        | 88/559 [01:35<08:36,  1.10s/it]
 16%|█▌        | 89/559 [01:36<08:35,  1.10s/it]
 16%|█▌        | 90/559 [01:37<08:35,  1.10s/it]
 16%|█▋        | 91/559 [01:38<08:34,  1.10s/it]
 16%|█▋        | 92/559 [01:39<08:33,  1.10s/it]
 17%|█▋        | 93/559 [01:40<08:32,  1.10s/it]
 17%|█▋        | 94/559 [01:41<08:56,  1.15s/it]
 17%|█▋        | 95/559 [01:43<08:47,  1.14s/it]
 17%|█▋        | 96/559 [01:44<08:41,  1.13s/it]
 17%|█▋        | 97/559 [01:45<08:36,  1.12s/it]
 18%|█▊        | 98/559 [01:46<08:33,  1.11s/it]
 18%|█▊        | 99/559 [01:47<08:30,  1.11s/it]
 18%|█▊        | 100/559 [01:48<08:27,  1.11s/it]
                                                 

 18%|█▊        | 100/559 [01:48<08:27,  1.11s/it][INFO|trainer.py:3801] 2025-08-07 03:20:00,667 >> Saving model checkpoint to ./checkpoint_dir/checkpoint-100
[INFO|configuration_utils.py:679] 2025-08-07 03:20:01,025 >> loading configuration file config.json from cache at /root/.cache/huggingface/hub/models--microsoft--Phi-3-mini-4k-instruct/snapshots/0a67737cc96d2554230f90338b163bc6380a2a85/config.json
[INFO|configuration_utils.py:746] 2025-08-07 03:20:01,025 >> Model config Phi3Config {
  "_name_or_path": "Phi-3-mini-4k-instruct",
  "architectures": [
    "Phi3ForCausalLM"
  ],
  "attention_bias": false,
  "attention_dropout": 0.0,
  "auto_map": {
    "AutoConfig": "microsoft/Phi-3-mini-4k-instruct--configuration_phi3.Phi3Config",
    "AutoModelForCausalLM": "microsoft/Phi-3-mini-4k-instruct--modeling_phi3.Phi3ForCausalLM"
  },
  "bos_token_id": 1,
  "embd_pdrop": 0.0,
  "eos_token_id": 32000,
  "hidden_act": "silu",
  "hidden_size": 3072,
  "initializer_range": 0.02,
  "intermediate_size": 8192,
  "max_position_embeddings": 4096,
  "model_type": "phi3",
  "num_attention_heads": 32,
  "num_hidden_layers": 32,
  "num_key_value_heads": 32,
  "original_max_position_embeddings": 4096,
  "pad_token_id": 32000,
  "resid_pdrop": 0.0,
  "rms_norm_eps": 1e-05,
  "rope_scaling": null,
  "rope_theta": 10000.0,
  "sliding_window": 2047,
  "tie_word_embeddings": false,
  "torch_dtype": "bfloat16",
  "transformers_version": "4.46.1",
  "use_cache": true,
  "vocab_size": 32064
}

[INFO|tokenization_utils_base.py:2646] 2025-08-07 03:20:01,182 >> tokenizer config file saved in ./checkpoint_dir/checkpoint-100/tokenizer_config.json
[INFO|tokenization_utils_base.py:2655] 2025-08-07 03:20:01,182 >> Special tokens file saved in ./checkpoint_dir/checkpoint-100/special_tokens_map.json
/opt/conda/envs/ptca/lib/python3.10/site-packages/torch/utils/checkpoint.py:1399: FutureWarning: `torch.cpu.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cpu', args...)` instead.
  with device_autocast_ctx, torch.cpu.amp.autocast(**cpu_autocast_kwargs), recompute_context:  # type: ignore[attr-defined]

 18%|█▊        | 101/559 [01:50<10:13,  1.34s/it]
 18%|█▊        | 102/559 [01:51<09:39,  1.27s/it]
 18%|█▊        | 103/559 [01:52<09:15,  1.22s/it]
 19%|█▊        | 104/559 [01:53<08:57,  1.18s/it]
 19%|█▉        | 105/559 [01:54<08:45,  1.16s/it]
 19%|█▉        | 106/559 [01:55<08:36,  1.14s/it]
 19%|█▉        | 107/559 [01:57<08:29,  1.13s/it]
 19%|█▉        | 108/559 [01:58<08:25,  1.12s/it]
 19%|█▉        | 109/559 [01:59<08:21,  1.11s/it]
 20%|█▉        | 110/559 [02:00<08:18,  1.11s/it]
 20%|█▉        | 111/559 [02:01<08:15,  1.11s/it]
 20%|██        | 112/559 [02:02<08:14,  1.11s/it]
 20%|██        | 113/559 [02:03<08:12,  1.10s/it]
 20%|██        | 114/559 [02:04<08:10,  1.10s/it]
 21%|██        | 115/559 [02:05<08:09,  1.10s/it]
 21%|██        | 116/559 [02:06<08:08,  1.10s/it]
 21%|██        | 117/559 [02:08<08:07,  1.10s/it]
 21%|██        | 118/559 [02:09<08:05,  1.10s/it]
 21%|██▏       | 119/559 [02:10<08:04,  1.10s/it]
 21%|██▏       | 120/559 [02:11<08:03,  1.10s/it]
                                                 

 21%|██▏       | 120/559 [02:11<08:03,  1.10s/it]
 22%|██▏       | 121/559 [02:12<08:02,  1.10s/it]
 22%|██▏       | 122/559 [02:13<08:24,  1.15s/it]
 22%|██▏       | 123/559 [02:14<08:16,  1.14s/it]
 22%|██▏       | 124/559 [02:15<08:10,  1.13s/it]
 22%|██▏       | 125/559 [02:17<08:06,  1.12s/it]
 23%|██▎       | 126/559 [02:18<08:03,  1.12s/it]
 23%|██▎       | 127/559 [02:19<08:00,  1.11s/it]
 23%|██▎       | 128/559 [02:20<07:57,  1.11s/it]
 23%|██▎       | 129/559 [02:21<07:55,  1.11s/it]
 23%|██▎       | 130/559 [02:22<07:54,  1.11s/it]
 23%|██▎       | 131/559 [02:23<07:52,  1.11s/it]
 24%|██▎       | 132/559 [02:24<07:51,  1.10s/it]
 24%|██▍       | 133/559 [02:25<07:50,  1.11s/it]
 24%|██▍       | 134/559 [02:26<07:49,  1.10s/it]
 24%|██▍       | 135/559 [02:28<07:48,  1.10s/it]
 24%|██▍       | 136/559 [02:29<07:47,  1.10s/it]
 25%|██▍       | 137/559 [02:30<07:45,  1.10s/it]
 25%|██▍       | 138/559 [02:31<07:44,  1.10s/it]
 25%|██▍       | 139/559 [02:32<07:43,  1.10s/it]
 25%|██▌       | 140/559 [02:33<07:42,  1.10s/it]
                                                 

 25%|██▌       | 140/559 [02:33<07:42,  1.10s/it]
 25%|██▌       | 141/559 [02:34<07:41,  1.10s/it]
 25%|██▌       | 142/559 [02:35<07:40,  1.10s/it]
 26%|██▌       | 143/559 [02:36<07:39,  1.10s/it]
 26%|██▌       | 144/559 [02:38<07:37,  1.10s/it]
 26%|██▌       | 145/559 [02:39<07:37,  1.10s/it]
 26%|██▌       | 146/559 [02:40<07:36,  1.10s/it]
 26%|██▋       | 147/559 [02:41<07:35,  1.11s/it]
 26%|██▋       | 148/559 [02:42<07:34,  1.11s/it]
 27%|██▋       | 149/559 [02:43<07:33,  1.11s/it]
 27%|██▋       | 150/559 [02:44<07:32,  1.11s/it]
 27%|██▋       | 151/559 [02:45<07:30,  1.11s/it]
 27%|██▋       | 152/559 [02:46<07:29,  1.11s/it]
 27%|██▋       | 153/559 [02:47<07:28,  1.11s/it]
 28%|██▊       | 154/559 [02:49<07:27,  1.11s/it]
 28%|██▊       | 155/559 [02:50<07:26,  1.10s/it]
 28%|██▊       | 156/559 [02:51<07:25,  1.11s/it]
 28%|██▊       | 157/559 [02:52<07:46,  1.16s/it]
 28%|██▊       | 158/559 [02:53<07:38,  1.14s/it]
 28%|██▊       | 159/559 [02:54<07:32,  1.13s/it]
 29%|██▊       | 160/559 [02:55<07:28,  1.12s/it]
                                                 

 29%|██▊       | 160/559 [02:55<07:28,  1.12s/it]
 29%|██▉       | 161/559 [02:56<07:24,  1.12s/it]
 29%|██▉       | 162/559 [02:58<07:22,  1.11s/it]
 29%|██▉       | 163/559 [02:59<07:20,  1.11s/it]
 29%|██▉       | 164/559 [03:00<07:18,  1.11s/it]
 30%|██▉       | 165/559 [03:01<07:17,  1.11s/it]
 30%|██▉       | 166/559 [03:02<07:15,  1.11s/it]
 30%|██▉       | 167/559 [03:03<07:14,  1.11s/it]
 30%|███       | 168/559 [03:04<07:13,  1.11s/it]
 30%|███       | 169/559 [03:05<07:11,  1.11s/it]
 30%|███       | 170/559 [03:06<07:10,  1.11s/it]
 31%|███       | 171/559 [03:08<07:09,  1.11s/it]
 31%|███       | 172/559 [03:09<07:08,  1.11s/it]
 31%|███       | 173/559 [03:10<07:07,  1.11s/it]
 31%|███       | 174/559 [03:11<07:05,  1.11s/it]
 31%|███▏      | 175/559 [03:12<07:04,  1.11s/it]
 31%|███▏      | 176/559 [03:13<07:03,  1.11s/it]
 32%|███▏      | 177/559 [03:14<07:02,  1.11s/it]
 32%|███▏      | 178/559 [03:15<07:01,  1.11s/it]
 32%|███▏      | 179/559 [03:16<07:00,  1.11s/it]
 32%|███▏      | 180/559 [03:18<06:58,  1.11s/it]
                                                 

 32%|███▏      | 180/559 [03:18<06:58,  1.11s/it]
 32%|███▏      | 181/559 [03:19<06:57,  1.11s/it]
 33%|███▎      | 182/559 [03:20<06:56,  1.11s/it]
 33%|███▎      | 183/559 [03:21<06:55,  1.11s/it]
 33%|███▎      | 184/559 [03:22<06:54,  1.11s/it]
 33%|███▎      | 185/559 [03:23<06:53,  1.11s/it]
 33%|███▎      | 186/559 [03:24<06:52,  1.11s/it]
 33%|███▎      | 187/559 [03:25<06:51,  1.11s/it]
 34%|███▎      | 188/559 [03:27<07:09,  1.16s/it]
 34%|███▍      | 189/559 [03:28<07:03,  1.14s/it]
 34%|███▍      | 190/559 [03:29<06:57,  1.13s/it]
 34%|███▍      | 191/559 [03:30<06:53,  1.12s/it]
 34%|███▍      | 192/559 [03:31<06:50,  1.12s/it]
 35%|███▍      | 193/559 [03:32<06:48,  1.11s/it]
 35%|███▍      | 194/559 [03:33<06:46,  1.11s/it]
 35%|███▍      | 195/559 [03:34<06:44,  1.11s/it]
 35%|███▌      | 196/559 [03:35<06:42,  1.11s/it]
 35%|███▌      | 197/559 [03:36<06:41,  1.11s/it]
 35%|███▌      | 198/559 [03:38<06:39,  1.11s/it]
 36%|███▌      | 199/559 [03:39<06:38,  1.11s/it]
 36%|███▌      | 200/559 [03:40<06:37,  1.11s/it]
                                                 

 36%|███▌      | 200/559 [03:40<06:37,  1.11s/it][INFO|trainer.py:3801] 2025-08-07 03:21:52,412 >> Saving model checkpoint to ./checkpoint_dir/checkpoint-200
[INFO|configuration_utils.py:679] 2025-08-07 03:21:52,767 >> loading configuration file config.json from cache at /root/.cache/huggingface/hub/models--microsoft--Phi-3-mini-4k-instruct/snapshots/0a67737cc96d2554230f90338b163bc6380a2a85/config.json
[INFO|configuration_utils.py:746] 2025-08-07 03:21:52,768 >> Model config Phi3Config {
  "_name_or_path": "Phi-3-mini-4k-instruct",
  "architectures": [
    "Phi3ForCausalLM"
  ],
  "attention_bias": false,
  "attention_dropout": 0.0,
  "auto_map": {
    "AutoConfig": "microsoft/Phi-3-mini-4k-instruct--configuration_phi3.Phi3Config",
    "AutoModelForCausalLM": "microsoft/Phi-3-mini-4k-instruct--modeling_phi3.Phi3ForCausalLM"
  },
  "bos_token_id": 1,
  "embd_pdrop": 0.0,
  "eos_token_id": 32000,
  "hidden_act": "silu",
  "hidden_size": 3072,
  "initializer_range": 0.02,
  "intermediate_size": 8192,
  "max_position_embeddings": 4096,
  "model_type": "phi3",
  "num_attention_heads": 32,
  "num_hidden_layers": 32,
  "num_key_value_heads": 32,
  "original_max_position_embeddings": 4096,
  "pad_token_id": 32000,
  "resid_pdrop": 0.0,
  "rms_norm_eps": 1e-05,
  "rope_scaling": null,
  "rope_theta": 10000.0,
  "sliding_window": 2047,
  "tie_word_embeddings": false,
  "torch_dtype": "bfloat16",
  "transformers_version": "4.46.1",
  "use_cache": true,
  "vocab_size": 32064
}

[INFO|tokenization_utils_base.py:2646] 2025-08-07 03:21:52,921 >> tokenizer config file saved in ./checkpoint_dir/checkpoint-200/tokenizer_config.json
[INFO|tokenization_utils_base.py:2655] 2025-08-07 03:21:52,922 >> Special tokens file saved in ./checkpoint_dir/checkpoint-200/special_tokens_map.json
[INFO|trainer.py:3893] 2025-08-07 03:21:53,189 >> Deleting older checkpoint [checkpoint_dir/checkpoint-100] due to args.save_total_limit
/opt/conda/envs/ptca/lib/python3.10/site-packages/torch/utils/checkpoint.py:1399: FutureWarning: `torch.cpu.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cpu', args...)` instead.
  with device_autocast_ctx, torch.cpu.amp.autocast(**cpu_autocast_kwargs), recompute_context:  # type: ignore[attr-defined]

 36%|███▌      | 201/559 [03:42<08:04,  1.35s/it]
 36%|███▌      | 202/559 [03:43<07:36,  1.28s/it]
 36%|███▋      | 203/559 [03:44<07:16,  1.23s/it]
 36%|███▋      | 204/559 [03:45<07:02,  1.19s/it]
 37%|███▋      | 205/559 [03:46<06:52,  1.17s/it]
 37%|███▋      | 206/559 [03:47<06:45,  1.15s/it]
 37%|███▋      | 207/559 [03:48<06:39,  1.14s/it]
 37%|███▋      | 208/559 [03:49<06:35,  1.13s/it]
 37%|███▋      | 209/559 [03:51<06:32,  1.12s/it]
 38%|███▊      | 210/559 [03:52<06:29,  1.12s/it]
 38%|███▊      | 211/559 [03:53<06:27,  1.11s/it]
 38%|███▊      | 212/559 [03:54<06:25,  1.11s/it]
 38%|███▊      | 213/559 [03:55<06:23,  1.11s/it]
 38%|███▊      | 214/559 [03:56<06:22,  1.11s/it]
 38%|███▊      | 215/559 [03:57<06:21,  1.11s/it]
 39%|███▊      | 216/559 [03:59<06:37,  1.16s/it]
 39%|███▉      | 217/559 [04:00<06:31,  1.14s/it]
 39%|███▉      | 218/559 [04:01<06:26,  1.13s/it]
 39%|███▉      | 219/559 [04:02<06:22,  1.12s/it]
 39%|███▉      | 220/559 [04:03<06:19,  1.12s/it]
                                                 

 39%|███▉      | 220/559 [04:03<06:19,  1.12s/it]
 40%|███▉      | 221/559 [04:04<06:16,  1.12s/it]
 40%|███▉      | 222/559 [04:05<06:14,  1.11s/it]
 40%|███▉      | 223/559 [04:06<06:13,  1.11s/it]
 40%|████      | 224/559 [04:07<06:11,  1.11s/it]
 40%|████      | 225/559 [04:08<06:10,  1.11s/it]
 40%|████      | 226/559 [04:10<06:08,  1.11s/it]
 41%|████      | 227/559 [04:11<06:07,  1.11s/it]
 41%|████      | 228/559 [04:12<06:06,  1.11s/it]
 41%|████      | 229/559 [04:13<06:05,  1.11s/it]
 41%|████      | 230/559 [04:14<06:04,  1.11s/it]
 41%|████▏     | 231/559 [04:15<06:02,  1.11s/it]
 42%|████▏     | 232/559 [04:16<06:01,  1.11s/it]
 42%|████▏     | 233/559 [04:17<06:00,  1.11s/it]
 42%|████▏     | 234/559 [04:18<05:59,  1.11s/it]
 42%|████▏     | 235/559 [04:20<05:58,  1.11s/it]
 42%|████▏     | 236/559 [04:21<05:57,  1.11s/it]
 42%|████▏     | 237/559 [04:22<05:56,  1.11s/it]
 43%|████▎     | 238/559 [04:23<05:55,  1.11s/it]
 43%|████▎     | 239/559 [04:24<05:54,  1.11s/it]
 43%|████▎     | 240/559 [04:25<05:53,  1.11s/it]
                                                 

 43%|████▎     | 240/559 [04:25<05:53,  1.11s/it]
 43%|████▎     | 241/559 [04:26<05:52,  1.11s/it]
 43%|████▎     | 242/559 [04:27<05:51,  1.11s/it]
 43%|████▎     | 243/559 [04:28<05:50,  1.11s/it]
 44%|████▎     | 244/559 [04:29<05:48,  1.11s/it]
 44%|████▍     | 245/559 [04:31<05:47,  1.11s/it]
 44%|████▍     | 246/559 [04:32<05:46,  1.11s/it]
 44%|████▍     | 247/559 [04:33<05:45,  1.11s/it]
 44%|████▍     | 248/559 [04:34<05:44,  1.11s/it]
 45%|████▍     | 249/559 [04:35<05:43,  1.11s/it]
 45%|████▍     | 250/559 [04:36<05:42,  1.11s/it]
 45%|████▍     | 251/559 [04:37<05:57,  1.16s/it]
 45%|████▌     | 252/559 [04:39<05:51,  1.15s/it]
 45%|████▌     | 253/559 [04:40<05:46,  1.13s/it]
 45%|████▌     | 254/559 [04:41<05:43,  1.13s/it]
 46%|████▌     | 255/559 [04:42<05:40,  1.12s/it]
 46%|████▌     | 256/559 [04:43<05:38,  1.12s/it]
 46%|████▌     | 257/559 [04:44<05:36,  1.11s/it]
 46%|████▌     | 258/559 [04:45<05:34,  1.11s/it]
 46%|████▋     | 259/559 [04:46<05:33,  1.11s/it]
 47%|████▋     | 260/559 [04:47<05:31,  1.11s/it]
                                                 

 47%|████▋     | 260/559 [04:47<05:31,  1.11s/it]
 47%|████▋     | 261/559 [04:49<05:30,  1.11s/it]
 47%|████▋     | 262/559 [04:50<05:29,  1.11s/it]
 47%|████▋     | 263/559 [04:51<05:28,  1.11s/it]
 47%|████▋     | 264/559 [04:52<05:26,  1.11s/it]
 47%|████▋     | 265/559 [04:53<05:25,  1.11s/it]
 48%|████▊     | 266/559 [04:54<05:24,  1.11s/it]
 48%|████▊     | 267/559 [04:55<05:23,  1.11s/it]
 48%|████▊     | 268/559 [04:56<05:22,  1.11s/it]
 48%|████▊     | 269/559 [04:57<05:21,  1.11s/it]
 48%|████▊     | 270/559 [04:58<05:20,  1.11s/it]
 48%|████▊     | 271/559 [05:00<05:18,  1.11s/it]
 49%|████▊     | 272/559 [05:01<05:17,  1.11s/it]
 49%|████▉     | 273/559 [05:02<05:16,  1.11s/it]
 49%|████▉     | 274/559 [05:03<05:15,  1.11s/it]
 49%|████▉     | 275/559 [05:04<05:14,  1.11s/it]
 49%|████▉     | 276/559 [05:05<05:13,  1.11s/it]
 50%|████▉     | 277/559 [05:06<05:12,  1.11s/it]
 50%|████▉     | 278/559 [05:07<05:10,  1.11s/it]
 50%|████▉     | 279/559 [05:08<05:09,  1.11s/it]
 50%|█████     | 280/559 [05:10<05:08,  1.11s/it]
                                                 

 50%|█████     | 280/559 [05:10<05:08,  1.11s/it]
 50%|█████     | 281/559 [05:11<05:07,  1.11s/it]
 50%|█████     | 282/559 [05:12<05:06,  1.11s/it]
 51%|█████     | 283/559 [05:13<05:19,  1.16s/it]
 51%|█████     | 284/559 [05:14<05:14,  1.14s/it]
 51%|█████     | 285/559 [05:15<05:10,  1.13s/it]
 51%|█████     | 286/559 [05:16<05:06,  1.12s/it]
 51%|█████▏    | 287/559 [05:17<05:04,  1.12s/it]
 52%|█████▏    | 288/559 [05:19<05:02,  1.12s/it]
 52%|█████▏    | 289/559 [05:20<05:00,  1.11s/it]
 52%|█████▏    | 290/559 [05:21<04:58,  1.11s/it]
 52%|█████▏    | 291/559 [05:22<04:57,  1.11s/it]
 52%|█████▏    | 292/559 [05:23<04:55,  1.11s/it]
 52%|█████▏    | 293/559 [05:24<04:54,  1.11s/it]
 53%|█████▎    | 294/559 [05:25<04:53,  1.11s/it]
 53%|█████▎    | 295/559 [05:26<04:52,  1.11s/it]
 53%|█████▎    | 296/559 [05:27<04:51,  1.11s/it]
 53%|█████▎    | 297/559 [05:29<04:49,  1.11s/it]
 53%|█████▎    | 298/559 [05:30<04:48,  1.11s/it]
 53%|█████▎    | 299/559 [05:31<04:47,  1.11s/it]
 54%|█████▎    | 300/559 [05:32<04:46,  1.11s/it]
                                                 

 54%|█████▎    | 300/559 [05:32<04:46,  1.11s/it][INFO|trainer.py:3801] 2025-08-07 03:23:44,435 >> Saving model checkpoint to ./checkpoint_dir/checkpoint-300
[INFO|configuration_utils.py:679] 2025-08-07 03:23:44,792 >> loading configuration file config.json from cache at /root/.cache/huggingface/hub/models--microsoft--Phi-3-mini-4k-instruct/snapshots/0a67737cc96d2554230f90338b163bc6380a2a85/config.json
[INFO|configuration_utils.py:746] 2025-08-07 03:23:44,793 >> Model config Phi3Config {
  "_name_or_path": "Phi-3-mini-4k-instruct",
  "architectures": [
    "Phi3ForCausalLM"
  ],
  "attention_bias": false,
  "attention_dropout": 0.0,
  "auto_map": {
    "AutoConfig": "microsoft/Phi-3-mini-4k-instruct--configuration_phi3.Phi3Config",
    "AutoModelForCausalLM": "microsoft/Phi-3-mini-4k-instruct--modeling_phi3.Phi3ForCausalLM"
  },
  "bos_token_id": 1,
  "embd_pdrop": 0.0,
  "eos_token_id": 32000,
  "hidden_act": "silu",
  "hidden_size": 3072,
  "initializer_range": 0.02,
  "intermediate_size": 8192,
  "max_position_embeddings": 4096,
  "model_type": "phi3",
  "num_attention_heads": 32,
  "num_hidden_layers": 32,
  "num_key_value_heads": 32,
  "original_max_position_embeddings": 4096,
  "pad_token_id": 32000,
  "resid_pdrop": 0.0,
  "rms_norm_eps": 1e-05,
  "rope_scaling": null,
  "rope_theta": 10000.0,
  "sliding_window": 2047,
  "tie_word_embeddings": false,
  "torch_dtype": "bfloat16",
  "transformers_version": "4.46.1",
  "use_cache": true,
  "vocab_size": 32064
}

[INFO|tokenization_utils_base.py:2646] 2025-08-07 03:23:44,945 >> tokenizer config file saved in ./checkpoint_dir/checkpoint-300/tokenizer_config.json
[INFO|tokenization_utils_base.py:2655] 2025-08-07 03:23:44,945 >> Special tokens file saved in ./checkpoint_dir/checkpoint-300/special_tokens_map.json
[INFO|trainer.py:3893] 2025-08-07 03:23:45,231 >> Deleting older checkpoint [checkpoint_dir/checkpoint-200] due to args.save_total_limit
/opt/conda/envs/ptca/lib/python3.10/site-packages/torch/utils/checkpoint.py:1399: FutureWarning: `torch.cpu.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cpu', args...)` instead.
  with device_autocast_ctx, torch.cpu.amp.autocast(**cpu_autocast_kwargs), recompute_context:  # type: ignore[attr-defined]

 54%|█████▍    | 301/559 [05:34<05:50,  1.36s/it]
 54%|█████▍    | 302/559 [05:35<05:29,  1.28s/it]
 54%|█████▍    | 303/559 [05:36<05:14,  1.23s/it]
 54%|█████▍    | 304/559 [05:37<05:03,  1.19s/it]
 55%|█████▍    | 305/559 [05:38<04:56,  1.17s/it]
 55%|█████▍    | 306/559 [05:39<04:50,  1.15s/it]
 55%|█████▍    | 307/559 [05:40<04:45,  1.13s/it]
 55%|█████▌    | 308/559 [05:42<04:42,  1.13s/it]
 55%|█████▌    | 309/559 [05:43<04:39,  1.12s/it]
 55%|█████▌    | 310/559 [05:44<04:37,  1.12s/it]
 56%|█████▌    | 311/559 [05:45<04:36,  1.11s/it]
 56%|█████▌    | 312/559 [05:46<04:47,  1.17s/it]
 56%|█████▌    | 313/559 [05:47<04:42,  1.15s/it]
 56%|█████▌    | 314/559 [05:48<04:38,  1.13s/it]
 56%|█████▋    | 315/559 [05:49<04:34,  1.13s/it]
 57%|█████▋    | 316/559 [05:51<04:32,  1.12s/it]
 57%|█████▋    | 317/559 [05:52<04:30,  1.12s/it]
 57%|█████▋    | 318/559 [05:53<04:28,  1.11s/it]
 57%|█████▋    | 319/559 [05:54<04:26,  1.11s/it]
 57%|█████▋    | 320/559 [05:55<04:25,  1.11s/it]
                                                 

 57%|█████▋    | 320/559 [05:55<04:25,  1.11s/it]
 57%|█████▋    | 321/559 [05:56<04:23,  1.11s/it]
 58%|█████▊    | 322/559 [05:57<04:22,  1.11s/it]
 58%|█████▊    | 323/559 [05:58<04:21,  1.11s/it]
 58%|█████▊    | 324/559 [05:59<04:20,  1.11s/it]
 58%|█████▊    | 325/559 [06:01<04:19,  1.11s/it]
 58%|█████▊    | 326/559 [06:02<04:17,  1.11s/it]
 58%|█████▊    | 327/559 [06:03<04:16,  1.11s/it]
 59%|█████▊    | 328/559 [06:04<04:15,  1.11s/it]
 59%|█████▉    | 329/559 [06:05<04:14,  1.11s/it]
 59%|█████▉    | 330/559 [06:06<04:13,  1.11s/it]
 59%|█████▉    | 331/559 [06:07<04:12,  1.11s/it]
 59%|█████▉    | 332/559 [06:08<04:11,  1.11s/it]
 60%|█████▉    | 333/559 [06:09<04:09,  1.11s/it]
 60%|█████▉    | 334/559 [06:10<04:08,  1.11s/it]
 60%|█████▉    | 335/559 [06:12<04:07,  1.11s/it]
 60%|██████    | 336/559 [06:13<04:06,  1.11s/it]
 60%|██████    | 337/559 [06:14<04:05,  1.11s/it]
 60%|██████    | 338/559 [06:15<04:04,  1.11s/it]
 61%|██████    | 339/559 [06:16<04:03,  1.11s/it]
 61%|██████    | 340/559 [06:17<04:02,  1.11s/it]
                                                 

 61%|██████    | 340/559 [06:17<04:02,  1.11s/it]
 61%|██████    | 341/559 [06:18<04:01,  1.11s/it]
 61%|██████    | 342/559 [06:19<03:59,  1.11s/it]
 61%|██████▏   | 343/559 [06:20<03:58,  1.11s/it]
 62%|██████▏   | 344/559 [06:22<03:57,  1.11s/it]
 62%|██████▏   | 345/559 [06:23<04:07,  1.16s/it]
 62%|██████▏   | 346/559 [06:24<04:03,  1.14s/it]
 62%|██████▏   | 347/559 [06:25<03:59,  1.13s/it]
 62%|██████▏   | 348/559 [06:26<03:57,  1.12s/it]
 62%|██████▏   | 349/559 [06:27<03:54,  1.12s/it]
 63%|██████▎   | 350/559 [06:28<03:52,  1.11s/it]
 63%|██████▎   | 351/559 [06:29<03:51,  1.11s/it]
 63%|██████▎   | 352/559 [06:31<03:50,  1.11s/it]
 63%|██████▎   | 353/559 [06:32<03:48,  1.11s/it]
 63%|██████▎   | 354/559 [06:33<03:47,  1.11s/it]
 64%|██████▎   | 355/559 [06:34<03:46,  1.11s/it]
 64%|██████▎   | 356/559 [06:35<03:45,  1.11s/it]
 64%|██████▍   | 357/559 [06:36<03:43,  1.11s/it]
 64%|██████▍   | 358/559 [06:37<03:42,  1.11s/it]
 64%|██████▍   | 359/559 [06:38<03:41,  1.11s/it]
 64%|██████▍   | 360/559 [06:39<03:40,  1.11s/it]
                                                 

 64%|██████▍   | 360/559 [06:39<03:40,  1.11s/it]
 65%|██████▍   | 361/559 [06:40<03:39,  1.11s/it]
 65%|██████▍   | 362/559 [06:42<03:37,  1.11s/it]
 65%|██████▍   | 363/559 [06:43<03:36,  1.11s/it]
 65%|██████▌   | 364/559 [06:44<03:35,  1.11s/it]
 65%|██████▌   | 365/559 [06:45<03:34,  1.11s/it]
 65%|██████▌   | 366/559 [06:46<03:33,  1.11s/it]
 66%|██████▌   | 367/559 [06:47<03:32,  1.11s/it]
 66%|██████▌   | 368/559 [06:48<03:31,  1.11s/it]
 66%|██████▌   | 369/559 [06:49<03:30,  1.11s/it]
 66%|██████▌   | 370/559 [06:50<03:29,  1.11s/it]
 66%|██████▋   | 371/559 [06:52<03:28,  1.11s/it]
 67%|██████▋   | 372/559 [06:53<03:26,  1.11s/it]
 67%|██████▋   | 373/559 [06:54<03:25,  1.11s/it]
 67%|██████▋   | 374/559 [06:55<03:24,  1.11s/it]
 67%|██████▋   | 375/559 [06:56<03:23,  1.11s/it]
 67%|██████▋   | 376/559 [06:57<03:22,  1.11s/it]
 67%|██████▋   | 377/559 [06:58<03:31,  1.16s/it]
 68%|██████▊   | 378/559 [06:59<03:27,  1.15s/it]
 68%|██████▊   | 379/559 [07:01<03:24,  1.13s/it]
 68%|██████▊   | 380/559 [07:02<03:21,  1.13s/it]
                                                 

 68%|██████▊   | 380/559 [07:02<03:21,  1.13s/it]
 68%|██████▊   | 381/559 [07:03<03:19,  1.12s/it]
 68%|██████▊   | 382/559 [07:04<03:17,  1.12s/it]
 69%|██████▊   | 383/559 [07:05<03:15,  1.11s/it]
 69%|██████▊   | 384/559 [07:06<03:14,  1.11s/it]
 69%|██████▉   | 385/559 [07:07<03:13,  1.11s/it]
 69%|██████▉   | 386/559 [07:08<03:11,  1.11s/it]
 69%|██████▉   | 387/559 [07:09<03:10,  1.11s/it]
 69%|██████▉   | 388/559 [07:11<03:09,  1.11s/it]
 70%|██████▉   | 389/559 [07:12<03:08,  1.11s/it]
 70%|██████▉   | 390/559 [07:13<03:07,  1.11s/it]
 70%|██████▉   | 391/559 [07:14<03:06,  1.11s/it]
 70%|███████   | 392/559 [07:15<03:04,  1.11s/it]
 70%|███████   | 393/559 [07:16<03:03,  1.11s/it]
 70%|███████   | 394/559 [07:17<03:02,  1.11s/it]
 71%|███████   | 395/559 [07:18<03:01,  1.11s/it]
 71%|███████   | 396/559 [07:19<03:00,  1.11s/it]
 71%|███████   | 397/559 [07:21<02:59,  1.11s/it]
 71%|███████   | 398/559 [07:22<02:58,  1.11s/it]
 71%|███████▏  | 399/559 [07:23<02:57,  1.11s/it]
 72%|███████▏  | 400/559 [07:24<02:56,  1.11s/it]
                                                 

 72%|███████▏  | 400/559 [07:24<02:56,  1.11s/it][INFO|trainer.py:3801] 2025-08-07 03:25:36,447 >> Saving model checkpoint to ./checkpoint_dir/checkpoint-400
[INFO|configuration_utils.py:679] 2025-08-07 03:25:36,803 >> loading configuration file config.json from cache at /root/.cache/huggingface/hub/models--microsoft--Phi-3-mini-4k-instruct/snapshots/0a67737cc96d2554230f90338b163bc6380a2a85/config.json
[INFO|configuration_utils.py:746] 2025-08-07 03:25:36,804 >> Model config Phi3Config {
  "_name_or_path": "Phi-3-mini-4k-instruct",
  "architectures": [
    "Phi3ForCausalLM"
  ],
  "attention_bias": false,
  "attention_dropout": 0.0,
  "auto_map": {
    "AutoConfig": "microsoft/Phi-3-mini-4k-instruct--configuration_phi3.Phi3Config",
    "AutoModelForCausalLM": "microsoft/Phi-3-mini-4k-instruct--modeling_phi3.Phi3ForCausalLM"
  },
  "bos_token_id": 1,
  "embd_pdrop": 0.0,
  "eos_token_id": 32000,
  "hidden_act": "silu",
  "hidden_size": 3072,
  "initializer_range": 0.02,
  "intermediate_size": 8192,
  "max_position_embeddings": 4096,
  "model_type": "phi3",
  "num_attention_heads": 32,
  "num_hidden_layers": 32,
  "num_key_value_heads": 32,
  "original_max_position_embeddings": 4096,
  "pad_token_id": 32000,
  "resid_pdrop": 0.0,
  "rms_norm_eps": 1e-05,
  "rope_scaling": null,
  "rope_theta": 10000.0,
  "sliding_window": 2047,
  "tie_word_embeddings": false,
  "torch_dtype": "bfloat16",
  "transformers_version": "4.46.1",
  "use_cache": true,
  "vocab_size": 32064
}

[INFO|tokenization_utils_base.py:2646] 2025-08-07 03:25:36,957 >> tokenizer config file saved in ./checkpoint_dir/checkpoint-400/tokenizer_config.json
[INFO|tokenization_utils_base.py:2655] 2025-08-07 03:25:36,957 >> Special tokens file saved in ./checkpoint_dir/checkpoint-400/special_tokens_map.json
[INFO|trainer.py:3893] 2025-08-07 03:25:37,224 >> Deleting older checkpoint [checkpoint_dir/checkpoint-300] due to args.save_total_limit
/opt/conda/envs/ptca/lib/python3.10/site-packages/torch/utils/checkpoint.py:1399: FutureWarning: `torch.cpu.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cpu', args...)` instead.
  with device_autocast_ctx, torch.cpu.amp.autocast(**cpu_autocast_kwargs), recompute_context:  # type: ignore[attr-defined]

 72%|███████▏  | 401/559 [07:26<03:33,  1.35s/it]
 72%|███████▏  | 402/559 [07:27<03:20,  1.28s/it]
 72%|███████▏  | 403/559 [07:28<03:11,  1.23s/it]
 72%|███████▏  | 404/559 [07:29<03:04,  1.19s/it]
 72%|███████▏  | 405/559 [07:30<03:07,  1.22s/it]
 73%|███████▎  | 406/559 [07:31<03:01,  1.18s/it]
 73%|███████▎  | 407/559 [07:33<02:56,  1.16s/it]
 73%|███████▎  | 408/559 [07:34<02:52,  1.14s/it]
 73%|███████▎  | 409/559 [07:35<02:49,  1.13s/it]
 73%|███████▎  | 410/559 [07:36<02:47,  1.12s/it]
 74%|███████▎  | 411/559 [07:37<02:45,  1.12s/it]
 74%|███████▎  | 412/559 [07:38<02:43,  1.11s/it]
 74%|███████▍  | 413/559 [07:39<02:42,  1.11s/it]
 74%|███████▍  | 414/559 [07:40<02:40,  1.11s/it]
 74%|███████▍  | 415/559 [07:41<02:39,  1.11s/it]
 74%|███████▍  | 416/559 [07:43<02:38,  1.11s/it]
 75%|███████▍  | 417/559 [07:44<02:37,  1.11s/it]
 75%|███████▍  | 418/559 [07:45<02:36,  1.11s/it]
 75%|███████▍  | 419/559 [07:46<02:35,  1.11s/it]
 75%|███████▌  | 420/559 [07:47<02:34,  1.11s/it]
                                                 

 75%|███████▌  | 420/559 [07:47<02:34,  1.11s/it]
 75%|███████▌  | 421/559 [07:48<02:32,  1.11s/it]
 75%|███████▌  | 422/559 [07:49<02:31,  1.11s/it]
 76%|███████▌  | 423/559 [07:50<02:30,  1.11s/it]
 76%|███████▌  | 424/559 [07:51<02:29,  1.11s/it]
 76%|███████▌  | 425/559 [07:53<02:28,  1.11s/it]
 76%|███████▌  | 426/559 [07:54<02:27,  1.11s/it]
 76%|███████▋  | 427/559 [07:55<02:26,  1.11s/it]
 77%|███████▋  | 428/559 [07:56<02:25,  1.11s/it]
 77%|███████▋  | 429/559 [07:57<02:23,  1.11s/it]
 77%|███████▋  | 430/559 [07:58<02:22,  1.11s/it]
 77%|███████▋  | 431/559 [07:59<02:21,  1.11s/it]
 77%|███████▋  | 432/559 [08:00<02:20,  1.11s/it]
 77%|███████▋  | 433/559 [08:01<02:19,  1.11s/it]
 78%|███████▊  | 434/559 [08:02<02:18,  1.11s/it]
 78%|███████▊  | 435/559 [08:04<02:17,  1.11s/it]
 78%|███████▊  | 436/559 [08:05<02:16,  1.11s/it]
 78%|███████▊  | 437/559 [08:06<02:15,  1.11s/it]
 78%|███████▊  | 438/559 [08:07<02:14,  1.11s/it]
 79%|███████▊  | 439/559 [08:08<02:12,  1.11s/it]
 79%|███████▊  | 440/559 [08:09<02:18,  1.16s/it]
                                                 

 79%|███████▊  | 440/559 [08:09<02:18,  1.16s/it]
 79%|███████▉  | 441/559 [08:10<02:15,  1.15s/it]
 79%|███████▉  | 442/559 [08:12<02:12,  1.13s/it]
 79%|███████▉  | 443/559 [08:13<02:10,  1.13s/it]
 79%|███████▉  | 444/559 [08:14<02:08,  1.12s/it]
 80%|███████▉  | 445/559 [08:15<02:07,  1.12s/it]
 80%|███████▉  | 446/559 [08:16<02:05,  1.11s/it]
 80%|███████▉  | 447/559 [08:17<02:04,  1.11s/it]
 80%|████████  | 448/559 [08:18<02:03,  1.11s/it]
 80%|████████  | 449/559 [08:19<02:02,  1.11s/it]
 81%|████████  | 450/559 [08:20<02:00,  1.11s/it]
 81%|████████  | 451/559 [08:21<01:59,  1.11s/it]
 81%|████████  | 452/559 [08:23<01:58,  1.11s/it]
 81%|████████  | 453/559 [08:24<01:57,  1.11s/it]
 81%|████████  | 454/559 [08:25<01:56,  1.11s/it]
 81%|████████▏ | 455/559 [08:26<01:55,  1.11s/it]
 82%|████████▏ | 456/559 [08:27<01:54,  1.11s/it]
 82%|████████▏ | 457/559 [08:28<01:52,  1.11s/it]
 82%|████████▏ | 458/559 [08:29<01:51,  1.11s/it]
 82%|████████▏ | 459/559 [08:30<01:50,  1.11s/it]
 82%|████████▏ | 460/559 [08:31<01:49,  1.11s/it]
                                                 

 82%|████████▏ | 460/559 [08:31<01:49,  1.11s/it]
 82%|████████▏ | 461/559 [08:33<01:48,  1.11s/it]
 83%|████████▎ | 462/559 [08:34<01:47,  1.11s/it]
 83%|████████▎ | 463/559 [08:35<01:46,  1.11s/it]
 83%|████████▎ | 464/559 [08:36<01:45,  1.11s/it]
 83%|████████▎ | 465/559 [08:37<01:44,  1.11s/it]
 83%|████████▎ | 466/559 [08:38<01:42,  1.11s/it]
 84%|████████▎ | 467/559 [08:39<01:41,  1.11s/it]
 84%|████████▎ | 468/559 [08:40<01:40,  1.11s/it]
 84%|████████▍ | 469/559 [08:41<01:39,  1.11s/it]
 84%|████████▍ | 470/559 [08:42<01:38,  1.11s/it]
 84%|████████▍ | 471/559 [08:44<01:41,  1.16s/it]
 84%|████████▍ | 472/559 [08:45<01:39,  1.14s/it]
 85%|████████▍ | 473/559 [08:46<01:37,  1.13s/it]
 85%|████████▍ | 474/559 [08:47<01:35,  1.12s/it]
 85%|████████▍ | 475/559 [08:48<01:34,  1.12s/it]
 85%|████████▌ | 476/559 [08:49<01:32,  1.12s/it]
 85%|████████▌ | 477/559 [08:50<01:31,  1.11s/it]
 86%|████████▌ | 478/559 [08:52<01:30,  1.11s/it]
 86%|████████▌ | 479/559 [08:53<01:28,  1.11s/it]
 86%|████████▌ | 480/559 [08:54<01:27,  1.11s/it]
                                                 

 86%|████████▌ | 480/559 [08:54<01:27,  1.11s/it]
 86%|████████▌ | 481/559 [08:55<01:26,  1.11s/it]
 86%|████████▌ | 482/559 [08:56<01:25,  1.11s/it]
 86%|████████▋ | 483/559 [08:57<01:24,  1.11s/it]
 87%|████████▋ | 484/559 [08:58<01:23,  1.11s/it]
 87%|████████▋ | 485/559 [08:59<01:21,  1.11s/it]
 87%|████████▋ | 486/559 [09:00<01:20,  1.11s/it]
 87%|████████▋ | 487/559 [09:01<01:19,  1.11s/it]
 87%|████████▋ | 488/559 [09:03<01:18,  1.11s/it]
 87%|████████▋ | 489/559 [09:04<01:17,  1.11s/it]
 88%|████████▊ | 490/559 [09:05<01:16,  1.11s/it]
 88%|████████▊ | 491/559 [09:06<01:15,  1.11s/it]
 88%|████████▊ | 492/559 [09:07<01:14,  1.11s/it]
 88%|████████▊ | 493/559 [09:08<01:13,  1.11s/it]
 88%|████████▊ | 494/559 [09:09<01:11,  1.11s/it]
 89%|████████▊ | 495/559 [09:10<01:10,  1.11s/it]
 89%|████████▊ | 496/559 [09:11<01:09,  1.11s/it]
 89%|████████▉ | 497/559 [09:13<01:08,  1.11s/it]
 89%|████████▉ | 498/559 [09:14<01:07,  1.11s/it]
 89%|████████▉ | 499/559 [09:15<01:09,  1.16s/it]
 89%|████████▉ | 500/559 [09:16<01:07,  1.14s/it]
                                                 

 89%|████████▉ | 500/559 [09:16<01:07,  1.14s/it][INFO|trainer.py:3801] 2025-08-07 03:27:28,655 >> Saving model checkpoint to ./checkpoint_dir/checkpoint-500
[INFO|configuration_utils.py:679] 2025-08-07 03:27:29,016 >> loading configuration file config.json from cache at /root/.cache/huggingface/hub/models--microsoft--Phi-3-mini-4k-instruct/snapshots/0a67737cc96d2554230f90338b163bc6380a2a85/config.json
[INFO|configuration_utils.py:746] 2025-08-07 03:27:29,016 >> Model config Phi3Config {
  "_name_or_path": "Phi-3-mini-4k-instruct",
  "architectures": [
    "Phi3ForCausalLM"
  ],
  "attention_bias": false,
  "attention_dropout": 0.0,
  "auto_map": {
    "AutoConfig": "microsoft/Phi-3-mini-4k-instruct--configuration_phi3.Phi3Config",
    "AutoModelForCausalLM": "microsoft/Phi-3-mini-4k-instruct--modeling_phi3.Phi3ForCausalLM"
  },
  "bos_token_id": 1,
  "embd_pdrop": 0.0,
  "eos_token_id": 32000,
  "hidden_act": "silu",
  "hidden_size": 3072,
  "initializer_range": 0.02,
  "intermediate_size": 8192,
  "max_position_embeddings": 4096,
  "model_type": "phi3",
  "num_attention_heads": 32,
  "num_hidden_layers": 32,
  "num_key_value_heads": 32,
  "original_max_position_embeddings": 4096,
  "pad_token_id": 32000,
  "resid_pdrop": 0.0,
  "rms_norm_eps": 1e-05,
  "rope_scaling": null,
  "rope_theta": 10000.0,
  "sliding_window": 2047,
  "tie_word_embeddings": false,
  "torch_dtype": "bfloat16",
  "transformers_version": "4.46.1",
  "use_cache": true,
  "vocab_size": 32064
}

[INFO|tokenization_utils_base.py:2646] 2025-08-07 03:27:29,172 >> tokenizer config file saved in ./checkpoint_dir/checkpoint-500/tokenizer_config.json
[INFO|tokenization_utils_base.py:2655] 2025-08-07 03:27:29,172 >> Special tokens file saved in ./checkpoint_dir/checkpoint-500/special_tokens_map.json
[INFO|trainer.py:3893] 2025-08-07 03:27:29,440 >> Deleting older checkpoint [checkpoint_dir/checkpoint-400] due to args.save_total_limit
/opt/conda/envs/ptca/lib/python3.10/site-packages/torch/utils/checkpoint.py:1399: FutureWarning: `torch.cpu.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cpu', args...)` instead.
  with device_autocast_ctx, torch.cpu.amp.autocast(**cpu_autocast_kwargs), recompute_context:  # type: ignore[attr-defined]

 90%|████████▉ | 501/559 [09:18<01:20,  1.38s/it]
 90%|████████▉ | 502/559 [09:19<01:14,  1.30s/it]
 90%|████████▉ | 503/559 [09:20<01:09,  1.24s/it]
 90%|█████████ | 504/559 [09:21<01:06,  1.20s/it]
 90%|█████████ | 505/559 [09:22<01:03,  1.17s/it]
 91%|█████████ | 506/559 [09:24<01:01,  1.15s/it]
 91%|█████████ | 507/559 [09:25<00:59,  1.14s/it]
 91%|█████████ | 508/559 [09:26<00:57,  1.13s/it]
 91%|█████████ | 509/559 [09:27<00:56,  1.12s/it]
 91%|█████████ | 510/559 [09:28<00:54,  1.12s/it]
 91%|█████████▏| 511/559 [09:29<00:53,  1.11s/it]
 92%|█████████▏| 512/559 [09:30<00:52,  1.11s/it]
 92%|█████████▏| 513/559 [09:31<00:51,  1.11s/it]
 92%|█████████▏| 514/559 [09:32<00:49,  1.11s/it]
 92%|█████████▏| 515/559 [09:33<00:48,  1.11s/it]
 92%|█████████▏| 516/559 [09:35<00:47,  1.11s/it]
 92%|█████████▏| 517/559 [09:36<00:46,  1.11s/it]
 93%|█████████▎| 518/559 [09:37<00:45,  1.11s/it]
 93%|█████████▎| 519/559 [09:38<00:44,  1.11s/it]
 93%|█████████▎| 520/559 [09:39<00:43,  1.11s/it]
                                                 

 93%|█████████▎| 520/559 [09:39<00:43,  1.11s/it]
 93%|█████████▎| 521/559 [09:40<00:42,  1.11s/it]
 93%|█████████▎| 522/559 [09:41<00:40,  1.11s/it]
 94%|█████████▎| 523/559 [09:42<00:39,  1.11s/it]
 94%|█████████▎| 524/559 [09:43<00:38,  1.11s/it]
 94%|█████████▍| 525/559 [09:45<00:37,  1.11s/it]
 94%|█████████▍| 526/559 [09:46<00:36,  1.11s/it]
 94%|█████████▍| 527/559 [09:47<00:37,  1.16s/it]
 94%|█████████▍| 528/559 [09:48<00:35,  1.14s/it]
 95%|█████████▍| 529/559 [09:49<00:33,  1.13s/it]
 95%|█████████▍| 530/559 [09:50<00:32,  1.13s/it]
 95%|█████████▍| 531/559 [09:51<00:31,  1.12s/it]
 95%|█████████▌| 532/559 [09:52<00:30,  1.12s/it]
 95%|█████████▌| 533/559 [09:54<00:28,  1.11s/it]
 96%|█████████▌| 534/559 [09:55<00:27,  1.11s/it]
 96%|█████████▌| 535/559 [09:56<00:26,  1.11s/it]
 96%|█████████▌| 536/559 [09:57<00:25,  1.11s/it]
 96%|█████████▌| 537/559 [09:58<00:24,  1.11s/it]
 96%|█████████▌| 538/559 [09:59<00:23,  1.11s/it]
 96%|█████████▋| 539/559 [10:00<00:22,  1.11s/it]
 97%|█████████▋| 540/559 [10:01<00:21,  1.11s/it]
                                                 

 97%|█████████▋| 540/559 [10:01<00:21,  1.11s/it]
 97%|█████████▋| 541/559 [10:02<00:19,  1.11s/it]
 97%|█████████▋| 542/559 [10:04<00:18,  1.11s/it]
 97%|█████████▋| 543/559 [10:05<00:17,  1.11s/it]
 97%|█████████▋| 544/559 [10:06<00:16,  1.11s/it]
 97%|█████████▋| 545/559 [10:07<00:15,  1.11s/it]
 98%|█████████▊| 546/559 [10:08<00:14,  1.11s/it]
 98%|█████████▊| 547/559 [10:09<00:13,  1.11s/it]
 98%|█████████▊| 548/559 [10:10<00:12,  1.11s/it]
 98%|█████████▊| 549/559 [10:11<00:11,  1.11s/it]
 98%|█████████▊| 550/559 [10:12<00:09,  1.11s/it]
 99%|█████████▊| 551/559 [10:14<00:08,  1.11s/it]
 99%|█████████▊| 552/559 [10:15<00:07,  1.11s/it]
 99%|█████████▉| 553/559 [10:16<00:06,  1.11s/it]
 99%|█████████▉| 554/559 [10:17<00:05,  1.11s/it]
 99%|█████████▉| 555/559 [10:18<00:04,  1.11s/it]
 99%|█████████▉| 556/559 [10:19<00:03,  1.11s/it]
100%|█████████▉| 557/559 [10:20<00:02,  1.11s/it]
100%|█████████▉| 558/559 [10:21<00:01,  1.11s/it]
100%|██████████| 559/559 [10:22<00:00,  1.06it/s][INFO|trainer.py:3801] 2025-08-07 03:28:34,421 >> Saving model checkpoint to ./checkpoint_dir/checkpoint-559
[INFO|configuration_utils.py:679] 2025-08-07 03:28:34,789 >> loading configuration file config.json from cache at /root/.cache/huggingface/hub/models--microsoft--Phi-3-mini-4k-instruct/snapshots/0a67737cc96d2554230f90338b163bc6380a2a85/config.json
[INFO|configuration_utils.py:746] 2025-08-07 03:28:34,789 >> Model config Phi3Config {
  "_name_or_path": "Phi-3-mini-4k-instruct",
  "architectures": [
    "Phi3ForCausalLM"
  ],
  "attention_bias": false,
  "attention_dropout": 0.0,
  "auto_map": {
    "AutoConfig": "microsoft/Phi-3-mini-4k-instruct--configuration_phi3.Phi3Config",
    "AutoModelForCausalLM": "microsoft/Phi-3-mini-4k-instruct--modeling_phi3.Phi3ForCausalLM"
  },
  "bos_token_id": 1,
  "embd_pdrop": 0.0,
  "eos_token_id": 32000,
  "hidden_act": "silu",
  "hidden_size": 3072,
  "initializer_range": 0.02,
  "intermediate_size": 8192,
  "max_position_embeddings": 4096,
  "model_type": "phi3",
  "num_attention_heads": 32,
  "num_hidden_layers": 32,
  "num_key_value_heads": 32,
  "original_max_position_embeddings": 4096,
  "pad_token_id": 32000,
  "resid_pdrop": 0.0,
  "rms_norm_eps": 1e-05,
  "rope_scaling": null,
  "rope_theta": 10000.0,
  "sliding_window": 2047,
  "tie_word_embeddings": false,
  "torch_dtype": "bfloat16",
  "transformers_version": "4.46.1",
  "use_cache": true,
  "vocab_size": 32064
}

[INFO|tokenization_utils_base.py:2646] 2025-08-07 03:28:34,942 >> tokenizer config file saved in ./checkpoint_dir/checkpoint-559/tokenizer_config.json
[INFO|tokenization_utils_base.py:2655] 2025-08-07 03:28:34,942 >> Special tokens file saved in ./checkpoint_dir/checkpoint-559/special_tokens_map.json
[INFO|trainer.py:3893] 2025-08-07 03:28:35,207 >> Deleting older checkpoint [checkpoint_dir/checkpoint-500] due to args.save_total_limit
[INFO|trainer.py:2584] 2025-08-07 03:28:35,251 >> 

Training completed. Do not forget to share your model on huggingface.co/models =)



                                                 

100%|██████████| 559/559 [10:23<00:00,  1.06it/s]
100%|██████████| 559/559 [10:23<00:00,  1.11s/it]
[INFO|trainer.py:4117] 2025-08-07 03:28:35,255 >> 
***** Running Evaluation *****
[INFO|trainer.py:4119] 2025-08-07 03:28:35,255 >>   Num examples = 566
[INFO|trainer.py:4122] 2025-08-07 03:28:35,255 >>   Batch size = 4
{'loss': 1.156, 'grad_norm': 0.20893405377864838, 'learning_rate': 8.928571428571429e-07, 'epoch': 0.04}
{'loss': 1.1415, 'grad_norm': 0.22294564545154572, 'learning_rate': 1.7857142857142859e-06, 'epoch': 0.07}
{'loss': 1.0875, 'grad_norm': 0.20138047635555267, 'learning_rate': 2.6785714285714285e-06, 'epoch': 0.11}
{'loss': 1.174, 'grad_norm': 0.17881540954113007, 'learning_rate': 3.5714285714285718e-06, 'epoch': 0.14}
{'loss': 1.1465, 'grad_norm': 0.22206445038318634, 'learning_rate': 4.464285714285715e-06, 'epoch': 0.18}
{'loss': 1.0602, 'grad_norm': 0.27950939536094666, 'learning_rate': 4.996049425354717e-06, 'epoch': 0.21}
{'loss': 1.106, 'grad_norm': 0.20527702569961548, 'learning_rate': 4.951748725674643e-06, 'epoch': 0.25}
{'loss': 1.0925, 'grad_norm': 0.17080006003379822, 'learning_rate': 4.8590858913041775e-06, 'epoch': 0.29}
{'loss': 1.0448, 'grad_norm': 0.1579141467809677, 'learning_rate': 4.719888749226442e-06, 'epoch': 0.32}
{'loss': 1.0583, 'grad_norm': 0.14792706072330475, 'learning_rate': 4.536903042046778e-06, 'epoch': 0.36}
{'loss': 1.0526, 'grad_norm': 0.18974100053310394, 'learning_rate': 4.313738266661979e-06, 'epoch': 0.39}
{'loss': 1.0898, 'grad_norm': 0.15398603677749634, 'learning_rate': 4.054796474886038e-06, 'epoch': 0.43}
{'loss': 1.0667, 'grad_norm': 0.12285584956407547, 'learning_rate': 3.7651854404804757e-06, 'epoch': 0.47}
{'loss': 0.9937, 'grad_norm': 0.14537231624126434, 'learning_rate': 3.450617905418834e-06, 'epoch': 0.5}
{'loss': 1.0999, 'grad_norm': 0.12442641705274582, 'learning_rate': 3.117298892809953e-06, 'epoch': 0.54}
{'loss': 1.0669, 'grad_norm': 0.13244427740573883, 'learning_rate': 2.7718033092965267e-06, 'epoch': 0.57}
{'loss': 1.0391, 'grad_norm': 0.1277468353509903, 'learning_rate': 2.420946251291103e-06, 'epoch': 0.61}
{'loss': 1.0698, 'grad_norm': 0.1201210543513298, 'learning_rate': 2.0716485733325834e-06, 'epoch': 0.64}
{'loss': 1.0936, 'grad_norm': 0.12373249232769012, 'learning_rate': 1.730800370303683e-06, 'epoch': 0.68}
{'loss': 1.0583, 'grad_norm': 0.13216498494148254, 'learning_rate': 1.4051250664000515e-06, 'epoch': 0.72}
{'loss': 1.0508, 'grad_norm': 0.13129547238349915, 'learning_rate': 1.1010467917732783e-06, 'epoch': 0.75}
{'loss': 1.0734, 'grad_norm': 0.11844677478075027, 'learning_rate': 8.245636629187121e-07, 'epoch': 0.79}
{'loss': 1.0682, 'grad_norm': 0.11956647038459778, 'learning_rate': 5.811294664243752e-07, 'epoch': 0.82}
{'loss': 1.0093, 'grad_norm': 0.11379951983690262, 'learning_rate': 3.7554607993613823e-07, 'epoch': 0.86}
{'loss': 1.0195, 'grad_norm': 0.1225791648030281, 'learning_rate': 2.118687523966559e-07, 'epoch': 0.89}
{'loss': 1.0074, 'grad_norm': 0.12060265988111496, 'learning_rate': 9.332611195910585e-08, 'epoch': 0.93}
{'loss': 1.0939, 'grad_norm': 0.1429004669189453, 'learning_rate': 2.2256479464999315e-08, 'epoch': 0.97}
{'train_runtime': 623.1565, 'train_samples_per_second': 3.585, 'train_steps_per_second': 0.897, 'train_loss': 1.071784788892606, 'epoch': 1.0}
***** train metrics *****
  epoch                    =        1.0
  total_flos               = 95815219GF
  train_loss               =     1.0718
  train_runtime            = 0:10:23.15
  train_samples_per_second =      3.585
  train_steps_per_second   =      0.897

  0%|          | 0/142 [00:00<?, ?it/s]
  1%|▏         | 2/142 [00:00<00:22,  6.36it/s]
  2%|▏         | 3/142 [00:00<00:31,  4.46it/s]
  3%|▎         | 4/142 [00:00<00:36,  3.83it/s]
  4%|▎         | 5/142 [00:01<00:38,  3.57it/s]
  4%|▍         | 6/142 [00:01<00:39,  3.42it/s]
  5%|▍         | 7/142 [00:01<00:40,  3.32it/s]
  6%|▌         | 8/142 [00:02<00:40,  3.27it/s]
  6%|▋         | 9/142 [00:02<00:41,  3.24it/s]
  7%|▋         | 10/142 [00:02<00:41,  3.20it/s]
  8%|▊         | 11/142 [00:03<00:41,  3.18it/s]
  8%|▊         | 12/142 [00:03<00:41,  3.17it/s]
  9%|▉         | 13/142 [00:03<00:40,  3.16it/s]
 10%|▉         | 14/142 [00:04<00:40,  3.15it/s]
 11%|█         | 15/142 [00:04<00:40,  3.15it/s]
 11%|█▏        | 16/142 [00:04<00:40,  3.15it/s]
 12%|█▏        | 17/142 [00:05<00:39,  3.15it/s]
 13%|█▎        | 18/142 [00:05<00:39,  3.15it/s]
 13%|█▎        | 19/142 [00:05<00:39,  3.15it/s]
 14%|█▍        | 20/142 [00:06<00:38,  3.15it/s]
 15%|█▍        | 21/142 [00:06<00:38,  3.15it/s]
 15%|█▌        | 22/142 [00:06<00:38,  3.14it/s]
 16%|█▌        | 23/142 [00:06<00:37,  3.14it/s]
 17%|█▋        | 24/142 [00:07<00:37,  3.14it/s]
 18%|█▊        | 25/142 [00:07<00:37,  3.14it/s]
 18%|█▊        | 26/142 [00:07<00:36,  3.14it/s]
 19%|█▉        | 27/142 [00:08<00:36,  3.14it/s]
 20%|█▉        | 28/142 [00:08<00:36,  3.14it/s]
 20%|██        | 29/142 [00:08<00:35,  3.14it/s]
 21%|██        | 30/142 [00:09<00:35,  3.15it/s]
 22%|██▏       | 31/142 [00:09<00:35,  3.15it/s]
 23%|██▎       | 32/142 [00:09<00:34,  3.15it/s]
 23%|██▎       | 33/142 [00:10<00:34,  3.15it/s]
 24%|██▍       | 34/142 [00:10<00:34,  3.14it/s]
 25%|██▍       | 35/142 [00:10<00:34,  3.14it/s]
 25%|██▌       | 36/142 [00:11<00:33,  3.14it/s]
 26%|██▌       | 37/142 [00:11<00:33,  3.14it/s]
 27%|██▋       | 38/142 [00:11<00:33,  3.14it/s]
 27%|██▋       | 39/142 [00:12<00:32,  3.14it/s]
 28%|██▊       | 40/142 [00:12<00:32,  3.14it/s]
 29%|██▉       | 41/142 [00:12<00:32,  3.15it/s]
 30%|██▉       | 42/142 [00:13<00:31,  3.15it/s]
 30%|███       | 43/142 [00:13<00:31,  3.15it/s]
 31%|███       | 44/142 [00:13<00:31,  3.14it/s]
 32%|███▏      | 45/142 [00:13<00:30,  3.14it/s]
 32%|███▏      | 46/142 [00:14<00:30,  3.14it/s]
 33%|███▎      | 47/142 [00:14<00:30,  3.14it/s]
 34%|███▍      | 48/142 [00:14<00:30,  3.13it/s]
 35%|███▍      | 49/142 [00:15<00:29,  3.13it/s]
 35%|███▌      | 50/142 [00:15<00:29,  3.13it/s]
 36%|███▌      | 51/142 [00:15<00:29,  3.13it/s]
 37%|███▋      | 52/142 [00:16<00:28,  3.13it/s]
 37%|███▋      | 53/142 [00:16<00:28,  3.14it/s]
 38%|███▊      | 54/142 [00:16<00:28,  3.14it/s]
 39%|███▊      | 55/142 [00:17<00:27,  3.14it/s]
 39%|███▉      | 56/142 [00:17<00:27,  3.14it/s]
 40%|████      | 57/142 [00:17<00:27,  3.15it/s]
 41%|████      | 58/142 [00:18<00:26,  3.14it/s]
 42%|████▏     | 59/142 [00:18<00:26,  3.14it/s]
 42%|████▏     | 60/142 [00:18<00:26,  3.14it/s]
 43%|████▎     | 61/142 [00:19<00:25,  3.14it/s]
 44%|████▎     | 62/142 [00:19<00:25,  3.14it/s]
 44%|████▍     | 63/142 [00:19<00:25,  3.14it/s]
 45%|████▌     | 64/142 [00:20<00:24,  3.13it/s]
 46%|████▌     | 65/142 [00:20<00:24,  3.14it/s]
 46%|████▋     | 66/142 [00:20<00:24,  3.13it/s]
 47%|████▋     | 67/142 [00:21<00:23,  3.14it/s]
 48%|████▊     | 68/142 [00:21<00:23,  3.14it/s]
 49%|████▊     | 69/142 [00:21<00:23,  3.14it/s]
 49%|████▉     | 70/142 [00:21<00:22,  3.15it/s]
 50%|█████     | 71/142 [00:22<00:22,  3.14it/s]
 51%|█████     | 72/142 [00:22<00:22,  3.14it/s]
 51%|█████▏    | 73/142 [00:22<00:21,  3.14it/s]
 52%|█████▏    | 74/142 [00:23<00:21,  3.14it/s]
 53%|█████▎    | 75/142 [00:23<00:21,  3.14it/s]
 54%|█████▎    | 76/142 [00:23<00:21,  3.14it/s]
 54%|█████▍    | 77/142 [00:24<00:20,  3.14it/s]
 55%|█████▍    | 78/142 [00:24<00:20,  3.14it/s]
 56%|█████▌    | 79/142 [00:24<00:20,  3.14it/s]
 56%|█████▋    | 80/142 [00:25<00:19,  3.14it/s]
 57%|█████▋    | 81/142 [00:25<00:19,  3.14it/s]
 58%|█████▊    | 82/142 [00:25<00:19,  3.14it/s]
 58%|█████▊    | 83/142 [00:26<00:18,  3.15it/s]
 59%|█████▉    | 84/142 [00:26<00:18,  3.15it/s]
 60%|█████▉    | 85/142 [00:26<00:18,  3.14it/s]
 61%|██████    | 86/142 [00:27<00:17,  3.14it/s]
 61%|██████▏   | 87/142 [00:27<00:17,  3.14it/s]
 62%|██████▏   | 88/142 [00:27<00:17,  3.14it/s]
 63%|██████▎   | 89/142 [00:28<00:16,  3.14it/s]
 63%|██████▎   | 90/142 [00:28<00:16,  3.14it/s]
 64%|██████▍   | 91/142 [00:28<00:16,  3.14it/s]
 65%|██████▍   | 92/142 [00:28<00:15,  3.14it/s]
 65%|██████▌   | 93/142 [00:29<00:15,  3.14it/s]
 66%|██████▌   | 94/142 [00:29<00:15,  3.14it/s]
 67%|██████▋   | 95/142 [00:29<00:14,  3.14it/s]
 68%|██████▊   | 96/142 [00:30<00:14,  3.14it/s]
 68%|██████▊   | 97/142 [00:30<00:14,  3.15it/s]
 69%|██████▉   | 98/142 [00:30<00:14,  3.14it/s]
 70%|██████▉   | 99/142 [00:31<00:13,  3.14it/s]
 70%|███████   | 100/142 [00:31<00:13,  3.14it/s]
 71%|███████   | 101/142 [00:31<00:13,  3.14it/s]
 72%|███████▏  | 102/142 [00:32<00:12,  3.14it/s]
 73%|███████▎  | 103/142 [00:32<00:12,  3.14it/s]
 73%|███████▎  | 104/142 [00:32<00:12,  3.14it/s]
 74%|███████▍  | 105/142 [00:33<00:11,  3.14it/s]
 75%|███████▍  | 106/142 [00:33<00:11,  3.14it/s]
 75%|███████▌  | 107/142 [00:33<00:11,  3.13it/s]
 76%|███████▌  | 108/142 [00:34<00:10,  3.13it/s]
 77%|███████▋  | 109/142 [00:34<00:10,  3.13it/s]
 77%|███████▋  | 110/142 [00:34<00:10,  3.13it/s]
 78%|███████▊  | 111/142 [00:35<00:09,  3.14it/s]
 79%|███████▉  | 112/142 [00:35<00:09,  3.14it/s]
 80%|███████▉  | 113/142 [00:35<00:09,  3.14it/s]
 80%|████████  | 114/142 [00:35<00:08,  3.14it/s]
 81%|████████  | 115/142 [00:36<00:08,  3.14it/s]
 82%|████████▏ | 116/142 [00:36<00:08,  3.13it/s]
 82%|████████▏ | 117/142 [00:36<00:07,  3.13it/s]
 83%|████████▎ | 118/142 [00:37<00:07,  3.13it/s]
 84%|████████▍ | 119/142 [00:37<00:07,  3.13it/s]
 85%|████████▍ | 120/142 [00:37<00:07,  3.13it/s]
 85%|████████▌ | 121/142 [00:38<00:06,  3.14it/s]
 86%|████████▌ | 122/142 [00:38<00:06,  3.14it/s]
 87%|████████▋ | 123/142 [00:38<00:06,  3.13it/s]
 87%|████████▋ | 124/142 [00:39<00:05,  3.13it/s]
 88%|████████▊ | 125/142 [00:39<00:05,  3.13it/s]
 89%|████████▊ | 126/142 [00:39<00:05,  3.14it/s]
 89%|████████▉ | 127/142 [00:40<00:04,  3.14it/s]
 90%|█████████ | 128/142 [00:40<00:04,  3.14it/s]
 91%|█████████ | 129/142 [00:40<00:04,  3.15it/s]
 92%|█████████▏| 130/142 [00:41<00:03,  3.15it/s]
 92%|█████████▏| 131/142 [00:41<00:03,  3.14it/s]
 93%|█████████▎| 132/142 [00:41<00:03,  3.14it/s]
 94%|█████████▎| 133/142 [00:42<00:02,  3.14it/s]
 94%|█████████▍| 134/142 [00:42<00:02,  3.14it/s]
 95%|█████████▌| 135/142 [00:42<00:02,  3.13it/s]
 96%|█████████▌| 136/142 [00:43<00:01,  3.13it/s]
 96%|█████████▋| 137/142 [00:43<00:01,  3.13it/s]
 97%|█████████▋| 138/142 [00:43<00:01,  3.13it/s]
 98%|█████████▊| 139/142 [00:43<00:00,  3.13it/s]
 99%|█████████▊| 140/142 [00:44<00:00,  3.13it/s]
 99%|█████████▉| 141/142 [00:44<00:00,  3.14it/s]
100%|██████████| 142/142 [00:44<00:00,  3.64it/s]
100%|██████████| 142/142 [00:44<00:00,  3.17it/s]
[WARNING|trainer.py:760] 2025-08-07 03:29:23,502 >> Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
[WARNING|base.py:892] 2025-08-07 03:29:23,680 >> Hardware accelerator e.g. GPU is available in the environment, but no `device` argument is passed to the `Pipeline` object. Model will be on CPU.
2025/08/07 03:29:23 INFO mlflow.transformers: Overriding save_pretrained to False for PEFT models, following the Transformers behavior. The PEFT adaptor and config will be saved, but the base model weights will not and reference to the HuggingFace Hub repository will be logged instead.
[INFO|configuration_utils.py:679] 2025-08-07 03:29:24,324 >> loading configuration file config.json from cache at /root/.cache/huggingface/hub/models--microsoft--Phi-3-mini-4k-instruct/snapshots/0a67737cc96d2554230f90338b163bc6380a2a85/config.json
[INFO|configuration_utils.py:746] 2025-08-07 03:29:24,324 >> Model config Phi3Config {
  "_name_or_path": "Phi-3-mini-4k-instruct",
  "architectures": [
    "Phi3ForCausalLM"
  ],
  "attention_bias": false,
  "attention_dropout": 0.0,
  "auto_map": {
    "AutoConfig": "microsoft/Phi-3-mini-4k-instruct--configuration_phi3.Phi3Config",
    "AutoModelForCausalLM": "microsoft/Phi-3-mini-4k-instruct--modeling_phi3.Phi3ForCausalLM"
  },
  "bos_token_id": 1,
  "embd_pdrop": 0.0,
  "eos_token_id": 32000,
  "hidden_act": "silu",
  "hidden_size": 3072,
  "initializer_range": 0.02,
  "intermediate_size": 8192,
  "max_position_embeddings": 4096,
  "model_type": "phi3",
  "num_attention_heads": 32,
  "num_hidden_layers": 32,
  "num_key_value_heads": 32,
  "original_max_position_embeddings": 4096,
  "pad_token_id": 32000,
  "resid_pdrop": 0.0,
  "rms_norm_eps": 1e-05,
  "rope_scaling": null,
  "rope_theta": 10000.0,
  "sliding_window": 2047,
  "tie_word_embeddings": false,
  "torch_dtype": "bfloat16",
  "transformers_version": "4.46.1",
  "use_cache": true,
  "vocab_size": 32064
}

2025/08/07 03:29:24 INFO mlflow.transformers: Skipping saving pretrained model weights to disk as the save_pretrained argumentis set to False. The reference to the HuggingFace Hub repository microsoft/Phi-3-mini-4k-instruct will be logged instead.
2025/08/07 03:29:25 WARNING mlflow.utils.requirements_utils: Found torchvision version (0.19.1+cu124) contains a local version label (+cu124). MLflow logged a pip requirement for this package as 'torchvision==0.19.1' without the local version label to make it installable from PyPI. To specify pip requirements containing local version labels, please use `conda_env` or `pip_requirements`.
2025/08/07 03:29:25 INFO mlflow.transformers: A local checkpoint path or PEFT model is given as the `transformers_model`. To avoid loading the full model into memory, we don't infer the pip requirement for the model. Instead, we will use the default requirements, but it may not capture all required pip libraries for the model. Consider providing the pip requirements explicitly.
***** eval metrics *****
  epoch                   =        1.0
  eval_loss               =     1.0459
  eval_runtime            = 0:00:45.08
  eval_samples            =        832
  eval_samples_per_second =     12.554
  eval_steps_per_second   =       3.15
🏃 View run maroon_roti_7xb8lpyy24 at: https://japaneast.api.azureml.ms/mlflow/v2.0/subscriptions/884e9b80-487d-43ed-bb19-a874dab1b483/resourceGroups/rg-finetune-jpe/providers/Microsoft.MachineLearningServices/workspaces/aml-finetune-jpe/#/experiments/18129c90-e2bb-4555-9e64-5490551e5603/runs/maroon_roti_7xb8lpyy24
🧪 View experiment at: https://japaneast.api.azureml.ms/mlflow/v2.0/subscriptions/884e9b80-487d-43ed-bb19-a874dab1b483/resourceGroups/rg-finetune-jpe/providers/Microsoft.MachineLearningServices/workspaces/aml-finetune-jpe/#/experiments/18129c90-e2bb-4555-9e64-5490551e5603

3.3.2. モデル登録

artifactsを入れるのがわからず、半日くらい試行錯誤しました。

# モデル登録
phi_model = Model(
    # train.py のmlflow.transformers.log_modelで保存したモデルのパス
    path=f"azureml://jobs/{job.name}/outputs/artifacts/model",  
    type=AssetTypes.MLFLOW_MODEL,
    name="phi3-finetuned",
    description="Phi3 model fine-tuned on custom dataset",
)

registered_model = ml_client.models.create_or_update(phi_model)

これで登録されます。

3.3.3. エンドポイント登録

推論のためのエンドポイント登録です。

# エンドポイント作成
ENDPOINT_NAME = "test-endpoint-for-phi3"
endpoint = ManagedOnlineEndpoint(
    name=ENDPOINT_NAME,
    description="Online endpoint for test",
    auth_mode="key",
)
ml_client.begin_create_or_update(endpoint).wait()

まだデプロイない状態で登録されます。

3.3.4. デプロイ登録

デプロイを登録します。9分くらいかかりました。
ここから放置と課金に注意しましょう。

DEPLOYMENT_NAME = "Deploy-test"

deployment = ManagedOnlineDeployment(
    name=DEPLOYMENT_NAME,
    endpoint_name=ENDPOINT_NAME,
    model=f"{registered_model.name}:{registered_model.version}",
    instance_type="Standard_NC40ads_H100_v5",
    instance_count=1,
    liveness_probe=ProbeSettings(initial_delay=600),
    request_settings=OnlineRequestSettings(request_timeout_ms=90000),
)
ml_client.online_deployments.begin_create_or_update(deployment).wait()

デプロイが追加されたのがわかります。

instance_typeに指定できる値は以下で確認可能です。

for sku in ml_client.compute.list_sizes():
    print(f"{sku.name=}, {sku.family=}, {sku.v_cp_us=}, {sku.gpus=}, {sku.memory_gb=}")

sku.name='Standard_A1_v2', sku.family='standardAv2Family', sku.v_cp_us=1, sku.gpus=0, sku.memory_gb=2.0
sku.name='Standard_A2m_v2', sku.family='standardAv2Family', sku.v_cp_us=2, sku.gpus=0, sku.memory_gb=16.0
sku.name='Standard_A2_v2', sku.family='standardAv2Family', sku.v_cp_us=2, sku.gpus=0, sku.memory_gb=4.0
sku.name='Standard_A4m_v2', sku.family='standardAv2Family', sku.v_cp_us=4, sku.gpus=0, sku.memory_gb=32.0
後略

3.3.5. トラフィック更新

エンドポイントのトラフィックを更新します。デプロイメントを100%割り当てます。

endpoint.traffic = {DEPLOYMENT_NAME: 100}
updated_online_endpoint = ml_client.begin_create_or_update(endpoint).result()

100%トラフィックが割りあたっています。

3.3.6. 推論実行

エンドポイントに対して推論を実行。

score_url = updated_online_endpoint.scoring_uri
parsed = urlparse(score_url)
uri = f"{str(parsed.scheme)}://{str(parsed.hostname)}"
data = {"input_data": [{"message": "Hello"}]}

body = str.encode(json.dumps(data))
auth_keys = ml_client.online_endpoints.get_keys(name=ENDPOINT_NAME)
headers = {'Content-Type':'application/json', 'Accept': 'application/json', 'Authorization':('Bearer '+ auth_keys.primary_key)}

req = Request(uri+"/score", body, headers)

try:
    response = urlopen(req)

    result = response.read()
    print(result)
except HTTPError as error:
    print("The request failed with status code: " + str(error.code))

    # Print the headers - they include the requert ID and the timestamp, which are useful for debugging the failure
    print(error.info())
    print(error.read().decode("utf8", 'ignore'))

無事、結果が返ってきます。

b'["Hello, I\'m interested in learning more about the history of the United States. Can you tell"]'

3.3.7. エンドポイント削除

最後にエンドポイントを忘れずに削除します。

ml_client.online_endpoints.begin_delete(name=ENDPOINT_NAME).wait()

参考リンク

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up