昨日の記事の手法をQwen2.5-0.5Bに適用して、ModernBERT-baseと比較してみることにした。Google Colaboratory (GPU版)だと、こんな感じ。
!pip install transformers datasets evaluate seqeval accelerate
s='$1=="transformers"{printf("-b v%s",$2)}'
!test -d transformers || git clone `pip3 list | awk '{s}'` https://github.com/huggingface/transformers
!test -d UD_English-EWT || git clone --depth=1 https://github.com/UniversalDependencies/UD_English-EWT
import json
def makejson(conllu_file,json_file):
with open(conllu_file,"r",encoding="utf-8") as r, open(json_file,"w",encoding="utf-8") as w:
d={"tokens":[],"tags":[]}
for s in r:
if s.strip()=="":
if d["tokens"]>[]:
print(json.dumps(d),file=w)
d={"tokens":[],"tags":[]}
else:
t=s.split("\t")
if len(t)==10 and t[0].isdecimal():
d["tokens"].append(t[1])
d["tags"].append(t[3])
makejson("UD_English-EWT/en_ewt-ud-train.conllu","train.json")
makejson("UD_English-EWT/en_ewt-ud-dev.conllu","dev.json")
makejson("UD_English-EWT/en_ewt-ud-test.conllu","test.json")
!env WANDB_DISABLED=true python3 transformers/examples/pytorch/token-classification/run_ner.py --task_name pos --model_name_or_path Qwen/Qwen2.5-0.5B --train_file train.json --validation_file dev.json --test_file test.json --output_dir ./Qwen2.5-0.5B-english-upos --overwrite_output_dir --do_train --do_eval --do_predict
私(安岡孝一)の手元では、1時間弱で以下のmetricsが出力されて、Qwen2.5-0.5B-english-uposが出来上がった。
***** train metrics *****
epoch = 3.0
total_flos = 3490858GF
train_loss = 0.1989
train_runtime = 0:53:51.30
train_samples = 12544
train_samples_per_second = 11.646
train_steps_per_second = 1.456
***** eval metrics *****
epoch = 3.0
eval_accuracy = 0.9099
eval_f1 = 0.8934
eval_loss = 0.4152
eval_precision = 0.8943
eval_recall = 0.8926
eval_runtime = 0:00:20.36
eval_samples = 2001
eval_samples_per_second = 98.271
eval_steps_per_second = 12.327
***** predict metrics *****
predict_accuracy = 0.913
predict_f1 = 0.8958
predict_loss = 0.3968
predict_precision = 0.8959
predict_recall = 0.8956
predict_runtime = 0:00:24.82
predict_samples_per_second = 83.682
predict_steps_per_second = 10.475
eval・predictともにF1値が0.89程度で、正直あまり良くない。ちょっと動かしてみよう。
from transformers import AutoTokenizer,AutoModelForTokenClassification,TokenClassificationPipeline
tkz=AutoTokenizer.from_pretrained("Qwen2.5-0.5B-english-upos")
mdl=AutoModelForTokenClassification.from_pretrained("Qwen2.5-0.5B-english-upos")
nlp=TokenClassificationPipeline(model=mdl,tokenizer=tkz,device=0)
print(nlp("It don't mean a thing if it ain't got that swing"))
出来立てのQwen2.5-0.5B-english-uposで「It don't mean a thing if it ain't got that swing」に品詞付与してみたところ、私の手元では以下の結果が得られた。
[{'entity': 'PRON', 'score': 0.9992797, 'index': 0, 'word': 'It', 'start': 0, 'end': 2}, {'entity': 'AUX', 'score': 0.91161925, 'index': 1, 'word': 'Ġdon', 'start': 2, 'end': 6}, {'entity': 'PART', 'score': 0.9997278, 'index': 2, 'word': "'t", 'start': 6, 'end': 8}, {'entity': 'VERB', 'score': 0.9998871, 'index': 3, 'word': 'Ġmean', 'start': 8, 'end': 13}, {'entity': 'DET', 'score': 0.9993187, 'index': 4, 'word': 'Ġa', 'start': 13, 'end': 15}, {'entity': 'NOUN', 'score': 0.99999917, 'index': 5, 'word': 'Ġthing', 'start': 15, 'end': 21}, {'entity': 'SCONJ', 'score': 0.9998536, 'index': 6, 'word': 'Ġif', 'start': 21, 'end': 24}, {'entity': 'PRON', 'score': 0.9999038, 'index': 7, 'word': 'Ġit', 'start': 24, 'end': 27}, {'entity': 'AUX', 'score': 0.57848704, 'index': 8, 'word': 'Ġain', 'start': 27, 'end': 31}, {'entity': 'PART', 'score': 0.99963856, 'index': 9, 'word': "'t", 'start': 31, 'end': 33}, {'entity': 'VERB', 'score': 0.9222952, 'index': 10, 'word': 'Ġgot', 'start': 33, 'end': 37}, {'entity': 'ADV', 'score': 0.8038623, 'index': 11, 'word': 'Ġthat', 'start': 37, 'end': 42}, {'entity': 'ADJ', 'score': 0.44765362, 'index': 12, 'word': 'Ġswing', 'start': 42, 'end': 48}]
空白が「Ġ」に化けてしまっているのは、まあ何とかするとしても、「that」と「swing」が読めていない。『GPT系モデルの系列ラベリングによる品詞付与』でも書いたが、Qwen2ForTokenClassificationは、品詞付与の精度がイマイチ上がらないようだ。うーん、残念。