Lfm2ForTokenClassificationによる国語研長単位Universal Dependenciesの品詞付与

Posted at 2025-07-28

昨日の記事の続きだが、run_ner.pyの助けを借りてLFM2-350Mに対し、UD_Japanese-GSDLUWのUPOS品詞付与を試してみた。Google Colaboratory (GPU版)だと、こんな感じ。

!pip install "transformers>=4.54.0" accelerate datasets evaluate seqeval
!test -d LFM2-350M || git clone --depth=1 https://huggingface.co/LiquidAI/LFM2-350M
import json
with open("LFM2-350M/model.py","w",encoding="utf-8") as w:
  print("""from transformers import Lfm2PreTrainedModel
from transformers.modeling_layers import GenericForTokenClassification
class Lfm2ForTokenClassification(GenericForTokenClassification,Lfm2PreTrainedModel):
  pass""",file=w)
with open("LFM2-350M/config.json","r",encoding="utf-8") as r:
  d=json.load(r)
if "auto_map" not in d:
  d["auto_map"]={}
d["auto_map"]["AutoModelForTokenClassification"]="model.Lfm2ForTokenClassification"
with open("LFM2-350M/config.json","w",encoding="utf-8") as w:
  json.dump(d,w,indent=2)
s='$1=="transformers"{printf("-b v%s",$2)}'
!test -d transformers || git clone `pip list | awk '{s}'` https://github.com/huggingface/transformers
!test -d UD_Japanese-GSDLUW || git clone --depth=1 https://github.com/UniversalDependencies/UD_Japanese-GSDLUW
def makejson(conllu_file,json_file):
  with open(conllu_file,"r",encoding="utf-8") as r, open(json_file,"w",encoding="utf-8") as w:
    d,f={"tokens":["<|startoftext|>"],"tags":["SYM"]},False
    for s in r:
      if s.strip()=="":
        if len(d["tokens"])>1:
          print(json.dumps(d),file=w)
        d,f={"tokens":["<|startoftext|>"],"tags":["SYM"]},False
      else:
        t=s.split("\t")
        if len(t)==10 and t[0].isdecimal():
          d["tokens"].append(" "+t[1] if f else t[1])
          d["tags"].append(t[3])
          f=t[9].find("SpaceAfter=No")<0
makejson("UD_Japanese-GSDLUW/ja_gsdluw-ud-train.conllu","train.json")
makejson("UD_Japanese-GSDLUW/ja_gsdluw-ud-dev.conllu","dev.json")
makejson("UD_Japanese-GSDLUW/ja_gsdluw-ud-test.conllu","test.json")
!env WANDB_DISABLED=true python transformers/examples/pytorch/token-classification/run_ner.py --task_name pos --model_name_or_path LFM2-350M --trust_remote_code --train_file train.json --validation_file dev.json --test_file test.json --output_dir ./LFM2-350M-japanese-luw-upos --overwrite_output_dir --do_train --do_eval --do_predict
!cp LFM2-350M/model.py LFM2-350M-japanese-luw-upos

私(安岡孝一)の手元では、40分ほどで以下のmetricsが出力されて、LFM2-350M-japanese-luw-uposが出来上がった。

***** train metrics *****
  epoch                    =        3.0
  total_flos               =  2191825GF
  train_loss               =     0.1819
  train_runtime            = 0:34:06.70
  train_samples            =       7050
  train_samples_per_second =     10.334
  train_steps_per_second   =      1.293

***** eval metrics *****
  epoch                   =        3.0
  eval_accuracy           =     0.9169
  eval_f1                 =     0.9067
  eval_loss               =     0.3976
  eval_precision          =     0.9076
  eval_recall             =     0.9059
  eval_runtime            = 0:00:07.89
  eval_samples            =        507
  eval_samples_per_second =     64.258
  eval_steps_per_second   =      8.111

***** predict metrics *****
  predict_accuracy           =     0.9047
  predict_f1                 =     0.8921
  predict_loss               =     0.4551
  predict_precision          =     0.8935
  predict_recall             =     0.8906
  predict_runtime            = 0:00:08.37
  predict_samples_per_second =     64.834
  predict_steps_per_second   =      8.119

eval・predictのF1値が0.90前後なので、もう少し伸びしろがあるように思える。ちょっと動かしてみよう。

from transformers import AutoTokenizer,AutoModelForTokenClassification,TokenClassificationPipeline
tkz=AutoTokenizer.from_pretrained("LFM2-350M-japanese-luw-upos")
mdl=AutoModelForTokenClassification.from_pretrained("LFM2-350M-japanese-luw-upos",trust_remote_code=True)
nlp=TokenClassificationPipeline(model=mdl,tokenizer=tkz)
txt="国境の長いトンネルを抜けると雪国であった。"
doc=nlp(txt)
for t in doc:
  t["text"]=txt[t["start"]:t["end"]]
print(doc)

出来立てのLFM2-350M-japanese-luw-uposで「国境の長いトンネルを抜けると雪国であった。」に品詞付与してみたところ、私の手元では以下の結果が得られた。

[{'entity': 'NOUN', 'score': np.float32(0.7175606), 'index': 1, 'word': 'åĽ½', 'start': 0, 'end': 1, 'text': '国'}, {'entity': 'NOUN', 'score': np.float32(0.9999825), 'index': 2, 'word': 'å¢ĥ', 'start': 1, 'end': 2, 'text': '境'}, {'entity': 'ADP', 'score': np.float32(1.0), 'index': 3, 'word': 'ãģ®', 'start': 2, 'end': 3, 'text': 'の'}, {'entity': 'NOUN', 'score': np.float32(0.9765905), 'index': 4, 'word': 'éķ·', 'start': 3, 'end': 4, 'text': '長'}, {'entity': 'ADJ', 'score': np.float32(0.95155585), 'index': 5, 'word': 'ãģĦ', 'start': 4, 'end': 5, 'text': 'い'}, {'entity': 'NOUN', 'score': np.float32(0.9999083), 'index': 6, 'word': 'ãĥĪ', 'start': 5, 'end': 6, 'text': 'ト'}, {'entity': 'NOUN', 'score': np.float32(0.9998863), 'index': 7, 'word': 'ãĥ³ãĥ', 'start': 6, 'end': 8, 'text': 'ンネ'}, {'entity': 'NOUN', 'score': np.float32(0.99999976), 'index': 8, 'word': 'į', 'start': 7, 'end': 8, 'text': 'ネ'}, {'entity': 'NOUN', 'score': np.float32(0.9999635), 'index': 9, 'word': 'ãĥ«', 'start': 8, 'end': 9, 'text': 'ル'}, {'entity': 'ADP', 'score': np.float32(1.0), 'index': 10, 'word': 'ãĤĴ', 'start': 9, 'end': 10, 'text': 'を'}, {'entity': 'VERB', 'score': np.float32(0.9999256), 'index': 11, 'word': 'æĬľ', 'start': 10, 'end': 11, 'text': '抜'}, {'entity': 'AUX', 'score': np.float32(0.43896234), 'index': 12, 'word': 'ãģĳ', 'start': 11, 'end': 12, 'text': 'け'}, {'entity': 'SCONJ', 'score': np.float32(0.99998295), 'index': 13, 'word': 'ãĤĭãģ¨', 'start': 12, 'end': 14, 'text': 'ると'}, {'entity': 'NOUN', 'score': np.float32(0.99985266), 'index': 14, 'word': 'éĽª', 'start': 14, 'end': 15, 'text': '雪'}, {'entity': 'NOUN', 'score': np.float32(0.99924785), 'index': 15, 'word': 'åĽ½', 'start': 15, 'end': 16, 'text': '国'}, {'entity': 'AUX', 'score': np.float32(0.99999976), 'index': 16, 'word': 'ãģ§ãģĤãģ£ãģŁ', 'start': 16, 'end': 20, 'text': 'であった'}, {'entity': 'PUNCT', 'score': np.float32(1.0), 'index': 17, 'word': 'ãĢĤ', 'start': 20, 'end': 21, 'text': '。'}]

やはり、トークナイザが日本語(というか国語研長単位)に向いておらず、結果として品詞付与がイマイチだ。このあたり『GPT系言語モデルによる国語研長単位係り受け解析』で試したトークナイザ改良を、LFM2-350Mでも試す必要があるのかな。

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up