0
0

Delete article

Deleted articles cannot be recovered.

Draft of this article would be also deleted.

Are you sure you want to delete this article?

Lfm2ForTokenClassificationによる国語研長単位Universal Dependenciesの品詞付与

Posted at

昨日の記事の続きだが、run_ner.pyの助けを借りてLFM2-350Mに対し、UD_Japanese-GSDLUWUPOS品詞付与を試してみた。Google Colaboratory (GPU版)だと、こんな感じ。

!pip install "transformers>=4.54.0" accelerate datasets evaluate seqeval
!test -d LFM2-350M || git clone --depth=1 https://huggingface.co/LiquidAI/LFM2-350M
import json
with open("LFM2-350M/model.py","w",encoding="utf-8") as w:
  print("""from transformers import Lfm2PreTrainedModel
from transformers.modeling_layers import GenericForTokenClassification
class Lfm2ForTokenClassification(GenericForTokenClassification,Lfm2PreTrainedModel):
  pass""",file=w)
with open("LFM2-350M/config.json","r",encoding="utf-8") as r:
  d=json.load(r)
if "auto_map" not in d:
  d["auto_map"]={}
d["auto_map"]["AutoModelForTokenClassification"]="model.Lfm2ForTokenClassification"
with open("LFM2-350M/config.json","w",encoding="utf-8") as w:
  json.dump(d,w,indent=2)
s='$1=="transformers"{printf("-b v%s",$2)}'
!test -d transformers || git clone `pip list | awk '{s}'` https://github.com/huggingface/transformers
!test -d UD_Japanese-GSDLUW || git clone --depth=1 https://github.com/UniversalDependencies/UD_Japanese-GSDLUW
def makejson(conllu_file,json_file):
  with open(conllu_file,"r",encoding="utf-8") as r, open(json_file,"w",encoding="utf-8") as w:
    d,f={"tokens":["<|startoftext|>"],"tags":["SYM"]},False
    for s in r:
      if s.strip()=="":
        if len(d["tokens"])>1:
          print(json.dumps(d),file=w)
        d,f={"tokens":["<|startoftext|>"],"tags":["SYM"]},False
      else:
        t=s.split("\t")
        if len(t)==10 and t[0].isdecimal():
          d["tokens"].append(" "+t[1] if f else t[1])
          d["tags"].append(t[3])
          f=t[9].find("SpaceAfter=No")<0
makejson("UD_Japanese-GSDLUW/ja_gsdluw-ud-train.conllu","train.json")
makejson("UD_Japanese-GSDLUW/ja_gsdluw-ud-dev.conllu","dev.json")
makejson("UD_Japanese-GSDLUW/ja_gsdluw-ud-test.conllu","test.json")
!env WANDB_DISABLED=true python transformers/examples/pytorch/token-classification/run_ner.py --task_name pos --model_name_or_path LFM2-350M --trust_remote_code --train_file train.json --validation_file dev.json --test_file test.json --output_dir ./LFM2-350M-japanese-luw-upos --overwrite_output_dir --do_train --do_eval --do_predict
!cp LFM2-350M/model.py LFM2-350M-japanese-luw-upos

私(安岡孝一)の手元では、40分ほどで以下のmetricsが出力されて、LFM2-350M-japanese-luw-uposが出来上がった。

***** train metrics *****
  epoch                    =        3.0
  total_flos               =  2191825GF
  train_loss               =     0.1819
  train_runtime            = 0:34:06.70
  train_samples            =       7050
  train_samples_per_second =     10.334
  train_steps_per_second   =      1.293

***** eval metrics *****
  epoch                   =        3.0
  eval_accuracy           =     0.9169
  eval_f1                 =     0.9067
  eval_loss               =     0.3976
  eval_precision          =     0.9076
  eval_recall             =     0.9059
  eval_runtime            = 0:00:07.89
  eval_samples            =        507
  eval_samples_per_second =     64.258
  eval_steps_per_second   =      8.111

***** predict metrics *****
  predict_accuracy           =     0.9047
  predict_f1                 =     0.8921
  predict_loss               =     0.4551
  predict_precision          =     0.8935
  predict_recall             =     0.8906
  predict_runtime            = 0:00:08.37
  predict_samples_per_second =     64.834
  predict_steps_per_second   =      8.119

eval・predictのF1値が0.90前後なので、もう少し伸びしろがあるように思える。ちょっと動かしてみよう。

from transformers import AutoTokenizer,AutoModelForTokenClassification,TokenClassificationPipeline
tkz=AutoTokenizer.from_pretrained("LFM2-350M-japanese-luw-upos")
mdl=AutoModelForTokenClassification.from_pretrained("LFM2-350M-japanese-luw-upos",trust_remote_code=True)
nlp=TokenClassificationPipeline(model=mdl,tokenizer=tkz)
txt="国境の長いトンネルを抜けると雪国であった。"
doc=nlp(txt)
for t in doc:
  t["text"]=txt[t["start"]:t["end"]]
print(doc)

出来立てのLFM2-350M-japanese-luw-uposで「国境の長いトンネルを抜けると雪国であった。」に品詞付与してみたところ、私の手元では以下の結果が得られた。

[{'entity': 'NOUN', 'score': np.float32(0.7175606), 'index': 1, 'word': 'åĽ½', 'start': 0, 'end': 1, 'text': '国'}, {'entity': 'NOUN', 'score': np.float32(0.9999825), 'index': 2, 'word': 'å¢ĥ', 'start': 1, 'end': 2, 'text': '境'}, {'entity': 'ADP', 'score': np.float32(1.0), 'index': 3, 'word': 'ãģ®', 'start': 2, 'end': 3, 'text': 'の'}, {'entity': 'NOUN', 'score': np.float32(0.9765905), 'index': 4, 'word': 'éķ·', 'start': 3, 'end': 4, 'text': '長'}, {'entity': 'ADJ', 'score': np.float32(0.95155585), 'index': 5, 'word': 'ãģĦ', 'start': 4, 'end': 5, 'text': 'い'}, {'entity': 'NOUN', 'score': np.float32(0.9999083), 'index': 6, 'word': 'ãĥĪ', 'start': 5, 'end': 6, 'text': 'ト'}, {'entity': 'NOUN', 'score': np.float32(0.9998863), 'index': 7, 'word': 'ãĥ³ãĥ', 'start': 6, 'end': 8, 'text': 'ンネ'}, {'entity': 'NOUN', 'score': np.float32(0.99999976), 'index': 8, 'word': 'į', 'start': 7, 'end': 8, 'text': 'ネ'}, {'entity': 'NOUN', 'score': np.float32(0.9999635), 'index': 9, 'word': 'ãĥ«', 'start': 8, 'end': 9, 'text': 'ル'}, {'entity': 'ADP', 'score': np.float32(1.0), 'index': 10, 'word': 'ãĤĴ', 'start': 9, 'end': 10, 'text': 'を'}, {'entity': 'VERB', 'score': np.float32(0.9999256), 'index': 11, 'word': 'æĬľ', 'start': 10, 'end': 11, 'text': '抜'}, {'entity': 'AUX', 'score': np.float32(0.43896234), 'index': 12, 'word': 'ãģij', 'start': 11, 'end': 12, 'text': 'け'}, {'entity': 'SCONJ', 'score': np.float32(0.99998295), 'index': 13, 'word': 'ãĤĭãģ¨', 'start': 12, 'end': 14, 'text': 'ると'}, {'entity': 'NOUN', 'score': np.float32(0.99985266), 'index': 14, 'word': 'éĽª', 'start': 14, 'end': 15, 'text': '雪'}, {'entity': 'NOUN', 'score': np.float32(0.99924785), 'index': 15, 'word': 'åĽ½', 'start': 15, 'end': 16, 'text': '国'}, {'entity': 'AUX', 'score': np.float32(0.99999976), 'index': 16, 'word': 'ãģ§ãģĤãģ£ãģŁ', 'start': 16, 'end': 20, 'text': 'であった'}, {'entity': 'PUNCT', 'score': np.float32(1.0), 'index': 17, 'word': 'ãĢĤ', 'start': 20, 'end': 21, 'text': '。'}]

やはり、トークナイザが日本語(というか国語研長単位)に向いておらず、結果として品詞付与がイマイチだ。このあたり『GPT系言語モデルによる国語研長単位係り受け解析』で試したトークナイザ改良を、LFM2-350Mでも試す必要があるのかな。

0
0
0

Register as a new user and use Qiita more conveniently

  1. You get articles that match your needs
  2. You can efficiently read back useful information
  3. You can use dark theme
What you can do with signing up
0
0

Delete article

Deleted articles cannot be recovered.

Draft of this article would be also deleted.

Are you sure you want to delete this article?