0
0

Delete article

Deleted articles cannot be recovered.

Draft of this article would be also deleted.

Are you sure you want to delete this article?

NeoBERTForTokenClassificationによる英語Universal Dependenciesの品詞付与

Last updated at Posted at 2025-03-02

昨日の記事NeoBERTに対し、Stefan SchweterがNeoBERTForTokenClassificationを書いてくれたので、transformers v4.49.0のrun_ner.pyの助けを借りて、UD_English-EWTUPOS品詞付与を試してみた。Google Colaboratory (GPU版)だと、こんな感じ。

!pip install -U transformers datasets evaluate seqeval accelerate xformers flash_attn torchvision
!test -d NeoBERT || git clone --depth=1 https://huggingface.co/chandar-lab/NeoBERT
!curl -L https://huggingface.co/stefan-it/neobert-ner-conll03/raw/main/model.py > NeoBERT/model.py
import json
with open("NeoBERT/config.json","r",encoding="utf-8") as r:
  d=json.load(r)
d["auto_map"]["AutoModelForTokenClassification"]="model.NeoBERTForTokenClassification"
with open("NeoBERT/config.json","w",encoding="utf-8") as w:
  json.dump(d,w,indent=2)
s='$1=="transformers"{printf("-b v%s",$2)}'
!test -d transformers || git clone `pip3 list | awk '{s}'` https://github.com/huggingface/transformers
!test -d UD_English-EWT || git clone --depth=1 https://github.com/UniversalDependencies/UD_English-EWT
def makejson(conllu_file,json_file):
  with open(conllu_file,"r",encoding="utf-8") as r, open(json_file,"w",encoding="utf-8") as w:
    d={"tokens":["[CLS]"],"tags":["SYM"]}
    for s in r:
      if s.strip()=="":
        if len(d["tokens"])>1:
          d["tokens"].append("[SEP]")
          d["tags"].append("SYM")
          print(json.dumps(d),file=w)
        d={"tokens":["[CLS]"],"tags":["SYM"]}
      else:
        t=s.split("\t")
        if len(t)==10 and t[0].isdecimal():
          d["tokens"].append(t[1])
          d["tags"].append(t[3])
makejson("UD_English-EWT/en_ewt-ud-train.conllu","train.json")
makejson("UD_English-EWT/en_ewt-ud-dev.conllu","dev.json")
makejson("UD_English-EWT/en_ewt-ud-test.conllu","test.json")
!env WANDB_DISABLED=true python3 transformers/examples/pytorch/token-classification/run_ner.py --task_name pos --model_name_or_path NeoBERT --trust_remote_code --train_file train.json --validation_file dev.json --test_file test.json --output_dir ./NeoBERT-english-upos --overwrite_output_dir --do_train --do_eval --do_predict
!cp NeoBERT/*.py NeoBERT-english-upos

私(安岡孝一)の手元では、30分ほどで以下のmetricsが出力されて、NeoBERT-english-uposが出来上がった。

***** train metrics *****
  epoch                    =        3.0
  total_flos               =  1907895GF
  train_loss               =     0.0529
  train_runtime            = 0:26:18.59
  train_samples            =      12544
  train_samples_per_second =     23.839
  train_steps_per_second   =       2.98

***** eval metrics *****
  epoch                   =        3.0
  eval_accuracy           =     0.9809
  eval_f1                 =     0.9782
  eval_loss               =     0.0842
  eval_precision          =     0.9779
  eval_recall             =     0.9786
  eval_runtime            = 0:00:12.95
  eval_samples            =       2001
  eval_samples_per_second =    154.408
  eval_steps_per_second   =     19.369

***** predict metrics *****
  predict_accuracy           =     0.9814
  predict_f1                 =     0.9785
  predict_loss               =     0.0786
  predict_precision          =      0.977
  predict_recall             =       0.98
  predict_runtime            = 0:00:13.69
  predict_samples_per_second =     151.67
  predict_steps_per_second   =     18.986

eval・predictともにF1値が0.98弱なので、そこそこイイセンいってるようだ。ちょっと動かしてみよう。

from transformers import AutoTokenizer,AutoModelForTokenClassification,TokenClassificationPipeline
tkz=AutoTokenizer.from_pretrained("NeoBERT-english-upos")
mdl=AutoModelForTokenClassification.from_pretrained("NeoBERT-english-upos",trust_remote_code=True)
nlp=TokenClassificationPipeline(model=mdl,tokenizer=tkz,device=0)
print(nlp("It don't mean a thing if it ain't got that swing"))

出来立てのNeoBERT-english-uposで「It don't mean a thing if it ain't got that swing」に品詞付与してみたところ、私の手元では以下の結果が得られた。

[{'entity': 'PRON', 'score': 0.9999176, 'index': 1, 'word': 'it', 'start': 0, 'end': 2}, {'entity': 'AUX', 'score': 0.99964607, 'index': 2, 'word': 'don', 'start': 3, 'end': 6}, {'entity': 'PART', 'score': 0.9090374, 'index': 3, 'word': "'", 'start': 6, 'end': 7}, {'entity': 'PART', 'score': 0.9998319, 'index': 4, 'word': 't', 'start': 7, 'end': 8}, {'entity': 'VERB', 'score': 0.999897, 'index': 5, 'word': 'mean', 'start': 9, 'end': 13}, {'entity': 'DET', 'score': 0.9998585, 'index': 6, 'word': 'a', 'start': 14, 'end': 15}, {'entity': 'NOUN', 'score': 0.9997912, 'index': 7, 'word': 'thing', 'start': 16, 'end': 21}, {'entity': 'SCONJ', 'score': 0.9997054, 'index': 8, 'word': 'if', 'start': 22, 'end': 24}, {'entity': 'PRON', 'score': 0.9999281, 'index': 9, 'word': 'it', 'start': 25, 'end': 27}, {'entity': 'AUX', 'score': 0.99948126, 'index': 10, 'word': 'ain', 'start': 28, 'end': 31}, {'entity': 'PART', 'score': 0.97454005, 'index': 11, 'word': "'", 'start': 31, 'end': 32}, {'entity': 'PART', 'score': 0.9998357, 'index': 12, 'word': 't', 'start': 32, 'end': 33}, {'entity': 'VERB', 'score': 0.9996866, 'index': 13, 'word': 'got', 'start': 34, 'end': 37}, {'entity': 'DET', 'score': 0.99983764, 'index': 14, 'word': 'that', 'start': 38, 'end': 42}, {'entity': 'NOUN', 'score': 0.99974066, 'index': 15, 'word': 'swing', 'start': 43, 'end': 48}]

やはり「don't」と「ain't」がイマイチだ。UD_English-EWTでは「don't」は「do」「n't」に、「ain't」は「ai」「n't」に切って、AUXとPARTを品詞付与するのが流儀なのだが、上の結果では「don」「'」「t」と「ain」「'」「t」に切られてしまってる。その点ではdeberta-v3-baseとほぼ同じ結果なのだが、F1値ではNeoBERTよりdeberta-v3-baseの方が精度が高い。うーん、NeoBERTForTokenClassificationを、もう少し改良すべきかな。

0
0
0

Register as a new user and use Qiita more conveniently

  1. You get articles that match your needs
  2. You can efficiently read back useful information
  3. You can use dark theme
What you can do with signing up
0
0

Delete article

Deleted articles cannot be recovered.

Draft of this article would be also deleted.

Are you sure you want to delete this article?