DebertaV2ForTokenClassificationによる英語Universal Dependenciesの品詞付与

Posted at 2024-12-29

昨日の記事の手法をdeberta-v3-baseに適用して、ModernBERT-baseと比較してみることにした。Google Colaboratory (GPU版)だと、こんな感じ。

!pip install transformers datasets evaluate seqeval accelerate sentencepiece
s='$1=="transformers"{printf("-b v%s",$2)}'
!test -d transformers || git clone `pip3 list | awk '{s}'` https://github.com/huggingface/transformers
!test -d UD_English-EWT || git clone --depth=1 https://github.com/UniversalDependencies/UD_English-EWT
import json
def makejson(conllu_file,json_file):
  with open(conllu_file,"r",encoding="utf-8") as r, open(json_file,"w",encoding="utf-8") as w:
    d={"tokens":["[CLS]"],"tags":["SYM"]}
    for s in r:
      if s.strip()=="":
        if len(d["tokens"])>1:
          d["tokens"].append("[SEP]")
          d["tags"].append("SYM")
          print(json.dumps(d),file=w)
        d={"tokens":["[CLS]"],"tags":["SYM"]}
      else:
        t=s.split("\t")
        if len(t)==10 and t[0].isdecimal():
          d["tokens"].append(t[1])
          d["tags"].append(t[3])
makejson("UD_English-EWT/en_ewt-ud-train.conllu","train.json")
makejson("UD_English-EWT/en_ewt-ud-dev.conllu","dev.json")
makejson("UD_English-EWT/en_ewt-ud-test.conllu","test.json")
!env WANDB_DISABLED=true python3 transformers/examples/pytorch/token-classification/run_ner.py --task_name pos --model_name_or_path microsoft/deberta-v3-base --train_file train.json --validation_file dev.json --test_file test.json --output_dir ./deberta-v3-base-english-upos --overwrite_output_dir --do_train --do_eval --do_predict

私(安岡孝一)の手元では、22分ほどで以下のmetricsが出力されて、deberta-v3-base-english-uposが出来上がった。

***** train metrics *****
  epoch                    =        3.0
  total_flos               =   789063GF
  train_loss               =     0.0529
  train_runtime            = 0:20:06.49
  train_samples            =      12544
  train_samples_per_second =     31.191
  train_steps_per_second   =      3.899

***** eval metrics *****
  epoch                   =        3.0
  eval_accuracy           =      0.983
  eval_f1                 =     0.9804
  eval_loss               =     0.0749
  eval_precision          =       0.98
  eval_recall             =     0.9808
  eval_runtime            = 0:00:08.77
  eval_samples            =       2001
  eval_samples_per_second =    228.122
  eval_steps_per_second   =     28.615

***** predict metrics *****
  predict_accuracy           =     0.9844
  predict_f1                 =     0.9815
  predict_loss               =     0.0675
  predict_precision          =     0.9802
  predict_recall             =     0.9828
  predict_runtime            = 0:00:09.73
  predict_samples_per_second =    213.346
  predict_steps_per_second   =     26.707

eval・predictともにF1値が0.98強で、かなりイイセンだ。ちょっと動かしてみよう。

from transformers import AutoTokenizer,AutoModelForTokenClassification,TokenClassificationPipeline
tkz=AutoTokenizer.from_pretrained("deberta-v3-base-english-upos")
mdl=AutoModelForTokenClassification.from_pretrained("deberta-v3-base-english-upos")
nlp=TokenClassificationPipeline(model=mdl,tokenizer=tkz,device=0)
print(nlp("It don't mean a thing if it ain't got that swing"))

出来立てのdeberta-v3-base-english-uposで「It don't mean a thing if it ain't got that swing」に品詞付与してみたところ、私の手元では以下の結果が得られた。

[{'entity': 'PRON', 'score': 0.99985194, 'index': 1, 'word': 'âĸģIt', 'start': 0, 'end': 2}, {'entity': 'AUX', 'score': 0.9993647, 'index': 2, 'word': 'âĸģdon', 'start': 2, 'end': 6}, {'entity': 'PART', 'score': 0.40081334, 'index': 3, 'word': "'", 'start': 6, 'end': 7}, {'entity': 'PART', 'score': 0.9993881, 'index': 4, 'word': 't', 'start': 7, 'end': 8}, {'entity': 'VERB', 'score': 0.99983907, 'index': 5, 'word': 'âĸģmean', 'start': 8, 'end': 13}, {'entity': 'DET', 'score': 0.9998281, 'index': 6, 'word': 'âĸģa', 'start': 13, 'end': 15}, {'entity': 'NOUN', 'score': 0.99954957, 'index': 7, 'word': 'âĸģthing', 'start': 15, 'end': 21}, {'entity': 'SCONJ', 'score': 0.99974984, 'index': 8, 'word': 'âĸģif', 'start': 21, 'end': 24}, {'entity': 'PRON', 'score': 0.99986565, 'index': 9, 'word': 'âĸģit', 'start': 24, 'end': 27}, {'entity': 'AUX', 'score': 0.9997147, 'index': 10, 'word': 'âĸģain', 'start': 27, 'end': 31}, {'entity': 'PART', 'score': 0.36902195, 'index': 11, 'word': "'", 'start': 31, 'end': 32}, {'entity': 'PART', 'score': 0.998618, 'index': 12, 'word': 't', 'start': 32, 'end': 33}, {'entity': 'VERB', 'score': 0.9987697, 'index': 13, 'word': 'âĸģgot', 'start': 33, 'end': 37}, {'entity': 'DET', 'score': 0.98608893, 'index': 14, 'word': 'âĸģthat', 'start': 37, 'end': 42}, {'entity': 'NOUN', 'score': 0.9983407, 'index': 15, 'word': 'âĸģswing', 'start': 42, 'end': 48}]

語頭に「âĸģ」が付加されてしまってるのは、まあ何とかするとしても、やはり「don't」と「ain't」がイマイチだ。UD_English-EWTでは「don't」は「do」「n't」に、「ain't」は「ai」「n't」に切って、AUXとPARTを品詞付与するのが流儀なのだが、上の結果では「don」「'」「t」と「ain」「'」「t」に切られてしまってる。ただ、この結果を見る限りだと、deberta-v3-baseの方がModernBERT-baseよりスジがいいように見える。さて、どうしてかなあ。

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up