0
0

Delete article

Deleted articles cannot be recovered.

Draft of this article would be also deleted.

Are you sure you want to delete this article?

DebertaV2ForTokenClassificationによる英語Universal Dependenciesの品詞付与

Posted at

昨日の記事の手法をdeberta-v3-baseに適用して、ModernBERT-baseと比較してみることにした。Google Colaboratory (GPU版)だと、こんな感じ。

!pip install transformers datasets evaluate seqeval accelerate sentencepiece
s='$1=="transformers"{printf("-b v%s",$2)}'
!test -d transformers || git clone `pip3 list | awk '{s}'` https://github.com/huggingface/transformers
!test -d UD_English-EWT || git clone --depth=1 https://github.com/UniversalDependencies/UD_English-EWT
import json
def makejson(conllu_file,json_file):
  with open(conllu_file,"r",encoding="utf-8") as r, open(json_file,"w",encoding="utf-8") as w:
    d={"tokens":["[CLS]"],"tags":["SYM"]}
    for s in r:
      if s.strip()=="":
        if len(d["tokens"])>1:
          d["tokens"].append("[SEP]")
          d["tags"].append("SYM")
          print(json.dumps(d),file=w)
        d={"tokens":["[CLS]"],"tags":["SYM"]}
      else:
        t=s.split("\t")
        if len(t)==10 and t[0].isdecimal():
          d["tokens"].append(t[1])
          d["tags"].append(t[3])
makejson("UD_English-EWT/en_ewt-ud-train.conllu","train.json")
makejson("UD_English-EWT/en_ewt-ud-dev.conllu","dev.json")
makejson("UD_English-EWT/en_ewt-ud-test.conllu","test.json")
!env WANDB_DISABLED=true python3 transformers/examples/pytorch/token-classification/run_ner.py --task_name pos --model_name_or_path microsoft/deberta-v3-base --train_file train.json --validation_file dev.json --test_file test.json --output_dir ./deberta-v3-base-english-upos --overwrite_output_dir --do_train --do_eval --do_predict

私(安岡孝一)の手元では、22分ほどで以下のmetricsが出力されて、deberta-v3-base-english-uposが出来上がった。

***** train metrics *****
  epoch                    =        3.0
  total_flos               =   789063GF
  train_loss               =     0.0529
  train_runtime            = 0:20:06.49
  train_samples            =      12544
  train_samples_per_second =     31.191
  train_steps_per_second   =      3.899

***** eval metrics *****
  epoch                   =        3.0
  eval_accuracy           =      0.983
  eval_f1                 =     0.9804
  eval_loss               =     0.0749
  eval_precision          =       0.98
  eval_recall             =     0.9808
  eval_runtime            = 0:00:08.77
  eval_samples            =       2001
  eval_samples_per_second =    228.122
  eval_steps_per_second   =     28.615

***** predict metrics *****
  predict_accuracy           =     0.9844
  predict_f1                 =     0.9815
  predict_loss               =     0.0675
  predict_precision          =     0.9802
  predict_recall             =     0.9828
  predict_runtime            = 0:00:09.73
  predict_samples_per_second =    213.346
  predict_steps_per_second   =     26.707

eval・predictともにF1値が0.98強で、かなりイイセンだ。ちょっと動かしてみよう。

from transformers import AutoTokenizer,AutoModelForTokenClassification,TokenClassificationPipeline
tkz=AutoTokenizer.from_pretrained("deberta-v3-base-english-upos")
mdl=AutoModelForTokenClassification.from_pretrained("deberta-v3-base-english-upos")
nlp=TokenClassificationPipeline(model=mdl,tokenizer=tkz,device=0)
print(nlp("It don't mean a thing if it ain't got that swing"))

出来立てのdeberta-v3-base-english-uposで「It don't mean a thing if it ain't got that swing」に品詞付与してみたところ、私の手元では以下の結果が得られた。

[{'entity': 'PRON', 'score': 0.99985194, 'index': 1, 'word': 'âĸģIt', 'start': 0, 'end': 2}, {'entity': 'AUX', 'score': 0.9993647, 'index': 2, 'word': 'âĸģdon', 'start': 2, 'end': 6}, {'entity': 'PART', 'score': 0.40081334, 'index': 3, 'word': "'", 'start': 6, 'end': 7}, {'entity': 'PART', 'score': 0.9993881, 'index': 4, 'word': 't', 'start': 7, 'end': 8}, {'entity': 'VERB', 'score': 0.99983907, 'index': 5, 'word': 'âĸģmean', 'start': 8, 'end': 13}, {'entity': 'DET', 'score': 0.9998281, 'index': 6, 'word': 'âĸģa', 'start': 13, 'end': 15}, {'entity': 'NOUN', 'score': 0.99954957, 'index': 7, 'word': 'âĸģthing', 'start': 15, 'end': 21}, {'entity': 'SCONJ', 'score': 0.99974984, 'index': 8, 'word': 'âĸģif', 'start': 21, 'end': 24}, {'entity': 'PRON', 'score': 0.99986565, 'index': 9, 'word': 'âĸģit', 'start': 24, 'end': 27}, {'entity': 'AUX', 'score': 0.9997147, 'index': 10, 'word': 'âĸģain', 'start': 27, 'end': 31}, {'entity': 'PART', 'score': 0.36902195, 'index': 11, 'word': "'", 'start': 31, 'end': 32}, {'entity': 'PART', 'score': 0.998618, 'index': 12, 'word': 't', 'start': 32, 'end': 33}, {'entity': 'VERB', 'score': 0.9987697, 'index': 13, 'word': 'âĸģgot', 'start': 33, 'end': 37}, {'entity': 'DET', 'score': 0.98608893, 'index': 14, 'word': 'âĸģthat', 'start': 37, 'end': 42}, {'entity': 'NOUN', 'score': 0.9983407, 'index': 15, 'word': 'âĸģswing', 'start': 42, 'end': 48}]

語頭に「âĸģ」が付加されてしまってるのは、まあ何とかするとしても、やはり「don't」と「ain't」がイマイチだ。UD_English-EWTでは「don't」は「do」「n't」に、「ain't」は「ai」「n't」に切って、AUXとPARTを品詞付与するのが流儀なのだが、上の結果では「don」「'」「t」と「ain」「'」「t」に切られてしまってる。ただ、この結果を見る限りだと、deberta-v3-baseの方がModernBERT-baseよりスジがいいように見える。さて、どうしてかなあ。

0
0
0

Register as a new user and use Qiita more conveniently

  1. You get articles that match your needs
  2. You can efficiently read back useful information
  3. You can use dark theme
What you can do with signing up
0
0

Delete article

Deleted articles cannot be recovered.

Draft of this article would be also deleted.

Are you sure you want to delete this article?