ModernBertForTokenClassificationによる英語Universal Dependenciesの品詞付与

Posted at 2024-12-28

一昨日の記事のModernBERT-baseをもとに、transformers v4.47.1のrun_ner.pyの助けを借りて、UD_English-EWTのUPOS品詞付与を試してみた。Google Colaboratory (GPU版)だと、こんな感じ。

!pip install transformers datasets evaluate seqeval accelerate triton
!test -d ModernBERT-base || git clone --depth=1 https://huggingface.co/answerdotai/ModernBERT-base
!test -f ModernBERT-base/configuration_modernbert.py || ( curl -L https://github.com/huggingface/transformers/raw/refs/heads/main/src/transformers/models/modernbert/configuration_modernbert.py | sed 's/^from \.\.\./from transformers./' > ModernBERT-base/configuration_modernbert.py )
!test -f ModernBERT-base/modeling_modernbert.py || ( curl -L https://github.com/huggingface/transformers/raw/refs/heads/main/src/transformers/models/modernbert/modeling_modernbert.py | sed -e 's/^from \.\.\./from transformers./' -e 's/^from .* import is_triton_available/is_triton_available = lambda: True/' > ModernBERT-base/modeling_modernbert.py )
import json
with open("ModernBERT-base/config.json","r",encoding="utf-8") as r:
  d=json.load(r)
if not "auto_map" in d:
  d["auto_map"]={
    "AutoConfig":"configuration_modernbert.ModernBertConfig",
    "AutoModel":"modeling_modernbert.ModernBertModel",
    "AutoModelForMaskedLM":"modeling_modernbert.ModernBertForMaskedLM",
    "AutoModelForSequenceClassification":"modeling_modernbert.ModernBertForSequenceClassification",
    "AutoModelForTokenClassification":"modeling_modernbert.ModernBertForTokenClassification"
  }
  with open("ModernBERT-base/config.json","w",encoding="utf-8") as w:
    json.dump(d,w,indent=2)
s='$1=="transformers"{printf("-b v%s",$2)}'
!test -d transformers || git clone `pip list | awk '{s}'` https://github.com/huggingface/transformers
!test -d UD_English-EWT || git clone --depth=1 https://github.com/UniversalDependencies/UD_English-EWT
def makejson(conllu_file,json_file):
  with open(json_file,"w",encoding="utf-8") as w:
    with open(conllu_file,"r",encoding="utf-8") as r:
      d={"tokens":[],"tags":[]}
      for s in r:
        if s.strip()=="":
          if d["tokens"]>[]:
            print(json.dumps(d),file=w)
          d={"tokens":[],"tags":[]}
        else:
          t=s.split("\t")
          if len(t)==10 and t[0].isdecimal():
            d["tokens"].append(t[1])
            d["tags"].append(t[3])
makejson("UD_English-EWT/en_ewt-ud-train.conllu","train.json")
makejson("UD_English-EWT/en_ewt-ud-dev.conllu","dev.json")
makejson("UD_English-EWT/en_ewt-ud-test.conllu","test.json")
!env WANDB_DISABLED=true python transformers/examples/pytorch/token-classification/run_ner.py --task_name pos --model_name_or_path ModernBERT-base --trust_remote_code --train_file train.json --validation_file dev.json --test_file test.json --output_dir ./ModernBERT-base-english-upos --overwrite_output_dir --do_train --do_eval --do_predict
!cp ModernBERT-base/*.py ModernBERT-base-english-upos

私(安岡孝一)の手元では、20分ほどで以下のmetricsが出力されて、ModernBERT-base-english-uposが出来上がった。

***** train metrics *****
  epoch                    =        3.0
  total_flos               =  1131708GF
  train_loss               =     0.0585
  train_runtime            = 0:19:12.79
  train_samples            =      12544
  train_samples_per_second =     32.644
  train_steps_per_second   =      4.081

***** eval metrics *****
  epoch                   =        3.0
  eval_accuracy           =     0.9779
  eval_f1                 =     0.9737
  eval_loss               =     0.1094
  eval_precision          =     0.9742
  eval_recall             =     0.9732
  eval_runtime            = 0:00:14.20
  eval_samples            =       2001
  eval_samples_per_second =    140.845
  eval_steps_per_second   =     17.667

***** predict metrics *****
  predict_accuracy           =     0.9766
  predict_f1                 =     0.9727
  predict_loss               =     0.1092
  predict_precision          =     0.9711
  predict_recall             =     0.9743
  predict_runtime            = 0:00:10.60
  predict_samples_per_second =    195.791
  predict_steps_per_second   =     24.509

eval・predictともにF1値が0.97強なので、まあ、イイセンいってるようだ。ちょっと動かしてみよう。

from transformers import AutoTokenizer,AutoModelForTokenClassification,TokenClassificationPipeline
tkz=AutoTokenizer.from_pretrained("ModernBERT-base-english-upos")
mdl=AutoModelForTokenClassification.from_pretrained("ModernBERT-base-english-upos",trust_remote_code=True)
nlp=TokenClassificationPipeline(model=mdl,tokenizer=tkz,device=0)
print(nlp("It don't mean a thing if it ain't got that swing"))

出来立てのModernBERT-base-english-uposで「It don't mean a thing if it ain't got that swing」に品詞付与してみたところ、私の手元では以下の結果が得られた。

[{'entity': 'PRON', 'score': 0.9999982, 'index': 1, 'word': 'It', 'start': 0, 'end': 2}, {'entity': 'AUX', 'score': 0.9999801, 'index': 2, 'word': 'Ġdon', 'start': 2, 'end': 6}, {'entity': 'PART', 'score': 0.99992585, 'index': 3, 'word': "'t", 'start': 6, 'end': 8}, {'entity': 'VERB', 'score': 0.9997824, 'index': 4, 'word': 'Ġmean', 'start': 8, 'end': 13}, {'entity': 'DET', 'score': 0.9999635, 'index': 5, 'word': 'Ġa', 'start': 13, 'end': 15}, {'entity': 'NOUN', 'score': 0.99820757, 'index': 6, 'word': 'Ġthing', 'start': 15, 'end': 21}, {'entity': 'SCONJ', 'score': 0.99993324, 'index': 7, 'word': 'Ġif', 'start': 21, 'end': 24}, {'entity': 'PRON', 'score': 0.999997, 'index': 8, 'word': 'Ġit', 'start': 24, 'end': 27}, {'entity': 'AUX', 'score': 0.99998164, 'index': 9, 'word': 'Ġain', 'start': 27, 'end': 31}, {'entity': 'PART', 'score': 0.99943155, 'index': 10, 'word': "'t", 'start': 31, 'end': 33}, {'entity': 'VERB', 'score': 0.9996099, 'index': 11, 'word': 'Ġgot', 'start': 33, 'end': 37}, {'entity': 'DET', 'score': 0.99999225, 'index': 12, 'word': 'Ġthat', 'start': 37, 'end': 42}, {'entity': 'NOUN', 'score': 0.99994314, 'index': 13, 'word': 'Ġswing', 'start': 42, 'end': 48}]

うーん、空白が「Ġ」に化けてしまっているのは、まあ何とかするとしても、「don't」と「ain't」がイマイチだ。UD_English-EWTでは「don't」は「do」「n't」に、「ain't」は「ai」「n't」に切って、AUXとPARTを品詞付与するのが流儀なのだが、上の結果では、単語の切れ目が1文字ズレてしまっている。他にも、embeddingsガラミで多少ヤヤコシイ問題も見つけちゃったし、もう少しModernBertForTokenClassificationの正式サポートを待った方がいいのかな。

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up