0
0

Delete article

Deleted articles cannot be recovered.

Draft of this article would be also deleted.

Are you sure you want to delete this article?

ModernBertForTokenClassificationによる英語Universal Dependenciesの品詞付与

Posted at

一昨日の記事ModernBERT-baseをもとに、transformers v4.47.1のrun_ner.pyの助けを借りて、UD_English-EWTUPOS品詞付与を試してみた。Google Colaboratory (GPU版)だと、こんな感じ。

!pip install transformers datasets evaluate seqeval accelerate triton
!test -d ModernBERT-base || git clone --depth=1 https://huggingface.co/answerdotai/ModernBERT-base
!test -f ModernBERT-base/configuration_modernbert.py || ( curl -L https://github.com/huggingface/transformers/raw/refs/heads/main/src/transformers/models/modernbert/configuration_modernbert.py | sed 's/^from \.\.\./from transformers./' > ModernBERT-base/configuration_modernbert.py )
!test -f ModernBERT-base/modeling_modernbert.py || ( curl -L https://github.com/huggingface/transformers/raw/refs/heads/main/src/transformers/models/modernbert/modeling_modernbert.py | sed -e 's/^from \.\.\./from transformers./' -e 's/^from .* import is_triton_available/is_triton_available = lambda: True/' > ModernBERT-base/modeling_modernbert.py )
import json
with open("ModernBERT-base/config.json","r",encoding="utf-8") as r:
  d=json.load(r)
if not "auto_map" in d:
  d["auto_map"]={
    "AutoConfig":"configuration_modernbert.ModernBertConfig",
    "AutoModel":"modeling_modernbert.ModernBertModel",
    "AutoModelForMaskedLM":"modeling_modernbert.ModernBertForMaskedLM",
    "AutoModelForSequenceClassification":"modeling_modernbert.ModernBertForSequenceClassification",
    "AutoModelForTokenClassification":"modeling_modernbert.ModernBertForTokenClassification"
  }
  with open("ModernBERT-base/config.json","w",encoding="utf-8") as w:
    json.dump(d,w,indent=2)
s='$1=="transformers"{printf("-b v%s",$2)}'
!test -d transformers || git clone `pip list | awk '{s}'` https://github.com/huggingface/transformers
!test -d UD_English-EWT || git clone --depth=1 https://github.com/UniversalDependencies/UD_English-EWT
def makejson(conllu_file,json_file):
  with open(json_file,"w",encoding="utf-8") as w:
    with open(conllu_file,"r",encoding="utf-8") as r:
      d={"tokens":[],"tags":[]}
      for s in r:
        if s.strip()=="":
          if d["tokens"]>[]:
            print(json.dumps(d),file=w)
          d={"tokens":[],"tags":[]}
        else:
          t=s.split("\t")
          if len(t)==10 and t[0].isdecimal():
            d["tokens"].append(t[1])
            d["tags"].append(t[3])
makejson("UD_English-EWT/en_ewt-ud-train.conllu","train.json")
makejson("UD_English-EWT/en_ewt-ud-dev.conllu","dev.json")
makejson("UD_English-EWT/en_ewt-ud-test.conllu","test.json")
!env WANDB_DISABLED=true python transformers/examples/pytorch/token-classification/run_ner.py --task_name pos --model_name_or_path ModernBERT-base --trust_remote_code --train_file train.json --validation_file dev.json --test_file test.json --output_dir ./ModernBERT-base-english-upos --overwrite_output_dir --do_train --do_eval --do_predict
!cp ModernBERT-base/*.py ModernBERT-base-english-upos

私(安岡孝一)の手元では、20分ほどで以下のmetricsが出力されて、ModernBERT-base-english-uposが出来上がった。

***** train metrics *****
  epoch                    =        3.0
  total_flos               =  1131708GF
  train_loss               =     0.0585
  train_runtime            = 0:19:12.79
  train_samples            =      12544
  train_samples_per_second =     32.644
  train_steps_per_second   =      4.081

***** eval metrics *****
  epoch                   =        3.0
  eval_accuracy           =     0.9779
  eval_f1                 =     0.9737
  eval_loss               =     0.1094
  eval_precision          =     0.9742
  eval_recall             =     0.9732
  eval_runtime            = 0:00:14.20
  eval_samples            =       2001
  eval_samples_per_second =    140.845
  eval_steps_per_second   =     17.667

***** predict metrics *****
  predict_accuracy           =     0.9766
  predict_f1                 =     0.9727
  predict_loss               =     0.1092
  predict_precision          =     0.9711
  predict_recall             =     0.9743
  predict_runtime            = 0:00:10.60
  predict_samples_per_second =    195.791
  predict_steps_per_second   =     24.509

eval・predictともにF1値が0.97強なので、まあ、イイセンいってるようだ。ちょっと動かしてみよう。

from transformers import AutoTokenizer,AutoModelForTokenClassification,TokenClassificationPipeline
tkz=AutoTokenizer.from_pretrained("ModernBERT-base-english-upos")
mdl=AutoModelForTokenClassification.from_pretrained("ModernBERT-base-english-upos",trust_remote_code=True)
nlp=TokenClassificationPipeline(model=mdl,tokenizer=tkz,device=0)
print(nlp("It don't mean a thing if it ain't got that swing"))

出来立てのModernBERT-base-english-uposで「It don't mean a thing if it ain't got that swing」に品詞付与してみたところ、私の手元では以下の結果が得られた。

[{'entity': 'PRON', 'score': 0.9999982, 'index': 1, 'word': 'It', 'start': 0, 'end': 2}, {'entity': 'AUX', 'score': 0.9999801, 'index': 2, 'word': 'Ġdon', 'start': 2, 'end': 6}, {'entity': 'PART', 'score': 0.99992585, 'index': 3, 'word': "'t", 'start': 6, 'end': 8}, {'entity': 'VERB', 'score': 0.9997824, 'index': 4, 'word': 'Ġmean', 'start': 8, 'end': 13}, {'entity': 'DET', 'score': 0.9999635, 'index': 5, 'word': 'Ġa', 'start': 13, 'end': 15}, {'entity': 'NOUN', 'score': 0.99820757, 'index': 6, 'word': 'Ġthing', 'start': 15, 'end': 21}, {'entity': 'SCONJ', 'score': 0.99993324, 'index': 7, 'word': 'Ġif', 'start': 21, 'end': 24}, {'entity': 'PRON', 'score': 0.999997, 'index': 8, 'word': 'Ġit', 'start': 24, 'end': 27}, {'entity': 'AUX', 'score': 0.99998164, 'index': 9, 'word': 'Ġain', 'start': 27, 'end': 31}, {'entity': 'PART', 'score': 0.99943155, 'index': 10, 'word': "'t", 'start': 31, 'end': 33}, {'entity': 'VERB', 'score': 0.9996099, 'index': 11, 'word': 'Ġgot', 'start': 33, 'end': 37}, {'entity': 'DET', 'score': 0.99999225, 'index': 12, 'word': 'Ġthat', 'start': 37, 'end': 42}, {'entity': 'NOUN', 'score': 0.99994314, 'index': 13, 'word': 'Ġswing', 'start': 42, 'end': 48}]

うーん、空白が「Ġ」に化けてしまっているのは、まあ何とかするとしても、「don't」と「ain't」がイマイチだ。UD_English-EWTでは「don't」は「do」「n't」に、「ain't」は「ai」「n't」に切って、AUXとPARTを品詞付与するのが流儀なのだが、上の結果では、単語の切れ目が1文字ズレてしまっている。他にも、embeddingsガラミで多少ヤヤコシイ問題も見つけちゃったし、もう少しModernBertForTokenClassificationの正式サポートを待った方がいいのかな。

0
0
0

Register as a new user and use Qiita more conveniently

  1. You get articles that match your needs
  2. You can efficiently read back useful information
  3. You can use dark theme
What you can do with signing up
0
0

Delete article

Deleted articles cannot be recovered.

Draft of this article would be also deleted.

Are you sure you want to delete this article?