日本語ModernBERTにおけるトークナイザの重要性

Last updated at 2025-02-13Posted at 2025-02-13

昨日の記事の続きだが、modernbert-ja-130mには以下の注意書きがある。

Since the unigram language model is used as a tokenizer, the token boundaries often do not align with the morpheme boundaries, resulting in poor performance in token classification tasks such as named entity recognition and span extraction.

どうやら、固有表現抽出や品詞付与には向いてないらしく、『Sentencepieceの分割をMeCabっぽくする』つもりは無いようだ。でも、それでも何とか使ってみたい。

私(安岡孝一)なりにアレコレ考えたあげく、『GPT系言語モデルによる国語研長単位係り受け解析』でのトークナイザ改良をmodernbert-ja-130mにも適用して、国語研長単位品詞付与・係り受け解析モデルmodernbert-japanese-130m-ud-embedsを試作してみた。とりあえず、トークナイザ改良の効果をUPOS品詞付与で見てみよう。改良前だと、こんな感じ。

>>> from transformers import pipeline
>>> nlp=pipeline("token-classification","KoichiYasuoka/modernbert-japanese-130m-ud-embeds",aggregation_strategy="simple")
>>> print(nlp("国境の長いトンネルを抜けると雪国であった。"))
[{'entity_group': 'NOUN', 'score': 0.9999975, 'word': '国境', 'start': 0, 'end': 2}, {'entity_group': 'ADJ', 'score': 0.99999857, 'word': 'の長い', 'start': 2, 'end': 5}, {'entity_group': 'NOUN', 'score': 0.9999852, 'word': 'トンネル', 'start': 5, 'end': 9}, {'entity_group': 'VERB', 'score': 0.99999464, 'word': 'を抜け', 'start': 9, 'end': 12}, {'entity_group': 'SCONJ.', 'score': 0.9999615, 'word': 'ると', 'start': 12, 'end': 14}, {'entity_group': 'PROPN', 'score': 0.9998966, 'word': '雪国', 'start': 14, 'end': 16}, {'entity_group': 'AUX.', 'score': 0.99999976, 'word': 'であった', 'start': 16, 'end': 20}, {'entity_group': 'PUNCT.', 'score': 1.0, 'word': '。', 'start': 20, 'end': 21}]

トークナイザ改良後は、こんな感じ。

>>> from transformers import pipeline
>>> nlp=pipeline("upos","KoichiYasuoka/modernbert-japanese-130m-ud-embeds",aggregation_strategy="simple",trust_remote_code=True)
>>> print(nlp("国境の長いトンネルを抜けると雪国であった。"))
[{'start': 0, 'end': 2, 'score': 0.9999995, 'entity_group': 'NOUN', 'text': '国境'}, {'start': 2, 'end': 3, 'score': 1.0, 'entity_group': 'ADP.', 'text': 'の'}, {'start': 3, 'end': 5, 'score': 0.9999945, 'entity_group': 'ADJ', 'text': '長い'}, {'start': 5, 'end': 9, 'score': 0.9999995, 'entity_group': 'NOUN', 'text': 'トンネル'}, {'start': 9, 'end': 10, 'score': 1.0, 'entity_group': 'ADP.', 'text': 'を'}, {'start': 10, 'end': 13, 'score': 0.9999993, 'entity_group': 'VERB', 'text': '抜ける'}, {'start': 13, 'end': 14, 'score': 0.9996749, 'entity_group': 'SCONJ.', 'text': 'と'}, {'start': 14, 'end': 16, 'score': 0.99962664, 'entity_group': 'PROPN', 'text': '雪国'}, {'start': 16, 'end': 19, 'score': 0.99982184, 'entity_group': 'AUX.', 'text': 'であっ'}, {'start': 19, 'end': 20, 'score': 1.0, 'entity_group': 'AUX.', 'text': 'た'}, {'start': 20, 'end': 21, 'score': 0.9999999, 'entity_group': 'PUNCT.', 'text': '。'}]

「雪国」がPROPNになってしまっているのが惜しいが、あとは完璧だ。2月9日の記事にしたがって、係り受け解析の精度も比較してみよう。Google Colaboratoryだと、こんな感じ。

!pip install transformers triton sentencepiece
mdl="KoichiYasuoka/modernbert-japanese-130m-ud-embeds"
org="sbintuitions/modernbert-ja-130m"
import os,sys,subprocess
from transformers import pipeline,AutoTokenizer
url="https://github.com/UniversalDependencies/UD_Japanese-GSDLUW"
f=os.path.join(os.path.basename(url),"ja_gsdluw-ud-test.conllu")
os.system(f"test -f {f} || git clone --depth=1 {url}")
url="https://universaldependencies.org/conll18/conll18_ud_eval.py"
c=os.path.basename(url)
os.system(f"test -f {c} || curl -LO {url}")
with open(f,"r",encoding="utf-8") as r:
  s=[t[8:].strip() for t in r if t.startswith("# text =")]
for tkz in ["original","refined"]:
  nlp=pipeline("universal-dependencies",mdl,trust_remote_code=True)
  if tkz=="original":
    nlp.tokenizer=AutoTokenizer.from_pretrained(org)
  with open("result.conllu","w",encoding="utf-8") as w:
    for t in s:
      w.write(nlp(t))
  p=subprocess.run([sys.executable,c,"-v",f,"result.conllu"],
    encoding="utf-8",stdout=subprocess.PIPE,stderr=subprocess.STDOUT)
  os.system(f"mkdir -p result/{mdl}")
  with open(f"result/{mdl}/{tkz}.txt","w",encoding="utf-8") as w:
    print(f"\n*** {mdl} ({tkz} tokenizer)",p.stdout,sep="\n",file=w)
!cat result/{mdl}/*.txt

私の手元では、以下の結果が出力された。

*** KoichiYasuoka/modernbert-japanese-130m-ud-embeds (original tokenizer)
Metric     | Precision |    Recall |  F1 Score | AligndAcc
-----------+-----------+-----------+-----------+-----------
Tokens     |     63.93 |     41.13 |     50.06 |
Sentences  |    100.00 |    100.00 |    100.00 |
Words      |     63.93 |     41.13 |     50.06 |
UPOS       |     62.36 |     40.12 |     48.83 |     97.55
XPOS       |      0.00 |      0.00 |      0.00 |      0.00
UFeats     |     63.93 |     41.13 |     50.06 |    100.00
AllTags    |      0.00 |      0.00 |      0.00 |      0.00
Lemmas     |      0.00 |      0.00 |      0.00 |      0.00
UAS        |     29.80 |     19.17 |     23.33 |     46.61
LAS        |     29.24 |     18.81 |     22.90 |     45.74
CLAS       |     18.04 |     13.88 |     15.69 |     33.76
MLAS       |     15.78 |     12.15 |     13.73 |     29.54
BLEX       |      0.00 |      0.00 |      0.00 |      0.00

*** KoichiYasuoka/modernbert-japanese-130m-ud-embeds (refined tokenizer)
Metric     | Precision |    Recall |  F1 Score | AligndAcc
-----------+-----------+-----------+-----------+-----------
Tokens     |     97.88 |     97.71 |     97.79 |
Sentences  |    100.00 |    100.00 |    100.00 |
Words      |     97.88 |     97.71 |     97.79 |
UPOS       |     96.61 |     96.44 |     96.53 |     98.70
XPOS       |      0.00 |      0.00 |      0.00 |      0.00
UFeats     |     97.85 |     97.68 |     97.76 |     99.97
AllTags    |      0.00 |      0.00 |      0.00 |      0.00
Lemmas     |      0.00 |      0.00 |      0.00 |      0.00
UAS        |     91.11 |     90.96 |     91.04 |     93.09
LAS        |     90.42 |     90.27 |     90.34 |     92.38
CLAS       |     84.95 |     84.89 |     84.92 |     87.70
MLAS       |     83.05 |     83.00 |     83.03 |     85.75
BLEX       |      0.00 |      0.00 |      0.00 |      0.00

UPOS/LAS/MLASで比較すると、トークナイザ改良前が48.83/22.90/13.73で、改良後が96.53/90.34/83.03だ。トークナイザが全てではないが、かなり重要な要素なのは間違いない。このあたり、もう少し気をつけて設計してくれるといいんだけどなあ。

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up