国語研長単位品詞付与・係り受け解析モデルmodernbert-japanese-{30m,70m,130m,310m}-ud-embedsリリース

Posted at 2025-02-28

SB IntuitionsのModernBERT-Jaシリーズは、入出力幅8192トークンの日本語ModernBERTで、3500万パラメータ・6680万パラメータ・1.26億パラメータ・3億パラメータのモデルが公開されている。入出力幅が8192トークンもあると、係り受け解析の隣接行列を上三角行列に変換すれば、126×126が乗ってしまうので非常にうれしい。ただ、2月13日の記事にも書いた通り、トークナイザに癖があって、そのままだと品詞付与や係り受け解析に適さない。ちょっと見てみよう。

>>> from transformers import AutoTokenizer
>>> tkz=AutoTokenizer.from_pretrained("sbintuitions/modernbert-ja-30m")
>>> print(tkz.convert_ids_to_tokens(tkz("国境の長いトンネルを抜けると雪国であった。","夜の底が白くなった。")["input_ids"]))
['<s>', '国境', 'の長い', 'トンネル', 'を抜け', 'ると', '雪', '国', 'であった', '。', '</s>', '<s>', '夜の', '底', 'が', '白', 'くなった', '。', '</s>']

「の長い」「を抜け」「ると」「くなった」が、単語の切れ目(形態素境界)を完全に無視していて、かなり使いにくい。仕方ないので、トークナイザを改良することにした。

>>> from transformers import AutoTokenizer
>>> from tokenizers import Regex
>>> from tokenizers.pre_tokenizers import Sequence,Split
>>> tkz=AutoTokenizer.from_pretrained("sbintuitions/modernbert-ja-30m")
>>> tkz.backend_tokenizer.pre_tokenizer=Sequence([Split(Regex("[ぁ-ん]"),"isolated"),tkz.backend_tokenizer.pre_tokenizer])
>>> print(tkz.convert_ids_to_tokens(tkz("国境の長いトンネルを抜けると雪国であった。","夜の底が白くなった。")["input_ids"]))
['<s>', '国境', 'の', '長', 'い', 'トンネル', 'を', '抜', 'け', 'る', 'と', '雪', '国', 'で', 'あ', 'っ', 'た', '。', '</s>', '<s>', '夜', 'の', '底', 'が', '白', 'く', 'な', 'っ', 'た', '。', '</s>']

ひらがなが細かすぎる気がしないでもないが、とりあえず、このやり方でトークナイザを改良して、国語研長単位品詞付与・係り受け解析モデルmodernbert-japanese-{30m,70m,130m,310m}-ud-embedsを試作してみた。一昨昨日の記事にしたがって、係り受け解析の精度を比較してみよう。Google Colaboratoryだと、こんな感じ。

!pip install transformers triton
mdl="KoichiYasuoka/modernbert-japanese-{}-ud-embeds"
org="sbintuitions/modernbert-ja-{}"
import os,sys,subprocess
from transformers import pipeline,AutoTokenizer
url="https://github.com/UniversalDependencies/UD_Japanese-GSDLUW"
f=os.path.join(os.path.basename(url),"ja_gsdluw-ud-test.conllu")
os.system(f"test -f {f} || git clone --depth=1 {url}")
url="https://universaldependencies.org/conll18/conll18_ud_eval.py"
c=os.path.basename(url)
os.system(f"test -f {c} || curl -LO {url}")
with open(f,"r",encoding="utf-8") as r:
  s=[t[8:].strip() for t in r if t.startswith("# text =")]
rst=[]
for z in ["30m","70m","130m","310m"]:
  for tkz in ["original","refined"]:
    nlp=pipeline("universal-dependencies",mdl.format(z),trust_remote_code=True)
    if tkz=="original":
      nlp.tokenizer=AutoTokenizer.from_pretrained(org.format(z))
    with open("result.conllu","w",encoding="utf-8") as w:
      for t in s:
        w.write(nlp(t))
    p=subprocess.run([sys.executable,c,"-v",f,"result.conllu"],
      encoding="utf-8",stdout=subprocess.PIPE,stderr=subprocess.STDOUT)
    os.system("mkdir -p "+os.path.join("result",mdl.format(z)))
    rst.append(os.path.join("result",mdl.format(z),tkz+".txt"))
    with open(rst[-1],"w",encoding="utf-8") as w:
      print(f"\n*** {mdl.format(z)} ({tkz} tokenizer)",p.stdout,sep="\n",file=w)
!cat {" ".join(rst)}

私(安岡孝一)の手元では、以下の結果が出力された。

*** KoichiYasuoka/modernbert-japanese-30m-ud-embeds (original tokenizer)
Metric     | Precision |    Recall |  F1 Score | AligndAcc
-----------+-----------+-----------+-----------+-----------
Tokens     |     63.29 |     41.54 |     50.16 |
Sentences  |    100.00 |    100.00 |    100.00 |
Words      |     63.29 |     41.54 |     50.16 |
UPOS       |     61.15 |     40.14 |     48.47 |     96.63
XPOS       |      0.00 |      0.00 |      0.00 |      0.00
UFeats     |     63.29 |     41.54 |     50.16 |    100.00
AllTags    |      0.00 |      0.00 |      0.00 |      0.00
Lemmas     |      0.00 |      0.00 |      0.00 |      0.00
UAS        |     28.68 |     18.82 |     22.73 |     45.31
LAS        |     28.12 |     18.46 |     22.29 |     44.44
CLAS       |     16.62 |     13.20 |     14.71 |     31.43
MLAS       |     14.55 |     11.55 |     12.88 |     27.51
BLEX       |      0.00 |      0.00 |      0.00 |      0.00

*** KoichiYasuoka/modernbert-japanese-30m-ud-embeds (refined tokenizer)
Metric     | Precision |    Recall |  F1 Score | AligndAcc
-----------+-----------+-----------+-----------+-----------
Tokens     |     97.68 |     97.61 |     97.64 |
Sentences  |    100.00 |    100.00 |    100.00 |
Words      |     97.68 |     97.61 |     97.64 |
UPOS       |     95.88 |     95.82 |     95.85 |     98.16
XPOS       |      0.00 |      0.00 |      0.00 |      0.00
UFeats     |     97.64 |     97.57 |     97.61 |     99.96
AllTags    |      0.00 |      0.00 |      0.00 |      0.00
Lemmas     |      0.00 |      0.00 |      0.00 |      0.00
UAS        |     90.15 |     90.09 |     90.12 |     92.30
LAS        |     89.31 |     89.25 |     89.28 |     91.43
CLAS       |     82.35 |     82.80 |     82.58 |     85.47
MLAS       |     80.03 |     80.47 |     80.25 |     83.07
BLEX       |      0.00 |      0.00 |      0.00 |      0.00

*** KoichiYasuoka/modernbert-japanese-70m-ud-embeds (original tokenizer)
Metric     | Precision |    Recall |  F1 Score | AligndAcc
-----------+-----------+-----------+-----------+-----------
Tokens     |     63.42 |     41.32 |     50.04 |
Sentences  |    100.00 |    100.00 |    100.00 |
Words      |     63.42 |     41.32 |     50.04 |
UPOS       |     61.66 |     40.17 |     48.65 |     97.22
XPOS       |      0.00 |      0.00 |      0.00 |      0.00
UFeats     |     63.42 |     41.32 |     50.04 |    100.00
AllTags    |      0.00 |      0.00 |      0.00 |      0.00
Lemmas     |      0.00 |      0.00 |      0.00 |      0.00
UAS        |     29.29 |     19.08 |     23.11 |     46.18
LAS        |     28.67 |     18.68 |     22.62 |     45.21
CLAS       |     17.75 |     13.86 |     15.56 |     33.26
MLAS       |     15.41 |     12.04 |     13.52 |     28.89
BLEX       |      0.00 |      0.00 |      0.00 |      0.00

*** KoichiYasuoka/modernbert-japanese-70m-ud-embeds (refined tokenizer)
Metric     | Precision |    Recall |  F1 Score | AligndAcc
-----------+-----------+-----------+-----------+-----------
Tokens     |     97.78 |     97.82 |     97.80 |
Sentences  |    100.00 |    100.00 |    100.00 |
Words      |     97.78 |     97.82 |     97.80 |
UPOS       |     96.30 |     96.35 |     96.32 |     98.49
XPOS       |      0.00 |      0.00 |      0.00 |      0.00
UFeats     |     97.74 |     97.78 |     97.76 |     99.96
AllTags    |      0.00 |      0.00 |      0.00 |      0.00
Lemmas     |      0.00 |      0.00 |      0.00 |      0.00
UAS        |     91.17 |     91.22 |     91.19 |     93.25
LAS        |     90.30 |     90.34 |     90.32 |     92.35
CLAS       |     84.88 |     84.98 |     84.93 |     87.53
MLAS       |     82.47 |     82.56 |     82.52 |     85.05
BLEX       |      0.00 |      0.00 |      0.00 |      0.00

*** KoichiYasuoka/modernbert-japanese-130m-ud-embeds (original tokenizer)
Metric     | Precision |    Recall |  F1 Score | AligndAcc
-----------+-----------+-----------+-----------+-----------
Tokens     |     63.93 |     41.13 |     50.06 |
Sentences  |    100.00 |    100.00 |    100.00 |
Words      |     63.93 |     41.13 |     50.06 |
UPOS       |     62.36 |     40.12 |     48.83 |     97.55
XPOS       |      0.00 |      0.00 |      0.00 |      0.00
UFeats     |     63.93 |     41.13 |     50.06 |    100.00
AllTags    |      0.00 |      0.00 |      0.00 |      0.00
Lemmas     |      0.00 |      0.00 |      0.00 |      0.00
UAS        |     29.80 |     19.17 |     23.33 |     46.61
LAS        |     29.24 |     18.81 |     22.90 |     45.74
CLAS       |     18.04 |     13.88 |     15.69 |     33.76
MLAS       |     15.78 |     12.15 |     13.73 |     29.54
BLEX       |      0.00 |      0.00 |      0.00 |      0.00

*** KoichiYasuoka/modernbert-japanese-130m-ud-embeds (refined tokenizer)
Metric     | Precision |    Recall |  F1 Score | AligndAcc
-----------+-----------+-----------+-----------+-----------
Tokens     |     97.88 |     97.71 |     97.79 |
Sentences  |    100.00 |    100.00 |    100.00 |
Words      |     97.88 |     97.71 |     97.79 |
UPOS       |     96.61 |     96.44 |     96.53 |     98.70
XPOS       |      0.00 |      0.00 |      0.00 |      0.00
UFeats     |     97.85 |     97.68 |     97.76 |     99.97
AllTags    |      0.00 |      0.00 |      0.00 |      0.00
Lemmas     |      0.00 |      0.00 |      0.00 |      0.00
UAS        |     91.11 |     90.96 |     91.04 |     93.09
LAS        |     90.42 |     90.27 |     90.34 |     92.38
CLAS       |     84.95 |     84.89 |     84.92 |     87.70
MLAS       |     83.05 |     83.00 |     83.03 |     85.75
BLEX       |      0.00 |      0.00 |      0.00 |      0.00

*** KoichiYasuoka/modernbert-japanese-310m-ud-embeds (original tokenizer)
Metric     | Precision |    Recall |  F1 Score | AligndAcc
-----------+-----------+-----------+-----------+-----------
Tokens     |     63.42 |     41.47 |     50.14 |
Sentences  |    100.00 |    100.00 |    100.00 |
Words      |     63.42 |     41.47 |     50.14 |
UPOS       |     61.92 |     40.49 |     48.96 |     97.64
XPOS       |      0.00 |      0.00 |      0.00 |      0.00
UFeats     |     63.41 |     41.46 |     50.13 |     99.98
AllTags    |      0.00 |      0.00 |      0.00 |      0.00
Lemmas     |      0.00 |      0.00 |      0.00 |      0.00
UAS        |     29.25 |     19.12 |     23.12 |     46.11
LAS        |     28.79 |     18.82 |     22.76 |     45.40
CLAS       |     18.27 |     14.21 |     15.99 |     33.79
MLAS       |     16.13 |     12.54 |     14.11 |     29.82
BLEX       |      0.00 |      0.00 |      0.00 |      0.00

*** KoichiYasuoka/modernbert-japanese-310m-ud-embeds (refined tokenizer)
Metric     | Precision |    Recall |  F1 Score | AligndAcc
-----------+-----------+-----------+-----------+-----------
Tokens     |     98.00 |     97.97 |     97.99 |
Sentences  |    100.00 |    100.00 |    100.00 |
Words      |     98.00 |     97.97 |     97.99 |
UPOS       |     96.86 |     96.83 |     96.84 |     98.84
XPOS       |      0.00 |      0.00 |      0.00 |      0.00
UFeats     |     97.97 |     97.93 |     97.95 |     99.96
AllTags    |      0.00 |      0.00 |      0.00 |      0.00
Lemmas     |      0.00 |      0.00 |      0.00 |      0.00
UAS        |     92.20 |     92.17 |     92.18 |     94.08
LAS        |     91.53 |     91.49 |     91.51 |     93.39
CLAS       |     86.75 |     86.71 |     86.73 |     89.22
MLAS       |     84.64 |     84.60 |     84.62 |     87.05
BLEX       |      0.00 |      0.00 |      0.00 |      0.00

UPOS/LAS/MLASを表にしてみよう。

	トークナイザ改良前	トークナイザ改良後
modernbert-japanese-30m-ud-embeds	48.47/22.29/12.88	95.85/89.28/80.25
modernbert-japanese-70m-ud-embeds	48.65/22.62/13.52	96.32/90.32/82.52
modernbert-japanese-130m-ud-embeds	48.83/22.90/13.73	96.53/90.34/83.03
modernbert-japanese-310m-ud-embeds	48.96/22.76/14.11	96.84/91.51/84.62

この結果を見る限り、トークナイザの改良は、品詞付与・係り受け解析の精度に寄与する。一方、モデルの規模(パラメータ数)は、微妙には寄与するものの、思いのほか効かないようだ。あくまで品詞付与・係り受け解析に限った話ではあるものの、もう少しトークナイザに気をつけて設計してくれると、私個人としてはうれしいなあ。

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up