国語研長単位品詞付与・係り受け解析モデルllm-jp-modernbert-base-ud-embedsリリース

Posted at 2025-04-29

llm-jp-modernbert-baseがリリースされたので、とりあえずトークナイザを試してみた。

>>> from transformers import AutoTokenizer
>>> tkz=AutoTokenizer.from_pretrained("llm-jp/llm-jp-modernbert-base")
>>> print(tkz.convert_ids_to_tokens(tkz("国境の長いトンネルを抜けると雪国であった。","夜の底が白くなった。")["input_ids"]))
['<CLS|LLM-jp>', '▁国境', 'の', '長い', 'トンネル', 'を', '抜ける', 'と', '雪', '国', 'であった。', '<SEP|LLM-jp>', '▁夜', 'の', '底', 'が', '白', 'くなった', '。', '<SEP|LLM-jp>']

文頭のmetaspaceが邪魔だ。しかも「であった。」が、句点を含んだまま1トークンになっていて、かなり使いづらい。仕方ないので、2月25日の記事をまねて、トークナイザを改良することにした。

>>> from transformers import AutoTokenizer
>>> from tokenizers import Regex
>>> from tokenizers.pre_tokenizers import Sequence,Split,Whitespace,Punctuation
>>> tkz=AutoTokenizer.from_pretrained("llm-jp/llm-jp-modernbert-base")
>>> tkz.backend_tokenizer.normalizer=None
>>> tkz.backend_tokenizer.pre_tokenizer=Sequence([Split(Regex("[ぁ-ん]"),"isolated"),Whitespace(),Punctuation()])
>>> print(tkz.convert_ids_to_tokens(tkz("国境の長いトンネルを抜けると雪国であった。","夜の底が白くなった。")["input_ids"]))
['<CLS|LLM-jp>', '国境', 'の', '長', 'い', 'トンネル', 'を', '抜', 'け', 'る', 'と', '雪', '国', 'で', 'あ', 'っ', 'た', '。', '<SEP|LLM-jp>', '夜', 'の', '底', 'が', '白', 'く', 'な', 'っ', 'た', '。', '<SEP|LLM-jp>']

ひらがなが細かすぎる気がしないでもないが、とりあえず、このやり方でトークナイザを改良して、国語研長単位品詞付与・係り受け解析モデルllm-jp-modernbert-base-ud-embedsを試作してみた。2月28日の記事にしたがって、係り受け解析の精度を比較してみよう。Google Colaboratoryだと、こんな感じ。

!pip install transformers
mdl="KoichiYasuoka/llm-jp-modernbert-base-ud-embeds"
org="llm-jp/llm-jp-modernbert-base"
import os,sys,subprocess
from transformers import pipeline,AutoTokenizer
url="https://github.com/UniversalDependencies/UD_Japanese-GSDLUW"
f=os.path.join(os.path.basename(url),"ja_gsdluw-ud-test.conllu")
os.system(f"test -f {f} || git clone --depth=1 {url}")
url="https://universaldependencies.org/conll18/conll18_ud_eval.py"
c=os.path.basename(url)
os.system(f"test -f {c} || curl -LO {url}")
with open(f,"r",encoding="utf-8") as r:
  s=[t[8:].strip() for t in r if t.startswith("# text =")]
for tkz in ["original","refined"]:
  nlp=pipeline("universal-dependencies",mdl,trust_remote_code=True)
  if tkz=="original":
    nlp.tokenizer=AutoTokenizer.from_pretrained(org)
    nlp.tokenizer.backend_tokenizer.normalizer=None
  with open("result.conllu","w",encoding="utf-8") as w:
    for t in s:
      w.write(nlp(t))
  p=subprocess.run([sys.executable,c,"-v",f,"result.conllu"],
    encoding="utf-8",stdout=subprocess.PIPE,stderr=subprocess.STDOUT)
  os.system(f"mkdir -p result/{mdl}")
  with open(f"result/{mdl}/{tkz}.txt","w",encoding="utf-8") as w:
    print(f"\n*** {mdl} ({tkz} tokenizer)",p.stdout,sep="\n",file=w)
!cat result/{mdl}/*.txt

私(安岡孝一)の手元では、以下の結果が出力された。

*** KoichiYasuoka/llm-jp-modernbert-base-ud-embeds (original tokenizer)
Metric     | Precision |    Recall |  F1 Score | AligndAcc
-----------+-----------+-----------+-----------+-----------
Tokens     |     73.38 |     56.64 |     63.93 |
Sentences  |    100.00 |    100.00 |    100.00 |
Words      |     73.38 |     56.64 |     63.93 |
UPOS       |     71.66 |     55.30 |     62.43 |     97.65
XPOS       |      0.00 |      0.00 |      0.00 |      0.00
UFeats     |     73.38 |     56.64 |     63.93 |    100.00
AllTags    |      0.00 |      0.00 |      0.00 |      0.00
Lemmas     |      0.00 |      0.00 |      0.00 |      0.00
UAS        |     50.92 |     39.30 |     44.36 |     69.39
LAS        |     50.16 |     38.71 |     43.70 |     68.35
CLAS       |     40.93 |     36.44 |     38.55 |     52.37
MLAS       |     34.05 |     30.31 |     32.07 |     43.56
BLEX       |      0.00 |      0.00 |      0.00 |      0.00

*** KoichiYasuoka/llm-jp-modernbert-base-ud-embeds (refined tokenizer)
Metric     | Precision |    Recall |  F1 Score | AligndAcc
-----------+-----------+-----------+-----------+-----------
Tokens     |     98.26 |     98.48 |     98.37 |
Sentences  |    100.00 |    100.00 |    100.00 |
Words      |     98.26 |     98.48 |     98.37 |
UPOS       |     96.79 |     97.01 |     96.90 |     98.51
XPOS       |      0.00 |      0.00 |      0.00 |      0.00
UFeats     |     98.24 |     98.46 |     98.35 |     99.98
AllTags    |      0.00 |      0.00 |      0.00 |      0.00
Lemmas     |      0.00 |      0.00 |      0.00 |      0.00
UAS        |     92.25 |     92.45 |     92.35 |     93.88
LAS        |     91.34 |     91.54 |     91.44 |     92.96
CLAS       |     85.87 |     86.08 |     85.97 |     88.25
MLAS       |     83.61 |     83.81 |     83.71 |     85.93
BLEX       |      0.00 |      0.00 |      0.00 |      0.00

UPOS/LAS/MLASで比較すると、トークナイザ改良前が62.43/43.70/32.07で、改良後が96.90/91.44/83.71だ。2月13日の記事の結果を上回っており、日本語ModernBERTとしては、かなり良いモデルとなっている。あとは、トークナイザがもう少し使いやすいと、いいんだけどなあ。

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up