Delete article

Deleted articles cannot be recovered.

Draft of this article would be also deleted.

Are you sure you want to delete this article?


Posted at

SB IntuitionsのModernBERT-Jaシリーズは、入出力幅8192トークンの日本語ModernBERTで、3500万パラメータ6680万パラメータ1.26億パラメータ3億パラメータのモデルが公開されている。入出力幅が8192トークンもあると、係り受け解析の隣接行列を上三角行列に変換すれば、126×126が乗ってしまうので非常にうれしい。ただ、2月13日の記事にも書いた通り、トークナイザに癖があって、そのままだと品詞付与や係り受け解析に適さない。ちょっと見てみよう。

>>> from transformers import AutoTokenizer
>>> tkz=AutoTokenizer.from_pretrained("sbintuitions/modernbert-ja-30m")
>>> print(tkz.convert_ids_to_tokens(tkz("国境の長いトンネルを抜けると雪国であった。","夜の底が白くなった。")["input_ids"]))
['<s>', '国境', 'の長い', 'トンネル', 'を抜け', 'ると', '雪', '国', 'であった', '。', '</s>', '<s>', '夜の', '底', 'が', '白', 'くなった', '。', '</s>']


>>> from transformers import AutoTokenizer
>>> from tokenizers import Regex
>>> from tokenizers.pre_tokenizers import Sequence,Split
>>> tkz=AutoTokenizer.from_pretrained("sbintuitions/modernbert-ja-30m")
>>> tkz.backend_tokenizer.pre_tokenizer=Sequence([Split(Regex("[ぁ-ん]"),"isolated"),tkz.backend_tokenizer.pre_tokenizer])
>>> print(tkz.convert_ids_to_tokens(tkz("国境の長いトンネルを抜けると雪国であった。","夜の底が白くなった。")["input_ids"]))
['<s>', '国境', 'の', '長', 'い', 'トンネル', 'を', '抜', 'け', 'る', 'と', '雪', '国', 'で', 'あ', 'っ', 'た', '。', '</s>', '<s>', '夜', 'の', '底', 'が', '白', 'く', 'な', 'っ', 'た', '。', '</s>']

ひらがなが細かすぎる気がしないでもないが、とりあえず、このやり方でトークナイザを改良して、国語研長単位品詞付与・係り受け解析モデルmodernbert-japanese-{30m,70m,130m,310m}-ud-embedsを試作してみた。一昨昨日の記事にしたがって、係り受け解析の精度を比較してみよう。Google Colaboratoryだと、こんな感じ。

!pip install transformers triton
import os,sys,subprocess
from transformers import pipeline,AutoTokenizer
os.system(f"test -f {f} || git clone --depth=1 {url}")
os.system(f"test -f {c} || curl -LO {url}")
with open(f,"r",encoding="utf-8") as r:
  s=[t[8:].strip() for t in r if t.startswith("# text =")]
for z in ["30m","70m","130m","310m"]:
  for tkz in ["original","refined"]:
    if tkz=="original":
    with open("result.conllu","w",encoding="utf-8") as w:
      for t in s:
    os.system("mkdir -p "+os.path.join("result",mdl.format(z)))
    with open(rst[-1],"w",encoding="utf-8") as w:
      print(f"\n*** {mdl.format(z)} ({tkz} tokenizer)",p.stdout,sep="\n",file=w)
!cat {" ".join(rst)}


*** KoichiYasuoka/modernbert-japanese-30m-ud-embeds (original tokenizer)
Metric     | Precision |    Recall |  F1 Score | AligndAcc
Tokens     |     63.29 |     41.54 |     50.16 |
Sentences  |    100.00 |    100.00 |    100.00 |
Words      |     63.29 |     41.54 |     50.16 |
UPOS       |     61.15 |     40.14 |     48.47 |     96.63
XPOS       |      0.00 |      0.00 |      0.00 |      0.00
UFeats     |     63.29 |     41.54 |     50.16 |    100.00
AllTags    |      0.00 |      0.00 |      0.00 |      0.00
Lemmas     |      0.00 |      0.00 |      0.00 |      0.00
UAS        |     28.68 |     18.82 |     22.73 |     45.31
LAS        |     28.12 |     18.46 |     22.29 |     44.44
CLAS       |     16.62 |     13.20 |     14.71 |     31.43
MLAS       |     14.55 |     11.55 |     12.88 |     27.51
BLEX       |      0.00 |      0.00 |      0.00 |      0.00

*** KoichiYasuoka/modernbert-japanese-30m-ud-embeds (refined tokenizer)
Metric     | Precision |    Recall |  F1 Score | AligndAcc
Tokens     |     97.68 |     97.61 |     97.64 |
Sentences  |    100.00 |    100.00 |    100.00 |
Words      |     97.68 |     97.61 |     97.64 |
UPOS       |     95.88 |     95.82 |     95.85 |     98.16
XPOS       |      0.00 |      0.00 |      0.00 |      0.00
UFeats     |     97.64 |     97.57 |     97.61 |     99.96
AllTags    |      0.00 |      0.00 |      0.00 |      0.00
Lemmas     |      0.00 |      0.00 |      0.00 |      0.00
UAS        |     90.15 |     90.09 |     90.12 |     92.30
LAS        |     89.31 |     89.25 |     89.28 |     91.43
CLAS       |     82.35 |     82.80 |     82.58 |     85.47
MLAS       |     80.03 |     80.47 |     80.25 |     83.07
BLEX       |      0.00 |      0.00 |      0.00 |      0.00

*** KoichiYasuoka/modernbert-japanese-70m-ud-embeds (original tokenizer)
Metric     | Precision |    Recall |  F1 Score | AligndAcc
Tokens     |     63.42 |     41.32 |     50.04 |
Sentences  |    100.00 |    100.00 |    100.00 |
Words      |     63.42 |     41.32 |     50.04 |
UPOS       |     61.66 |     40.17 |     48.65 |     97.22
XPOS       |      0.00 |      0.00 |      0.00 |      0.00
UFeats     |     63.42 |     41.32 |     50.04 |    100.00
AllTags    |      0.00 |      0.00 |      0.00 |      0.00
Lemmas     |      0.00 |      0.00 |      0.00 |      0.00
UAS        |     29.29 |     19.08 |     23.11 |     46.18
LAS        |     28.67 |     18.68 |     22.62 |     45.21
CLAS       |     17.75 |     13.86 |     15.56 |     33.26
MLAS       |     15.41 |     12.04 |     13.52 |     28.89
BLEX       |      0.00 |      0.00 |      0.00 |      0.00

*** KoichiYasuoka/modernbert-japanese-70m-ud-embeds (refined tokenizer)
Metric     | Precision |    Recall |  F1 Score | AligndAcc
Tokens     |     97.78 |     97.82 |     97.80 |
Sentences  |    100.00 |    100.00 |    100.00 |
Words      |     97.78 |     97.82 |     97.80 |
UPOS       |     96.30 |     96.35 |     96.32 |     98.49
XPOS       |      0.00 |      0.00 |      0.00 |      0.00
UFeats     |     97.74 |     97.78 |     97.76 |     99.96
AllTags    |      0.00 |      0.00 |      0.00 |      0.00
Lemmas     |      0.00 |      0.00 |      0.00 |      0.00
UAS        |     91.17 |     91.22 |     91.19 |     93.25
LAS        |     90.30 |     90.34 |     90.32 |     92.35
CLAS       |     84.88 |     84.98 |     84.93 |     87.53
MLAS       |     82.47 |     82.56 |     82.52 |     85.05
BLEX       |      0.00 |      0.00 |      0.00 |      0.00

*** KoichiYasuoka/modernbert-japanese-130m-ud-embeds (original tokenizer)
Metric     | Precision |    Recall |  F1 Score | AligndAcc
Tokens     |     63.93 |     41.13 |     50.06 |
Sentences  |    100.00 |    100.00 |    100.00 |
Words      |     63.93 |     41.13 |     50.06 |
UPOS       |     62.36 |     40.12 |     48.83 |     97.55
XPOS       |      0.00 |      0.00 |      0.00 |      0.00
UFeats     |     63.93 |     41.13 |     50.06 |    100.00
AllTags    |      0.00 |      0.00 |      0.00 |      0.00
Lemmas     |      0.00 |      0.00 |      0.00 |      0.00
UAS        |     29.80 |     19.17 |     23.33 |     46.61
LAS        |     29.24 |     18.81 |     22.90 |     45.74
CLAS       |     18.04 |     13.88 |     15.69 |     33.76
MLAS       |     15.78 |     12.15 |     13.73 |     29.54
BLEX       |      0.00 |      0.00 |      0.00 |      0.00

*** KoichiYasuoka/modernbert-japanese-130m-ud-embeds (refined tokenizer)
Metric     | Precision |    Recall |  F1 Score | AligndAcc
Tokens     |     97.88 |     97.71 |     97.79 |
Sentences  |    100.00 |    100.00 |    100.00 |
Words      |     97.88 |     97.71 |     97.79 |
UPOS       |     96.61 |     96.44 |     96.53 |     98.70
XPOS       |      0.00 |      0.00 |      0.00 |      0.00
UFeats     |     97.85 |     97.68 |     97.76 |     99.97
AllTags    |      0.00 |      0.00 |      0.00 |      0.00
Lemmas     |      0.00 |      0.00 |      0.00 |      0.00
UAS        |     91.11 |     90.96 |     91.04 |     93.09
LAS        |     90.42 |     90.27 |     90.34 |     92.38
CLAS       |     84.95 |     84.89 |     84.92 |     87.70
MLAS       |     83.05 |     83.00 |     83.03 |     85.75
BLEX       |      0.00 |      0.00 |      0.00 |      0.00

*** KoichiYasuoka/modernbert-japanese-310m-ud-embeds (original tokenizer)
Metric     | Precision |    Recall |  F1 Score | AligndAcc
Tokens     |     63.42 |     41.47 |     50.14 |
Sentences  |    100.00 |    100.00 |    100.00 |
Words      |     63.42 |     41.47 |     50.14 |
UPOS       |     61.92 |     40.49 |     48.96 |     97.64
XPOS       |      0.00 |      0.00 |      0.00 |      0.00
UFeats     |     63.41 |     41.46 |     50.13 |     99.98
AllTags    |      0.00 |      0.00 |      0.00 |      0.00
Lemmas     |      0.00 |      0.00 |      0.00 |      0.00
UAS        |     29.25 |     19.12 |     23.12 |     46.11
LAS        |     28.79 |     18.82 |     22.76 |     45.40
CLAS       |     18.27 |     14.21 |     15.99 |     33.79
MLAS       |     16.13 |     12.54 |     14.11 |     29.82
BLEX       |      0.00 |      0.00 |      0.00 |      0.00

*** KoichiYasuoka/modernbert-japanese-310m-ud-embeds (refined tokenizer)
Metric     | Precision |    Recall |  F1 Score | AligndAcc
Tokens     |     98.00 |     97.97 |     97.99 |
Sentences  |    100.00 |    100.00 |    100.00 |
Words      |     98.00 |     97.97 |     97.99 |
UPOS       |     96.86 |     96.83 |     96.84 |     98.84
XPOS       |      0.00 |      0.00 |      0.00 |      0.00
UFeats     |     97.97 |     97.93 |     97.95 |     99.96
AllTags    |      0.00 |      0.00 |      0.00 |      0.00
Lemmas     |      0.00 |      0.00 |      0.00 |      0.00
UAS        |     92.20 |     92.17 |     92.18 |     94.08
LAS        |     91.53 |     91.49 |     91.51 |     93.39
CLAS       |     86.75 |     86.71 |     86.73 |     89.22
MLAS       |     84.64 |     84.60 |     84.62 |     87.05
BLEX       |      0.00 |      0.00 |      0.00 |      0.00


トークナイザ改良前 トークナイザ改良後
modernbert-japanese-30m-ud-embeds 48.47/22.29/12.88 95.85/89.28/80.25
modernbert-japanese-70m-ud-embeds 48.65/22.62/13.52 96.32/90.32/82.52
modernbert-japanese-130m-ud-embeds 48.83/22.90/13.73 96.53/90.34/83.03
modernbert-japanese-310m-ud-embeds 48.96/22.76/14.11 96.84/91.51/84.62



Register as a new user and use Qiita more conveniently

  1. You get articles that match your needs
  2. You can efficiently read back useful information
  3. You can use dark theme
What you can do with signing up

Delete article

Deleted articles cannot be recovered.

Draft of this article would be also deleted.

Are you sure you want to delete this article?