単文字トークナイザ版日本語ModernBERTリリース

Posted at 2025-02-26

単文字トークナイザによる青空文庫+日本語Wikipedia ModernBERTを、3種類(small,base,large)試作してみた。各モデルのパラメータ数を見てみよう。Google Colaboratoryだと、こんな感じ。

!pip install accelerate
!accelerate estimate-memory KoichiYasuoka/modernbert-small-japanese-char --library_name transformers
!accelerate estimate-memory KoichiYasuoka/modernbert-base-japanese-char --library_name transformers
!accelerate estimate-memory KoichiYasuoka/modernbert-large-japanese-char --library_name transformers

私(安岡孝一)の手元では、以下の結果が出力された。

┌──────────────────────────────────────────────────────────────────────────────────────┐
│       Memory Usage for loading `KoichiYasuoka/modernbert-small-japanese-char`        │
├───────┬─────────────┬──────────┬─────────────────────────────────────────────────────┤
│ dtype │Largest Layer│Total Size│                 Training using Adam                 │
├───────┼─────────────┼──────────┼─────────────────────────────────────────────────────┤
│float32│   22.9 MB   │ 97.99 MB │                      391.95 MB                      │
│float16│   11.45 MB  │ 48.99 MB │                      195.97 MB                      │
│  int8 │   5.72 MB   │ 24.5 MB  │                       97.99 MB                      │
│  int4 │   2.86 MB   │ 12.25 MB │                       48.99 MB                      │
└───────┴─────────────┴──────────┴─────────────────────────────────────────────────────┘

┌────────────────────────────────────────────────────────────────────────────────────┐
│       Memory Usage for loading `KoichiYasuoka/modernbert-base-japanese-char`       │
├───────┬─────────────┬──────────┬───────────────────────────────────────────────────┤
│ dtype │Largest Layer│Total Size│                Training using Adam                │
├───────┼─────────────┼──────────┼───────────────────────────────────────────────────┤
│float32│   68.51 MB  │560.07 MB │                      2.19 GB                      │
│float16│   34.25 MB  │280.03 MB │                      1.09 GB                      │
│  int8 │   17.13 MB  │140.02 MB │                     560.07 MB                     │
│  int4 │   8.56 MB   │ 70.01 MB │                     280.03 MB                     │
└───────┴─────────────┴──────────┴───────────────────────────────────────────────────┘

┌──────────────────────────────────────────────────────────────────────────────────────┐
│       Memory Usage for loading `KoichiYasuoka/modernbert-large-japanese-char`        │
├───────┬─────────────┬──────────┬─────────────────────────────────────────────────────┤
│ dtype │Largest Layer│Total Size│                 Training using Adam                 │
├───────┼─────────────┼──────────┼─────────────────────────────────────────────────────┤
│float32│   91.32 MB  │ 1.46 GB  │                       5.84 GB                       │
│float16│   45.66 MB  │747.89 MB │                       2.92 GB                       │
│  int8 │   22.83 MB  │373.94 MB │                       1.46 GB                       │
│  int4 │   11.41 MB  │186.97 MB │                      747.89 MB                      │
└───────┴─────────────┴──────────┴─────────────────────────────────────────────────────┘

小さい方から順に、2450万パラメータ・1.4億パラメータ・3.74億パラメータとなっている。それぞれ、国語研長単位品詞付与・係り受け解析モデルもファインチューニングしてみたので、2月9日の記事の手法で、精度を比較してみよう。Google Colaboratory (GPU版)だと、こんな感じ。

!pip install transformers triton
models=[
  "KoichiYasuoka/modernbert-small-japanese-char-ud-embeds",
  "KoichiYasuoka/modernbert-base-japanese-char-ud-embeds",
  "KoichiYasuoka/modernbert-large-japanese-char-ud-embeds"
]
import os,sys,subprocess
url="https://github.com/UniversalDependencies/UD_Japanese-GSDLUW"
f=os.path.join(os.path.basename(url),"ja_gsdluw-ud-test.conllu")
os.system(f"test -f {f} || git clone --depth=1 {url}")
url="https://universaldependencies.org/conll18/conll18_ud_eval.py"
c=os.path.basename(url)
os.system(f"test -f {c} || curl -LO {url}")
with open(f,"r",encoding="utf-8") as r:
  s=[t[8:].strip() for t in r if t.startswith("# text =")]
for mdl in models:
  from transformers import pipeline
  nlp=pipeline("universal-dependencies",mdl,trust_remote_code=True,
    aggregation_strategy="simple",device=0)
  with open("result.conllu","w",encoding="utf-8") as w:
    for t in s:
      w.write(nlp(t))
  p=subprocess.run([sys.executable,c,"-v",f,"result.conllu"],
    encoding="utf-8",stdout=subprocess.PIPE,stderr=subprocess.STDOUT)
  os.system(f"mkdir -p result/{mdl}")
  with open(f"result/{mdl}/result.txt","w",encoding="utf-8") as w:
    print(f"\n*** {mdl}",p.stdout,sep="\n",file=w)
!( cd result && cat `find {" ".join(models)} -name result.txt` )

私の手元では、以下の結果が出力された。

*** KoichiYasuoka/modernbert-small-japanese-char-ud-embeds
Metric     | Precision |    Recall |  F1 Score | AligndAcc
-----------+-----------+-----------+-----------+-----------
Tokens     |     97.55 |     97.82 |     97.69 |
Sentences  |    100.00 |    100.00 |    100.00 |
Words      |     97.55 |     97.82 |     97.69 |
UPOS       |     94.78 |     95.04 |     94.91 |     97.16
XPOS       |      0.00 |      0.00 |      0.00 |      0.00
UFeats     |     97.51 |     97.78 |     97.65 |     99.96
AllTags    |      0.00 |      0.00 |      0.00 |      0.00
Lemmas     |      0.00 |      0.00 |      0.00 |      0.00
UAS        |     89.31 |     89.56 |     89.43 |     91.55
LAS        |     87.95 |     88.20 |     88.07 |     90.16
CLAS       |     80.82 |     80.89 |     80.86 |     83.72
MLAS       |     76.72 |     76.78 |     76.75 |     79.47
BLEX       |      0.00 |      0.00 |      0.00 |      0.00

*** KoichiYasuoka/modernbert-base-japanese-char-ud-embeds
Metric     | Precision |    Recall |  F1 Score | AligndAcc
-----------+-----------+-----------+-----------+-----------
Tokens     |     97.68 |     97.88 |     97.78 |
Sentences  |    100.00 |    100.00 |    100.00 |
Words      |     97.68 |     97.88 |     97.78 |
UPOS       |     95.13 |     95.32 |     95.22 |     97.38
XPOS       |      0.00 |      0.00 |      0.00 |      0.00
UFeats     |     97.67 |     97.87 |     97.77 |     99.99
AllTags    |      0.00 |      0.00 |      0.00 |      0.00
Lemmas     |      0.00 |      0.00 |      0.00 |      0.00
UAS        |     89.78 |     89.96 |     89.87 |     91.91
LAS        |     88.64 |     88.82 |     88.73 |     90.74
CLAS       |     82.15 |     81.86 |     82.00 |     84.63
MLAS       |     78.38 |     78.10 |     78.24 |     80.74
BLEX       |      0.00 |      0.00 |      0.00 |      0.00

*** KoichiYasuoka/modernbert-large-japanese-char-ud-embeds
Metric     | Precision |    Recall |  F1 Score | AligndAcc
-----------+-----------+-----------+-----------+-----------
Tokens     |     97.81 |     98.13 |     97.97 |
Sentences  |    100.00 |    100.00 |    100.00 |
Words      |     97.81 |     98.13 |     97.97 |
UPOS       |     95.14 |     95.45 |     95.30 |     97.27
XPOS       |      0.00 |      0.00 |      0.00 |      0.00
UFeats     |     97.76 |     98.08 |     97.92 |     99.95
AllTags    |      0.00 |      0.00 |      0.00 |      0.00
Lemmas     |      0.00 |      0.00 |      0.00 |      0.00
UAS        |     90.15 |     90.44 |     90.29 |     92.16
LAS        |     88.89 |     89.18 |     89.04 |     90.88
CLAS       |     82.02 |     82.08 |     82.05 |     84.70
MLAS       |     77.96 |     78.01 |     77.99 |     80.51
BLEX       |      0.00 |      0.00 |      0.00 |      0.00

UPOS/LAS/MLASで見る限り、smallが94.91/88.07/76.75、baseが95.22/88.73/78.24、largeが95.30/89.04/77.99となっている。各モデルとも、ちゃんと入出力幅8192文字を実現しているので、ぜひ使ってみてほしい。

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up