単文字トークナイザによる青空文庫+日本語Wikipedia ModernBERTを、3種類(small,base,large)試作してみた。各モデルのパラメータ数を見てみよう。Google Colaboratoryだと、こんな感じ。
!pip install accelerate
!accelerate estimate-memory KoichiYasuoka/modernbert-small-japanese-char --library_name transformers
!accelerate estimate-memory KoichiYasuoka/modernbert-base-japanese-char --library_name transformers
!accelerate estimate-memory KoichiYasuoka/modernbert-large-japanese-char --library_name transformers
│ Memory Usage for loading `KoichiYasuoka/modernbert-small-japanese-char` │
│ dtype │Largest Layer│Total Size│ Training using Adam │
│float32│ 22.9 MB │ 97.99 MB │ 391.95 MB │
│float16│ 11.45 MB │ 48.99 MB │ 195.97 MB │
│ int8 │ 5.72 MB │ 24.5 MB │ 97.99 MB │
│ int4 │ 2.86 MB │ 12.25 MB │ 48.99 MB │
│ Memory Usage for loading `KoichiYasuoka/modernbert-base-japanese-char` │
│ dtype │Largest Layer│Total Size│ Training using Adam │
│float32│ 68.51 MB │560.07 MB │ 2.19 GB │
│float16│ 34.25 MB │280.03 MB │ 1.09 GB │
│ int8 │ 17.13 MB │140.02 MB │ 560.07 MB │
│ int4 │ 8.56 MB │ 70.01 MB │ 280.03 MB │
│ Memory Usage for loading `KoichiYasuoka/modernbert-large-japanese-char` │
│ dtype │Largest Layer│Total Size│ Training using Adam │
│float32│ 91.32 MB │ 1.46 GB │ 5.84 GB │
│float16│ 45.66 MB │747.89 MB │ 2.92 GB │
│ int8 │ 22.83 MB │373.94 MB │ 1.46 GB │
│ int4 │ 11.41 MB │186.97 MB │ 747.89 MB │
小さい方から順に、2450万パラメータ・1.4億パラメータ・3.74億パラメータとなっている。それぞれ、国語研長単位品詞付与・係り受け解析モデルもファインチューニングしてみたので、2月9日の記事の手法で、精度を比較してみよう。Google Colaboratory (GPU版)だと、こんな感じ。
!pip install transformers triton
import os,sys,subprocess
os.system(f"test -f {f} || git clone --depth=1 {url}")
os.system(f"test -f {c} || curl -LO {url}")
with open(f,"r",encoding="utf-8") as r:
s=[t[8:].strip() for t in r if t.startswith("# text =")]
for mdl in models:
from transformers import pipeline
with open("result.conllu","w",encoding="utf-8") as w:
for t in s:
os.system(f"mkdir -p result/{mdl}")
with open(f"result/{mdl}/result.txt","w",encoding="utf-8") as w:
print(f"\n*** {mdl}",p.stdout,sep="\n",file=w)
!( cd result && cat `find {" ".join(models)} -name result.txt` )
*** KoichiYasuoka/modernbert-small-japanese-char-ud-embeds
Metric | Precision | Recall | F1 Score | AligndAcc
Tokens | 97.55 | 97.82 | 97.69 |
Sentences | 100.00 | 100.00 | 100.00 |
Words | 97.55 | 97.82 | 97.69 |
UPOS | 94.78 | 95.04 | 94.91 | 97.16
XPOS | 0.00 | 0.00 | 0.00 | 0.00
UFeats | 97.51 | 97.78 | 97.65 | 99.96
AllTags | 0.00 | 0.00 | 0.00 | 0.00
Lemmas | 0.00 | 0.00 | 0.00 | 0.00
UAS | 89.31 | 89.56 | 89.43 | 91.55
LAS | 87.95 | 88.20 | 88.07 | 90.16
CLAS | 80.82 | 80.89 | 80.86 | 83.72
MLAS | 76.72 | 76.78 | 76.75 | 79.47
BLEX | 0.00 | 0.00 | 0.00 | 0.00
*** KoichiYasuoka/modernbert-base-japanese-char-ud-embeds
Metric | Precision | Recall | F1 Score | AligndAcc
Tokens | 97.68 | 97.88 | 97.78 |
Sentences | 100.00 | 100.00 | 100.00 |
Words | 97.68 | 97.88 | 97.78 |
UPOS | 95.13 | 95.32 | 95.22 | 97.38
XPOS | 0.00 | 0.00 | 0.00 | 0.00
UFeats | 97.67 | 97.87 | 97.77 | 99.99
AllTags | 0.00 | 0.00 | 0.00 | 0.00
Lemmas | 0.00 | 0.00 | 0.00 | 0.00
UAS | 89.78 | 89.96 | 89.87 | 91.91
LAS | 88.64 | 88.82 | 88.73 | 90.74
CLAS | 82.15 | 81.86 | 82.00 | 84.63
MLAS | 78.38 | 78.10 | 78.24 | 80.74
BLEX | 0.00 | 0.00 | 0.00 | 0.00
*** KoichiYasuoka/modernbert-large-japanese-char-ud-embeds
Metric | Precision | Recall | F1 Score | AligndAcc
Tokens | 97.81 | 98.13 | 97.97 |
Sentences | 100.00 | 100.00 | 100.00 |
Words | 97.81 | 98.13 | 97.97 |
UPOS | 95.14 | 95.45 | 95.30 | 97.27
XPOS | 0.00 | 0.00 | 0.00 | 0.00
UFeats | 97.76 | 98.08 | 97.92 | 99.95
AllTags | 0.00 | 0.00 | 0.00 | 0.00
Lemmas | 0.00 | 0.00 | 0.00 | 0.00
UAS | 90.15 | 90.44 | 90.29 | 92.16
LAS | 88.89 | 89.18 | 89.04 | 90.88
CLAS | 82.02 | 82.08 | 82.05 | 84.70
MLAS | 77.96 | 78.01 | 77.99 | 80.51
BLEX | 0.00 | 0.00 | 0.00 | 0.00