kanripo古典中国語(漢文)ModernBERTモデルリリース

Posted at 2025-05-15

kanripo古典中国語(漢文)ModernBERTを、3種類(small,base,large)試作してみた。各モデルのパラメータ数を見てみよう。Google Colaboratoryだと、こんな感じ。

!pip install accelerate
!accelerate estimate-memory KoichiYasuoka/modernbert-small-classical-chinese --library_name transformers
!accelerate estimate-memory KoichiYasuoka/modernbert-base-classical-chinese --library_name transformers
!accelerate estimate-memory KoichiYasuoka/modernbert-large-classical-chinese --library_name transformers

私(安岡孝一)の手元では、以下の結果が出力された。

┌──────────────────────────────────────────────────────────────────────────────────────────────┐
│         Memory Usage for loading `KoichiYasuoka/modernbert-small-classical-chinese`          │
├───────┬─────────────┬──────────┬─────────────────────────────────────────────────────────────┤
│ dtype │Largest Layer│Total Size│                     Training using Adam                     │
├───────┼─────────────┼──────────┼─────────────────────────────────────────────────────────────┤
│float32│   24.49 MB  │ 76.52 MB │                          306.09 MB                          │
│float16│   12.25 MB  │ 38.26 MB │                          153.05 MB                          │
│  int8 │   6.12 MB   │ 19.13 MB │                             N/A                             │
│  int4 │   3.06 MB   │ 9.57 MB  │                             N/A                             │
└───────┴─────────────┴──────────┴─────────────────────────────────────────────────────────────┘

┌────────────────────────────────────────────────────────────────────────────────────────────┐
│         Memory Usage for loading `KoichiYasuoka/modernbert-base-classical-chinese`         │
├───────┬─────────────┬──────────┬───────────────────────────────────────────────────────────┤
│ dtype │Largest Layer│Total Size│                    Training using Adam                    │
├───────┼─────────────┼──────────┼───────────────────────────────────────────────────────────┤
│float32│   73.47 MB  │494.36 MB │                          1.93 GB                          │
│float16│   36.74 MB  │247.18 MB │                         988.71 MB                         │
│  int8 │   18.37 MB  │123.59 MB │                            N/A                            │
│  int4 │   9.18 MB   │ 61.79 MB │                            N/A                            │
└───────┴─────────────┴──────────┴───────────────────────────────────────────────────────────┘

┌──────────────────────────────────────────────────────────────────────────────────────────────┐
│         Memory Usage for loading `KoichiYasuoka/modernbert-large-classical-chinese`          │
├───────┬─────────────┬──────────┬─────────────────────────────────────────────────────────────┤
│ dtype │Largest Layer│Total Size│                     Training using Adam                     │
├───────┼─────────────┼──────────┼─────────────────────────────────────────────────────────────┤
│float32│   97.96 MB  │ 1.37 GB  │                            5.5 GB                           │
│float16│   48.98 MB  │703.59 MB │                           2.75 GB                           │
│  int8 │   24.49 MB  │ 351.8 MB │                             N/A                             │
│  int4 │   12.25 MB  │ 175.9 MB │                             N/A                             │
└───────┴─────────────┴──────────┴─────────────────────────────────────────────────────────────┘

小さい方から順に1913万パラメータ・1.24億パラメータ・3.52億パラメータとなっている。品詞付与・係り受け解析モデルもファインチューニングしてみたので、2024年8月26日の記事の手法で、精度を比較してみよう。Google Colaboratory (GPU版)だと、こんな感じ。

!pip install transformers triton
models=[
  "KoichiYasuoka/modernbert-small-classical-chinese-ud-embeds",
  "KoichiYasuoka/modernbert-base-classical-chinese-ud-embeds",
  "KoichiYasuoka/modernbert-large-classical-chinese-ud-embeds"
]
import os,sys,subprocess
url="https://github.com/UniversalDependencies/UD_Classical_Chinese-Kyoto"
d=os.path.basename(url)
!test -d {d} || git clone --depth=1 {url}
!for F in train dev test ; do cp {d}/*-$$F.conllu $$F.conllu ; done
url="https://universaldependencies.org/conll18/conll18_ud_eval.py"
c=os.path.basename(url)
!test -f {c} || curl -LO {url}
with open("test.conllu","r",encoding="utf-8") as r:
  s=[t[8:].strip() for t in r if t.startswith("# text =")]
for mdl in models:
  from transformers import pipeline
  nlp=pipeline("universal-dependencies",mdl,trust_remote_code=True)
  with open("result.conllu","w",encoding="utf-8") as w:
    for t in s:
      w.write(nlp(t))
  p=subprocess.run([sys.executable,c,"-v","test.conllu","result.conllu"],
    encoding="utf-8",stdout=subprocess.PIPE,stderr=subprocess.STDOUT)
  os.system(f"mkdir -p result/{mdl}")
  with open(f"result/{mdl}/result.txt","w",encoding="utf-8") as w:
    print(f"\n*** {mdl}",p.stdout,sep="\n",file=w)
!( cd result && cat `find {" ".join(models)} -name result.txt` )

私の手元では、以下の結果が出力された。

*** KoichiYasuoka/modernbert-small-classical-chinese-ud-embeds
Metric     | Precision |    Recall |  F1 Score | AligndAcc
-----------+-----------+-----------+-----------+-----------
Tokens     |     98.08 |     98.50 |     98.29 |
Sentences  |    100.00 |    100.00 |    100.00 |
Words      |     98.08 |     98.50 |     98.29 |
UPOS       |     90.90 |     91.29 |     91.10 |     92.68
XPOS       |      0.00 |      0.00 |      0.00 |      0.00
UFeats     |     92.30 |     92.69 |     92.49 |     94.10
AllTags    |      0.00 |      0.00 |      0.00 |      0.00
Lemmas     |     96.38 |     96.79 |     96.59 |     98.27
UAS        |     82.41 |     82.76 |     82.58 |     84.02
LAS        |     76.99 |     77.32 |     77.15 |     78.50
CLAS       |     76.44 |     76.51 |     76.47 |     77.81
MLAS       |     73.68 |     73.75 |     73.72 |     75.01
BLEX       |     75.51 |     75.58 |     75.55 |     76.86

*** KoichiYasuoka/modernbert-base-classical-chinese-ud-embeds
Metric     | Precision |    Recall |  F1 Score | AligndAcc
-----------+-----------+-----------+-----------+-----------
Tokens     |     97.98 |     98.44 |     98.21 |
Sentences  |    100.00 |    100.00 |    100.00 |
Words      |     97.98 |     98.44 |     98.21 |
UPOS       |     90.54 |     90.96 |     90.75 |     92.40
XPOS       |      0.00 |      0.00 |      0.00 |      0.00
UFeats     |     91.75 |     92.17 |     91.96 |     93.64
AllTags    |      0.00 |      0.00 |      0.00 |      0.00
Lemmas     |     96.27 |     96.71 |     96.49 |     98.25
UAS        |     80.74 |     81.11 |     80.93 |     82.40
LAS        |     75.76 |     76.11 |     75.93 |     77.32
CLAS       |     75.12 |     75.21 |     75.17 |     76.54
MLAS       |     72.38 |     72.46 |     72.42 |     73.74
BLEX       |     74.15 |     74.23 |     74.19 |     75.55

*** KoichiYasuoka/modernbert-large-classical-chinese-ud-embeds
Metric     | Precision |    Recall |  F1 Score | AligndAcc
-----------+-----------+-----------+-----------+-----------
Tokens     |     98.04 |     98.47 |     98.26 |
Sentences  |    100.00 |    100.00 |    100.00 |
Words      |     98.04 |     98.47 |     98.26 |
UPOS       |     90.14 |     90.54 |     90.34 |     91.94
XPOS       |      0.00 |      0.00 |      0.00 |      0.00
UFeats     |     91.59 |     92.00 |     91.79 |     93.42
AllTags    |      0.00 |      0.00 |      0.00 |      0.00
Lemmas     |     96.33 |     96.75 |     96.54 |     98.25
UAS        |     79.77 |     80.12 |     79.95 |     81.37
LAS        |     74.86 |     75.19 |     75.02 |     76.36
CLAS       |     74.27 |     74.53 |     74.40 |     75.81
MLAS       |     71.33 |     71.57 |     71.45 |     72.80
BLEX       |     73.36 |     73.61 |     73.48 |     74.88

LAS/MLAS/BLEXで見る限り、smallが77.15/73.72/75.55、baseが75.93/72.42/74.19、largeが75.02/71.45/73.48となっている。各モデルとも、ちゃんと入出力幅8192文字を実現しているのだが、モデルが大きいほど精度が下がっているということは、モデルの大きさに対してデータ量が足りていない、ということだ。さて、どうしたらいいかな。

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up