0
0

Delete article

Deleted articles cannot be recovered.

Draft of this article would be also deleted.

Are you sure you want to delete this article?

単文字トークナイザ版日本語ModernBERTリリース

Posted at

単文字トークナイザによる青空文庫+日本語Wikipedia ModernBERTを、3種類(small,base,large)試作してみた。各モデルのパラメータ数を見てみよう。Google Colaboratoryだと、こんな感じ。

!pip install accelerate
!accelerate estimate-memory KoichiYasuoka/modernbert-small-japanese-char --library_name transformers
!accelerate estimate-memory KoichiYasuoka/modernbert-base-japanese-char --library_name transformers
!accelerate estimate-memory KoichiYasuoka/modernbert-large-japanese-char --library_name transformers

私(安岡孝一)の手元では、以下の結果が出力された。

┌──────────────────────────────────────────────────────────────────────────────────────┐
│       Memory Usage for loading `KoichiYasuoka/modernbert-small-japanese-char`        │
├───────┬─────────────┬──────────┬─────────────────────────────────────────────────────┤
│ dtype │Largest Layer│Total Size│                 Training using Adam                 │
├───────┼─────────────┼──────────┼─────────────────────────────────────────────────────┤
│float32│   22.9 MB   │ 97.99 MB │                      391.95 MB                      │
│float16│   11.45 MB  │ 48.99 MB │                      195.97 MB                      │
│  int8 │   5.72 MB   │ 24.5 MB  │                       97.99 MB                      │
│  int4 │   2.86 MB   │ 12.25 MB │                       48.99 MB                      │
└───────┴─────────────┴──────────┴─────────────────────────────────────────────────────┘

┌────────────────────────────────────────────────────────────────────────────────────┐
│       Memory Usage for loading `KoichiYasuoka/modernbert-base-japanese-char`       │
├───────┬─────────────┬──────────┬───────────────────────────────────────────────────┤
│ dtype │Largest Layer│Total Size│                Training using Adam                │
├───────┼─────────────┼──────────┼───────────────────────────────────────────────────┤
│float32│   68.51 MB  │560.07 MB │                      2.19 GB                      │
│float16│   34.25 MB  │280.03 MB │                      1.09 GB                      │
│  int8 │   17.13 MB  │140.02 MB │                     560.07 MB                     │
│  int4 │   8.56 MB   │ 70.01 MB │                     280.03 MB                     │
└───────┴─────────────┴──────────┴───────────────────────────────────────────────────┘

┌──────────────────────────────────────────────────────────────────────────────────────┐
│       Memory Usage for loading `KoichiYasuoka/modernbert-large-japanese-char`        │
├───────┬─────────────┬──────────┬─────────────────────────────────────────────────────┤
│ dtype │Largest Layer│Total Size│                 Training using Adam                 │
├───────┼─────────────┼──────────┼─────────────────────────────────────────────────────┤
│float32│   91.32 MB  │ 1.46 GB  │                       5.84 GB                       │
│float16│   45.66 MB  │747.89 MB │                       2.92 GB                       │
│  int8 │   22.83 MB  │373.94 MB │                       1.46 GB                       │
│  int4 │   11.41 MB  │186.97 MB │                      747.89 MB                      │
└───────┴─────────────┴──────────┴─────────────────────────────────────────────────────┘

小さい方から順に、2450万パラメータ・1.4億パラメータ・3.74億パラメータとなっている。それぞれ、国語研長単位品詞付与・係り受け解析モデルもファインチューニングしてみたので、2月9日の記事の手法で、精度を比較してみよう。Google Colaboratory (GPU版)だと、こんな感じ。

!pip install transformers triton
models=[
  "KoichiYasuoka/modernbert-small-japanese-char-ud-embeds",
  "KoichiYasuoka/modernbert-base-japanese-char-ud-embeds",
  "KoichiYasuoka/modernbert-large-japanese-char-ud-embeds"
]
import os,sys,subprocess
url="https://github.com/UniversalDependencies/UD_Japanese-GSDLUW"
f=os.path.join(os.path.basename(url),"ja_gsdluw-ud-test.conllu")
os.system(f"test -f {f} || git clone --depth=1 {url}")
url="https://universaldependencies.org/conll18/conll18_ud_eval.py"
c=os.path.basename(url)
os.system(f"test -f {c} || curl -LO {url}")
with open(f,"r",encoding="utf-8") as r:
  s=[t[8:].strip() for t in r if t.startswith("# text =")]
for mdl in models:
  from transformers import pipeline
  nlp=pipeline("universal-dependencies",mdl,trust_remote_code=True,
    aggregation_strategy="simple",device=0)
  with open("result.conllu","w",encoding="utf-8") as w:
    for t in s:
      w.write(nlp(t))
  p=subprocess.run([sys.executable,c,"-v",f,"result.conllu"],
    encoding="utf-8",stdout=subprocess.PIPE,stderr=subprocess.STDOUT)
  os.system(f"mkdir -p result/{mdl}")
  with open(f"result/{mdl}/result.txt","w",encoding="utf-8") as w:
    print(f"\n*** {mdl}",p.stdout,sep="\n",file=w)
!( cd result && cat `find {" ".join(models)} -name result.txt` )

私の手元では、以下の結果が出力された。

*** KoichiYasuoka/modernbert-small-japanese-char-ud-embeds
Metric     | Precision |    Recall |  F1 Score | AligndAcc
-----------+-----------+-----------+-----------+-----------
Tokens     |     97.55 |     97.82 |     97.69 |
Sentences  |    100.00 |    100.00 |    100.00 |
Words      |     97.55 |     97.82 |     97.69 |
UPOS       |     94.78 |     95.04 |     94.91 |     97.16
XPOS       |      0.00 |      0.00 |      0.00 |      0.00
UFeats     |     97.51 |     97.78 |     97.65 |     99.96
AllTags    |      0.00 |      0.00 |      0.00 |      0.00
Lemmas     |      0.00 |      0.00 |      0.00 |      0.00
UAS        |     89.31 |     89.56 |     89.43 |     91.55
LAS        |     87.95 |     88.20 |     88.07 |     90.16
CLAS       |     80.82 |     80.89 |     80.86 |     83.72
MLAS       |     76.72 |     76.78 |     76.75 |     79.47
BLEX       |      0.00 |      0.00 |      0.00 |      0.00

*** KoichiYasuoka/modernbert-base-japanese-char-ud-embeds
Metric     | Precision |    Recall |  F1 Score | AligndAcc
-----------+-----------+-----------+-----------+-----------
Tokens     |     97.68 |     97.88 |     97.78 |
Sentences  |    100.00 |    100.00 |    100.00 |
Words      |     97.68 |     97.88 |     97.78 |
UPOS       |     95.13 |     95.32 |     95.22 |     97.38
XPOS       |      0.00 |      0.00 |      0.00 |      0.00
UFeats     |     97.67 |     97.87 |     97.77 |     99.99
AllTags    |      0.00 |      0.00 |      0.00 |      0.00
Lemmas     |      0.00 |      0.00 |      0.00 |      0.00
UAS        |     89.78 |     89.96 |     89.87 |     91.91
LAS        |     88.64 |     88.82 |     88.73 |     90.74
CLAS       |     82.15 |     81.86 |     82.00 |     84.63
MLAS       |     78.38 |     78.10 |     78.24 |     80.74
BLEX       |      0.00 |      0.00 |      0.00 |      0.00

*** KoichiYasuoka/modernbert-large-japanese-char-ud-embeds
Metric     | Precision |    Recall |  F1 Score | AligndAcc
-----------+-----------+-----------+-----------+-----------
Tokens     |     97.81 |     98.13 |     97.97 |
Sentences  |    100.00 |    100.00 |    100.00 |
Words      |     97.81 |     98.13 |     97.97 |
UPOS       |     95.14 |     95.45 |     95.30 |     97.27
XPOS       |      0.00 |      0.00 |      0.00 |      0.00
UFeats     |     97.76 |     98.08 |     97.92 |     99.95
AllTags    |      0.00 |      0.00 |      0.00 |      0.00
Lemmas     |      0.00 |      0.00 |      0.00 |      0.00
UAS        |     90.15 |     90.44 |     90.29 |     92.16
LAS        |     88.89 |     89.18 |     89.04 |     90.88
CLAS       |     82.02 |     82.08 |     82.05 |     84.70
MLAS       |     77.96 |     78.01 |     77.99 |     80.51
BLEX       |      0.00 |      0.00 |      0.00 |      0.00

UPOS/LAS/MLASで見る限り、smallが94.91/88.07/76.75、baseが95.22/88.73/78.24、largeが95.30/89.04/77.99となっている。各モデルとも、ちゃんと入出力幅8192文字を実現しているので、ぜひ使ってみてほしい。

0
0
0

Register as a new user and use Qiita more conveniently

  1. You get articles that match your needs
  2. You can efficiently read back useful information
  3. You can use dark theme
What you can do with signing up
0
0

Delete article

Deleted articles cannot be recovered.

Draft of this article would be also deleted.

Are you sure you want to delete this article?