0
0

Delete article

Deleted articles cannot be recovered.

Draft of this article would be also deleted.

Are you sure you want to delete this article?

国語研短単位ModernBERTで作る国語研長単位係り受け解析モデル

Last updated at Posted at 2025-02-15

国語研短単位(unidic-lite)日本語ModernBERTがbaselargeとリリースされたので、これらをもとに2月9日13日の記事の手法を援用して、上三角行列アルゴリズムによる国語研長単位係り受け解析モデルを作ってみた。いずれもトークナイザは、2023年7月21日の日記で作ったBertMecabTokenizerFastを流用している。係り受け解析の精度を、ja_gsdluw-ud-test.conlluで比較してみよう。Google Colaboratory (GPU版)だと、こんな感じ。

!pip install transformers triton fugashi unidic-lite
models=[
  "KoichiYasuoka/modernbert-base-japanese-unidic-ud-triangular",
  "KoichiYasuoka/modernbert-large-japanese-unidic-ud-triangular",
  "KoichiYasuoka/modernbert-base-japanese-unidic-ud-embeds",
  "KoichiYasuoka/modernbert-large-japanese-unidic-ud-embeds"
]
import os,sys,subprocess
url="https://github.com/UniversalDependencies/UD_Japanese-GSDLUW"
f=os.path.join(os.path.basename(url),"ja_gsdluw-ud-test.conllu")
os.system(f"test -f {f} || git clone --depth=1 {url}")
url="https://universaldependencies.org/conll18/conll18_ud_eval.py"
c=os.path.basename(url)
os.system(f"test -f {c} || curl -LO {url}")
with open(f,"r",encoding="utf-8") as r:
  s=[t[8:].strip() for t in r if t.startswith("# text =")]
for mdl in models:
  from transformers import pipeline
  nlp=pipeline("universal-dependencies",mdl,trust_remote_code=True,
    aggregation_strategy="simple",device=0)
  with open("result.conllu","w",encoding="utf-8") as w:
    for t in s:
      w.write(nlp(t))
  p=subprocess.run([sys.executable,c,"-v",f,"result.conllu"],
    encoding="utf-8",stdout=subprocess.PIPE,stderr=subprocess.STDOUT)
  os.system(f"mkdir -p result/{mdl}")
  with open(f"result/{mdl}/result.txt","w",encoding="utf-8") as w:
    print(f"\n*** {mdl}",p.stdout,sep="\n",file=w)
!( cd result && cat `find {" ".join(models)} -name result.txt` )

私(安岡孝一)の手元では、以下の結果が出力された。

*** KoichiYasuoka/modernbert-base-japanese-unidic-ud-triangular
Metric     | Precision |    Recall |  F1 Score | AligndAcc
-----------+-----------+-----------+-----------+-----------
Tokens     |     97.04 |     96.61 |     96.82 |
Sentences  |    100.00 |    100.00 |    100.00 |
Words      |     97.04 |     96.61 |     96.82 |
UPOS       |     94.14 |     93.72 |     93.93 |     97.01
XPOS       |      0.00 |      0.00 |      0.00 |      0.00
UFeats     |     97.01 |     96.58 |     96.79 |     99.97
AllTags    |      0.00 |      0.00 |      0.00 |      0.00
Lemmas     |      0.00 |      0.00 |      0.00 |      0.00
UAS        |     87.26 |     86.86 |     87.06 |     89.91
LAS        |     85.85 |     85.46 |     85.66 |     88.47
CLAS       |     77.29 |     76.94 |     77.12 |     81.41
MLAS       |     73.30 |     72.96 |     73.13 |     77.20
BLEX       |      0.00 |      0.00 |      0.00 |      0.00

*** KoichiYasuoka/modernbert-large-japanese-unidic-ud-triangular
Metric     | Precision |    Recall |  F1 Score | AligndAcc
-----------+-----------+-----------+-----------+-----------
Tokens     |     97.15 |     96.77 |     96.96 |
Sentences  |    100.00 |    100.00 |    100.00 |
Words      |     97.15 |     96.77 |     96.96 |
UPOS       |     94.24 |     93.87 |     94.06 |     97.01
XPOS       |      0.00 |      0.00 |      0.00 |      0.00
UFeats     |     97.11 |     96.73 |     96.92 |     99.96
AllTags    |      0.00 |      0.00 |      0.00 |      0.00
Lemmas     |      0.00 |      0.00 |      0.00 |      0.00
UAS        |     87.20 |     86.85 |     87.02 |     89.75
LAS        |     85.78 |     85.44 |     85.61 |     88.30
CLAS       |     77.12 |     76.94 |     77.03 |     81.20
MLAS       |     72.72 |     72.55 |     72.63 |     76.56
BLEX       |      0.00 |      0.00 |      0.00 |      0.00

*** KoichiYasuoka/modernbert-base-japanese-unidic-ud-embeds
Metric     | Precision |    Recall |  F1 Score | AligndAcc
-----------+-----------+-----------+-----------+-----------
Tokens     |     97.21 |     96.85 |     97.03 |
Sentences  |    100.00 |    100.00 |    100.00 |
Words      |     97.21 |     96.85 |     97.03 |
UPOS       |     94.01 |     93.67 |     93.84 |     96.71
XPOS       |      0.00 |      0.00 |      0.00 |      0.00
UFeats     |     97.17 |     96.82 |     96.99 |     99.96
AllTags    |      0.00 |      0.00 |      0.00 |      0.00
Lemmas     |      0.00 |      0.00 |      0.00 |      0.00
UAS        |     87.69 |     87.37 |     87.53 |     90.21
LAS        |     86.16 |     85.85 |     86.00 |     88.63
CLAS       |     77.92 |     77.49 |     77.70 |     81.67
MLAS       |     73.94 |     73.53 |     73.74 |     77.50
BLEX       |      0.00 |      0.00 |      0.00 |      0.00

*** KoichiYasuoka/modernbert-large-japanese-unidic-ud-embeds
Metric     | Precision |    Recall |  F1 Score | AligndAcc
-----------+-----------+-----------+-----------+-----------
Tokens     |     97.29 |     96.77 |     97.03 |
Sentences  |    100.00 |    100.00 |    100.00 |
Words      |     97.29 |     96.77 |     97.03 |
UPOS       |     94.51 |     94.01 |     94.26 |     97.15
XPOS       |      0.00 |      0.00 |      0.00 |      0.00
UFeats     |     97.23 |     96.71 |     96.97 |     99.94
AllTags    |      0.00 |      0.00 |      0.00 |      0.00
Lemmas     |      0.00 |      0.00 |      0.00 |      0.00
UAS        |     88.01 |     87.53 |     87.77 |     90.46
LAS        |     86.73 |     86.27 |     86.50 |     89.15
CLAS       |     78.60 |     78.10 |     78.35 |     82.49
MLAS       |     74.80 |     74.32 |     74.56 |     78.50
BLEX       |      0.00 |      0.00 |      0.00 |      0.00

UPOS/LAS/MLASで見る限り、modernbert-large-japanese-unidic-ud-embedsが94.26/86.50/74.56で最もいいが、一昨日の記事で試作したmodernbert-japanese-130m-ud-embeds(トークナイザ改良版)には届いていない。さて、このあたり、どの要素が精度に響いてるんだろ。

0
0
0

Register as a new user and use Qiita more conveniently

  1. You get articles that match your needs
  2. You can efficiently read back useful information
  3. You can use dark theme
What you can do with signing up
0
0

Delete article

Deleted articles cannot be recovered.

Draft of this article would be also deleted.

Are you sure you want to delete this article?