国語研長単位品詞付与・係り受け解析モデルsarashina2.2-{0.5b,1b}-ud-embedsリリース

Posted at 2025-03-09

SB IntuitionsのSarashina2.2シリーズは、入出力幅8192トークンの日本語LLaMAで、5億パラメータ・10億パラメータ・30億パラメータのモデルが公開されている。ただ、SB Intuitionsの言語モデルは、どうしてもトークナイザに難があって、そのままだと品詞付与や係り受け解析に適さない。そこで、2月28日の記事の手法で、ひらがなが細かくなるようトークナイザを改良して、国語研長単位品詞付与・係り受け解析モデルsarashina2.2-{0.5b,1b}-ud-embedsを試作してみた。係り受け解析の精度を比較してみよう。Google Colaboratoryだと、こんな感じ。

!pip install transformers sentencepiece
mdl="KoichiYasuoka/sarashina2.2-{}-ud-embeds"
org="sbintuitions/sarashina2.2-{}"
import os,sys,subprocess
from transformers import pipeline,AutoTokenizer
url="https://github.com/UniversalDependencies/UD_Japanese-GSDLUW"
f=os.path.join(os.path.basename(url),"ja_gsdluw-ud-test.conllu")
os.system(f"test -f {f} || git clone --depth=1 {url}")
url="https://universaldependencies.org/conll18/conll18_ud_eval.py"
c=os.path.basename(url)
os.system(f"test -f {c} || curl -LO {url}")
with open(f,"r",encoding="utf-8") as r:
  s=[t[8:].strip() for t in r if t.startswith("# text =")]
rst=[]
for z in ["0.5b","1b"]:
  for tkz in ["original","refined"]:
    nlp=pipeline("universal-dependencies",mdl.format(z),trust_remote_code=True)
    if tkz=="original":
      nlp.tokenizer=AutoTokenizer.from_pretrained(org.format(z))
    with open("result.conllu","w",encoding="utf-8") as w:
      for t in s:
        w.write(nlp(t))
    p=subprocess.run([sys.executable,c,"-v",f,"result.conllu"],
      encoding="utf-8",stdout=subprocess.PIPE,stderr=subprocess.STDOUT)
    os.system("mkdir -p "+os.path.join("result",mdl.format(z)))
    rst.append(os.path.join("result",mdl.format(z),tkz+".txt"))
    with open(rst[-1],"w",encoding="utf-8") as w:
      print(f"\n*** {mdl.format(z)} ({tkz} tokenizer)",p.stdout,sep="\n",file=w)
!cat {" ".join(rst)}

私(安岡孝一)の手元では、以下の結果が出力された。

*** KoichiYasuoka/sarashina2.2-0.5b-ud-embeds (original tokenizer)
Metric     | Precision |    Recall |  F1 Score | AligndAcc
-----------+-----------+-----------+-----------+-----------
Tokens     |     62.55 |     38.61 |     47.75 |
Sentences  |    100.00 |    100.00 |    100.00 |
Words      |     62.55 |     38.61 |     47.75 |
UPOS       |     59.63 |     36.80 |     45.52 |     95.33
XPOS       |      0.00 |      0.00 |      0.00 |      0.00
UFeats     |     62.55 |     38.61 |     47.75 |    100.00
AllTags    |      0.00 |      0.00 |      0.00 |      0.00
Lemmas     |      0.00 |      0.00 |      0.00 |      0.00
UAS        |     26.58 |     16.41 |     20.29 |     42.50
LAS        |     26.04 |     16.07 |     19.88 |     41.63
CLAS       |     12.93 |      9.69 |     11.07 |     27.01
MLAS       |     11.14 |      8.35 |      9.54 |     23.27
BLEX       |      0.00 |      0.00 |      0.00 |      0.00

*** KoichiYasuoka/sarashina2.2-0.5b-ud-embeds (refined tokenizer)
Metric     | Precision |    Recall |  F1 Score | AligndAcc
-----------+-----------+-----------+-----------+-----------
Tokens     |     94.20 |     94.86 |     94.53 |
Sentences  |    100.00 |    100.00 |    100.00 |
Words      |     94.20 |     94.86 |     94.53 |
UPOS       |     91.49 |     92.13 |     91.81 |     97.12
XPOS       |      0.00 |      0.00 |      0.00 |      0.00
UFeats     |     94.14 |     94.80 |     94.47 |     99.94
AllTags    |      0.00 |      0.00 |      0.00 |      0.00
Lemmas     |      0.00 |      0.00 |      0.00 |      0.00
UAS        |     79.86 |     80.42 |     80.14 |     84.78
LAS        |     78.54 |     79.09 |     78.82 |     83.38
CLAS       |     67.06 |     68.13 |     67.59 |     72.89
MLAS       |     62.80 |     63.80 |     63.30 |     68.26
BLEX       |      0.00 |      0.00 |      0.00 |      0.00

*** KoichiYasuoka/sarashina2.2-1b-ud-embeds (original tokenizer)
Metric     | Precision |    Recall |  F1 Score | AligndAcc
-----------+-----------+-----------+-----------+-----------
Tokens     |     62.27 |     38.48 |     47.57 |
Sentences  |    100.00 |    100.00 |    100.00 |
Words      |     62.27 |     38.48 |     47.57 |
UPOS       |     59.95 |     37.04 |     45.79 |     96.26
XPOS       |      0.00 |      0.00 |      0.00 |      0.00
UFeats     |     62.26 |     38.47 |     47.56 |     99.98
AllTags    |      0.00 |      0.00 |      0.00 |      0.00
Lemmas     |      0.00 |      0.00 |      0.00 |      0.00
UAS        |     25.96 |     16.04 |     19.83 |     41.69
LAS        |     25.45 |     15.73 |     19.44 |     40.87
CLAS       |     14.80 |     11.03 |     12.64 |     30.72
MLAS       |     12.79 |      9.53 |     10.92 |     26.56
BLEX       |      0.00 |      0.00 |      0.00 |      0.00

*** KoichiYasuoka/sarashina2.2-1b-ud-embeds (refined tokenizer)
Metric     | Precision |    Recall |  F1 Score | AligndAcc
-----------+-----------+-----------+-----------+-----------
Tokens     |     94.36 |     94.79 |     94.58 |
Sentences  |    100.00 |    100.00 |    100.00 |
Words      |     94.36 |     94.79 |     94.58 |
UPOS       |     92.45 |     92.87 |     92.66 |     97.98
XPOS       |      0.00 |      0.00 |      0.00 |      0.00
UFeats     |     94.33 |     94.76 |     94.55 |     99.97
AllTags    |      0.00 |      0.00 |      0.00 |      0.00
Lemmas     |      0.00 |      0.00 |      0.00 |      0.00
UAS        |     79.34 |     79.71 |     79.53 |     84.09
LAS        |     78.13 |     78.49 |     78.31 |     82.80
CLAS       |     72.05 |     72.88 |     72.46 |     78.18
MLAS       |     68.58 |     69.36 |     68.97 |     74.41
BLEX       |      0.00 |      0.00 |      0.00 |      0.00

UPOS/LAS/MLASを表にしてみよう。

	トークナイザ改良前	トークナイザ改良後
sarashina2.2-0.5b-ud-embeds	45.52/19.88/9.54	91.81/78.82/63.30
sarashima2.2-1b-ud-embeds	45.79/19.44/10.92	92.66/78.31/68.97

2月25日の記事と比較すると、同じLLaMAでもLLM-jp-3シリーズの方が、Sarashina2.2シリーズより解析精度が高いようだ。このあたり、トークナイザだけじゃないのかな。

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up