0
0

Delete article

Deleted articles cannot be recovered.

Draft of this article would be also deleted.

Are you sure you want to delete this article?

国語研長単位品詞付与・係り受け解析モデルsarashina2.2-{0.5b,1b}-ud-embedsリリース

Posted at

SB IntuitionsのSarashina2.2シリーズは、入出力幅8192トークンの日本語LLaMAで、5億パラメータ10億パラメータ30億パラメータのモデルが公開されている。ただ、SB Intuitionsの言語モデルは、どうしてもトークナイザに難があって、そのままだと品詞付与や係り受け解析に適さない。そこで、2月28日の記事の手法で、ひらがなが細かくなるようトークナイザを改良して、国語研長単位品詞付与・係り受け解析モデルsarashina2.2-{0.5b,1b}-ud-embedsを試作してみた。係り受け解析の精度を比較してみよう。Google Colaboratoryだと、こんな感じ。

!pip install transformers sentencepiece
mdl="KoichiYasuoka/sarashina2.2-{}-ud-embeds"
org="sbintuitions/sarashina2.2-{}"
import os,sys,subprocess
from transformers import pipeline,AutoTokenizer
url="https://github.com/UniversalDependencies/UD_Japanese-GSDLUW"
f=os.path.join(os.path.basename(url),"ja_gsdluw-ud-test.conllu")
os.system(f"test -f {f} || git clone --depth=1 {url}")
url="https://universaldependencies.org/conll18/conll18_ud_eval.py"
c=os.path.basename(url)
os.system(f"test -f {c} || curl -LO {url}")
with open(f,"r",encoding="utf-8") as r:
  s=[t[8:].strip() for t in r if t.startswith("# text =")]
rst=[]
for z in ["0.5b","1b"]:
  for tkz in ["original","refined"]:
    nlp=pipeline("universal-dependencies",mdl.format(z),trust_remote_code=True)
    if tkz=="original":
      nlp.tokenizer=AutoTokenizer.from_pretrained(org.format(z))
    with open("result.conllu","w",encoding="utf-8") as w:
      for t in s:
        w.write(nlp(t))
    p=subprocess.run([sys.executable,c,"-v",f,"result.conllu"],
      encoding="utf-8",stdout=subprocess.PIPE,stderr=subprocess.STDOUT)
    os.system("mkdir -p "+os.path.join("result",mdl.format(z)))
    rst.append(os.path.join("result",mdl.format(z),tkz+".txt"))
    with open(rst[-1],"w",encoding="utf-8") as w:
      print(f"\n*** {mdl.format(z)} ({tkz} tokenizer)",p.stdout,sep="\n",file=w)
!cat {" ".join(rst)}

私(安岡孝一)の手元では、以下の結果が出力された。

*** KoichiYasuoka/sarashina2.2-0.5b-ud-embeds (original tokenizer)
Metric     | Precision |    Recall |  F1 Score | AligndAcc
-----------+-----------+-----------+-----------+-----------
Tokens     |     62.55 |     38.61 |     47.75 |
Sentences  |    100.00 |    100.00 |    100.00 |
Words      |     62.55 |     38.61 |     47.75 |
UPOS       |     59.63 |     36.80 |     45.52 |     95.33
XPOS       |      0.00 |      0.00 |      0.00 |      0.00
UFeats     |     62.55 |     38.61 |     47.75 |    100.00
AllTags    |      0.00 |      0.00 |      0.00 |      0.00
Lemmas     |      0.00 |      0.00 |      0.00 |      0.00
UAS        |     26.58 |     16.41 |     20.29 |     42.50
LAS        |     26.04 |     16.07 |     19.88 |     41.63
CLAS       |     12.93 |      9.69 |     11.07 |     27.01
MLAS       |     11.14 |      8.35 |      9.54 |     23.27
BLEX       |      0.00 |      0.00 |      0.00 |      0.00

*** KoichiYasuoka/sarashina2.2-0.5b-ud-embeds (refined tokenizer)
Metric     | Precision |    Recall |  F1 Score | AligndAcc
-----------+-----------+-----------+-----------+-----------
Tokens     |     94.20 |     94.86 |     94.53 |
Sentences  |    100.00 |    100.00 |    100.00 |
Words      |     94.20 |     94.86 |     94.53 |
UPOS       |     91.49 |     92.13 |     91.81 |     97.12
XPOS       |      0.00 |      0.00 |      0.00 |      0.00
UFeats     |     94.14 |     94.80 |     94.47 |     99.94
AllTags    |      0.00 |      0.00 |      0.00 |      0.00
Lemmas     |      0.00 |      0.00 |      0.00 |      0.00
UAS        |     79.86 |     80.42 |     80.14 |     84.78
LAS        |     78.54 |     79.09 |     78.82 |     83.38
CLAS       |     67.06 |     68.13 |     67.59 |     72.89
MLAS       |     62.80 |     63.80 |     63.30 |     68.26
BLEX       |      0.00 |      0.00 |      0.00 |      0.00

*** KoichiYasuoka/sarashina2.2-1b-ud-embeds (original tokenizer)
Metric     | Precision |    Recall |  F1 Score | AligndAcc
-----------+-----------+-----------+-----------+-----------
Tokens     |     62.27 |     38.48 |     47.57 |
Sentences  |    100.00 |    100.00 |    100.00 |
Words      |     62.27 |     38.48 |     47.57 |
UPOS       |     59.95 |     37.04 |     45.79 |     96.26
XPOS       |      0.00 |      0.00 |      0.00 |      0.00
UFeats     |     62.26 |     38.47 |     47.56 |     99.98
AllTags    |      0.00 |      0.00 |      0.00 |      0.00
Lemmas     |      0.00 |      0.00 |      0.00 |      0.00
UAS        |     25.96 |     16.04 |     19.83 |     41.69
LAS        |     25.45 |     15.73 |     19.44 |     40.87
CLAS       |     14.80 |     11.03 |     12.64 |     30.72
MLAS       |     12.79 |      9.53 |     10.92 |     26.56
BLEX       |      0.00 |      0.00 |      0.00 |      0.00

*** KoichiYasuoka/sarashina2.2-1b-ud-embeds (refined tokenizer)
Metric     | Precision |    Recall |  F1 Score | AligndAcc
-----------+-----------+-----------+-----------+-----------
Tokens     |     94.36 |     94.79 |     94.58 |
Sentences  |    100.00 |    100.00 |    100.00 |
Words      |     94.36 |     94.79 |     94.58 |
UPOS       |     92.45 |     92.87 |     92.66 |     97.98
XPOS       |      0.00 |      0.00 |      0.00 |      0.00
UFeats     |     94.33 |     94.76 |     94.55 |     99.97
AllTags    |      0.00 |      0.00 |      0.00 |      0.00
Lemmas     |      0.00 |      0.00 |      0.00 |      0.00
UAS        |     79.34 |     79.71 |     79.53 |     84.09
LAS        |     78.13 |     78.49 |     78.31 |     82.80
CLAS       |     72.05 |     72.88 |     72.46 |     78.18
MLAS       |     68.58 |     69.36 |     68.97 |     74.41
BLEX       |      0.00 |      0.00 |      0.00 |      0.00

UPOS/LAS/MLASを表にしてみよう。

トークナイザ改良前 トークナイザ改良後
sarashina2.2-0.5b-ud-embeds 45.52/19.88/9.54 91.81/78.82/63.30
sarashima2.2-1b-ud-embeds 45.79/19.44/10.92 92.66/78.31/68.97

2月25日の記事と比較すると、同じLLaMAでもLLM-jp-3シリーズの方が、Sarashina2.2シリーズより解析精度が高いようだ。このあたり、トークナイザだけじゃないのかな。

0
0
0

Register as a new user and use Qiita more conveniently

  1. You get articles that match your needs
  2. You can efficiently read back useful information
  3. You can use dark theme
What you can do with signing up
0
0

Delete article

Deleted articles cannot be recovered.

Draft of this article would be also deleted.

Are you sure you want to delete this article?