国語研長単位品詞付与・係り受け解析モデルllm-jp-3-{150m,440m,980m,1.8b}-ud-embedsリリース

Posted at 2025-02-25

LLM-jp-3シリーズは、入出力幅4096トークンの日本語LLaMAで、1.5億・4.4億・9.8億・18億・37億・72億・130億パラメータのモデルが公開されている。ちなみに、1720億パラメータのモデルもあるらしいが、これは使用許諾が必要だ。1.5億パラメータだと、私(安岡孝一)が作った青空文庫ModernBERTのbaseモデルより小さい。次の4.4億パラメータだと、青空文庫ModernBERTのlargeモデルより少し小さい。となると、比較したくなるのが人情だったりする。

ただ、入出力幅が4096トークンだと、係り受け解析の隣接行列を上三角行列に変換しても89×89あたりが限界で、それ以上は空行を間引く必要が生じる。また、LLM-jp-3のトークナイザは、かなり癖が強い。ちょっと見てみよう。

>>> from transformers import AutoTokenizer
>>> tkz=AutoTokenizer.from_pretrained("llm-jp/llm-jp-3-150m")
>>> print(tkz.convert_ids_to_tokens(tkz("国境の長いトンネルを抜けると雪国であった。","夜の底が白くなった。")["input_ids"]))
['<s>', '▁国境', 'の', '長い', 'トンネル', 'を', '抜ける', 'と', '雪', '国', 'であった。', '<s>', '▁夜', 'の', '底', 'が', '白', 'くなった', '。']

文頭のmetaspaceが邪魔だ。しかも「であった。」が、句点を含んだまま1トークンになっていて、かなり使いづらい。仕方ないので、トークナイザを改良することにした。

>>> from transformers import AutoTokenizer
>>> from tokenizers import Regex
>>> from tokenizers.pre_tokenizers import Sequence,Split,Whitespace,Punctuation
>>> tkz=AutoTokenizer.from_pretrained("llm-jp/llm-jp-3-150m")
>>> tkz.backend_tokenizer.normalizer=None
>>> tkz.backend_tokenizer.pre_tokenizer=Sequence([Split(Regex("[ぁ-ん]"),"isolated"),Whitespace(),Punctuation()])
>>> print(tkz.convert_ids_to_tokens(tkz("国境の長いトンネルを抜けると雪国であった。","夜の底が白くなった。")["input_ids"]))
['<s>', '国境', 'の', '長', 'い', 'トンネル', 'を', '抜', 'け', 'る', 'と', '雪', '国', 'で', 'あ', 'っ', 'た', '。', '<s>', '夜', 'の', '底', 'が', '白', 'く', 'な', 'っ', 'た', '。']

ひらがなが細かすぎる気がしないでもないが、とりあえず、このやり方でトークナイザを改良して、国語研長単位品詞付与・係り受け解析モデルllm-jp-3-{150m,440m,980m,1.8b}-ud-embedsを試作してみた。一昨日の記事にしたがって、係り受け解析の精度を比較してみよう。Google Colaboratoryだと、こんな感じ。

!pip install transformers
mdl="KoichiYasuoka/llm-jp-3-{}-ud-embeds"
org="llm-jp/llm-jp-3-{}"
import os,sys,subprocess
from transformers import pipeline
url="https://github.com/UniversalDependencies/UD_Japanese-GSDLUW"
f=os.path.join(os.path.basename(url),"ja_gsdluw-ud-test.conllu")
os.system(f"test -f {f} || git clone --depth=1 {url}")
url="https://universaldependencies.org/conll18/conll18_ud_eval.py"
c=os.path.basename(url)
os.system(f"test -f {c} || curl -LO {url}")
with open(f,"r",encoding="utf-8") as r:
  s=[t[8:].strip() for t in r if t.startswith("# text =")]
rst=[]
for z in ["150m","440m","980m","1.8b"]:
  for tkz in ["original","refined"]:
    nlp=pipeline("universal-dependencies",mdl.format(z),trust_remote_code=True)
    if tkz=="original":
      from transformers import AutoTokenizer
      from tokenizers.normalizers import Replace
      nlp.tokenizer=AutoTokenizer.from_pretrained(org.format(z))
      nlp.tokenizer.backend_tokenizer.normalizer=Replace(" ","▁")
      nlp.tokenizer.model_input_names=["input_ids","attention_mask"]
    with open("result.conllu","w",encoding="utf-8") as w:
      for t in s:
        w.write(nlp(t))
    p=subprocess.run([sys.executable,c,"-v",f,"result.conllu"],
      encoding="utf-8",stdout=subprocess.PIPE,stderr=subprocess.STDOUT)
    os.system("mkdir -p "+os.path.join("result",mdl.format(z)))
    rst.append(os.path.join("result",mdl.format(z),tkz+".txt"))
    with open(rst[-1],"w",encoding="utf-8") as w:
      print(f"\n*** {mdl.format(z)} ({tkz} tokenizer)",p.stdout,sep="\n",file=w)
!cat {" ".join(rst)}

私の手元では、以下の結果が出力された。

*** KoichiYasuoka/llm-jp-3-150m-ud-embeds (original tokenizer)
Metric     | Precision |    Recall |  F1 Score | AligndAcc
-----------+-----------+-----------+-----------+-----------
Tokens     |     71.64 |     54.32 |     61.79 |
Sentences  |    100.00 |    100.00 |    100.00 |
Words      |     71.64 |     54.32 |     61.79 |
UPOS       |     69.66 |     52.81 |     60.07 |     97.23
XPOS       |      0.00 |      0.00 |      0.00 |      0.00
UFeats     |     71.64 |     54.32 |     61.79 |    100.00
AllTags    |      0.00 |      0.00 |      0.00 |      0.00
Lemmas     |      0.00 |      0.00 |      0.00 |      0.00
UAS        |     47.63 |     36.11 |     41.08 |     66.49
LAS        |     46.85 |     35.52 |     40.41 |     65.40
CLAS       |     36.98 |     32.70 |     34.71 |     49.87
MLAS       |     30.73 |     27.17 |     28.84 |     41.43
BLEX       |      0.00 |      0.00 |      0.00 |      0.00

*** KoichiYasuoka/llm-jp-3-150m-ud-embeds (refined tokenizer)
Metric     | Precision |    Recall |  F1 Score | AligndAcc
-----------+-----------+-----------+-----------+-----------
Tokens     |     95.08 |     95.82 |     95.45 |
Sentences  |    100.00 |    100.00 |    100.00 |
Words      |     95.08 |     95.82 |     95.45 |
UPOS       |     93.44 |     94.17 |     93.81 |     98.28
XPOS       |      0.00 |      0.00 |      0.00 |      0.00
UFeats     |     95.05 |     95.79 |     95.42 |     99.97
AllTags    |      0.00 |      0.00 |      0.00 |      0.00
Lemmas     |      0.00 |      0.00 |      0.00 |      0.00
UAS        |     84.29 |     84.94 |     84.62 |     88.65
LAS        |     83.47 |     84.12 |     83.79 |     87.79
CLAS       |     75.88 |     77.03 |     76.45 |     81.62
MLAS       |     72.83 |     73.93 |     73.37 |     78.33
BLEX       |      0.00 |      0.00 |      0.00 |      0.00

*** KoichiYasuoka/llm-jp-3-440m-ud-embeds (original tokenizer)
Metric     | Precision |    Recall |  F1 Score | AligndAcc
-----------+-----------+-----------+-----------+-----------
Tokens     |     71.12 |     53.89 |     61.32 |
Sentences  |    100.00 |    100.00 |    100.00 |
Words      |     71.12 |     53.89 |     61.32 |
UPOS       |     69.22 |     52.45 |     59.68 |     97.33
XPOS       |      0.00 |      0.00 |      0.00 |      0.00
UFeats     |     71.11 |     53.88 |     61.31 |     99.98
AllTags    |      0.00 |      0.00 |      0.00 |      0.00
Lemmas     |      0.00 |      0.00 |      0.00 |      0.00
UAS        |     45.75 |     34.67 |     39.44 |     64.32
LAS        |     45.09 |     34.17 |     38.88 |     63.40
CLAS       |     35.03 |     31.83 |     33.35 |     49.02
MLAS       |     29.30 |     26.62 |     27.89 |     41.00
BLEX       |      0.00 |      0.00 |      0.00 |      0.00

*** KoichiYasuoka/llm-jp-3-440m-ud-embeds (refined tokenizer)
Metric     | Precision |    Recall |  F1 Score | AligndAcc
-----------+-----------+-----------+-----------+-----------
Tokens     |     95.57 |     96.07 |     95.82 |
Sentences  |    100.00 |    100.00 |    100.00 |
Words      |     95.57 |     96.07 |     95.82 |
UPOS       |     93.90 |     94.39 |     94.15 |     98.25
XPOS       |      0.00 |      0.00 |      0.00 |      0.00
UFeats     |     95.57 |     96.07 |     95.82 |    100.00
AllTags    |      0.00 |      0.00 |      0.00 |      0.00
Lemmas     |      0.00 |      0.00 |      0.00 |      0.00
UAS        |     83.76 |     84.20 |     83.98 |     87.64
LAS        |     82.97 |     83.40 |     83.19 |     86.81
CLAS       |     77.57 |     79.38 |     78.46 |     83.99
MLAS       |     74.59 |     76.32 |     75.45 |     80.76
BLEX       |      0.00 |      0.00 |      0.00 |      0.00

*** KoichiYasuoka/llm-jp-3-980m-ud-embeds (original tokenizer)
Metric     | Precision |    Recall |  F1 Score | AligndAcc
-----------+-----------+-----------+-----------+-----------
Tokens     |     72.53 |     54.93 |     62.52 |
Sentences  |    100.00 |    100.00 |    100.00 |
Words      |     72.53 |     54.93 |     62.52 |
UPOS       |     70.84 |     53.64 |     61.05 |     97.66
XPOS       |      0.00 |      0.00 |      0.00 |      0.00
UFeats     |     72.52 |     54.92 |     62.50 |     99.98
AllTags    |      0.00 |      0.00 |      0.00 |      0.00
Lemmas     |      0.00 |      0.00 |      0.00 |      0.00
UAS        |     48.47 |     36.71 |     41.78 |     66.83
LAS        |     47.71 |     36.13 |     41.12 |     65.78
CLAS       |     36.87 |     33.10 |     34.88 |     49.74
MLAS       |     31.12 |     27.94 |     29.44 |     41.98
BLEX       |      0.00 |      0.00 |      0.00 |      0.00

*** KoichiYasuoka/llm-jp-3-980m-ud-embeds (refined tokenizer)
Metric     | Precision |    Recall |  F1 Score | AligndAcc
-----------+-----------+-----------+-----------+-----------
Tokens     |     95.70 |     96.14 |     95.92 |
Sentences  |    100.00 |    100.00 |    100.00 |
Words      |     95.70 |     96.14 |     95.92 |
UPOS       |     94.34 |     94.78 |     94.56 |     98.58
XPOS       |      0.00 |      0.00 |      0.00 |      0.00
UFeats     |     95.67 |     96.12 |     95.89 |     99.97
AllTags    |      0.00 |      0.00 |      0.00 |      0.00
Lemmas     |      0.00 |      0.00 |      0.00 |      0.00
UAS        |     85.80 |     86.20 |     86.00 |     89.66
LAS        |     85.01 |     85.40 |     85.20 |     88.83
CLAS       |     78.39 |     79.53 |     78.96 |     83.70
MLAS       |     75.75 |     76.85 |     76.30 |     80.88
BLEX       |      0.00 |      0.00 |      0.00 |      0.00

*** KoichiYasuoka/llm-jp-3-1.8b-ud-embeds (original tokenizer)
Metric     | Precision |    Recall |  F1 Score | AligndAcc
-----------+-----------+-----------+-----------+-----------
Tokens     |     72.85 |     54.10 |     62.09 |
Sentences  |    100.00 |    100.00 |    100.00 |
Words      |     72.85 |     54.10 |     62.09 |
UPOS       |     71.00 |     52.73 |     60.52 |     97.47
XPOS       |      0.00 |      0.00 |      0.00 |      0.00
UFeats     |     72.83 |     54.09 |     62.08 |     99.98
AllTags    |      0.00 |      0.00 |      0.00 |      0.00
Lemmas     |      0.00 |      0.00 |      0.00 |      0.00
UAS        |     47.60 |     35.36 |     40.58 |     65.35
LAS        |     46.89 |     34.83 |     39.97 |     64.37
CLAS       |     35.28 |     31.06 |     33.03 |     47.83
MLAS       |     29.92 |     26.33 |     28.01 |     40.56
BLEX       |      0.00 |      0.00 |      0.00 |      0.00

*** KoichiYasuoka/llm-jp-3-1.8b-ud-embeds (refined tokenizer)
Metric     | Precision |    Recall |  F1 Score | AligndAcc
-----------+-----------+-----------+-----------+-----------
Tokens     |     95.74 |     96.14 |     95.94 |
Sentences  |    100.00 |    100.00 |    100.00 |
Words      |     95.74 |     96.14 |     95.94 |
UPOS       |     94.28 |     94.68 |     94.48 |     98.47
XPOS       |      0.00 |      0.00 |      0.00 |      0.00
UFeats     |     95.72 |     96.13 |     95.92 |     99.98
AllTags    |      0.00 |      0.00 |      0.00 |      0.00
Lemmas     |      0.00 |      0.00 |      0.00 |      0.00
UAS        |     86.74 |     87.10 |     86.92 |     90.59
LAS        |     85.96 |     86.33 |     86.14 |     89.79
CLAS       |     79.07 |     79.73 |     79.40 |     84.05
MLAS       |     76.39 |     77.03 |     76.71 |     81.20
BLEX       |      0.00 |      0.00 |      0.00 |      0.00

UPOS/LAS/MLASを表にしてみよう。

	トークナイザ改良前	トークナイザ改良後
llm-jp-3-150m-ud-embeds	60.07/40.41/28.84	93.81/83.79/73.37
llm-jp-3-440m-ud-embeds	59.68/38.88/27.89	94.15/83.19/75.45
llm-jp-3-980m-ud-embeds	61.05/41.12/29.44	94.56/85.20/76.30
llm-jp-3-1.8b-ud-embeds	60.52/39.97/28.01	94.48/86.14/76.71

この結果を見る限り、トークナイザの改良は、品詞付与・係り受け解析の精度に寄与する。一方、モデルの規模(パラメータ数)は、微妙には寄与するものの、思いのほか効かないようだ。あくまで品詞付与・係り受け解析に限った話ではあるものの、もう少しトークナイザに気をつけて設計してくれると、私個人としてはうれしいなあ。

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up