DiffLlama-1Bは『雪国』の冒頭をどうトークナイズするのか

Last updated at 2025-04-14Posted at 2025-04-14

一昨日の記事の続きだが、日本語DiffLlamaモデル「DiffLlama-1B」のトークナイザをチェックしてみた。

>>> from transformers import AutoTokenizer
>>> tkz=AutoTokenizer.from_pretrained("kajuma/DiffLlama-1B")
>>> print(tkz.convert_ids_to_tokens(tkz("国境の長いトンネルを抜けると雪国であった。","夜の底が白くなった。")["input_ids"]))
['国境', 'の長い', 'トンネル', 'を抜け', 'ると', '雪', '国', 'であった', '。', '夜の', '底', 'が', '白', 'くなった', '。']

「の長い」「を抜け」「ると」「くなった」が、単語の切れ目(形態素境界)を完全に無視していて、かなり使いにくい。仕方ないので2月28日の記事の手法で、トークナイザを改良することにした。

>>> from transformers import AutoTokenizer
>>> from tokenizers import Regex
>>> from tokenizers.pre_tokenizers import Sequence,Split
>>> tkz=AutoTokenizer.from_pretrained("kajuma/DiffLlama-1B")
>>> tkz.backend_tokenizer.pre_tokenizer=Sequence([Split(Regex("[ぁ-ん]"),"isolated"),tkz.backend_tokenizer.pre_tokenizer])
>>> print(tkz.convert_ids_to_tokens(tkz("国境の長いトンネルを抜けると雪国であっ た。","夜の底が白くなった。")["input_ids"]))
['国境', 'の', '長', 'い', 'トンネル', 'を', '抜', 'け', 'る', 'と', '雪', '国', 'で', 'あ', 'っ', 'た', '。', '夜', 'の', '底', 'が', '白', 'く', 'な', 'っ', 'た', '。']

ひらがなが細かすぎる気がしないでもないが、とりあえず、このやり方でトークナイザを改良して、国語研長単位品詞付与・係り受け解析モデルDiffLlama-1B-ud-embedsを試作してみた。2月25日の記事にしたがって、係り受け解析の精度を比較してみよう。Google Colaboratoryだと、こんな感じ。

!pip install transformers flash-attn
mdl="KoichiYasuoka/DiffLlama-1B-ud-embeds"
org="kajuma/DiffLlama-1B"
import os,sys,subprocess
from transformers import pipeline,AutoTokenizer
url="https://github.com/UniversalDependencies/UD_Japanese-GSDLUW"
f=os.path.join(os.path.basename(url),"ja_gsdluw-ud-test.conllu")
os.system(f"test -f {f} || git clone --depth=1 {url}")
url="https://universaldependencies.org/conll18/conll18_ud_eval.py"
c=os.path.basename(url)
os.system(f"test -f {c} || curl -LO {url}")
with open(f,"r",encoding="utf-8") as r:
  s=[t[8:].strip() for t in r if t.startswith("# text =")]
rst=[]
for tkz in ["original","refined"]:
  nlp=pipeline("universal-dependencies",mdl,trust_remote_code=True)
  if tkz=="original":
    nlp.tokenizer=AutoTokenizer.from_pretrained(org)
  with open("result.conllu","w",encoding="utf-8") as w:
    for t in s:
      w.write(nlp(t))
  p=subprocess.run([sys.executable,c,"-v",f,"result.conllu"],
    encoding="utf-8",stdout=subprocess.PIPE,stderr=subprocess.STDOUT)
  os.system("mkdir -p "+os.path.join("result",mdl))
  rst.append(os.path.join("result",mdl,tkz+".txt"))
  with open(rst[-1],"w",encoding="utf-8") as w:
    print(f"\n*** {mdl} ({tkz} tokenizer)",p.stdout,sep="\n",file=w)
!cat {" ".join(rst)}

私(安岡孝一)の手元では、以下の結果が得られた。

*** KoichiYasuoka/DiffLlama-1B-ud-embeds (original tokenizer)
Metric     | Precision |    Recall |  F1 Score | AligndAcc
-----------+-----------+-----------+-----------+-----------
Tokens     |     61.44 |     40.51 |     48.82 |
Sentences  |    100.00 |    100.00 |    100.00 |
Words      |     61.44 |     40.51 |     48.82 |
UPOS       |     57.95 |     38.20 |     46.05 |     94.32
XPOS       |      0.00 |      0.00 |      0.00 |      0.00
UFeats     |     61.43 |     40.50 |     48.81 |     99.98
AllTags    |      0.00 |      0.00 |      0.00 |      0.00
Lemmas     |      0.00 |      0.00 |      0.00 |      0.00
UAS        |     28.16 |     18.57 |     22.38 |     45.83
LAS        |     27.51 |     18.13 |     21.86 |     44.77
CLAS       |     16.28 |     12.43 |     14.10 |     31.05
MLAS       |     13.26 |     10.13 |     11.48 |     25.29
BLEX       |      0.00 |      0.00 |      0.00 |      0.00

*** KoichiYasuoka/DiffLlama-1B-ud-embeds (refined tokenizer)
Metric     | Precision |    Recall |  F1 Score | AligndAcc
-----------+-----------+-----------+-----------+-----------
Tokens     |     93.27 |     94.28 |     93.78 |
Sentences  |    100.00 |    100.00 |    100.00 |
Words      |     93.27 |     94.28 |     93.78 |
UPOS       |     88.88 |     89.84 |     89.36 |     95.29
XPOS       |      0.00 |      0.00 |      0.00 |      0.00
UFeats     |     93.21 |     94.22 |     93.71 |     99.93
AllTags    |      0.00 |      0.00 |      0.00 |      0.00
Lemmas     |      0.00 |      0.00 |      0.00 |      0.00
UAS        |     79.59 |     80.46 |     80.02 |     85.33
LAS        |     77.94 |     78.79 |     78.36 |     83.56
CLAS       |     69.12 |     69.91 |     69.51 |     75.43
MLAS       |     63.56 |     64.29 |     63.92 |     69.36
BLEX       |      0.00 |      0.00 |      0.00 |      0.00

UPOS/LAS/MLASで見る限り、トークナイザ改良前が46.05/21.86/11.48、改良後が89.36/78.36/63.92である。トークナイザ改良の効果は出ているものの、イマイチ解析精度が上がりきらない。うーん、DiffLlamaというモデルの限界なのかなあ。

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up