More than 1 year has passed since last update.

日本語トークナイザの「精度」をUD_Japanese-GSDとUD_Japanese-Modernのtestセットで測る

Posted at 2024-04-12

昨日の記事のアイデアをUD_Japanese-GSDに適用して、日本語モデルにおけるトークナイザの「精度」を測ってみた。Google Colaboratoryだと、こんな感じ。

!pip install transformers sentencepiece spacy-alignments fugashi unidic-lite
models=["tohoku-nlp/bert-base-japanese-v2","rinna/japanese-gpt-neox-3.6b","stockmark/gpt-neox-japanese-1.4b","tokyotech-llm/Swallow-MS-7b-v0.1","Rakuten/RakutenAI-7B","K-walk/chimaki-2b-base"]
ud="UD_Japanese-GSD"
!test -d $ud || git clone --depth=1 https://github.com/universaldependencies/$ud
!cp $ud/*-test.conllu test.conllu
from transformers import AutoTokenizer
from spacy_alignments import get_alignments
for mdl in models:
  tkz=AutoTokenizer.from_pretrained(mdl)
  gold=system=correct=0
  with open("test.conllu","r",encoding="utf-8") as r:
    for k in r:
      if k.startswith("# text ="):
        txt=k[8:].strip()
        frm=[]
      elif k.strip()=="":
        g=[(t[0],t[-1]+1) for t in get_alignments(list(txt),frm)[1]]
        try:
          s=[t for t in tkz(txt,return_offsets_mapping=True)["offset_mapping"] if t[0]<t[1]]
        except:
          s=[(t[0],t[-1]+1) if t>[] else (0,0) for t in get_alignments(list(txt),tkz.tokenize(txt))[1]]
        gold+=len(g)
        system+=len(s)
        i=j=0
        while i<len(g) and j<len(s):
          if s[j][0]<g[i][0]:
            j+=1
          elif g[i][0]<s[j][0]:
            i+=1
          else:
            correct+=g[i][1]==s[j][1]
            i+=1
            j+=1
      else:
        t=k.split("\t")
        if len(t)==10 and t[0].isdecimal():
          frm.append(t[1])
  print("\n***",mdl)
  print("Precision",correct/system if system else 0.0)
  print("Recall   ",correct/gold)
  print("F1 Score ",2*correct/(system+gold))

私(安岡孝一)の手元では、以下の結果が得られた。

*** tohoku-nlp/bert-base-japanese-v2
Precision 0.8708618721461188
Recall    0.9364738376553629
F1 Score  0.9024768946395564

*** rinna/japanese-gpt-neox-3.6b
Precision 0.5398446733074456
Recall    0.4906398649685438
F1 Score  0.5140675241157556

*** stockmark/gpt-neox-japanese-1.4b
Precision 0.4952922917744439
Recall    0.36727021635721957
F1 Score  0.42178069518480993

*** tokyotech-llm/Swallow-MS-7b-v0.1
Precision 0.7043613286524439
Recall    0.8004449900260856
F1 Score  0.7493356316885729

*** Rakuten/RakutenAI-7B
Precision 0.3627563076453998
Recall    0.3651219886450821
F1 Score  0.3639353037892402

*** K-walk/chimaki-2b-base
Precision 0.49307685929856565
Recall    0.45626822157434405
F1 Score  0.4739589559673242

UD_Japanese-GSDは国語研短単位に従っているので、この結果は、各トークナイザが国語研短単位にどの程度近いかを、示していると考えられる。ただ、それはあくまで現代日本語(新字新かな)の国語研短単位であり、これを近代日本語の国語研短単位(プログラム3行目をud="UD_Japanese-Modern")に変更すると、以下の結果になった。

*** tohoku-nlp/bert-base-japanese-v2
Precision 0.5946605983238052
Recall    0.7392024285911412
F1 Score  0.659099996924118

*** rinna/japanese-gpt-neox-3.6b
Precision 0.43397942363440006
Recall    0.5267696978059887
F1 Score  0.47589366410072614

*** stockmark/gpt-neox-japanese-1.4b
Precision 0.36781874927754016
Recall    0.43907823927142264
F1 Score  0.4003019247704114

*** tokyotech-llm/Swallow-MS-7b-v0.1
Precision 0.47191065325181725
Recall    0.676348834000276
F1 Score  0.5559304732469448

*** Rakuten/RakutenAI-7B
Precision 0.29508700102354146
Recall    0.3978197874982751
F1 Score  0.33883763295528

*** K-walk/chimaki-2b-base
Precision 0.4500943841434639
Recall    0.5264247274734373
F1 Score  0.48527634675316417

まあ、旧字旧かなの近代日本語モデルを必要とする分野は、そう多くはないと思うのだが、それでも参考にしてほしい。

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up