日本語生成AIの品詞付与「精度」をUD_Japanese-GSDLUWで測る

Posted at 2024-06-17

昨日の記事に続いて、「国語研長単位でのUPOS品詞付与をFew-Shot Promptingでやる」という手法を、UD_Japanese-GSDLUWを使ってベンチマークっぽくしてみた。やはり12-shot Promptingである。

#! /usr/bin/python3
# pip3 install transformers accelerate spacy-alignments
model="tokyotech-llm/Swallow-MS-7b-v0.1"
ud="https://github.com/UniversalDependencies/UD_Japanese-GSDLUW"
import os
d=os.path.basename(ud)
os.system(f"test -d {d} || git clone --depth=1 {ud}")
os.system("for F in train dev test; do cp "+d+"/*-$F.conllu $F.conllu; done")
with open("train.conllu","r",encoding="utf-8") as r:
  trn=r.read().strip().split("\n\n")
with open("test.conllu","r",encoding="utf-8") as r:
  tst=r.read().strip().split("\n\n")
def ext(x):
  u=v=""
  for s in x.split("\n"):
    if s.startswith("# text ="):
      u=s[8:].strip()
    else:
      t=s.split("\t")
      if t[0].isdigit():
        v+="|"+t[1]+"_"+t[3]
  return (u,v[1:])
from transformers import pipeline
nlp=pipeline("text-generation",model,max_new_tokens=256,device_map="auto")
from spacy_alignments import get_alignments
i,j=0,int(len(trn)/len(tst))
gold=system=correct=0
for t in tst:
  w="\n".join("###text:"+"\n###UPOS:".join(ext(x)) for x in trn[i*j:i*j+j])
  u,v=ext(t)
  w+="\n###text:"+u+"\n###UPOS:"
  g=v.split("|")
  s=nlp(w)[0]["generated_text"].split("\n")[j*2+1][8:].split("|")
  gold+=len(g)
  system+=len(s)
  correct+=sum(1 for t,k in zip(s,get_alignments(g,s)[1]) if len(k)==1 and t==g[k[0]])
  i+=1
print("\n***",model)
print("Precision",correct/system if system else 0.0)
print("Recall   ",correct/gold)
print("F1 Score ",2*correct/(system+gold))

3行目のmodelを変えつつ実行した結果、私(安岡孝一)の手元では以下の結果が得られた。

*** tokyotech-llm/Swallow-MS-7b-v0.1
Precision 0.6013539282990084
Recall    0.6048139624088991
F1 Score  0.6030789825970548

*** stabilityai/japanese-stablelm-base-gamma-7b
Precision 0.6442735248167287
Recall    0.5140966628308401
F1 Score  0.5718704997599872

*** rinna/llama-3-youko-8b
Precision 0.6153669197147458
Recall    0.513041810510165
F1 Score  0.559564899069135

*** rinna/youri-7b
Precision 0.5705988383737232
Recall    0.5464135021097046
F1 Score  0.5582443421181542

*** Rakuten/RakutenAI-7B
Precision 0.5327243293246994
Recall    0.4417913310318374
F1 Score  0.48301530719228347

*** cyberagent/calm2-7b
Precision 0.4641642228739003
Recall    0.3794591484464902
F1 Score  0.41755922545243496

*** lightblue/suzume-llama-3-8B-japanese
Precision 0.4584841761065047
Recall    0.37648638281549673
F1 Score  0.4134590068980043

残念ながらF1値が低く、品詞付与の「精度」としては不十分と言わざるを得ない。一昨年の『青空文庫DeBERTaモデルによる国語研長単位係り受け解析』でも、「UPOS」のF1値は0.95以上出ていたので、現時点での生成AIによる品詞付与は、まだまだ実用にはほど遠いと考えていいだろう。うーん、残念。

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up