日本語生成AIの品詞付与「精度」をUD_Japanese-GSDで測る

Posted at 2024-06-16

6月11日の記事で書いた「国語研短単位でのUPOS品詞付与をFew-Shot Promptingでやる」という手法を、UD_Japanese-GSDを使ってベンチマークっぽくしてみた。UD_Japanese-GSDはja_gsd-ud-test.conlluが543例文、ja_gsd-ud-train.conlluが7050例文あるので、trainから12文とってくるごとにtestを1文処理するよう実装してみた。いわば12-Shot Promptingである。

#! /usr/bin/python3
# pip3 install transformers accelerate spacy-alignments
model="tokyotech-llm/Swallow-MS-7b-v0.1"
ud="https://github.com/UniversalDependencies/UD_Japanese-GSD"
import os
d=os.path.basename(ud)
os.system(f"test -d {d} || git clone --depth=1 {ud}")
os.system("for F in train dev test; do cp "+d+"/*-$F.conllu $F.conllu; done")
with open("train.conllu","r",encoding="utf-8") as r:
  trn=r.read().strip().split("\n\n")
with open("test.conllu","r",encoding="utf-8") as r:
  tst=r.read().strip().split("\n\n")
def ext(x):
  u=v=""
  for s in x.split("\n"):
    if s.startswith("# text ="):
      u=s[8:].strip()
    else:
      t=s.split("\t")
      if t[0].isdigit():
        v+="|"+t[1]+"_"+t[3]
  return (u,v[1:])
from transformers import pipeline
nlp=pipeline("text-generation",model,max_new_tokens=256,device_map="auto")
from spacy_alignments import get_alignments
i,j=0,int(len(trn)/len(tst))
gold=system=correct=0
for t in tst:
  w="\n".join("###text:"+"\n###UPOS:".join(ext(x)) for x in trn[i*j:i*j+j])
  u,v=ext(t)
  w+="\n###text:"+u+"\n###UPOS:"
  g=v.split("|")
  s=nlp(w)[0]["generated_text"].split("\n")[j*2+1][8:].split("|")
  gold+=len(g)
  system+=len(s)
  correct+=sum(1 for t,k in zip(s,get_alignments(g,s)[1]) if len(k)==1 and t==g[k[0]])
  i+=1
print("\n***",model)
print("Precision",correct/system if system else 0.0)
print("Recall   ",correct/gold)
print("F1 Score ",2*correct/(system+gold))

3行目のmodelを変えつつ実行した結果、私(安岡孝一)の手元では以下の結果が得られた。

*** tokyotech-llm/Swallow-MS-7b-v0.1
Precision 0.6868431771894093
Recall    0.6468467086082553
F1 Score  0.6662452092141136

*** stabilityai/japanese-stablelm-base-gamma-7b
Precision 0.6976513992621255
Recall    0.5948289090072119
F1 Score  0.6421501635813973

*** rinna/youri-7b
Precision 0.633976833976834
Recall    0.5669019487494246
F1 Score  0.598566163068573

*** rinna/llama-3-youko-8b
Precision 0.6315032797196514
Recall    0.5392051557465092
F1 Score  0.581715846542234

*** Rakuten/RakutenAI-7B
Precision 0.5840498913351602
Recall    0.47422126745435017
F1 Score  0.5234365076004573

*** cyberagent/calm2-7b
Precision 0.5589371980676329
Recall    0.44383918981126286
F1 Score  0.49478275744098527

*** lightblue/suzume-llama-3-8B-japanese
Precision 0.5029414697377605
Recall    0.38698787785791006
F1 Score  0.4374105710445302

残念ながらF1値が低く、品詞付与の「精度」としては不十分と言わざるを得ない。12-Shot程度では品詞付与は難しいのか、それとも、そもそも生成AIは品詞付与に向いていないのか、そのあたり、どうやって見極めようかな。

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up