0
0

Delete article

Deleted articles cannot be recovered.

Draft of this article would be also deleted.

Are you sure you want to delete this article?

TinySwallow-1.5BでのFew-Shot Promptingによる日本語品詞付与

Posted at

2024年6月11日の記事の手法をTinySwallow-1.5Bに適用して、国語研短単位でのUPOS (Universal Part-Of-Speech)品詞付与に挑戦してみた。Google Colaboratoryだと、こんな感じ。

!pip install transformers
model="SakanaAI/TinySwallow-1.5B"
class TextUPOSList(list):
  __str__=lambda self:"\n".join("###text:"+"".join(t for t,u in s)+"\n###UPOS:"+"|".join(t+"_"+u for t,u in s) for s in self)+"\n"
ex=TextUPOSList()
ex.append([("一","NUM"),("直線","NOUN"),("に","ADP"),("伸びる","VERB"),("電撃","NOUN"),("を","ADP"),("放ち","VERB"),("、","PUNCT"),("電撃","NOUN"),("ダメージ","NOUN"),("を","ADP"),("与える","VERB"),("。","PUNCT")])
ex.append([("色々","ADV"),("と","ADP"),("面白い","ADJ"),("メニュー","NOUN"),("の","ADP"),("ある","VERB"),("店","NOUN"),("。","PUNCT")])
ex.append([("しかも","CCONJ"),("、","PUNCT"),("ここ","PRON"),("は","ADP"),("コース","NOUN"),("が","ADP"),("リーズナブル","ADJ"),("な","AUX"),("の","SCONJ"),("です","AUX"),("。","PUNCT")])
ex.append([("彼","PRON"),("は","ADP"),("コンピュータ","NOUN"),("を","ADP"),("個人","NOUN"),("の","ADP"),("持ち物","NOUN"),("に","ADP"),("し","VERB"),("まし","AUX"),("た","AUX"),("。","PUNCT")])
ex.append([("2007","NUM"),("年","NOUN"),("9","NUM"),("月","NOUN"),("現在","ADV"),("、","PUNCT"),("以下","NOUN"),("の","ADP"),("メーカー","NOUN"),("から","ADP"),("対応","NOUN"),("製品","NOUN"),("が","ADP"),("発売","VERB"),("さ","AUX"),("れ","AUX"),("て","SCONJ"),("いる","VERB"),("。","PUNCT")])
from transformers import pipeline
tgn=pipeline("text-generation",model,max_new_tokens=128)
nlp=lambda t:"\n".join(tgn(str(ex)+f"###text:{t}\n###UPOS:")[0]["generated_text"].split("\n")[len(ex)*2:len(ex)*2+2])
print(nlp("国境の長いトンネルを抜けると雪国であった。"))

『雪国』の冒頭に品詞付与させてみたところ、私(安岡孝一)の手元では以下の結果が出力された。

###text:国境の長いトンネルを抜けると雪国であった。
###UPOS:国境_NOUN|の_ADP|長い_NOUN|トンネル_NOUN|を_ADP|抜ける_VERB|とVERB|雪NOUN|国_NOUN|であつAUX|った。_PUNCT

この結果を見る限り、TinySwallow-1.5Bは「長い_ADJ」と「と_ADP|雪国_NOUN|で_AUX|あっ_AUX|た_AUX|。_PUNCT」に品詞付与できなかった。ならば、2024年6月17日の記事の手法で、12-Shot Promptingベンチマークを取ってみよう。Google Colaboratoryだと、こんな感じ。

!pip install transformers accelerate spacy-alignments
model="SakanaAI/TinySwallow-1.5B"
ud="https://github.com/UniversalDependencies/UD_Japanese-GSD"
import os
d=os.path.basename(ud)
os.system(f"test -d {d} || git clone --depth=1 {ud}")
os.system("for F in train dev test; do cp "+d+"/*-$F.conllu $F.conllu; done")
with open("train.conllu","r",encoding="utf-8") as r:
  trn=r.read().strip().split("\n\n")
with open("test.conllu","r",encoding="utf-8") as r:
  tst=r.read().strip().split("\n\n")
def ext(x):
  u=v=""
  for s in x.split("\n"):
    if s.startswith("# text ="):
      u=s[8:].strip()
    else:
      t=s.split("\t")
      if t[0].isdigit():
        v+="|"+t[1]+"_"+t[3]
  return (u,v[1:])
from transformers import pipeline
nlp=pipeline("text-generation",model,max_new_tokens=256,device_map="auto")
from spacy_alignments import get_alignments
i,j=0,int(len(trn)/len(tst))
gold=system=correct=0
for t in tst:
  w="\n".join("###text:"+"\n###UPOS:".join(ext(x)) for x in trn[i*j:i*j+j])
  u,v=ext(t)
  w+="\n###text:"+u+"\n###UPOS:"
  g=v.split("|")
  s=nlp(w)[0]["generated_text"].split("\n")[j*2+1][8:].split("|")
  gold+=len(g)
  system+=len(s)
  correct+=sum(1 for t,k in zip(s,get_alignments(g,s)[1]) if len(k)==1 and t==g[k[0]])
  i+=1
print("\n***",model)
print("Precision",correct/system if system else 0.0)
print("Recall   ",correct/gold)
print("F1 Score ",2*correct/(system+gold))

私の手元では、以下の結果が出力された。

*** SakanaAI/TinySwallow-1.5B
Precision 0.5506707248337279
Recall    0.3747890133497008
F1 Score  0.4460168911207487

F1値が44.60だと、品詞付与の「精度」としては、さすがに低すぎると言わざるを得ない。15億パラメータで入出力幅32768トークンは魅力的だけど、さて、どうすればいいかな。

0
0
0

Register as a new user and use Qiita more conveniently

  1. You get articles that match your needs
  2. You can efficiently read back useful information
  3. You can use dark theme
What you can do with signing up
0
0

Delete article

Deleted articles cannot be recovered.

Draft of this article would be also deleted.

Are you sure you want to delete this article?