Tyler A. Chang, Catherine Arnett, Zhuowen Tu, Benjamin K. Bergen『Goldfish: Monolingual Language Models for 350 Languages』を横目に、古典中国語(漢文)GPT2モデルgoldfish-models/lzh_hant_{5mb,10mb,full}のトークナイザ「精度」を、UD_Classical_Chinese-Kyotoのlzh_kyoto-ud-test.conlluで測ってみた。Google Colaboratoryだと、こんな感じ。
!pip install transformers sentencepiece spacy-alignments
models=["goldfish-models/lzh_hant_5mb","goldfish-models/lzh_hant_10mb","goldfish-models/lzh_hant_full"]
ud="UD_Classical_Chinese-Kyoto"
!test -d $ud || git clone --depth=1 https://github.com/universaldependencies/$ud
!cp $ud/*-test.conllu test.conllu
from transformers import AutoTokenizer
from spacy_alignments import get_alignments
for mdl in models:
tkz=AutoTokenizer.from_pretrained(mdl)
gold=system=correct=0
with open("test.conllu","r",encoding="utf-8") as r:
for k in r:
if k.startswith("# text ="):
txt=k[8:].strip()
frm=[]
elif k.strip()=="":
g=[(t[0],t[-1]+1) for t in get_alignments(list(txt),frm)[1]]
s=[t for t in tkz(txt,return_offsets_mapping=True)["offset_mapping"] if t[0]<t[1]]
gold+=len(g)
system+=len(s)
i=j=0
while i<len(g) and j<len(s):
if s[j][0]<g[i][0]:
j+=1
elif g[i][0]<s[j][0]:
i+=1
else:
correct+=g[i][1]==s[j][1]
i+=1
j+=1
else:
t=k.split("\t")
if len(t)==10 and t[0].isdecimal():
frm.append(t[1])
print("\n***",mdl)
print("Precision",correct/system if system else 0.0)
print("Recall ",correct/gold)
print("F1 Score ",2*correct/(system+gold))
私(安岡孝一)の手元では、以下の結果が出力された。
*** goldfish-models/lzh_hant_5mb
Precision 0.6529041552926903
Recall 0.624718856562432
F1 Score 0.6385006117681955
*** goldfish-models/lzh_hant_10mb
Precision 0.6401664058623716
Recall 0.608466952042371
F1 Score 0.6239142968735469
*** goldfish-models/lzh_hant_full
Precision 0.6355484019314785
Recall 0.6016106798229703
F1 Score 0.6181140514349609
正直かなり低い。このままトークナイザを改造せずに品詞付与モデルを作ると、当然、解析精度が上がらないのだけど、さて、どうしたものかな。