0
0

Delete article

Deleted articles cannot be recovered.

Draft of this article would be also deleted.

Are you sure you want to delete this article?

goldfish-models/lzh_hant_{5mb,10mb,full}のトークナイザ「精度」をUD_Classical_Chinese-Kyotoのtestセットで測る

Last updated at Posted at 2024-08-27

Tyler A. Chang, Catherine Arnett, Zhuowen Tu, Benjamin K. Bergen『Goldfish: Monolingual Language Models for 350 Languages』を横目に、古典中国語(漢文)GPT2モデルgoldfish-models/lzh_hant_{5mb,10mb,full}のトークナイザ「精度」を、UD_Classical_Chinese-Kyotoのlzh_kyoto-ud-test.conlluで測ってみた。Google Colaboratoryだと、こんな感じ。

!pip install transformers sentencepiece spacy-alignments
models=["goldfish-models/lzh_hant_5mb","goldfish-models/lzh_hant_10mb","goldfish-models/lzh_hant_full"]
ud="UD_Classical_Chinese-Kyoto"
!test -d $ud || git clone --depth=1 https://github.com/universaldependencies/$ud
!cp $ud/*-test.conllu test.conllu
from transformers import AutoTokenizer
from spacy_alignments import get_alignments
for mdl in models:
  tkz=AutoTokenizer.from_pretrained(mdl)
  gold=system=correct=0
  with open("test.conllu","r",encoding="utf-8") as r:
    for k in r:
      if k.startswith("# text ="):
        txt=k[8:].strip()
        frm=[]
      elif k.strip()=="":
        g=[(t[0],t[-1]+1) for t in get_alignments(list(txt),frm)[1]]
        s=[t for t in tkz(txt,return_offsets_mapping=True)["offset_mapping"] if t[0]<t[1]]
        gold+=len(g)
        system+=len(s)
        i=j=0
        while i<len(g) and j<len(s):
          if s[j][0]<g[i][0]:
            j+=1
          elif g[i][0]<s[j][0]:
            i+=1
          else:
            correct+=g[i][1]==s[j][1]
            i+=1
            j+=1
      else:
        t=k.split("\t")
        if len(t)==10 and t[0].isdecimal():
          frm.append(t[1])
  print("\n***",mdl)
  print("Precision",correct/system if system else 0.0)
  print("Recall   ",correct/gold)
  print("F1 Score ",2*correct/(system+gold))

私(安岡孝一)の手元では、以下の結果が出力された。

*** goldfish-models/lzh_hant_5mb
Precision 0.6529041552926903
Recall    0.624718856562432
F1 Score  0.6385006117681955

*** goldfish-models/lzh_hant_10mb
Precision 0.6401664058623716
Recall    0.608466952042371
F1 Score  0.6239142968735469

*** goldfish-models/lzh_hant_full
Precision 0.6355484019314785
Recall    0.6016106798229703
F1 Score  0.6181140514349609

正直かなり低い。このままトークナイザを改造せずに品詞付与モデルを作ると、当然、解析精度が上がらないのだけど、さて、どうしたものかな。

0
0
0

Register as a new user and use Qiita more conveniently

  1. You get articles that match your needs
  2. You can efficiently read back useful information
  3. You can use dark theme
What you can do with signing up
0
0

Delete article

Deleted articles cannot be recovered.

Draft of this article would be also deleted.

Are you sure you want to delete this article?