0
0

Delete article

Deleted articles cannot be recovered.

Draft of this article would be also deleted.

Are you sure you want to delete this article?

タイ語トークナイザの「精度」をUD_Thai-PUDのtestセットで測る

Last updated at Posted at 2024-04-11

CoNLL 2018 Shared TaskのEvaluationページを横目に、タイ語モデルにおけるトークナイザの「精度」を測るプログラムを書いてみた。Google Colaboratoryだと、こんな感じ。

!pip install transformers sentencepiece spacy-alignments
models=["KoichiYasuoka/roberta-base-thai-syllable","scb10x/typhoon-7b","openthaigpt/openthaigpt-1.0.0-7b-chat"]
ud="UD_Thai-PUD"
!test -d $ud || git clone --depth=1 https://github.com/universaldependencies/$ud
!cp $ud/*-test.conllu test.conllu
from transformers import AutoTokenizer
from spacy_alignments import get_alignments
for mdl in models:
  tkz=AutoTokenizer.from_pretrained(mdl)
  gold=system=correct=0
  with open("test.conllu","r",encoding="utf-8") as r:
    for k in r:
      if k.startswith("# text ="):
        txt=k[8:].strip()
        frm=[]
      elif k.strip()=="":
        g=[(t[0],t[-1]+1) for t in get_alignments(list(txt),frm)[1]]
        s=[t for t in tkz(txt,return_offsets_mapping=True)["offset_mapping"] if t[0]<t[1]]
        gold+=len(g)
        system+=len(s)
        i=j=0
        while i<len(g) and j<len(s):
          if s[j][0]<g[i][0]:
            j+=1
          elif g[i][0]<s[j][0]:
            i+=1
          else:
            correct+=g[i][1]==s[j][1]
            i+=1
            j+=1
      else:
        t=k.split("\t")
        if len(t)==10 and t[0].isdecimal():
          frm.append(t[1])
  print("\n***",mdl)
  print("Precision",correct/system if system else 0.0)
  print("Recall   ",correct/gold)
  print("F1 Score ",2*correct/(system+gold))

私(安岡孝一)の手元では、以下の結果が得られた。

*** KoichiYasuoka/roberta-base-thai-syllable
Precision 0.4651299687083682
Recall    0.6725651823313323
F1 Score  0.5499368120295244

*** scb10x/typhoon-7b
Precision 0.35939561475152354
Recall    0.5125436788818206
F1 Score  0.4225201270404018

*** openthaigpt/openthaigpt-1.0.0-7b-chat
Precision 0.37973561490731106
Recall    0.44140310008063793
F1 Score  0.4082537446394166

もちろん、UD_Thai-PUDにおける単語分割が完璧というわけではないし、2020年6月16日の日記に書いたとおり問題点も散見される。このあたり、いいテストセットを作るべきなのだろうけど、合意を取るのが大変そうだなぁ。

0
0
0

Register as a new user and use Qiita more conveniently

  1. You get articles that match your needs
  2. You can efficiently read back useful information
  3. You can use dark theme
What you can do with signing up
0
0

Delete article

Deleted articles cannot be recovered.

Draft of this article would be also deleted.

Are you sure you want to delete this article?