0
0

Delete article

Deleted articles cannot be recovered.

Draft of this article would be also deleted.

Are you sure you want to delete this article?

中国語ModernBERTのトークナイザ「精度」を繁體字UD_Chinese-GSDと簡化字UD_Chinese-GSDSimpのtestセットで測る

Posted at

中国語ModernBERTが「ming030890/modernbert-base-chinese」「TurboPascal/ChineseModernBert」「feynmanzhao/chinese-modernbert-large-wwm」と相次いでリリースされているのだが、それぞれトークナイザが異なっていて、どれがいいのかイマイチわからない。そこで、各トークナイザの「精度」を、繁體字UD_Chinese-GSDのzh_gsd-ud-test.conlluと簡化字UD_Chinese-GSDSimpのzh_gsdsimp-ud-test.conlluで測ってみた。Google Colaboratoryだと、こんな感じ。

!pip install transformers spacy-alignments
import os,glob
from transformers import AutoTokenizer
from spacy_alignments import get_alignments
models=["feynmanzhao/chinese-modernbert-large-wwm","TurboPascal/ChineseModernBert","ming030890/modernbert-base-chinese"]
for ud in ["UD_Chinese-GSD","UD_Chinese-GSDSimp"]:
  os.system(f"test -d {ud} || git clone --depth=1 https://github.com/UniversalDependencies/{ud}")
for mdl in models:
  tkz=AutoTokenizer.from_pretrained(mdl,trust_remote_code=mdl.startswith("feynmanzhao"))
  for ud in glob.glob("UD_Chinese-*/*-test.conllu"):
    gold=system=correct=0
    with open(ud,"r",encoding="utf-8") as r:
      for k in r:
        if k.startswith("# text ="):
          txt=k[8:].strip()
          frm=[]
        elif k.strip()=="":
          g=[(t[0],t[-1]+1) for t in get_alignments(list(txt),frm)[1]]
          s=[t for t in tkz(txt,return_offsets_mapping=True)["offset_mapping"] if t[0]<t[1]]
          gold+=len(g)
          system+=len(s)
          i=j=0
          while i<len(g) and j<len(s):
            if s[j][0]<g[i][0]:
              j+=1
            elif g[i][0]<s[j][0]:
              i+=1
            else:
              correct+=g[i][1]==s[j][1]
              i+=1
              j+=1
        else:
          t=k.split("\t")
          if len(t)==10 and t[0].isdecimal():
            frm.append(t[1])
    print("\n***",mdl,os.path.basename(ud))
    print("Precision",correct/system if system else 0.0)
    print("Recall   ",correct/gold)
    print("F1 Score ",2*correct/(system+gold))

私(安岡孝一)の手元では、以下の結果が出力された。

*** feynmanzhao/chinese-modernbert-large-wwm zh_gsd-ud-test.conllu
Precision 0.3976962715974538
Recall    0.5461205461205462
F1 Score  0.46023783632090365

*** feynmanzhao/chinese-modernbert-large-wwm zh_gsdsimp-ud-test.conllu
Precision 0.5329080893012468
Recall    0.6120546120546121
F1 Score  0.5697458152510849

*** TurboPascal/ChineseModernBert zh_gsd-ud-test.conllu
Precision 0.4729781525488693
Recall    0.6163836163836164
F1 Score  0.5352418130557363

*** TurboPascal/ChineseModernBert zh_gsdsimp-ud-test.conllu
Precision 0.5626244010320678
Recall    0.6353646353646354
F1 Score  0.5967861750791726

*** ming030890/modernbert-base-chinese zh_gsd-ud-test.conllu
Precision 0.35356692438449305
Recall    0.5367965367965368
F1 Score  0.42632814307911004

*** ming030890/modernbert-base-chinese zh_gsdsimp-ud-test.conllu
Precision 0.35356692438449305
Recall    0.5367965367965368
F1 Score  0.42632814307911004

うーん、どれもこれもRecallが70未満で、かなりツライ。しかも「feynmanzhao/chinese-modernbert-large-wwm」は、トークナイザが独自実装なので、改造すら出来なさそうだ。さて、どうしたものかな。

0
0
0

Register as a new user and use Qiita more conveniently

  1. You get articles that match your needs
  2. You can efficiently read back useful information
  3. You can use dark theme
What you can do with signing up
0
0

Delete article

Deleted articles cannot be recovered.

Draft of this article would be also deleted.

Are you sure you want to delete this article?