TurkuNLPからフィンランド語ModernBERTが大量にリリースされたので、トークナイザを少しだけ改良しつつ、フィンランド語品詞付与・係り受け解析モデルmodernbert-{tiny,base,large}-finnish-ud-embedsを試作してみた。UD_Finnish-TDTのfi_tdt-test.conlluで解析精度を測ってみよう。
#! /usr/bin/python3
mdl="KoichiYasuoka/modernbert-{}-finnish-ud-embeds"
org="TurkuNLP/finnish-modernbert-{}"
import os,sys,subprocess
from transformers import pipeline,AutoTokenizer
url="https://github.com/UniversalDependencies/UD_Finnish-TDT"
f=os.path.join(os.path.basename(url),"fi_tdt-ud-test.conllu")
os.system(f"test -f {f} || git clone --depth=1 {url}")
url="https://universaldependencies.org/conll18/conll18_ud_eval.py"
c=os.path.basename(url)
os.system(f"test -f {c} || curl -LO {url}")
with open(f,"r",encoding="utf-8") as r:
s=[t[8:].strip() for t in r if t.startswith("# text =")]
rst=[]
for z in ["tiny","base","large"]:
for tkz in ["original","refined"]:
nlp=pipeline("universal-dependencies",mdl.format(z),trust_remote_code=True)
if tkz=="original":
nlp.tokenizer=AutoTokenizer.from_pretrained(org.format(z))
nlp.multiword={}
with open("result.conllu","w",encoding="utf-8") as w:
for t in s:
w.write(nlp(t))
p=subprocess.run([sys.executable,c,"-v",f,"result.conllu"],
encoding="utf-8",stdout=subprocess.PIPE,stderr=subprocess.STDOUT)
os.system("mkdir -p "+os.path.join("result",mdl.format(z)))
rst.append(os.path.join("result",mdl.format(z),tkz+".txt"))
with open(rst[-1],"w",encoding="utf-8") as w:
print(f"\n*** {mdl.format(z)} ({tkz} tokenizer)",p.stdout,sep="\n",file=w)
os.system(f'cat {" ".join(rst)}')
私(安岡孝一)の手元では、以下の結果が出力された。
*** KoichiYasuoka/modernbert-tiny-finnish-ud-embeds (original tokenizer)
Metric | Precision | Recall | F1 Score | AligndAcc
-----------+-----------+-----------+-----------+-----------
Tokens | 99.50 | 99.13 | 99.32 |
Sentences | 100.00 | 100.00 | 100.00 |
Words | 99.38 | 98.88 | 99.13 |
UPOS | 94.76 | 94.28 | 94.52 | 95.35
XPOS | 0.00 | 0.00 | 0.00 | 0.00
UFeats | 91.35 | 90.89 | 91.12 | 91.92
AllTags | 0.00 | 0.00 | 0.00 | 0.00
Lemmas | 0.00 | 0.00 | 0.00 | 0.00
UAS | 84.47 | 84.05 | 84.26 | 85.00
LAS | 80.43 | 80.02 | 80.23 | 80.93
CLAS | 79.22 | 79.13 | 79.17 | 79.36
MLAS | 71.95 | 71.86 | 71.91 | 72.08
BLEX | 0.00 | 0.00 | 0.00 | 0.00
*** KoichiYasuoka/modernbert-tiny-finnish-ud-embeds (refined tokenizer)
Metric | Precision | Recall | F1 Score | AligndAcc
-----------+-----------+-----------+-----------+-----------
Tokens | 99.59 | 99.61 | 99.60 |
Sentences | 100.00 | 100.00 | 100.00 |
Words | 99.66 | 99.55 | 99.60 |
UPOS | 95.11 | 95.01 | 95.06 | 95.44
XPOS | 0.00 | 0.00 | 0.00 | 0.00
UFeats | 91.64 | 91.54 | 91.59 | 91.95
AllTags | 0.00 | 0.00 | 0.00 | 0.00
Lemmas | 0.00 | 0.00 | 0.00 | 0.00
UAS | 84.62 | 84.53 | 84.57 | 84.91
LAS | 80.57 | 80.48 | 80.52 | 80.84
CLAS | 79.33 | 79.25 | 79.29 | 79.47
MLAS | 72.01 | 71.94 | 71.98 | 72.14
BLEX | 0.00 | 0.00 | 0.00 | 0.00
*** KoichiYasuoka/modernbert-base-finnish-ud-embeds (original tokenizer)
Metric | Precision | Recall | F1 Score | AligndAcc
-----------+-----------+-----------+-----------+-----------
Tokens | 99.53 | 99.16 | 99.35 |
Sentences | 100.00 | 100.00 | 100.00 |
Words | 99.40 | 98.91 | 99.16 |
UPOS | 96.29 | 95.81 | 96.05 | 96.87
XPOS | 0.00 | 0.00 | 0.00 | 0.00
UFeats | 93.77 | 93.30 | 93.53 | 94.33
AllTags | 0.00 | 0.00 | 0.00 | 0.00
Lemmas | 0.00 | 0.00 | 0.00 | 0.00
UAS | 91.44 | 90.98 | 91.21 | 91.99
LAS | 88.58 | 88.14 | 88.36 | 89.11
CLAS | 87.65 | 87.50 | 87.57 | 87.72
MLAS | 82.00 | 81.86 | 81.93 | 82.07
BLEX | 0.00 | 0.00 | 0.00 | 0.00
*** KoichiYasuoka/modernbert-base-finnish-ud-embeds (refined tokenizer)
Metric | Precision | Recall | F1 Score | AligndAcc
-----------+-----------+-----------+-----------+-----------
Tokens | 99.65 | 99.69 | 99.67 |
Sentences | 100.00 | 100.00 | 100.00 |
Words | 99.72 | 99.63 | 99.68 |
UPOS | 96.65 | 96.55 | 96.60 | 96.91
XPOS | 0.00 | 0.00 | 0.00 | 0.00
UFeats | 94.09 | 94.00 | 94.05 | 94.35
AllTags | 0.00 | 0.00 | 0.00 | 0.00
Lemmas | 0.00 | 0.00 | 0.00 | 0.00
UAS | 91.68 | 91.59 | 91.63 | 91.93
LAS | 88.77 | 88.69 | 88.73 | 89.02
CLAS | 87.63 | 87.51 | 87.57 | 87.70
MLAS | 81.98 | 81.87 | 81.92 | 82.05
BLEX | 0.00 | 0.00 | 0.00 | 0.00
*** KoichiYasuoka/modernbert-large-finnish-ud-embeds (original tokenizer)
Metric | Precision | Recall | F1 Score | AligndAcc
-----------+-----------+-----------+-----------+-----------
Tokens | 99.50 | 99.07 | 99.29 |
Sentences | 100.00 | 100.00 | 100.00 |
Words | 99.38 | 98.82 | 99.10 |
UPOS | 96.42 | 95.88 | 96.14 | 97.02
XPOS | 0.00 | 0.00 | 0.00 | 0.00
UFeats | 94.32 | 93.79 | 94.06 | 94.91
AllTags | 0.00 | 0.00 | 0.00 | 0.00
Lemmas | 0.00 | 0.00 | 0.00 | 0.00
UAS | 92.01 | 91.50 | 91.75 | 92.58
LAS | 89.79 | 89.28 | 89.53 | 90.35
CLAS | 88.90 | 88.59 | 88.74 | 88.86
MLAS | 83.76 | 83.47 | 83.61 | 83.72
BLEX | 0.00 | 0.00 | 0.00 | 0.00
*** KoichiYasuoka/modernbert-large-finnish-ud-embeds (refined tokenizer)
Metric | Precision | Recall | F1 Score | AligndAcc
-----------+-----------+-----------+-----------+-----------
Tokens | 99.67 | 99.69 | 99.68 |
Sentences | 100.00 | 100.00 | 100.00 |
Words | 99.74 | 99.63 | 99.69 |
UPOS | 96.84 | 96.74 | 96.79 | 97.09
XPOS | 0.00 | 0.00 | 0.00 | 0.00
UFeats | 94.74 | 94.64 | 94.69 | 94.98
AllTags | 0.00 | 0.00 | 0.00 | 0.00
Lemmas | 0.00 | 0.00 | 0.00 | 0.00
UAS | 92.36 | 92.26 | 92.31 | 92.60
LAS | 90.09 | 90.00 | 90.05 | 90.33
CLAS | 88.97 | 88.72 | 88.84 | 88.90
MLAS | 83.85 | 83.61 | 83.73 | 83.79
BLEX | 0.00 | 0.00 | 0.00 | 0.00
UPOS/LAS/MLASを表にしてみよう。
| トークナイザ改良前 | トークナイザ改良後 | |
|---|---|---|
| modernbert-tiny-finnish-ud-embeds | 94.52/80.23/71.91 | 95.06/80.52/71.98 |
| modernbert-base-finnish-ud-embeds | 96.05/88.36/81.93 | 96.60/88.73/81.92 |
| modernbert-large-finnish-ud-embeds | 96.14/89.53/83.61 | 96.79/90.05/83.73 |
モデルのパラメータ数(49M・136M・382M)が、解析精度に寄与している。というか、tiny(49M)はフィンランド語ModernBERTとして小さすぎるのかもしれない。このあたり、フィンランド語ModernBERTの入出力幅が16000トークンで設計されている、という点に関係あるのかしら。