10日がかりでmodernbert-base-thai-cc100を製作して、タイ語係り受け解析モデルをいくつかファインチューニングしてみたところ、どうやら2024年9月12日の記事の記録を更新できたようだ。Google Colaboratory (GPU版)でのベンチマーク・プログラムは、こんな感じ。
!pip install -U transformers triton esupar
models=[
"KoichiYasuoka/modernbert-base-thai-cc100-ud-square",
"KoichiYasuoka/modernbert-base-thai-cc100-ud-triangular",
"KoichiYasuoka/modernbert-base-thai-cc100-ud-embeds",
"KoichiYasuoka/modernbert-base-thai-cc100-upos"
]
import os,sys,subprocess
url="https://github.com/nlp-chula/TUD"
f=os.path.join(os.path.basename(url),"TUD","test.conllu")
os.system(f"test -f {f} || git clone --depth=1 {url}")
url="https://universaldependencies.org/conll18/conll18_ud_eval.py"
c=os.path.basename(url)
os.system(f"test -f {c} || curl -LO {url}")
with open(f,"r",encoding="utf-8") as r:
s=[t[8:].strip() for t in r if t.startswith("# text =")]
for mdl in models:
if mdl.endswith("-upos"):
import esupar,deplacy
p=esupar.load(mdl)
nlp=lambda t:deplacy.to_conllu(p(t))
else:
from transformers import pipeline
nlp=pipeline("universal-dependencies",mdl,trust_remote_code=True,
aggregation_strategy="simple",device=0)
with open("result.conllu","w",encoding="utf-8") as w:
for t in s:
w.write(nlp(t))
p=subprocess.run([sys.executable,c,"-v",f,"result.conllu"],
encoding="utf-8",stdout=subprocess.PIPE,stderr=subprocess.STDOUT)
os.system(f"mkdir -p result/{mdl}")
with open(f"result/{mdl}/result.txt","w",encoding="utf-8") as w:
print(f"\n*** {mdl}",p.stdout,sep="\n",file=w)
!( cd result && cat `find {" ".join(models)} -name result.txt` )
私(安岡孝一)の手元では、以下の結果が出力された。
*** KoichiYasuoka/modernbert-base-thai-cc100-ud-square
Metric | Precision | Recall | F1 Score | AligndAcc
-----------+-----------+-----------+-----------+-----------
Tokens | 92.34 | 90.73 | 91.53 |
Sentences | 100.00 | 100.00 | 100.00 |
Words | 92.34 | 90.73 | 91.53 |
UPOS | 81.81 | 80.39 | 81.09 | 88.60
XPOS | 92.34 | 90.73 | 91.53 | 100.00
UFeats | 90.05 | 88.48 | 89.26 | 97.52
AllTags | 79.94 | 78.55 | 79.24 | 86.57
Lemmas | 92.34 | 90.73 | 91.53 | 100.00
UAS | 72.08 | 70.82 | 71.44 | 78.05
LAS | 60.34 | 59.29 | 59.81 | 65.34
CLAS | 56.96 | 54.07 | 55.48 | 60.85
MLAS | 49.18 | 46.68 | 47.90 | 52.54
BLEX | 56.96 | 54.07 | 55.48 | 60.85
*** KoichiYasuoka/modernbert-base-thai-cc100-ud-triangular
Metric | Precision | Recall | F1 Score | AligndAcc
-----------+-----------+-----------+-----------+-----------
Tokens | 92.11 | 91.16 | 91.63 |
Sentences | 100.00 | 100.00 | 100.00 |
Words | 92.11 | 91.16 | 91.63 |
UPOS | 81.44 | 80.61 | 81.02 | 88.42
XPOS | 92.11 | 91.16 | 91.63 | 100.00
UFeats | 89.77 | 88.85 | 89.30 | 97.46
AllTags | 79.51 | 78.69 | 79.10 | 86.32
Lemmas | 92.11 | 91.16 | 91.63 | 100.00
UAS | 72.32 | 71.57 | 71.94 | 78.51
LAS | 61.15 | 60.52 | 60.84 | 66.39
CLAS | 57.88 | 55.40 | 56.61 | 61.83
MLAS | 49.91 | 47.77 | 48.82 | 53.32
BLEX | 57.88 | 55.40 | 56.61 | 61.83
*** KoichiYasuoka/modernbert-base-thai-cc100-ud-embeds
Metric | Precision | Recall | F1 Score | AligndAcc
-----------+-----------+-----------+-----------+-----------
Tokens | 92.20 | 91.37 | 91.78 |
Sentences | 100.00 | 100.00 | 100.00 |
Words | 92.20 | 91.37 | 91.78 |
UPOS | 81.18 | 80.45 | 80.81 | 88.05
XPOS | 92.20 | 91.37 | 91.78 | 100.00
UFeats | 89.68 | 88.87 | 89.27 | 97.26
AllTags | 79.22 | 78.51 | 78.87 | 85.93
Lemmas | 92.20 | 91.37 | 91.78 | 100.00
UAS | 71.64 | 71.00 | 71.32 | 77.71
LAS | 61.18 | 60.63 | 60.90 | 66.35
CLAS | 57.76 | 55.45 | 56.58 | 61.87
MLAS | 49.86 | 47.88 | 48.85 | 53.41
BLEX | 57.76 | 55.45 | 56.58 | 61.87
*** KoichiYasuoka/modernbert-base-thai-cc100-upos
Metric | Precision | Recall | F1 Score | AligndAcc
-----------+-----------+-----------+-----------+-----------
Tokens | 90.68 | 88.57 | 89.62 |
Sentences | 100.00 | 100.00 | 100.00 |
Words | 90.68 | 88.57 | 89.62 |
UPOS | 75.39 | 73.63 | 74.50 | 83.13
XPOS | 90.68 | 88.57 | 89.62 | 100.00
UFeats | 87.79 | 85.75 | 86.76 | 96.81
AllTags | 72.96 | 71.26 | 72.10 | 80.46
Lemmas | 90.68 | 88.57 | 89.62 | 100.00
UAS | 71.14 | 69.48 | 70.30 | 78.44
LAS | 59.58 | 58.19 | 58.88 | 65.70
CLAS | 55.86 | 52.45 | 54.10 | 60.79
MLAS | 44.03 | 41.34 | 42.64 | 47.91
BLEX | 55.86 | 52.45 | 54.10 | 60.79
とは言うものの、modernbert-base-thai-cc100-ud-embedsのLAS/MLAS/BLEXは60.90/48.85/56.58で、LASはcamembert-thai-base-uposに負けている。なかなか難しいなあ。