『青空文庫ModernBERTモデルによる国語研長単位係り受け解析』の上三角行列アルゴリズムを、古典中国語(漢文)BERT・RoBERTa・ModernBERTに適用して、係り受け解析モデルを試作してみた。ModernBERTならば126×126の上三角行列が乗るが、BERTやRoBERTaだと30×30程度しか乗らないので、それでどのくらいの解析精度になるのか比較してみたかったのだ。Google Colaboratoryだと、こんな感じ。
!pip install transformers triton
models=[
"KoichiYasuoka/bert-ancient-chinese-base-ud-embeds",
"KoichiYasuoka/roberta-classical-chinese-base-ud-embeds",
"KoichiYasuoka/roberta-classical-chinese-large-ud-embeds",
"KoichiYasuoka/modernbert-small-classical-chinese-ud-embeds",
"KoichiYasuoka/modernbert-base-classical-chinese-ud-embeds",
"KoichiYasuoka/modernbert-large-classical-chinese-ud-embeds"
]
import os,sys,subprocess
url="https://github.com/UniversalDependencies/UD_Classical_Chinese-Kyoto"
d=os.path.basename(url)
!test -d {d} || git clone --depth=1 {url}
!for F in train dev test ; do cp {d}/*-$$F.conllu $$F.conllu ; done
url="https://universaldependencies.org/conll18/conll18_ud_eval.py"
c=os.path.basename(url)
!test -f {c} || curl -LO {url}
with open("test.conllu","r",encoding="utf-8") as r:
s=[t[8:].strip() for t in r if t.startswith("# text =")]
for mdl in models:
from transformers import pipeline
nlp=pipeline("universal-dependencies",mdl,trust_remote_code=True)
for f in ["dev.conllu","test.conllu"]:
with open(f,"r",encoding="utf-8") as r:
s=[t[8:].strip() for t in r if t.startswith("# text =")]
with open("result-"+f,"w",encoding="utf-8") as w:
for t in s:
try:
w.write(nlp(t))
except:
break
os.system(f"mkdir -p result/{mdl}")
with open(f"result/{mdl}/result.txt","w",encoding="utf-8") as w:
for f in ["dev.conllu","test.conllu"]:
p=subprocess.run([sys.executable,c,"-v",f,"result-"+f],
encoding="utf-8",stdout=subprocess.PIPE,stderr=subprocess.STDOUT)
print(f"\n*** {mdl} ({f})",p.stdout,sep="\n",file=w)
!( cd result && cat `find {" ".join(models)} -name result.txt` )
lzh_kyoto-ud-dev.conlluとlzh_kyoto-ud-test.conlluの両方で評価したところ、私(安岡孝一)の手元では以下の結果が出力された。
*** KoichiYasuoka/bert-ancient-chinese-base-ud-embeds (dev.conllu)
Traceback (most recent call last):
File "/content/conll18_ud_eval.py", line 532, in <module>
main()
File "/content/conll18_ud_eval.py", line 500, in main
evaluation = evaluate_wrapper(args)
^^^^^^^^^^^^^^^^^^^^^^
File "/content/conll18_ud_eval.py", line 484, in evaluate_wrapper
return evaluate(gold_ud, system_ud)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/content/conll18_ud_eval.py", line 441, in evaluate
raise UDError(
UDError: The concatenation of tokens in gold file and in system file differ!
First 20 differing characters in gold file: '杜如晦房玄齡虞世南褚亮姚志廉李玄道蔡允恭' and system file: ''
*** KoichiYasuoka/bert-ancient-chinese-base-ud-embeds (test.conllu)
Metric | Precision | Recall | F1 Score | AligndAcc
-----------+-----------+-----------+-----------+-----------
Tokens | 98.31 | 98.69 | 98.50 |
Sentences | 100.00 | 100.00 | 100.00 |
Words | 98.31 | 98.69 | 98.50 |
UPOS | 92.42 | 92.77 | 92.60 | 94.00
XPOS | 0.00 | 0.00 | 0.00 | 0.00
UFeats | 93.34 | 93.70 | 93.52 | 94.94
AllTags | 0.00 | 0.00 | 0.00 | 0.00
Lemmas | 96.57 | 96.95 | 96.76 | 98.23
UAS | 84.77 | 85.09 | 84.93 | 86.22
LAS | 79.78 | 80.09 | 79.93 | 81.15
CLAS | 79.34 | 79.35 | 79.34 | 80.52
MLAS | 77.06 | 77.07 | 77.07 | 78.21
BLEX | 78.26 | 78.27 | 78.26 | 79.42
*** KoichiYasuoka/roberta-classical-chinese-base-ud-embeds (dev.conllu)
Traceback (most recent call last):
File "/content/conll18_ud_eval.py", line 532, in <module>
main()
File "/content/conll18_ud_eval.py", line 500, in main
evaluation = evaluate_wrapper(args)
^^^^^^^^^^^^^^^^^^^^^^
File "/content/conll18_ud_eval.py", line 484, in evaluate_wrapper
return evaluate(gold_ud, system_ud)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/content/conll18_ud_eval.py", line 441, in evaluate
raise UDError(
UDError: The concatenation of tokens in gold file and in system file differ!
First 20 differing characters in gold file: '杜如晦房玄齡虞世南褚亮姚志廉李玄道蔡允恭' and system file: ''
*** KoichiYasuoka/roberta-classical-chinese-base-ud-embeds (test.conllu)
Metric | Precision | Recall | F1 Score | AligndAcc
-----------+-----------+-----------+-----------+-----------
Tokens | 98.23 | 98.70 | 98.47 |
Sentences | 100.00 | 100.00 | 100.00 |
Words | 98.23 | 98.70 | 98.47 |
UPOS | 92.42 | 92.86 | 92.64 | 94.08
XPOS | 0.00 | 0.00 | 0.00 | 0.00
UFeats | 93.38 | 93.82 | 93.60 | 95.06
AllTags | 0.00 | 0.00 | 0.00 | 0.00
Lemmas | 96.48 | 96.93 | 96.71 | 98.21
UAS | 84.28 | 84.68 | 84.48 | 85.79
LAS | 79.41 | 79.78 | 79.60 | 80.84
CLAS | 78.86 | 78.96 | 78.91 | 80.12
MLAS | 76.64 | 76.73 | 76.69 | 77.86
BLEX | 77.75 | 77.84 | 77.80 | 78.99
*** KoichiYasuoka/roberta-classical-chinese-large-ud-embeds (dev.conllu)
Traceback (most recent call last):
File "/content/conll18_ud_eval.py", line 532, in <module>
main()
File "/content/conll18_ud_eval.py", line 500, in main
evaluation = evaluate_wrapper(args)
^^^^^^^^^^^^^^^^^^^^^^
File "/content/conll18_ud_eval.py", line 484, in evaluate_wrapper
return evaluate(gold_ud, system_ud)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/content/conll18_ud_eval.py", line 441, in evaluate
raise UDError(
UDError: The concatenation of tokens in gold file and in system file differ!
First 20 differing characters in gold file: '圖畫功臣長孫無忌趙郡王孝恭杜如晦魏徵房玄' and system file: ''
*** KoichiYasuoka/roberta-classical-chinese-large-ud-embeds (test.conllu)
Metric | Precision | Recall | F1 Score | AligndAcc
-----------+-----------+-----------+-----------+-----------
Tokens | 98.84 | 99.01 | 98.93 |
Sentences | 100.00 | 100.00 | 100.00 |
Words | 98.84 | 99.01 | 98.93 |
UPOS | 93.32 | 93.49 | 93.41 | 94.42
XPOS | 0.00 | 0.00 | 0.00 | 0.00
UFeats | 93.87 | 94.04 | 93.96 | 94.98
AllTags | 0.00 | 0.00 | 0.00 | 0.00
Lemmas | 97.07 | 97.25 | 97.16 | 98.22
UAS | 86.77 | 86.93 | 86.85 | 87.79
LAS | 81.99 | 82.14 | 82.06 | 82.96
CLAS | 81.66 | 81.40 | 81.53 | 82.31
MLAS | 78.84 | 78.59 | 78.71 | 79.46
BLEX | 80.52 | 80.27 | 80.39 | 81.16
*** KoichiYasuoka/modernbert-small-classical-chinese-ud-embeds (dev.conllu)
Metric | Precision | Recall | F1 Score | AligndAcc
-----------+-----------+-----------+-----------+-----------
Tokens | 96.20 | 97.28 | 96.74 |
Sentences | 100.00 | 100.00 | 100.00 |
Words | 96.20 | 97.28 | 96.74 |
UPOS | 88.71 | 89.70 | 89.20 | 92.21
XPOS | 0.00 | 0.00 | 0.00 | 0.00
UFeats | 89.95 | 90.96 | 90.45 | 93.51
AllTags | 0.00 | 0.00 | 0.00 | 0.00
Lemmas | 94.50 | 95.56 | 95.03 | 98.23
UAS | 78.67 | 79.55 | 79.11 | 81.78
LAS | 73.43 | 74.25 | 73.83 | 76.33
CLAS | 72.68 | 73.53 | 73.10 | 75.79
MLAS | 70.01 | 70.82 | 70.41 | 73.00
BLEX | 71.69 | 72.53 | 72.11 | 74.76
*** KoichiYasuoka/modernbert-small-classical-chinese-ud-embeds (test.conllu)
Metric | Precision | Recall | F1 Score | AligndAcc
-----------+-----------+-----------+-----------+-----------
Tokens | 98.07 | 98.44 | 98.25 |
Sentences | 100.00 | 100.00 | 100.00 |
Words | 98.07 | 98.44 | 98.25 |
UPOS | 90.94 | 91.28 | 91.11 | 92.73
XPOS | 0.00 | 0.00 | 0.00 | 0.00
UFeats | 92.28 | 92.62 | 92.45 | 94.09
AllTags | 0.00 | 0.00 | 0.00 | 0.00
Lemmas | 96.36 | 96.72 | 96.54 | 98.26
UAS | 82.20 | 82.51 | 82.36 | 83.82
LAS | 77.01 | 77.30 | 77.16 | 78.53
CLAS | 76.50 | 76.46 | 76.48 | 77.81
MLAS | 73.84 | 73.79 | 73.81 | 75.09
BLEX | 75.59 | 75.54 | 75.57 | 76.88
*** KoichiYasuoka/modernbert-base-classical-chinese-ud-embeds (dev.conllu)
Metric | Precision | Recall | F1 Score | AligndAcc
-----------+-----------+-----------+-----------+-----------
Tokens | 95.62 | 97.14 | 96.37 |
Sentences | 100.00 | 100.00 | 100.00 |
Words | 95.62 | 97.14 | 96.37 |
UPOS | 88.27 | 89.67 | 88.96 | 92.31
XPOS | 0.00 | 0.00 | 0.00 | 0.00
UFeats | 89.52 | 90.95 | 90.23 | 93.63
AllTags | 0.00 | 0.00 | 0.00 | 0.00
Lemmas | 93.93 | 95.42 | 94.67 | 98.23
UAS | 77.23 | 78.46 | 77.84 | 80.77
LAS | 72.24 | 73.39 | 72.81 | 75.55
CLAS | 71.46 | 72.60 | 72.02 | 74.95
MLAS | 68.83 | 69.93 | 69.38 | 72.20
BLEX | 70.49 | 71.62 | 71.05 | 73.94
*** KoichiYasuoka/modernbert-base-classical-chinese-ud-embeds (test.conllu)
Metric | Precision | Recall | F1 Score | AligndAcc
-----------+-----------+-----------+-----------+-----------
Tokens | 97.88 | 98.49 | 98.18 |
Sentences | 100.00 | 100.00 | 100.00 |
Words | 97.88 | 98.49 | 98.18 |
UPOS | 90.96 | 91.52 | 91.24 | 92.92
XPOS | 0.00 | 0.00 | 0.00 | 0.00
UFeats | 91.97 | 92.53 | 92.25 | 93.96
AllTags | 0.00 | 0.00 | 0.00 | 0.00
Lemmas | 96.16 | 96.76 | 96.46 | 98.24
UAS | 81.20 | 81.71 | 81.45 | 82.96
LAS | 76.09 | 76.55 | 76.32 | 77.73
CLAS | 75.43 | 75.65 | 75.54 | 76.95
MLAS | 72.87 | 73.08 | 72.98 | 74.34
BLEX | 74.49 | 74.71 | 74.60 | 75.99
*** KoichiYasuoka/modernbert-large-classical-chinese-ud-embeds (dev.conllu)
Metric | Precision | Recall | F1 Score | AligndAcc
-----------+-----------+-----------+-----------+-----------
Tokens | 95.49 | 97.03 | 96.25 |
Sentences | 100.00 | 100.00 | 100.00 |
Words | 95.49 | 97.03 | 96.25 |
UPOS | 87.70 | 89.12 | 88.40 | 91.84
XPOS | 0.00 | 0.00 | 0.00 | 0.00
UFeats | 89.13 | 90.57 | 89.84 | 93.34
AllTags | 0.00 | 0.00 | 0.00 | 0.00
Lemmas | 93.81 | 95.33 | 94.56 | 98.24
UAS | 76.24 | 77.47 | 76.85 | 79.84
LAS | 71.16 | 72.31 | 71.73 | 74.52
CLAS | 70.29 | 71.50 | 70.89 | 73.91
MLAS | 67.66 | 68.83 | 68.24 | 71.15
BLEX | 69.33 | 70.53 | 69.92 | 72.91
*** KoichiYasuoka/modernbert-large-classical-chinese-ud-embeds (test.conllu)
Metric | Precision | Recall | F1 Score | AligndAcc
-----------+-----------+-----------+-----------+-----------
Tokens | 97.91 | 98.49 | 98.20 |
Sentences | 100.00 | 100.00 | 100.00 |
Words | 97.91 | 98.49 | 98.20 |
UPOS | 90.74 | 91.27 | 91.00 | 92.67
XPOS | 0.00 | 0.00 | 0.00 | 0.00
UFeats | 92.08 | 92.62 | 92.35 | 94.04
AllTags | 0.00 | 0.00 | 0.00 | 0.00
Lemmas | 96.20 | 96.76 | 96.48 | 98.25
UAS | 81.24 | 81.72 | 81.48 | 82.98
LAS | 76.03 | 76.48 | 76.25 | 77.65
CLAS | 75.29 | 75.52 | 75.40 | 76.81
MLAS | 72.44 | 72.66 | 72.55 | 73.90
BLEX | 74.34 | 74.57 | 74.46 | 75.84
LAS/MLAS/BLEXを表にしてみよう。
パラメータ数 | トークン幅 | lzh_kyoto-ud-dev.conllu | lzh_kyoto-ud-test.conllu | |
---|---|---|---|---|
bert-ancient-chinese-base-ud-embeds | 1.1億 | 512 | × | 79.93/77.07/78.26 |
roberta-classical-chinese-base-ud-embeds | 1.01億 | 514 | × | 79.60/76.69/77.80 |
roberta-classical-chinese-large-ud-embeds | 3.16億 | 514 | × | 82.06/78.71/80.39 |
modernbert-small-classical-chinese-ud-embeds | 1913万 | 8192 | 73.83/70.41/72.11 | 77.16/73.81/75.57 |
modernbert-base-classical-chinese-ud-embeds | 1.24億 | 8192 | 72.81/69.38/71.05 | 76.32/72.98/74.60 |
modernbert-large-classical-chinese-ud-embeds | 3.52億 | 8192 | 71.73/68.24/69.92 | 76.25/72.55/74.46 |
lzh_kyoto-ud-dev.conlluには「杜如晦房玄齡虞世南褚亮姚志廉李玄道蔡允恭薛元敬顏相時蘇勗于志寧蘇世長薛收李守素陸德明孔穎達蓋文達許敬宗爲文學館學士」や「圖畫功臣長孫無忌趙郡王孝恭杜如晦魏徵房玄齡高士廉尉遲敬德李靖蕭瑀段志玄劉弘基屈突通殷開山柴紹長孫順德張亮侯君集張公謹程知節虞世南劉政會唐儉李勣秦叔寶等於凌煙閣」のような40トークンを越える例文が含まれており、BERTやRoBERTaのトークン幅512ないし514に上三角行列が乗り切らない。一方、lzh_kyoto-ud-test.conlluの例文は全て30トークンに収まっており、BERTやRoBERTaでも上三角行列アルゴリズムが可能となっている。結果として、30トークンに例文が収まる限りにおいて、RoBERTaのlargeモデルが最も性能が高く、それ以上の長さの例文では、ModernBERTのsmallモデルを選ぶべきである。うーん、上三角行列アルゴリズムの「間引き」をもう少し進めれば、BERTやRoBERTaで50トークン程度は扱えるようにならないかなあ。