ベラルーシ語係り受け解析モデルltgbert-base-belarusian-ud-goeswithリリース

Posted at 2024-09-15

昨日の記事の手法を「HPLT Bert for Belarusian」とUD_Belarusian-HSEに適用して、ベラルーシ語係り受け解析モデルltgbert-base-belarusian-ud-goeswithを試作してみた。be_hse-ud-test.conlluによるベンチマーク・プログラム(Google Colaboratory GPU版)は、過去に作ったベラルーシ語モデルも含めると、こんな感じ。

!pip install esupar
models=["KoichiYasuoka/ltgbert-base-belarusian-ud-goeswith","KoichiYasuoka/deberta-base-belarusian-upos","KoichiYasuoka/deberta-base-belarusian-ud-goeswith","KoichiYasuoka/roberta-small-belarusian-upos"]
import os,sys,subprocess,esupar,deplacy
from transformers import pipeline
url="https://github.com/UniversalDependencies/UD_Belarusian-HSE"
d=os.path.basename(url)
os.system(f"test -d {d} || git clone --depth=1 {url}")
os.system("for F in train dev test ; do cp "+d+"/*-$F.conllu $F.conllu ; done")
url="https://universaldependencies.org/conll18/conll18_ud_eval.py"
c=os.path.basename(url)
os.system(f"test -f {c} || curl -LO {url}")
with open("test.conllu","r",encoding="utf-8") as r:
  s=[t[8:].strip() for t in r if t.startswith("# text =")]
for mdl in models:
  if mdl.endswith("-upos"):
    nlp=esupar.load(mdl)
  else:
    nlp=pipeline("universal-dependencies",mdl,trust_remote_code=True,aggregation_strategy="simple",device=0)
  with open("result.conllu","w",encoding="utf-8") as w:
    for t in s:
      print(deplacy.to_conllu(nlp(t)).strip(),end="\n\n",file=w)
  p=subprocess.run([sys.executable,c,"-v","test.conllu","result.conllu"],encoding="utf-8",stdout=subprocess.PIPE,stderr=subprocess.STDOUT)
  print(f"\n*** {mdl}",p.stdout,sep="\n",flush=True)

私(安岡孝一)の手元では、以下の結果が出力された。

*** KoichiYasuoka/ltgbert-base-belarusian-ud-goeswith
Metric     | Precision |    Recall |  F1 Score | AligndAcc
-----------+-----------+-----------+-----------+-----------
Tokens     |     99.57 |     99.47 |     99.52 |
Sentences  |    100.00 |    100.00 |    100.00 |
Words      |     99.57 |     99.47 |     99.52 |
UPOS       |     98.37 |     98.27 |     98.32 |     98.80
XPOS       |      0.00 |      0.00 |      0.00 |      0.00
UFeats     |     95.43 |     95.33 |     95.38 |     95.84
AllTags    |      0.00 |      0.00 |      0.00 |      0.00
Lemmas     |      0.00 |      0.00 |      0.00 |      0.00
UAS        |     92.37 |     92.28 |     92.33 |     92.77
LAS        |     90.39 |     90.30 |     90.35 |     90.79
CLAS       |     88.58 |     88.38 |     88.48 |     88.84
MLAS       |     82.88 |     82.69 |     82.78 |     83.12
BLEX       |      0.00 |      0.00 |      0.00 |      0.00

*** KoichiYasuoka/deberta-base-belarusian-upos
Metric     | Precision |    Recall |  F1 Score | AligndAcc
-----------+-----------+-----------+-----------+-----------
Tokens     |     99.76 |     99.89 |     99.83 |
Sentences  |    100.00 |    100.00 |    100.00 |
Words      |     99.76 |     99.89 |     99.83 |
UPOS       |     98.99 |     99.12 |     99.06 |     99.23
XPOS       |      0.00 |      0.00 |      0.00 |      0.00
UFeats     |     97.17 |     97.29 |     97.23 |     97.40
AllTags    |      0.00 |      0.00 |      0.00 |      0.00
Lemmas     |      0.00 |      0.00 |      0.00 |      0.00
UAS        |     88.94 |     89.05 |     88.99 |     89.15
LAS        |     86.73 |     86.84 |     86.78 |     86.93
CLAS       |     84.24 |     84.15 |     84.20 |     84.27
MLAS       |     80.28 |     80.19 |     80.24 |     80.31
BLEX       |      0.00 |      0.00 |      0.00 |      0.00

*** KoichiYasuoka/deberta-base-belarusian-ud-goeswith
Metric     | Precision |    Recall |  F1 Score | AligndAcc
-----------+-----------+-----------+-----------+-----------
Tokens     |     99.47 |     99.66 |     99.56 |
Sentences  |    100.00 |    100.00 |    100.00 |
Words      |     99.47 |     99.66 |     99.56 |
UPOS       |     97.92 |     98.11 |     98.01 |     98.44
XPOS       |      0.00 |      0.00 |      0.00 |      0.00
UFeats     |     94.78 |     94.96 |     94.87 |     95.28
AllTags    |      0.00 |      0.00 |      0.00 |      0.00
Lemmas     |      0.00 |      0.00 |      0.00 |      0.00
UAS        |     89.32 |     89.49 |     89.41 |     89.80
LAS        |     86.89 |     87.05 |     86.97 |     87.35
CLAS       |     84.51 |     84.28 |     84.40 |     84.62
MLAS       |     78.47 |     78.26 |     78.37 |     78.58
BLEX       |      0.00 |      0.00 |      0.00 |      0.00

*** KoichiYasuoka/roberta-small-belarusian-upos
Metric     | Precision |    Recall |  F1 Score | AligndAcc
-----------+-----------+-----------+-----------+-----------
Tokens     |     99.45 |     99.50 |     99.48 |
Sentences  |    100.00 |    100.00 |    100.00 |
Words      |     99.45 |     99.50 |     99.48 |
UPOS       |     94.66 |     94.71 |     94.68 |     95.18
XPOS       |      0.00 |      0.00 |      0.00 |      0.00
UFeats     |     83.66 |     83.70 |     83.68 |     84.12
AllTags    |      0.00 |      0.00 |      0.00 |      0.00
Lemmas     |      0.00 |      0.00 |      0.00 |      0.00
UAS        |     87.21 |     87.25 |     87.23 |     87.69
LAS        |     84.80 |     84.84 |     84.82 |     85.27
CLAS       |     81.93 |     81.90 |     81.91 |     82.37
MLAS       |     62.22 |     62.20 |     62.21 |     62.56
BLEX       |      0.00 |      0.00 |      0.00 |      0.00

ltgbert-base-belarusian-ud-goeswithがダントツで、UPOS/LAS/MLASが98.32/90.35/82.78だったりする。ただ、HPLT Deliverable 4.1『First language models trained』(2024年2月29日)のTable 3.5によれば、ベラルーシ語のLASは91.1まで行ったらしいので、もうちょっと改善の余地があるのかな。

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up