0
0

ベラルーシ語係り受け解析モデルltgbert-base-belarusian-ud-goeswithリリース

Posted at

昨日の記事の手法を「HPLT Bert for Belarusian」とUD_Belarusian-HSEに適用して、ベラルーシ語係り受け解析モデルltgbert-base-belarusian-ud-goeswithを試作してみた。be_hse-ud-test.conlluによるベンチマーク・プログラム(Google Colaboratory GPU版)は、過去に作ったベラルーシ語モデルも含めると、こんな感じ。

!pip install esupar
models=["KoichiYasuoka/ltgbert-base-belarusian-ud-goeswith","KoichiYasuoka/deberta-base-belarusian-upos","KoichiYasuoka/deberta-base-belarusian-ud-goeswith","KoichiYasuoka/roberta-small-belarusian-upos"]
import os,sys,subprocess,esupar,deplacy
from transformers import pipeline
url="https://github.com/UniversalDependencies/UD_Belarusian-HSE"
d=os.path.basename(url)
os.system(f"test -d {d} || git clone --depth=1 {url}")
os.system("for F in train dev test ; do cp "+d+"/*-$F.conllu $F.conllu ; done")
url="https://universaldependencies.org/conll18/conll18_ud_eval.py"
c=os.path.basename(url)
os.system(f"test -f {c} || curl -LO {url}")
with open("test.conllu","r",encoding="utf-8") as r:
  s=[t[8:].strip() for t in r if t.startswith("# text =")]
for mdl in models:
  if mdl.endswith("-upos"):
    nlp=esupar.load(mdl)
  else:
    nlp=pipeline("universal-dependencies",mdl,trust_remote_code=True,aggregation_strategy="simple",device=0)
  with open("result.conllu","w",encoding="utf-8") as w:
    for t in s:
      print(deplacy.to_conllu(nlp(t)).strip(),end="\n\n",file=w)
  p=subprocess.run([sys.executable,c,"-v","test.conllu","result.conllu"],encoding="utf-8",stdout=subprocess.PIPE,stderr=subprocess.STDOUT)
  print(f"\n*** {mdl}",p.stdout,sep="\n",flush=True)

私(安岡孝一)の手元では、以下の結果が出力された。

*** KoichiYasuoka/ltgbert-base-belarusian-ud-goeswith
Metric     | Precision |    Recall |  F1 Score | AligndAcc
-----------+-----------+-----------+-----------+-----------
Tokens     |     99.57 |     99.47 |     99.52 |
Sentences  |    100.00 |    100.00 |    100.00 |
Words      |     99.57 |     99.47 |     99.52 |
UPOS       |     98.37 |     98.27 |     98.32 |     98.80
XPOS       |      0.00 |      0.00 |      0.00 |      0.00
UFeats     |     95.43 |     95.33 |     95.38 |     95.84
AllTags    |      0.00 |      0.00 |      0.00 |      0.00
Lemmas     |      0.00 |      0.00 |      0.00 |      0.00
UAS        |     92.37 |     92.28 |     92.33 |     92.77
LAS        |     90.39 |     90.30 |     90.35 |     90.79
CLAS       |     88.58 |     88.38 |     88.48 |     88.84
MLAS       |     82.88 |     82.69 |     82.78 |     83.12
BLEX       |      0.00 |      0.00 |      0.00 |      0.00

*** KoichiYasuoka/deberta-base-belarusian-upos
Metric     | Precision |    Recall |  F1 Score | AligndAcc
-----------+-----------+-----------+-----------+-----------
Tokens     |     99.76 |     99.89 |     99.83 |
Sentences  |    100.00 |    100.00 |    100.00 |
Words      |     99.76 |     99.89 |     99.83 |
UPOS       |     98.99 |     99.12 |     99.06 |     99.23
XPOS       |      0.00 |      0.00 |      0.00 |      0.00
UFeats     |     97.17 |     97.29 |     97.23 |     97.40
AllTags    |      0.00 |      0.00 |      0.00 |      0.00
Lemmas     |      0.00 |      0.00 |      0.00 |      0.00
UAS        |     88.94 |     89.05 |     88.99 |     89.15
LAS        |     86.73 |     86.84 |     86.78 |     86.93
CLAS       |     84.24 |     84.15 |     84.20 |     84.27
MLAS       |     80.28 |     80.19 |     80.24 |     80.31
BLEX       |      0.00 |      0.00 |      0.00 |      0.00

*** KoichiYasuoka/deberta-base-belarusian-ud-goeswith
Metric     | Precision |    Recall |  F1 Score | AligndAcc
-----------+-----------+-----------+-----------+-----------
Tokens     |     99.47 |     99.66 |     99.56 |
Sentences  |    100.00 |    100.00 |    100.00 |
Words      |     99.47 |     99.66 |     99.56 |
UPOS       |     97.92 |     98.11 |     98.01 |     98.44
XPOS       |      0.00 |      0.00 |      0.00 |      0.00
UFeats     |     94.78 |     94.96 |     94.87 |     95.28
AllTags    |      0.00 |      0.00 |      0.00 |      0.00
Lemmas     |      0.00 |      0.00 |      0.00 |      0.00
UAS        |     89.32 |     89.49 |     89.41 |     89.80
LAS        |     86.89 |     87.05 |     86.97 |     87.35
CLAS       |     84.51 |     84.28 |     84.40 |     84.62
MLAS       |     78.47 |     78.26 |     78.37 |     78.58
BLEX       |      0.00 |      0.00 |      0.00 |      0.00

*** KoichiYasuoka/roberta-small-belarusian-upos
Metric     | Precision |    Recall |  F1 Score | AligndAcc
-----------+-----------+-----------+-----------+-----------
Tokens     |     99.45 |     99.50 |     99.48 |
Sentences  |    100.00 |    100.00 |    100.00 |
Words      |     99.45 |     99.50 |     99.48 |
UPOS       |     94.66 |     94.71 |     94.68 |     95.18
XPOS       |      0.00 |      0.00 |      0.00 |      0.00
UFeats     |     83.66 |     83.70 |     83.68 |     84.12
AllTags    |      0.00 |      0.00 |      0.00 |      0.00
Lemmas     |      0.00 |      0.00 |      0.00 |      0.00
UAS        |     87.21 |     87.25 |     87.23 |     87.69
LAS        |     84.80 |     84.84 |     84.82 |     85.27
CLAS       |     81.93 |     81.90 |     81.91 |     82.37
MLAS       |     62.22 |     62.20 |     62.21 |     62.56
BLEX       |      0.00 |      0.00 |      0.00 |      0.00

ltgbert-base-belarusian-ud-goeswithがダントツで、UPOS/LAS/MLASが98.32/90.35/82.78だったりする。ただ、HPLT Deliverable 4.1『First language models trained』(2024年2月29日)のTable 3.5によれば、ベラルーシ語のLASは91.1まで行ったらしいので、もうちょっと改善の余地があるのかな。

0
0
0

Register as a new user and use Qiita more conveniently

  1. You get articles that match your needs
  2. You can efficiently read back useful information
  3. You can use dark theme
What you can do with signing up
0
0