0
0

Delete article

Deleted articles cannot be recovered.

Draft of this article would be also deleted.

Are you sure you want to delete this article?

10日がかりでmodernbert-base-thai-cc100を製作して、タイ語係り受け解析モデルをいくつかファインチューニングしてみたところ、どうやら2024年9月12日の記事の記録を更新できたようだ。Google Colaboratory (GPU版)でのベンチマーク・プログラムは、こんな感じ。

!pip install -U transformers triton esupar
models=[
  "KoichiYasuoka/modernbert-base-thai-cc100-ud-square",
  "KoichiYasuoka/modernbert-base-thai-cc100-ud-triangular",
  "KoichiYasuoka/modernbert-base-thai-cc100-ud-embeds",
  "KoichiYasuoka/modernbert-base-thai-cc100-upos"
]
import os,sys,subprocess
url="https://github.com/nlp-chula/TUD"
f=os.path.join(os.path.basename(url),"TUD","test.conllu")
os.system(f"test -f {f} || git clone --depth=1 {url}")
url="https://universaldependencies.org/conll18/conll18_ud_eval.py"
c=os.path.basename(url)
os.system(f"test -f {c} || curl -LO {url}")
with open(f,"r",encoding="utf-8") as r:
  s=[t[8:].strip() for t in r if t.startswith("# text =")]
for mdl in models:
  if mdl.endswith("-upos"):
    import esupar,deplacy
    p=esupar.load(mdl)
    nlp=lambda t:deplacy.to_conllu(p(t))
  else:
    from transformers import pipeline
    nlp=pipeline("universal-dependencies",mdl,trust_remote_code=True,
      aggregation_strategy="simple",device=0)
  with open("result.conllu","w",encoding="utf-8") as w:
    for t in s:
      w.write(nlp(t))
  p=subprocess.run([sys.executable,c,"-v",f,"result.conllu"],
    encoding="utf-8",stdout=subprocess.PIPE,stderr=subprocess.STDOUT)
  os.system(f"mkdir -p result/{mdl}")
  with open(f"result/{mdl}/result.txt","w",encoding="utf-8") as w:
    print(f"\n*** {mdl}",p.stdout,sep="\n",file=w)
!( cd result && cat `find {" ".join(models)} -name result.txt` )

私(安岡孝一)の手元では、以下の結果が出力された。

*** KoichiYasuoka/modernbert-base-thai-cc100-ud-square
Metric     | Precision |    Recall |  F1 Score | AligndAcc
-----------+-----------+-----------+-----------+-----------
Tokens     |     92.34 |     90.73 |     91.53 |
Sentences  |    100.00 |    100.00 |    100.00 |
Words      |     92.34 |     90.73 |     91.53 |
UPOS       |     81.81 |     80.39 |     81.09 |     88.60
XPOS       |     92.34 |     90.73 |     91.53 |    100.00
UFeats     |     90.05 |     88.48 |     89.26 |     97.52
AllTags    |     79.94 |     78.55 |     79.24 |     86.57
Lemmas     |     92.34 |     90.73 |     91.53 |    100.00
UAS        |     72.08 |     70.82 |     71.44 |     78.05
LAS        |     60.34 |     59.29 |     59.81 |     65.34
CLAS       |     56.96 |     54.07 |     55.48 |     60.85
MLAS       |     49.18 |     46.68 |     47.90 |     52.54
BLEX       |     56.96 |     54.07 |     55.48 |     60.85

*** KoichiYasuoka/modernbert-base-thai-cc100-ud-triangular
Metric     | Precision |    Recall |  F1 Score | AligndAcc
-----------+-----------+-----------+-----------+-----------
Tokens     |     92.11 |     91.16 |     91.63 |
Sentences  |    100.00 |    100.00 |    100.00 |
Words      |     92.11 |     91.16 |     91.63 |
UPOS       |     81.44 |     80.61 |     81.02 |     88.42
XPOS       |     92.11 |     91.16 |     91.63 |    100.00
UFeats     |     89.77 |     88.85 |     89.30 |     97.46
AllTags    |     79.51 |     78.69 |     79.10 |     86.32
Lemmas     |     92.11 |     91.16 |     91.63 |    100.00
UAS        |     72.32 |     71.57 |     71.94 |     78.51
LAS        |     61.15 |     60.52 |     60.84 |     66.39
CLAS       |     57.88 |     55.40 |     56.61 |     61.83
MLAS       |     49.91 |     47.77 |     48.82 |     53.32
BLEX       |     57.88 |     55.40 |     56.61 |     61.83

*** KoichiYasuoka/modernbert-base-thai-cc100-ud-embeds
Metric     | Precision |    Recall |  F1 Score | AligndAcc
-----------+-----------+-----------+-----------+-----------
Tokens     |     92.20 |     91.37 |     91.78 |
Sentences  |    100.00 |    100.00 |    100.00 |
Words      |     92.20 |     91.37 |     91.78 |
UPOS       |     81.18 |     80.45 |     80.81 |     88.05
XPOS       |     92.20 |     91.37 |     91.78 |    100.00
UFeats     |     89.68 |     88.87 |     89.27 |     97.26
AllTags    |     79.22 |     78.51 |     78.87 |     85.93
Lemmas     |     92.20 |     91.37 |     91.78 |    100.00
UAS        |     71.64 |     71.00 |     71.32 |     77.71
LAS        |     61.18 |     60.63 |     60.90 |     66.35
CLAS       |     57.76 |     55.45 |     56.58 |     61.87
MLAS       |     49.86 |     47.88 |     48.85 |     53.41
BLEX       |     57.76 |     55.45 |     56.58 |     61.87

*** KoichiYasuoka/modernbert-base-thai-cc100-upos
Metric     | Precision |    Recall |  F1 Score | AligndAcc
-----------+-----------+-----------+-----------+-----------
Tokens     |     90.68 |     88.57 |     89.62 |
Sentences  |    100.00 |    100.00 |    100.00 |
Words      |     90.68 |     88.57 |     89.62 |
UPOS       |     75.39 |     73.63 |     74.50 |     83.13
XPOS       |     90.68 |     88.57 |     89.62 |    100.00
UFeats     |     87.79 |     85.75 |     86.76 |     96.81
AllTags    |     72.96 |     71.26 |     72.10 |     80.46
Lemmas     |     90.68 |     88.57 |     89.62 |    100.00
UAS        |     71.14 |     69.48 |     70.30 |     78.44
LAS        |     59.58 |     58.19 |     58.88 |     65.70
CLAS       |     55.86 |     52.45 |     54.10 |     60.79
MLAS       |     44.03 |     41.34 |     42.64 |     47.91
BLEX       |     55.86 |     52.45 |     54.10 |     60.79

とは言うものの、modernbert-base-thai-cc100-ud-embedsのLAS/MLAS/BLEXは60.90/48.85/56.58で、LASはcamembert-thai-base-uposに負けている。なかなか難しいなあ。

0
0
0

Register as a new user and use Qiita more conveniently

  1. You get articles that match your needs
  2. You can efficiently read back useful information
  3. You can use dark theme
What you can do with signing up
0
0

Delete article

Deleted articles cannot be recovered.

Draft of this article would be also deleted.

Are you sure you want to delete this article?