1
0

Delete article

Deleted articles cannot be recovered.

Draft of this article would be also deleted.

Are you sure you want to delete this article?

国語研長単位品詞付与・係り受け解析モデルTinySwallow-1.5B-ud-embedsリリース

Posted at

2月14日の記事のアイデアをTinySwallow-1.5Bに適用して、国語研長単位品詞付与・係り受け解析モデルTinySwallow-1.5B-ud-embedsを試作してみた。とりあえず、ひらがな単文字トークナイザの効果を、UPOS品詞付与で見てみよう。改良前だと、こんな感じ。

>>> from transformers import pipeline
>>> nlp=pipeline("token-classification","KoichiYasuoka/TinySwallow-1.5B-ud-embeds",aggregation_strategy="simple")
>>> print(nlp("国境の長いトンネルを抜けると雪国であった。"))
[{'entity_group': 'NOUN', 'score': 0.9213227, 'word': '国境', 'start': 0, 'end': 2}, {'entity_group': 'ADP.', 'score': 0.9983113, 'word': 'の', 'start': 2, 'end': 3}, {'entity_group': 'ADJ', 'score': 0.7436655, 'word': '長い', 'start': 3, 'end': 5}, {'entity_group': 'NOUN', 'score': 0.98790246, 'word': 'トンネル', 'start': 5, 'end': 9}, {'entity_group': 'ADP.', 'score': 0.9931259, 'word': 'を', 'start': 9, 'end': 10}, {'entity_group': 'VERB', 'score': 0.9929379, 'word': '抜ける', 'start': 10, 'end': 13}, {'entity_group': 'SCONJ.', 'score': 0.97754556, 'word': 'と', 'start': 13, 'end': 14}, {'entity_group': 'NOUN', 'score': 0.67460227, 'word': '雪国', 'start': 14, 'end': 16}, {'entity_group': 'AUX.', 'score': 0.79544526, 'word': 'であった', 'start': 16, 'end': 20}, {'entity_group': 'PUNCT.', 'score': 0.99999905, 'word': '。', 'start': 20, 'end': 21}]

トークナイザ改良後は、こんな感じ。

>>> from transformers import pipeline
>>> nlp=pipeline("upos","KoichiYasuoka/TinySwallow-1.5B-ud-embeds",aggregation_strategy="simple",trust_remote_code=True)
>>> print(nlp("国境の長いトンネルを抜けると雪国であった。"))
[{'start': 0, 'end': 2, 'score': 0.85084575, 'entity_group': 'NOUN', 'text': '国境'}, {'start': 2, 'end': 3, 'score': 0.9983113, 'entity_group': 'ADP.', 'text': 'の'}, {'start': 3, 'end': 5, 'score': 0.028116763, 'entity_group': 'ADJ', 'text': '長い'}, {'start': 5, 'end': 9, 'score': 0.9939516, 'entity_group': 'NOUN', 'text': 'トンネル'}, {'start': 9, 'end': 10, 'score': 0.99305433, 'entity_group': 'ADP.', 'text': 'を'}, {'start': 10, 'end': 13, 'score': 0.9814558, 'entity_group': 'VERB', 'text': '抜ける'}, {'start': 13, 'end': 14, 'score': 0.985485, 'entity_group': 'SCONJ.', 'text': 'と'}, {'start': 14, 'end': 16, 'score': 0.15250096, 'entity_group': 'NOUN', 'text': '雪国'}, {'start': 16, 'end': 19, 'score': 0.0011507535, 'entity_group': 'AUX.', 'text': 'であっ'}, {'start': 19, 'end': 20, 'score': 0.99542135, 'entity_group': 'AUX.', 'text': 'た'}, {'start': 20, 'end': 21, 'score': 0.99996924, 'entity_group': 'PUNCT.', 'text': '。'}]

「であっ」「た」が正しくトークナイズされていて、完璧だ。2月9日の記事にしたがって、係り受け解析の精度も比較してみよう。Google Colaboratoryだと、こんな感じ。

!pip install transformers
mdl="KoichiYasuoka/TinySwallow-1.5B-ud-embeds"
org="SakanaAI/TinySwallow-1.5B"
import os,sys,subprocess
from transformers import pipeline,AutoTokenizer
url="https://github.com/UniversalDependencies/UD_Japanese-GSDLUW"
f=os.path.join(os.path.basename(url),"ja_gsdluw-ud-test.conllu")
os.system(f"test -f {f} || git clone --depth=1 {url}")
url="https://universaldependencies.org/conll18/conll18_ud_eval.py"
c=os.path.basename(url)
os.system(f"test -f {c} || curl -LO {url}")
with open(f,"r",encoding="utf-8") as r:
  s=[t[8:].strip() for t in r if t.startswith("# text =")]
for tkz in ["original","refined"]:
  nlp=pipeline("universal-dependencies",mdl,trust_remote_code=True)
  if tkz=="original":
    nlp.tokenizer=AutoTokenizer.from_pretrained(org)
  with open("result.conllu","w",encoding="utf-8") as w:
    for t in s:
      w.write(nlp(t))
  p=subprocess.run([sys.executable,c,"-v",f,"result.conllu"],
    encoding="utf-8",stdout=subprocess.PIPE,stderr=subprocess.STDOUT)
  os.system(f"mkdir -p result/{mdl}")
  with open(f"result/{mdl}/{tkz}.txt","w",encoding="utf-8") as w:
    print(f"\n*** {mdl} ({tkz} tokenizer)",p.stdout,sep="\n",file=w)
!cat result/{mdl}/*.txt

私(安岡孝一)の手元では、以下の結果が出力された。

*** KoichiYasuoka/TinySwallow-1.5B-ud-embeds (original tokenizer)
Metric     | Precision |    Recall |  F1 Score | AligndAcc
-----------+-----------+-----------+-----------+-----------
Tokens     |     75.59 |     61.57 |     67.86 |
Sentences  |    100.00 |    100.00 |    100.00 |
Words      |     75.59 |     61.57 |     67.86 |
UPOS       |     73.43 |     59.82 |     65.93 |     97.15
XPOS       |      0.00 |      0.00 |      0.00 |      0.00
UFeats     |     75.59 |     61.57 |     67.86 |    100.00
AllTags    |      0.00 |      0.00 |      0.00 |      0.00
Lemmas     |      0.00 |      0.00 |      0.00 |      0.00
UAS        |     47.91 |     39.03 |     43.02 |     63.39
LAS        |     47.20 |     38.45 |     42.38 |     62.45
CLAS       |     34.99 |     31.91 |     33.38 |     49.30
MLAS       |     31.04 |     28.31 |     29.61 |     43.74
BLEX       |      0.00 |      0.00 |      0.00 |      0.00

*** KoichiYasuoka/TinySwallow-1.5B-ud-embeds (refined tokenizer)
Metric     | Precision |    Recall |  F1 Score | AligndAcc
-----------+-----------+-----------+-----------+-----------
Tokens     |     95.39 |     95.61 |     95.50 |
Sentences  |    100.00 |    100.00 |    100.00 |
Words      |     95.39 |     95.61 |     95.50 |
UPOS       |     93.57 |     93.79 |     93.68 |     98.09
XPOS       |      0.00 |      0.00 |      0.00 |      0.00
UFeats     |     95.37 |     95.59 |     95.48 |     99.98
AllTags    |      0.00 |      0.00 |      0.00 |      0.00
Lemmas     |      0.00 |      0.00 |      0.00 |      0.00
UAS        |     86.83 |     87.03 |     86.93 |     91.02
LAS        |     86.10 |     86.30 |     86.20 |     90.26
CLAS       |     78.89 |     80.01 |     79.45 |     84.92
MLAS       |     75.70 |     76.78 |     76.24 |     81.49
BLEX       |      0.00 |      0.00 |      0.00 |      0.00

UPOS/LAS/MLASで比較すると、トークナイザ改良前が65.93/42.38/29.61で、改良後が93.68/86.20/76.24だ。TinySwallow-1.5Bは、入出力幅32768トークンを15億パラメータで実現している言語モデルなので、もう少し解析精度が上がってほしいところだなあ。

1
0
0

Register as a new user and use Qiita more conveniently

  1. You get articles that match your needs
  2. You can efficiently read back useful information
  3. You can use dark theme
What you can do with signing up
1
0

Delete article

Deleted articles cannot be recovered.

Draft of this article would be also deleted.

Are you sure you want to delete this article?