Delete article

Deleted articles cannot be recovered.

Draft of this article would be also deleted.

Are you sure you want to delete this article?


Posted at



>>> from transformers import AutoTokenizer
>>> tkz=AutoTokenizer.from_pretrained("llm-jp/llm-jp-3-150m")
>>> print(tkz.convert_ids_to_tokens(tkz("国境の長いトンネルを抜けると雪国であった。","夜の底が白くなった。")["input_ids"]))
['<s>', '▁国境', 'の', '長い', 'トンネル', 'を', '抜ける', 'と', '雪', '国', 'であった。', '<s>', '▁夜', 'の', '底', 'が', '白', 'くなった', '。']


>>> from transformers import AutoTokenizer
>>> from tokenizers import Regex
>>> from tokenizers.pre_tokenizers import Sequence,Split,Whitespace,Punctuation
>>> tkz=AutoTokenizer.from_pretrained("llm-jp/llm-jp-3-150m")
>>> tkz.backend_tokenizer.normalizer=None
>>> tkz.backend_tokenizer.pre_tokenizer=Sequence([Split(Regex("[ぁ-ん]"),"isolated"),Whitespace(),Punctuation()])
>>> print(tkz.convert_ids_to_tokens(tkz("国境の長いトンネルを抜けると雪国であった。","夜の底が白くなった。")["input_ids"]))
['<s>', '国境', 'の', '長', 'い', 'トンネル', 'を', '抜', 'け', 'る', 'と', '雪', '国', 'で', 'あ', 'っ', 'た', '。', '<s>', '夜', 'の', '底', 'が', '白', 'く', 'な', 'っ', 'た', '。']

ひらがなが細かすぎる気がしないでもないが、とりあえず、このやり方でトークナイザを改良して、国語研長単位品詞付与・係り受け解析モデルllm-jp-3-{150m,440m,980m,1.8b}-ud-embedsを試作してみた。一昨日の記事にしたがって、係り受け解析の精度を比較してみよう。Google Colaboratoryだと、こんな感じ。

!pip install transformers
import os,sys,subprocess
from transformers import pipeline
os.system(f"test -f {f} || git clone --depth=1 {url}")
os.system(f"test -f {c} || curl -LO {url}")
with open(f,"r",encoding="utf-8") as r:
  s=[t[8:].strip() for t in r if t.startswith("# text =")]
for z in ["150m","440m","980m","1.8b"]:
  for tkz in ["original","refined"]:
    if tkz=="original":
      from transformers import AutoTokenizer
      from tokenizers.normalizers import Replace
      nlp.tokenizer.backend_tokenizer.normalizer=Replace(" ","▁")
    with open("result.conllu","w",encoding="utf-8") as w:
      for t in s:
    os.system("mkdir -p "+os.path.join("result",mdl.format(z)))
    with open(rst[-1],"w",encoding="utf-8") as w:
      print(f"\n*** {mdl.format(z)} ({tkz} tokenizer)",p.stdout,sep="\n",file=w)
!cat {" ".join(rst)}


*** KoichiYasuoka/llm-jp-3-150m-ud-embeds (original tokenizer)
Metric     | Precision |    Recall |  F1 Score | AligndAcc
Tokens     |     71.64 |     54.32 |     61.79 |
Sentences  |    100.00 |    100.00 |    100.00 |
Words      |     71.64 |     54.32 |     61.79 |
UPOS       |     69.66 |     52.81 |     60.07 |     97.23
XPOS       |      0.00 |      0.00 |      0.00 |      0.00
UFeats     |     71.64 |     54.32 |     61.79 |    100.00
AllTags    |      0.00 |      0.00 |      0.00 |      0.00
Lemmas     |      0.00 |      0.00 |      0.00 |      0.00
UAS        |     47.63 |     36.11 |     41.08 |     66.49
LAS        |     46.85 |     35.52 |     40.41 |     65.40
CLAS       |     36.98 |     32.70 |     34.71 |     49.87
MLAS       |     30.73 |     27.17 |     28.84 |     41.43
BLEX       |      0.00 |      0.00 |      0.00 |      0.00

*** KoichiYasuoka/llm-jp-3-150m-ud-embeds (refined tokenizer)
Metric     | Precision |    Recall |  F1 Score | AligndAcc
Tokens     |     95.08 |     95.82 |     95.45 |
Sentences  |    100.00 |    100.00 |    100.00 |
Words      |     95.08 |     95.82 |     95.45 |
UPOS       |     93.44 |     94.17 |     93.81 |     98.28
XPOS       |      0.00 |      0.00 |      0.00 |      0.00
UFeats     |     95.05 |     95.79 |     95.42 |     99.97
AllTags    |      0.00 |      0.00 |      0.00 |      0.00
Lemmas     |      0.00 |      0.00 |      0.00 |      0.00
UAS        |     84.29 |     84.94 |     84.62 |     88.65
LAS        |     83.47 |     84.12 |     83.79 |     87.79
CLAS       |     75.88 |     77.03 |     76.45 |     81.62
MLAS       |     72.83 |     73.93 |     73.37 |     78.33
BLEX       |      0.00 |      0.00 |      0.00 |      0.00

*** KoichiYasuoka/llm-jp-3-440m-ud-embeds (original tokenizer)
Metric     | Precision |    Recall |  F1 Score | AligndAcc
Tokens     |     71.12 |     53.89 |     61.32 |
Sentences  |    100.00 |    100.00 |    100.00 |
Words      |     71.12 |     53.89 |     61.32 |
UPOS       |     69.22 |     52.45 |     59.68 |     97.33
XPOS       |      0.00 |      0.00 |      0.00 |      0.00
UFeats     |     71.11 |     53.88 |     61.31 |     99.98
AllTags    |      0.00 |      0.00 |      0.00 |      0.00
Lemmas     |      0.00 |      0.00 |      0.00 |      0.00
UAS        |     45.75 |     34.67 |     39.44 |     64.32
LAS        |     45.09 |     34.17 |     38.88 |     63.40
CLAS       |     35.03 |     31.83 |     33.35 |     49.02
MLAS       |     29.30 |     26.62 |     27.89 |     41.00
BLEX       |      0.00 |      0.00 |      0.00 |      0.00

*** KoichiYasuoka/llm-jp-3-440m-ud-embeds (refined tokenizer)
Metric     | Precision |    Recall |  F1 Score | AligndAcc
Tokens     |     95.57 |     96.07 |     95.82 |
Sentences  |    100.00 |    100.00 |    100.00 |
Words      |     95.57 |     96.07 |     95.82 |
UPOS       |     93.90 |     94.39 |     94.15 |     98.25
XPOS       |      0.00 |      0.00 |      0.00 |      0.00
UFeats     |     95.57 |     96.07 |     95.82 |    100.00
AllTags    |      0.00 |      0.00 |      0.00 |      0.00
Lemmas     |      0.00 |      0.00 |      0.00 |      0.00
UAS        |     83.76 |     84.20 |     83.98 |     87.64
LAS        |     82.97 |     83.40 |     83.19 |     86.81
CLAS       |     77.57 |     79.38 |     78.46 |     83.99
MLAS       |     74.59 |     76.32 |     75.45 |     80.76
BLEX       |      0.00 |      0.00 |      0.00 |      0.00

*** KoichiYasuoka/llm-jp-3-980m-ud-embeds (original tokenizer)
Metric     | Precision |    Recall |  F1 Score | AligndAcc
Tokens     |     72.53 |     54.93 |     62.52 |
Sentences  |    100.00 |    100.00 |    100.00 |
Words      |     72.53 |     54.93 |     62.52 |
UPOS       |     70.84 |     53.64 |     61.05 |     97.66
XPOS       |      0.00 |      0.00 |      0.00 |      0.00
UFeats     |     72.52 |     54.92 |     62.50 |     99.98
AllTags    |      0.00 |      0.00 |      0.00 |      0.00
Lemmas     |      0.00 |      0.00 |      0.00 |      0.00
UAS        |     48.47 |     36.71 |     41.78 |     66.83
LAS        |     47.71 |     36.13 |     41.12 |     65.78
CLAS       |     36.87 |     33.10 |     34.88 |     49.74
MLAS       |     31.12 |     27.94 |     29.44 |     41.98
BLEX       |      0.00 |      0.00 |      0.00 |      0.00

*** KoichiYasuoka/llm-jp-3-980m-ud-embeds (refined tokenizer)
Metric     | Precision |    Recall |  F1 Score | AligndAcc
Tokens     |     95.70 |     96.14 |     95.92 |
Sentences  |    100.00 |    100.00 |    100.00 |
Words      |     95.70 |     96.14 |     95.92 |
UPOS       |     94.34 |     94.78 |     94.56 |     98.58
XPOS       |      0.00 |      0.00 |      0.00 |      0.00
UFeats     |     95.67 |     96.12 |     95.89 |     99.97
AllTags    |      0.00 |      0.00 |      0.00 |      0.00
Lemmas     |      0.00 |      0.00 |      0.00 |      0.00
UAS        |     85.80 |     86.20 |     86.00 |     89.66
LAS        |     85.01 |     85.40 |     85.20 |     88.83
CLAS       |     78.39 |     79.53 |     78.96 |     83.70
MLAS       |     75.75 |     76.85 |     76.30 |     80.88
BLEX       |      0.00 |      0.00 |      0.00 |      0.00

*** KoichiYasuoka/llm-jp-3-1.8b-ud-embeds (original tokenizer)
Metric     | Precision |    Recall |  F1 Score | AligndAcc
Tokens     |     72.85 |     54.10 |     62.09 |
Sentences  |    100.00 |    100.00 |    100.00 |
Words      |     72.85 |     54.10 |     62.09 |
UPOS       |     71.00 |     52.73 |     60.52 |     97.47
XPOS       |      0.00 |      0.00 |      0.00 |      0.00
UFeats     |     72.83 |     54.09 |     62.08 |     99.98
AllTags    |      0.00 |      0.00 |      0.00 |      0.00
Lemmas     |      0.00 |      0.00 |      0.00 |      0.00
UAS        |     47.60 |     35.36 |     40.58 |     65.35
LAS        |     46.89 |     34.83 |     39.97 |     64.37
CLAS       |     35.28 |     31.06 |     33.03 |     47.83
MLAS       |     29.92 |     26.33 |     28.01 |     40.56
BLEX       |      0.00 |      0.00 |      0.00 |      0.00

*** KoichiYasuoka/llm-jp-3-1.8b-ud-embeds (refined tokenizer)
Metric     | Precision |    Recall |  F1 Score | AligndAcc
Tokens     |     95.74 |     96.14 |     95.94 |
Sentences  |    100.00 |    100.00 |    100.00 |
Words      |     95.74 |     96.14 |     95.94 |
UPOS       |     94.28 |     94.68 |     94.48 |     98.47
XPOS       |      0.00 |      0.00 |      0.00 |      0.00
UFeats     |     95.72 |     96.13 |     95.92 |     99.98
AllTags    |      0.00 |      0.00 |      0.00 |      0.00
Lemmas     |      0.00 |      0.00 |      0.00 |      0.00
UAS        |     86.74 |     87.10 |     86.92 |     90.59
LAS        |     85.96 |     86.33 |     86.14 |     89.79
CLAS       |     79.07 |     79.73 |     79.40 |     84.05
MLAS       |     76.39 |     77.03 |     76.71 |     81.20
BLEX       |      0.00 |      0.00 |      0.00 |      0.00


トークナイザ改良前 トークナイザ改良後
llm-jp-3-150m-ud-embeds 60.07/40.41/28.84 93.81/83.79/73.37
llm-jp-3-440m-ud-embeds 59.68/38.88/27.89 94.15/83.19/75.45
llm-jp-3-980m-ud-embeds 61.05/41.12/29.44 94.56/85.20/76.30
llm-jp-3-1.8b-ud-embeds 60.52/39.97/28.01 94.48/86.14/76.71



Register as a new user and use Qiita more conveniently

  1. You get articles that match your needs
  2. You can efficiently read back useful information
  3. You can use dark theme
What you can do with signing up

Delete article

Deleted articles cannot be recovered.

Draft of this article would be also deleted.

Are you sure you want to delete this article?