0
0

Delete article

Deleted articles cannot be recovered.

Draft of this article would be also deleted.

Are you sure you want to delete this article?

ModernBERT-base-Turkish-uncased-mlmは「İyi [MASK] sözünün üstüne gelir」の[MASK]に何を埋めてくるのか

Posted at

ModernBERT-base-Turkish-uncased-mlmというトルコ語モデルを見つけたので、試しに使ってみた。ただ、トルコ語では「I」の小文字が「ı」なのだが、BertNormalizerlowercase=Trueは「I」を「i」にしてしまうので、ちょっとだけトークナイザをいじってみることにした。Google Colaboratoryだと、こんな感じ。

!pip install transformers
from transformers import pipeline
from tokenizers.normalizers import Sequence,Replace,BertNormalizer
fmp=pipeline("fill-mask","99eren99/ModernBERT-base-Turkish-uncased-mlm")
fmp.tokenizer.backend_tokenizer.normalizer=Sequence([Replace("İ","i"),Replace("I","ı"),BertNormalizer(lowercase=True,strip_accents=False)])
print(fmp("İyi [MASK] sözünün üstüne gelir"))

私(安岡孝一)の手元では、以下の結果が得られた。

[{'score': 0.10560256242752075, 'token': 1993, 'token_str': 'bir', 'sequence': 'iyi bir sözünün üstüne gelir'}, {'score': 0.10515685379505157, 'token': 2997, 'token_str': 'kişi', 'sequence': 'iyi kişi sözünün üstüne gelir'}, {'score': 0.03464006632566452, 'token': 2419, 'token_str': 'insan', 'sequence': 'iyi insan sözünün üstüne gelir'}, {'score': 0.03446153923869133, 'token': 6, 'token_str': '"', 'sequence': 'iyi " sözünün üstüne gelir'}, {'score': 0.03443867713212967, 'token': 2156, 'token_str': 'her', 'sequence': 'iyi her sözünün üstüne gelir'}]

「İyi [MASK] sözünün üstüne gelir」の[MASK]に「bir」「kişi」「insan」「"」「her」を埋めてきており、3番目に「insan」が来ているものの、4番目の「"」は謎である。そこそこの出来に見えるのだが、さて、トルコ語の係り受け解析に使えるかな。

0
0
0

Register as a new user and use Qiita more conveniently

  1. You get articles that match your needs
  2. You can efficiently read back useful information
  3. You can use dark theme
What you can do with signing up
0
0

Delete article

Deleted articles cannot be recovered.

Draft of this article would be also deleted.

Are you sure you want to delete this article?