ModernBERTが12月19日に発表されたので、試しに使ってみることにした。ただし、transformersでのサポートはv4.48以降に持ち越しとなったので、現時点のtransformers v4.47.1で動かそうとすると、trust_remote_code=True
の助けが必要だったりする。Google Colaboratoryだと、こんな感じ。
!pip install transformers triton
!test -d ModernBERT-base || git clone --depth=1 https://huggingface.co/answerdotai/ModernBERT-base
!test -f ModernBERT-base/configuration_modernbert.py || ( curl -L https://github.com/huggingface/transformers/raw/refs/heads/main/src/transformers/models/modernbert/configuration_modernbert.py | sed 's/^from \.\.\./from transformers./' > ModernBERT-base/configuration_modernbert.py )
!test -f ModernBERT-base/modeling_modernbert.py || ( curl -L https://github.com/huggingface/transformers/raw/refs/heads/main/src/transformers/models/modernbert/modeling_modernbert.py | sed -e 's/^from \.\.\./from transformers./' -e 's/^from .* import is_triton_available/import importlib\nis_triton_available = lambda: importlib.util.find_spec("triton") is not None/' > ModernBERT-base/modeling_modernbert.py )
import json
with open("ModernBERT-base/config.json","r",encoding="utf-8") as r:
d=json.load(r)
if not "auto_map" in d:
d["auto_map"]={
"AutoConfig":"configuration_modernbert.ModernBertConfig",
"AutoModel":"modeling_modernbert.ModernBertModel",
"AutoModelForMaskedLM":"modeling_modernbert.ModernBertForMaskedLM",
"AutoModelForSequenceClassification":"modeling_modernbert.ModernBertForSequenceClassification",
"AutoModelForTokenClassification":"modeling_modernbert.ModernBertForTokenClassification"
}
with open("ModernBERT-base/config.json","w",encoding="utf-8") as w:
json.dump(d,w,indent=2)
from transformers import AutoTokenizer,AutoModelForMaskedLM,FillMaskPipeline
tkz=AutoTokenizer.from_pretrained("ModernBERT-base")
mdl=AutoModelForMaskedLM.from_pretrained("ModernBERT-base",trust_remote_code=True)
fmp=FillMaskPipeline(tokenizer=tkz,model=mdl)
print(fmp("It don't [MASK] a thing if it ain't got that swing"))
ModernBERT-baseが「It don't [MASK] a thing if it ain't got that swing」の[MASK]に何を埋めてくるのか試したところ、私(安岡孝一)の手元では以下の結果が得られた。
[{'score': 0.8329909443855286, 'token': 1599, 'token_str': ' mean', 'sequence': "It don't mean a thing if it ain't got that swing"}, {'score': 0.026023954153060913, 'token': 513, 'token_str': ' do', 'sequence': "It don't do a thing if it ain't got that swing"}, {'score': 0.020469805225729942, 'token': 2647, 'token_str': ' matter', 'sequence': "It don't matter a thing if it ain't got that swing"}, {'score': 0.009303263388574123, 'token': 320, 'token_str': ' be', 'sequence': "It don't be a thing if it ain't got that swing"}, {'score': 0.008411774411797523, 'token': 1056, 'token_str': ' make', 'sequence': "It don't make a thing if it ain't got that swing"}]
「 mean」が83%でダントツなあたり、この英文をModernBERT-baseは知っているのだろう。ただし、「 mean」のアタマには空白が付いている。このあたり、ModernBERTのトークナイザは、BERTよりGPT風だったりする。うーむ、品詞付与や係り受け解析、このトークナイザ相手だと大変かなあ。