More than 1 year has passed since last update.

SOTAを獲得した言語モデルLukeを触ってみた(python, transformers, 固有表現認識)

Last updated at 2022-10-30Posted at 2022-10-30

こんにちにゃんです。
水色桜（みずいろさくら）です。
今回はstudio ousia社の開発したLUKE (Language Understanding with Knowledge-based Embeddings) を触ってみようと思います。

Luke

Lukeは新しい事前学習済み言語モデルであり、studio ousia社の論文で発表されました。Lukeは深い文脈化がなされたエンティティ表現であり、エンティティ認識self-attentionが使われています。（エンティティ：固有表現）２０２２年１０月２７日に日本語バージョンのLukeが無償公開されました。本記事では英語バージョンのLukeを用います（日本語バージョンは上手く動作しなかったため…）。日本語バージョンのLukeはJGLUEベンチマークに含まれる４つの日本語データセット（SQuAD v1.1 (extractive question answering), CoNLL-2003 (named entity recognition), ReCoRD (cloze-style question answering), TACRED (relation classification), and Open Entity (entity typing)）で既存の日本語モデルを上回る性能を示しました。

今回触ったプログラム

公式で配布されているプログラムを使わせていただきました。２つの固有表現の関係性を判定するプログラムです。そのプログラムに私なりの解説を加えてみました。

entity_recog.py

from transformers import LukeForEntityPairClassification, LukeTokenizer

model = LukeForEntityPairClassification.from_pretrained('studio-ousia/luke-large-finetuned-tacred')  # モデルの設定
tokenizer = LukeTokenizer.from_pretrained('studio-ousia/luke-large-finetuned-tacred')  # トーケナイザ（形態素解析を行うもの）の設定

entity_spans = [(0, 3), (15, 29)]  # どの単語とどの単語の関係を調べるか入力する
text = 'Taro belong to Keio university'  # 解析したい文章
inputs=tokenizer(text, entity_spans=entity_spans, return_tensors="pt")  # テキストを形態素解析する
outputs = model(**inputs)  # モデルにかける

logits = outputs.logits  # logits(確率をマッピングした配列)を抽出
predicted_class_idx = int(logits[0].argmax())  # 最も確率の高いインデックスを取得

print("Predicted class:", model.config.id2label[predicted_class_idx])  # インデックスが表す関係性を表示

実行すると、

Predicted class: org:parents

となります。日本語で「母体」という意味で、所属を表す関係性であることが判別できています。
まだ、Lukeに関する記事や解説が少なく、公式からも供給が少ないため、ちょっととっつき辛いと感じました。
もし日本語バージョンでうまくいったらそれも公開するつもりです。
では、ばいにゃん～。

参考

LUKEを開発したstudio ousia社さんのGitHub
https://github.com/studio-ousia/luke

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up