Coreference Resolution
Coreference Resolution(共参照解析・照応解析?)とは、ある文章中で、同一のモノを指す2つ以上の語句を見つけ対応づける処理です。
(詳しくはこちらが分かりやすいと思います。http://adsmedia.hatenablog.com/entry/2017/02/20/084846)
本記事では、AllenNLPの学習済みCoreference Resolutionモデルを用いた共参照解析手順を備忘録がてらに書いていきたいと思います。
[Coreference Resolutionのイメージ]
"The legal pressures facing Michael Cohen are growing in a wide-ranging investigation of his personal business affairs and his work on behalf of his former client, President Trump. In addition to his work for Mr. Trump, he pursued his own business interests, including ventures in real estate, personal loans and investments in taxi medallions."
(公式チュートリアル:https://demo.allennlp.org/coreference-resolution/MTE2NTIwOQ==)
青字や赤字はそれぞれ同一のモノを指す表現。
##AllenNLPのインストール
pip install allennlp
##学習済みCoreference Resolutionモデルのダウンロード
wget https://s3-us-west-2.amazonaws.com/allennlp/models/coref-model-2018.02.05.tar.gz
##解凍
tar -zxvf coref-model-2018.02.05.tar.gz
##ソースコード
変数doc1に任意の文章を与えます。
import pprint
from allennlp.predictors.predictor import Predictor
def my_coref(text, model = False):
d = model.predict(document = text)
clusters = d['clusters']
words = d['document']
coref_phrase_list = []
for cluster in clusters:
l = [(' '.join(words[c[0]:c[1]+1]), (c[0], c[1])) for c in cluster]
coref_phrase_list.append(l)
return coref_phrase_list
if __name__ == '__main__':
predictor = Predictor.from_path("coref-model-2018.02.05")
doc1 = """Paul Allen was born on January 21, 1953, in Seattle, Washington, to Kenneth Sam Allen and Edna Faye Allen. Allen attended Lakeside School, a private school in Seattle, where he befriended Bill Gates."""
d = predictor.predict(document=doc1)
pprint.pprint(d)
pprint.pprint(my_coref(doc1, predictor))
##結果
# pprint.pprint(d)
# d は辞書型オフジェクト
# d['clusters'][i]: ある同一のものを指し示している表現(フレーズor単語)の単語オフセットリスト [start, end]
# d['document']: 入力文章の単語リスト
# d['predicted_antecedents']: ??
# d['top_spans']: ??
{'clusters': [[[0, 1], [24, 24], [36, 36]], [[11, 11], [33, 33]]],
'document': ['Paul',
'Allen',
'was',
'born',
'on',
'January',
'21',
',',
'1953',
',',
'in',
'Seattle',
',',
'Washington',
',',
'to',
'Kenneth',
'Sam',
'Allen',
'and',
'Edna',
'Faye',
'Allen',
'.',
'Allen',
'attended',
'Lakeside',
'School',
',',
'a',
'private',
'school',
'in',
'Seattle',
',',
'where',
'he',
'befriended',
'Bill',
'Gates',
'.'],
'predicted_antecedents': [-1,
-1,
-1,
-1,
-1,
-1,
-1,
-1,
-1,
-1,
9,
-1,
8,
-1,
3,
-1],
'top_spans': [[0, 1],
[6, 6],
[8, 8],
[11, 11],
[11, 13],
[13, 13],
[16, 18],
[16, 22],
[17, 18],
[20, 22],
[24, 24],
[25, 25],
[33, 33],
[33, 39],
[36, 36],
[38, 39]]}
# pprint.pprint(my_coref(doc1, predictor))
[[('Paul Allen', (0, 1)), ('Allen', (24, 24)), ('he', (36, 36))],
[('Seattle', (11, 11)), ('Seattle', (33, 33))]]
#まとめ
今回はAllenNLPを使って簡単にCoreference Resolutionをやってみました。
'predicted_antecedents'や'top_spans'のvalueが何を示しているかお分かりの方は、コメントいただけると嬉しいです。
*my_coref()は自作関数です。