More than 5 years have passed since last update.

AllenNLPでCoreference Resolution(共参照解析)

Last updated at 2019-11-08Posted at 2019-11-08

Coreference Resolution

Coreference Resolution(共参照解析・照応解析？)とは、ある文章中で、同一のモノを指す２つ以上の語句を見つけ対応づける処理です。
(詳しくはこちらが分かりやすいと思います。http://adsmedia.hatenablog.com/entry/2017/02/20/084846)

本記事では、AllenNLPの学習済みCoreference Resolutionモデルを用いた共参照解析手順を備忘録がてらに書いていきたいと思います。

[Coreference Resolutionのイメージ]
"The legal pressures facing Michael Cohen are growing in a wide-ranging investigation of his personal business affairs and his work on behalf of his former client, President Trump. In addition to his work for Mr. Trump, he pursued his own business interests, including ventures in real estate, personal loans and investments in taxi medallions."
(公式チュートリアル：https://demo.allennlp.org/coreference-resolution/MTE2NTIwOQ==)

青字や赤字はそれぞれ同一のモノを指す表現。

AllenNLPのインストール

pip install allennlp

学習済みCoreference Resolutionモデルのダウンロード

wget　https://s3-us-west-2.amazonaws.com/allennlp/models/coref-model-2018.02.05.tar.gz

解凍

tar -zxvf coref-model-2018.02.05.tar.gz

ソースコード

変数doc1に任意の文章を与えます。

import pprint
from allennlp.predictors.predictor import Predictor

def my_coref(text, model = False):
    d = model.predict(document = text)
    
    clusters = d['clusters']
    words = d['document']

    coref_phrase_list = []
    for cluster in clusters:
        l = [(' '.join(words[c[0]:c[1]+1]), (c[0], c[1])) for c in cluster]
        coref_phrase_list.append(l)
    
    return coref_phrase_list

if __name__ == '__main__':
    predictor = Predictor.from_path("coref-model-2018.02.05")
    doc1 = """Paul Allen was born on January 21, 1953, in Seattle, Washington, to Kenneth Sam Allen and Edna Faye Allen. Allen attended Lakeside School, a private school in Seattle, where he befriended Bill Gates."""

    d = predictor.predict(document=doc1)

    pprint.pprint(d)

    pprint.pprint(my_coref(doc1, predictor))

結果

# pprint.pprint(d)
# d は辞書型オフジェクト
# d['clusters'][i]: ある同一のものを指し示している表現(フレーズor単語)の単語オフセットリスト [start, end]
# d['document']: 入力文章の単語リスト
# d['predicted_antecedents']: ??
# d['top_spans']: ??

{'clusters': [[[0, 1], [24, 24], [36, 36]], [[11, 11], [33, 33]]],
 'document': ['Paul',
              'Allen',
              'was',
              'born',
              'on',
              'January',
              '21',
              ',',
              '1953',
              ',',
              'in',
              'Seattle',
              ',',
              'Washington',
              ',',
              'to',
              'Kenneth',
              'Sam',
              'Allen',
              'and',
              'Edna',
              'Faye',
              'Allen',
              '.',
              'Allen',
              'attended',
              'Lakeside',
              'School',
              ',',
              'a',
              'private',
              'school',
              'in',
              'Seattle',
              ',',
              'where',
              'he',
              'befriended',
              'Bill',
              'Gates',
              '.'],
 'predicted_antecedents': [-1,
                           -1,
                           -1,
                           -1,
                           -1,
                           -1,
                           -1,
                           -1,
                           -1,
                           -1,
                           9,
                           -1,
                           8,
                           -1,
                           3,
                           -1],
 'top_spans': [[0, 1],
               [6, 6],
               [8, 8],
               [11, 11],
               [11, 13],
               [13, 13],
               [16, 18],
               [16, 22],
               [17, 18],
               [20, 22],
               [24, 24],
               [25, 25],
               [33, 33],
               [33, 39],
               [36, 36],
               [38, 39]]}



# pprint.pprint(my_coref(doc1, predictor))
[[('Paul Allen', (0, 1)), ('Allen', (24, 24)), ('he', (36, 36))],
 [('Seattle', (11, 11)), ('Seattle', (33, 33))]]

まとめ

今回はAllenNLPを使って簡単にCoreference Resolutionをやってみました。
'predicted_antecedents'や'top_spans'のvalueが何を示しているかお分かりの方は、コメントいただけると嬉しいです。

＊my_coref()は自作関数です。

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up