1. tamiyamoto

    No comment

    tamiyamoto
Changes in body
Source | HTML | Preview

Coreference Resolution

Coreference Resolution(共参照解析・照応解析?)とは、ある文章中で、同一のモノを指す2つ以上の語句を見つけ対応づける処理です。

本記事では、AllenNLPの学習済みCoreference Resolutionモデルを用いた共参照解析手順を備忘録がてらに書いていきたいと思います。


例えば...
The legal pressures facing Michael Cohen are growing in a wide-ranging investigation of his personal business affairs and his work on behalf of his former client, President Trump. In addition to his work for Mr. Trump, he pursued his own business interests, including ventures in real estate, personal loans and investments in taxi medallions.
(公式チュートリアル:https://demo.allennlp.org/coreference-resolution/MTE2NTIwOQ==)


AllenNLPのインストール

pip install allennlp

学習済みCoreference Resolutionモデルのダウンロード

wget https://s3-us-west-2.amazonaws.com/allennlp/models/coref-model-2018.02.05.tar.gz

解凍

tar -zxvf coref-model-2018.02.05.tar.gz

 

ソースコード

変数doc1に任意の文章を与えます。

import pprint
from allennlp.predictors.predictor import Predictor

def my_coref(text, model = False):
    d = model.predict(document = text)

    clusters = d['clusters']
    words = d['document']

    coref_phrase_list = []
    for cluster in clusters:
        l = [(' '.join(words[c[0]:c[1]+1]), (c[0], c[1])) for c in cluster]
        coref_phrase_list.append(l)

    return coref_phrase_list

if __name__ == '__main__':
    predictor = Predictor.from_path("coref-model-2018.02.05")
    doc1 = """Paul Allen was born on January 21, 1953, in Seattle, Washington, to Kenneth Sam Allen and Edna Faye Allen. Allen attended Lakeside School, a private school in Seattle, where he befriended Bill Gates."""

    d = predictor.predict(document=doc1)

    pprint.pprint(d)

    pprint.pprint(my_coref(doc1, predictor))

結果

# pprint.pprint(d)
# d は辞書型オフジェクト
# d['clusters'][i]: ある同一のものを指し示している表現(フレーズor単語)の単語オフセットリスト
# d['document']: 入力文章の単語リスト
# d['predicted_antecedents']: ??
# d['top_spans']: ??

{'clusters': [[[0, 1], [24, 24], [36, 36]], [[11, 11], [33, 33]]],
 'document': ['Paul',
              'Allen',
              'was',
              'born',
              'on',
              'January',
              '21',
              ',',
              '1953',
              ',',
              'in',
              'Seattle',
              ',',
              'Washington',
              ',',
              'to',
              'Kenneth',
              'Sam',
              'Allen',
              'and',
              'Edna',
              'Faye',
              'Allen',
              '.',
              'Allen',
              'attended',
              'Lakeside',
              'School',
              ',',
              'a',
              'private',
              'school',
              'in',
              'Seattle',
              ',',
              'where',
              'he',
              'befriended',
              'Bill',
              'Gates',
              '.'],
 'predicted_antecedents': [-1,
                           -1,
                           -1,
                           -1,
                           -1,
                           -1,
                           -1,
                           -1,
                           -1,
                           -1,
                           9,
                           -1,
                           8,
                           -1,
                           3,
                           -1],
 'top_spans': [[0, 1],
               [6, 6],
               [8, 8],
               [11, 11],
               [11, 13],
               [13, 13],
               [16, 18],
               [16, 22],
               [17, 18],
               [20, 22],
               [24, 24],
               [25, 25],
               [33, 33],
               [33, 39],
               [36, 36],
               [38, 39]]}



# pprint.pprint(my_coref(doc1, predictor))
[[('Paul Allen', (0, 1)), ('Allen', (24, 24)), ('he', (36, 36))],
 [('Seattle', (11, 11)), ('Seattle', (33, 33))]]

まとめ

今回はAllenNLPを使って簡単にCoreference Resolutionをやってみました。
'predicted_antecedents'や'top_spans'のvalueが何を示しているかお分かりの方は、コメントいただけると嬉しいです。

*my_coref()は自作関数です。