Help us understand the problem. What is going on with this article?

AllenNLPでCoreference Resolution(共参照解析)

More than 1 year has passed since last update.

Coreference Resolution

Coreference Resolution(共参照解析・照応解析?)とは、ある文章中で、同一のモノを指す2つ以上の語句を見つけ対応づける処理です。
(詳しくはこちらが分かりやすいと思います。http://adsmedia.hatenablog.com/entry/2017/02/20/084846)

本記事では、AllenNLPの学習済みCoreference Resolutionモデルを用いた共参照解析手順を備忘録がてらに書いていきたいと思います。


[Coreference Resolutionのイメージ]
"The legal pressures facing Michael Cohen are growing in a wide-ranging investigation of his personal business affairs and his work on behalf of his former client, President Trump. In addition to his work for Mr. Trump, he pursued his own business interests, including ventures in real estate, personal loans and investments in taxi medallions."
(公式チュートリアル:https://demo.allennlp.org/coreference-resolution/MTE2NTIwOQ==)

青字赤字はそれぞれ同一のモノを指す表現


AllenNLPのインストール

pip install allennlp

学習済みCoreference Resolutionモデルのダウンロード

wget https://s3-us-west-2.amazonaws.com/allennlp/models/coref-model-2018.02.05.tar.gz

解凍

tar -zxvf coref-model-2018.02.05.tar.gz

 

ソースコード

変数doc1に任意の文章を与えます。

import pprint
from allennlp.predictors.predictor import Predictor

def my_coref(text, model = False):
    d = model.predict(document = text)

    clusters = d['clusters']
    words = d['document']

    coref_phrase_list = []
    for cluster in clusters:
        l = [(' '.join(words[c[0]:c[1]+1]), (c[0], c[1])) for c in cluster]
        coref_phrase_list.append(l)

    return coref_phrase_list

if __name__ == '__main__':
    predictor = Predictor.from_path("coref-model-2018.02.05")
    doc1 = """Paul Allen was born on January 21, 1953, in Seattle, Washington, to Kenneth Sam Allen and Edna Faye Allen. Allen attended Lakeside School, a private school in Seattle, where he befriended Bill Gates."""

    d = predictor.predict(document=doc1)

    pprint.pprint(d)

    pprint.pprint(my_coref(doc1, predictor))

結果

# pprint.pprint(d)
# d は辞書型オフジェクト
# d['clusters'][i]: ある同一のものを指し示している表現(フレーズor単語)の単語オフセットリスト [start, end]
# d['document']: 入力文章の単語リスト
# d['predicted_antecedents']: ??
# d['top_spans']: ??

{'clusters': [[[0, 1], [24, 24], [36, 36]], [[11, 11], [33, 33]]],
 'document': ['Paul',
              'Allen',
              'was',
              'born',
              'on',
              'January',
              '21',
              ',',
              '1953',
              ',',
              'in',
              'Seattle',
              ',',
              'Washington',
              ',',
              'to',
              'Kenneth',
              'Sam',
              'Allen',
              'and',
              'Edna',
              'Faye',
              'Allen',
              '.',
              'Allen',
              'attended',
              'Lakeside',
              'School',
              ',',
              'a',
              'private',
              'school',
              'in',
              'Seattle',
              ',',
              'where',
              'he',
              'befriended',
              'Bill',
              'Gates',
              '.'],
 'predicted_antecedents': [-1,
                           -1,
                           -1,
                           -1,
                           -1,
                           -1,
                           -1,
                           -1,
                           -1,
                           -1,
                           9,
                           -1,
                           8,
                           -1,
                           3,
                           -1],
 'top_spans': [[0, 1],
               [6, 6],
               [8, 8],
               [11, 11],
               [11, 13],
               [13, 13],
               [16, 18],
               [16, 22],
               [17, 18],
               [20, 22],
               [24, 24],
               [25, 25],
               [33, 33],
               [33, 39],
               [36, 36],
               [38, 39]]}



# pprint.pprint(my_coref(doc1, predictor))
[[('Paul Allen', (0, 1)), ('Allen', (24, 24)), ('he', (36, 36))],
 [('Seattle', (11, 11)), ('Seattle', (33, 33))]]

まとめ

今回はAllenNLPを使って簡単にCoreference Resolutionをやってみました。
'predicted_antecedents'や'top_spans'のvalueが何を示しているかお分かりの方は、コメントいただけると嬉しいです。

*my_coref()は自作関数です。

Why not register and get more from Qiita?
  1. We will deliver articles that match you
    By following users and tags, you can catch up information on technical fields that you are interested in as a whole
  2. you can read useful information later efficiently
    By "stocking" the articles you like, you can search right away