1. tamiyamoto

    Posted

    tamiyamoto
Changes in title
+AllenNLPでCoreference Resolution(共参照解析)
Changes in tags
Changes in body
Source | HTML | Preview
@@ -0,0 +1,164 @@
+# Coreference Resolution
+Coreference Resolution(共参照解析・照応解析?)とは、ある文章中で、**同一のモノを指す2つ以上の語句**を見つけ対応づける処理です。
+
+
+
+*****
+The legal pressures facing <font color="Blue">**Michael Cohen**</font> are growing in a wide-ranging investigation of <font color="Blue">**his**</font> personal business affairs and <font color="Blue">**his**</font> work on behalf of <font color="Blue">**his**</font> former client, <font color="Red">**President Trump**</font>. In addition to <font color="Blue">**his**</font> work for <font color="Red">**Mr. Trump**</font>, <font color="Blue">**he**</font> pursued <font color="Blue">**his**</font> own business interests, including ventures in real estate, personal loans and investments in taxi medallions.
+*****
+
+本記事では、AllenNLPの学習済みCoreference Resolutionモデルを用いた共参照解析手順を備忘録がてらに書いていきたいと思います。
+
+公式チュートリアルはこちら(https://demo.allennlp.org/coreference-resolution/MTE2NTIwOQ==)
+
+##AllenNLPのインストール
+```
+pip install allennlp
+```
+
+
+##学習済みCoreference Resolutionモデルのダウンロード
+```
+wget https://s3-us-west-2.amazonaws.com/allennlp/models/coref-model-2018.02.05.tar.gz
+```
+
+##解凍
+```
+tar -zxvf coref-model-2018.02.05.tar.gz
+```
+ 
+##ソースコード
+変数doc1に任意の文章を与えます。
+
+```
+import pprint
+from allennlp.predictors.predictor import Predictor
+
+def my_coref(text, model = False):
+ d = model.predict(document = text)
+
+ clusters = d['clusters']
+ words = d['document']
+
+ coref_phrase_list = []
+ for cluster in clusters:
+ l = [(' '.join(words[c[0]:c[1]+1]), (c[0], c[1])) for c in cluster]
+ coref_phrase_list.append(l)
+
+ return coref_phrase_list
+
+if __name__ == '__main__':
+ predictor = Predictor.from_path("coref-model-2018.02.05")
+ doc1 = """Paul Allen was born on January 21, 1953, in Seattle, Washington, to Kenneth Sam Allen and Edna Faye Allen. Allen attended Lakeside School, a private school in Seattle, where he befriended Bill Gates."""
+
+ d = predictor.predict(document=doc1)
+
+ pprint.pprint(d)
+
+ pprint.pprint(my_coref(doc1, predictor))
+
+```
+
+
+##結果
+```
+# pprint.pprint(d)
+# d は辞書型オフジェクト
+# d['clusters'][i]: ある同一のものを指し示している表現(フレーズor単語)の単語オフセットリスト
+# d['document']: 入力文章の単語リスト
+# d['predicted_antecedents']: ??
+# d['top_spans']: ??
+
+{'clusters': [[[0, 1], [24, 24], [36, 36]], [[11, 11], [33, 33]]],
+ 'document': ['Paul',
+ 'Allen',
+ 'was',
+ 'born',
+ 'on',
+ 'January',
+ '21',
+ ',',
+ '1953',
+ ',',
+ 'in',
+ 'Seattle',
+ ',',
+ 'Washington',
+ ',',
+ 'to',
+ 'Kenneth',
+ 'Sam',
+ 'Allen',
+ 'and',
+ 'Edna',
+ 'Faye',
+ 'Allen',
+ '.',
+ 'Allen',
+ 'attended',
+ 'Lakeside',
+ 'School',
+ ',',
+ 'a',
+ 'private',
+ 'school',
+ 'in',
+ 'Seattle',
+ ',',
+ 'where',
+ 'he',
+ 'befriended',
+ 'Bill',
+ 'Gates',
+ '.'],
+ 'predicted_antecedents': [-1,
+ -1,
+ -1,
+ -1,
+ -1,
+ -1,
+ -1,
+ -1,
+ -1,
+ -1,
+ 9,
+ -1,
+ 8,
+ -1,
+ 3,
+ -1],
+ 'top_spans': [[0, 1],
+ [6, 6],
+ [8, 8],
+ [11, 11],
+ [11, 13],
+ [13, 13],
+ [16, 18],
+ [16, 22],
+ [17, 18],
+ [20, 22],
+ [24, 24],
+ [25, 25],
+ [33, 33],
+ [33, 39],
+ [36, 36],
+ [38, 39]]}
+
+
+
+# pprint.pprint(my_coref(doc1, predictor))
+[[('Paul Allen', (0, 1)), ('Allen', (24, 24)), ('he', (36, 36))],
+ [('Seattle', (11, 11)), ('Seattle', (33, 33))]]
+
+```
+
+#まとめ
+今回はAllenNLPを使って簡単にCoreference Resolutionをやってみました。
+'predicted_antecedents'や'top_spans'のvalueが何を示しているかお分かりの方は、コメントいただけると嬉しいです。
+
+*my_coref()は自作関数です。
+
+
+
+
+