言語処理100本ノック 2015「第6章: 英語テキストの処理」の56本目「共参照解析」記録です。
第6章で前回まで準備運動レベルの簡単さでしたが、そろそろ考えさせられる内容になってきます。
参考リンク
リンク | 備考 |
---|---|
056.共参照解析.ipynb | 回答プログラムのGitHubリンク |
素人の言語処理100本ノック:56 | 多くのソース部分のコピペ元 |
Stanford Core NLP公式 | 最初に見ておくStanford Core NLPのページ |
環境
種類 | バージョン | 内容 |
---|---|---|
OS | Ubuntu18.04.01 LTS | 仮想で動かしています |
pyenv | 1.2.16 | 複数Python環境を使うことがあるのでpyenv使っています |
Python | 3.8.1 | pyenv上でpython3.8.1を使っています パッケージはvenvを使って管理しています |
Stanford CoreNLP | 3.9.2 | インストールしたのが1年前で詳しく覚えていないです・・・ 1年たってもそれが最新だったのでそのまま使いました |
openJDK | 1.8.0_242 | 他目的でインストールしていたJDKをそのまま使いました |
第6章: 英語テキストの処理
学習内容
Stanford Core NLPを用いた英語のテキスト処理を通じて,自然言語処理の様々な基盤技術を概観します.
Stanford Core NLP, ステミング, 品詞タグ付け, 固有表現抽出, 共参照解析, 係り受け解析, 句構造解析, S式
ノック内容
英語のテキスト(nlp.txt)に対して,以下の処理を実行せよ.
56. 共参照解析
Stanford Core NLPの共参照解析の結果に基づき,文中の参照表現(mention)を代表参照表現(representative mention)に置換せよ.ただし,置換するときは,「代表参照表現(参照表現)」のように,元の参照表現が分かるように配慮せよ.
※私は「代表参照表現(参照表現)」ではなく「代表参照表現」(参照表現)としています。単純にその方が見やすいと思ったからです。
課題補足(「共参照」について)
共参照(Coreference)は、名詞句が同一の指示物を指すことらしいです。
Stanford CoreNLP公式にのっている画像がわかりやすいです。
Stanford CoreNLPでの仕組みはDeterministic Coreference Resolution Systemに詳細があります。
回答
回答プログラム 056.共参照解析.ipynb
import xml.etree.ElementTree as ET
# 解析結果のxmlをパース
root = ET.parse('./nlp.txt.xml')
# coreferenceの列挙し、代表参照表現に置き換える場所情報の辞書を作成
# 辞書は{(sentence id, 開始token id), (終了token id, 代表参照表現)}...
replaces = {}
for coreference in root.iterfind('./document/coreference/coreference'):
# 代表参照表現の取得
representative = coreference.findtext('./mention[@representative="true"]/text')
# 代表参照表現以外のmention列挙、辞書に追加
for mention in coreference.iterfind('./mention'):
if mention.get('representative') == None:
sentence_id = mention.findtext('sentence')
start = mention.findtext('start')
end = int(mention.findtext('end')) - 1 # endはずらす
# すでに辞書にあっても気にせず更新(後勝ち)
replaces[(sentence_id, start)] = (end, representative)
# 本文をreplacesで置き換えながら表示
for sentence in root.iterfind('./document/sentences/sentence'):
sentence_id = sentence.get('id')
for token in sentence.iterfind('./tokens/token'):
token_id = token.get('id')
# 置換スタート
if (sentence_id, token_id) in replaces:
# 辞書から終了位置と代表参照表現を取り出し
(end, representative) = replaces[(sentence_id, token_id)]
# 代表参照表現+カッコを挿入(end=''で改行なし)
print('「' + representative + '」 (', end='')
# token出力(end=''で改行なし)
print(token.findtext('word'), end='')
# 置換の終わりなら閉じカッコを挿入(end=''で改行なし)
if int(token_id) == end:
print(')', end='')
end = 0
# 文末(ピリオド)などの前にスペースが付加されることは気にしない(end=''で改行なし)
print(' ', end='')
print() # sentence単位で改行
回答解説
XMLファイルのパス
以下のXMLファイルのパスと共参照の情報のマッピングです。第3と4階層に同じcoreferenceが来ているのは間違いではないです(普通第3階層はcoreferences
になりそうですが・・・)。
第5階層のmention
タグで属性representative
がtrueの場合は、「代表参照表現」です。他は参照元です。
出力 | 第1階層 | 第2階層 | 第3階層 | 第4階層 | 第5階層 | 第6階層 |
---|---|---|---|---|---|---|
共参照のsentence id | root | document | coreference | coreference | mention | sentence |
共参照の開始token id | root | document | coreference | coreference | mention | start |
共参照の終了 token id | root | document | coreference | coreference | mention | end |
共参照のテキスト | root | document | coreference | coreference | mention | text |
XMLファイルはGitHubに置いています。
<root>
<document>
<docId>nlp.txt</docId>
<sentences>
--中略--
<coreference>
<coreference>
<mention representative="true">
<sentence>1</sentence>
<start>7</start>
<end>16</end>
<head>12</head>
<text>the free encyclopedia Natural language processing -LRB- NLP -RRB-</text>
</mention>
<mention>
<sentence>1</sentence>
<start>17</start>
<end>22</end>
<head>18</head>
<text>a field of computer science</text>
</mention>
参照表現辞書の作成
replaces
という名前の辞書型変数を作り、データを格納しています。
{(sentence id, 開始token id), (終了token id, 代表参照表現)}
とうデータの入れ方です。辞書のキーが重複する可能性があるかもしれませんが、後勝ち方式(後に来たレコードで先にあったレコードを上書き)にしています。
# coreferenceの列挙し、代表参照表現に置き換える場所情報の辞書を作成
# 辞書は{(sentence id, 開始token id), (終了token id, 代表参照表現)}...
replaces = {}
for coreference in root.iterfind('./document/coreference/coreference'):
# 代表参照表現の取得
representative = coreference.findtext('./mention[@representative="true"]/text')
# 代表参照表現以外のmention列挙、辞書に追加
for mention in coreference.iterfind('./mention'):
if mention.get('representative') == None:
sentence_id = mention.findtext('sentence')
start = mention.findtext('start')
end = int(mention.findtext('end')) - 1 # endはずらす
# すでに辞書にあっても気にせず更新(後勝ち)
replaces[(sentence_id, start)] = (end, representative)
文出力
あとは、sentences
部分の本文を読みながら参照表現があったら置換や括弧の追加をしています。
# 本文をreplacesで置き換えながら表示
for sentence in root.iterfind('./document/sentences/sentence'):
sentence_id = sentence.get('id')
for token in sentence.iterfind('./tokens/token'):
token_id = token.get('id')
# 置換スタート
if (sentence_id, token_id) in replaces:
# 辞書から終了位置と代表参照表現を取り出し
(end, representative) = replaces[(sentence_id, token_id)]
# 代表参照表現+カッコを挿入(end=''で改行なし)
print('「' + representative + '」 (', end='')
# token出力(end=''で改行なし)
print(token.findtext('word'), end='')
# 置換の終わりなら閉じカッコを挿入(end=''で改行なし)
if int(token_id) == end:
print(')', end='')
end = 0
# 文末(ピリオド)などの前にスペースが付加されることは気にしない(end=''で改行なし)
print(' ', end='')
print() # sentence単位で改行
出力結果(実行結果)
プログラム実行すると以下の結果が出力されます。
Natural language processing From Wikipedia , the free encyclopedia Natural language processing -LRB- NLP -RRB- is a field of computer science) , artificial intelligence , and linguistics concerned with the interactions between computers and human -LRB- natural -RRB- languages .
As such , 「NLP」 (NLP) is related to the area of humani-computer interaction .
Many challenges in 「NLP」 (NLP) involve natural language understanding , that is , enabling computers to derive meaning from human or natural language input , and others involve natural language generation .
History The history of 「NLP」 (NLP) generally starts in the 1950s , although work can be found from earlier periods .
In 1950 , Alan Turing published an article titled `` Computing Machinery and Intelligence '' which proposed what is now called the Turing test as a criterion of intelligence .
The Georgetown experiment in 1954 involved fully automatic translation of more than sixty Russian sentences into English .
The authors claimed that within three or five years , machine translation would be a solved problem .
However , real progress was much slower , and after the ALPAC report in 1966 , which found that ten year long research had failed to fulfill the expectations , funding for 「machine translation」 (machine translation) was dramatically reduced .
Little further research in 「machine translation」 (machine translation) was conducted until the late 1980s , when the first statistical machine translation systems were developed .
Some notably successful NLP systems developed in the 1960s were SHRDLU , a natural language system working in restricted `` blocks worlds '' with restricted vocabularies , and ELIZA , a simulation of a Rogerian psychotherapist , written by Joseph Weizenbaum between 1964 to 「1966」 (1966) .
Using almost no information about human thought or emotion , ELIZA sometimes provided a startlingly human-like interaction .
When the `` patient '' exceeded the very small knowledge base , 「ELIZA」 (ELIZA) might provide a generic response , for example , responding to `` My head hurts '' with `` Why do you say 「My head」 (your head) hurts ? ''
.
During the 1970s many programmers began to write ` conceptual ontologies ' , which structured real-world information into computer-understandable data .
Examples are MARGIE -LRB- Schank , 1975 -RRB- , SAM -LRB- Cullingford , 1978 -RRB- , PAM -LRB- Wilensky , 「1978」 (1978) -RRB- , TaleSpin -LRB- Meehan , 1976 -RRB- , QUALM -LRB- Lehnert , 1977 -RRB- , Politics -LRB- Carbonell , 1979 -RRB- , and Plot Units -LRB- Lehnert 1981 -RRB- .
During this time , many chatterbots were written including PARRY , Racter , and Jabberwacky .
Up to the 1980s , most 「NLP」 (NLP) systems were based on complex sets of hand-written rules .
Starting in the late 1980s , however , there was a revolution in 「NLP」 (NLP) with the introduction of machine learning algorithms for language processing .
This was due to both the steady increase in computational power resulting from Moore 's Law and the gradual lessening of the dominance of Chomskyan theories of linguistics -LRB- e.g. transformational grammar -RRB- , whose theoretical underpinnings discouraged the sort of corpus linguistics that underlies the machine-learning approach to language processing .
Some of the earliest-used machine learning algorithms , such as decision trees , produced systems of hard if-then rules similar to existing hand-written rules .
However , Part of speech tagging introduced the use of Hidden Markov Models to NLP , and increasingly , research has focused on statistical models , which make soft , probabilistic decisions based on attaching real-valued weights to the features making up the input data .
The cache language models upon which many speech recognition systems now rely are examples of such statistical models .
Such models are generally more robust when given unfamiliar input , especially input that contains errors -LRB- as is very common for real-world data -RRB- , and produce more reliable results when integrated into a larger system comprising multiple subtasks .
Many of the notable early successes occurred in the field of machine translation , due especially to work at IBM Research , where successively more complicated statistical models were developed .
「many speech recognition systems」 (These systems) were able to take advantage of existing multilingual textual corpora that had been produced by the Parliament of Canada and the European Union as a result of laws calling for the translation of all governmental proceedings into all official languages of the corresponding systems of government .
However , most other systems depended on corpora specifically developed for the tasks implemented by these systems , which was -LRB- and often continues to be -RRB- a major limitation in the success of 「many speech recognition systems」 (these systems) .
As a result , a great deal of research has gone into methods of more effectively learning from limited amounts of data .
Recent research has increasingly focused on unsupervised and semi-supervised learning algorithms .
Such algorithms are able to learn from data that has not been hand-annotated with the desired answers , or using a combination of annotated and non-annotated data .
Generally , this task is much more difficult than supervised learning , and typically produces less accurate results for a given amount of input data .
However , there is an enormous amount of non-annotated data available -LRB- including , among other things , the entire content of the World Wide Web -RRB- , which can often make up for the inferior results .
NLP using machine learning Modern NLP algorithms are based on machine learning , especially statistical machine learning .
「The machine-learning paradigm」 (The paradigm of machine learning) is different from that of most prior attempts at language processing .
Prior implementations of language-processing tasks typically involved the direct hand coding of large sets of rules .
The machine-learning paradigm calls instead for using general learning algorithms - often , although not always , grounded in statistical inference - to automatically learn such rules through the analysis of large corpora of typical real-world examples .
A corpus -LRB- plural , `` corpora '' -RRB- is a set of documents -LRB- or sometimes , individual sentences -RRB- that have been hand-annotated with the correct values to be learned .
Many different classes of machine learning algorithms have been applied to 「NLP」 (NLP) tasks .
「machine learning algorithms」 (These algorithms) take as input a large set of `` features '' that are generated from 「the input data」 (the input data) .
Some of 「machine learning algorithms」 (the earliest-used algorithms) , such as decision trees , produced systems of hard if-then rules similar to the systems of hand-written rules that were then common .
Increasingly , however , research has focused on statistical models , which make soft , probabilistic decisions based on attaching real-valued weights to each input feature .
Such models have the advantage that 「Such models」 (they) can express the relative certainty of many different possible answers rather than only one , producing more reliable results when such a model is included as a component of a larger system .
Systems based on machine-learning algorithms have many advantages over hand-produced rules : The learning procedures used during machine learning automatically focus on the most common cases , whereas when writing rules by hand it is often not obvious at all where the effort should be directed .
Automatic learning procedures can make use of statistical inference algorithms to produce models that are robust to unfamiliar input -LRB- e.g. containing words or structures that have not been seen before -RRB- and to erroneous input -LRB- e.g. with misspelled words or words accidentally omitted -RRB- .
Generally , handling such input gracefully with hand-written rules -- or more generally , creating systems of hand-written rules that make soft decisions -- extremely difficult , error-prone and time-consuming .
Systems based on automatically learning 「hand-written rules --」 (the rules) can be made more accurate simply by supplying more input data .
However , systems based on hand-written rules can only be made more accurate by increasing the complexity of the rules , which is a much more difficult task .
In particular , there is a limit to the complexity of systems based on hand-crafted rules , beyond which 「many speech recognition systems」 (the systems) become more and more unmanageable .
However , creating more data to input to machine-learning systems simply requires a corresponding increase in the number of man-hours worked , generally without significant increases in the complexity of the annotation process .
The subfield of 「NLP」 (NLP) devoted to learning approaches is known as Natural Language Learning -LRB- NLL -RRB- and 「NLP」 (its) conference CoNLL and peak body SIGNLL are sponsored by ACL , recognizing also their links with Computational Linguistics and Language Acquisition .
When the aims of computational language learning research is to understand more about human language acquisition , or psycholinguistics , 「NLL」 (NLL) overlaps into the related field of Computational Psycholinguistics .