言語処理100本ノック 2015「第6章: 英語テキストの処理」の56本目「共参照解析」記録です。
第6章: 英語テキストの処理
Stanford Core NLPを用いた英語のテキスト処理を通じて,自然言語処理の様々な基盤技術を概観します.
Stanford Core NLP, ステミング, 品詞タグ付け, 固有表現抽出, 共参照解析, 係り受け解析, 句構造解析, S式
56. 共参照解析
Stanford Core NLPの共参照解析の結果に基づき,文中の参照表現(mention)を代表参照表現(representative mention)に置換せよ.ただし,置換するときは,「代表参照表現(参照表現)」のように,元の参照表現が分かるように配慮せよ.
Stanford CoreNLP公式にのっている画像がわかりやすいです。
Stanford CoreNLPでの仕組みはDeterministic Coreference Resolution Systemに詳細があります。
回答プログラム 056.共参照解析.ipynb
import xml.etree.ElementTree as ET
# 解析結果のxmlをパース
root = ET.parse('./nlp.txt.xml')
# coreferenceの列挙し、代表参照表現に置き換える場所情報の辞書を作成
# 辞書は{(sentence id, 開始token id), (終了token id, 代表参照表現)}...
replaces = {}
for coreference in root.iterfind('./document/coreference/coreference'):
# 代表参照表現の取得
representative = coreference.findtext('./mention[@representative="true"]/text')
# 代表参照表現以外のmention列挙、辞書に追加
for mention in coreference.iterfind('./mention'):
if mention.get('representative') == None:
sentence_id = mention.findtext('sentence')
start = mention.findtext('start')
end = int(mention.findtext('end')) - 1 # endはずらす
# すでに辞書にあっても気にせず更新(後勝ち)
replaces[(sentence_id, start)] = (end, representative)
# 本文をreplacesで置き換えながら表示
for sentence in root.iterfind('./document/sentences/sentence'):
sentence_id = sentence.get('id')
for token in sentence.iterfind('./tokens/token'):
token_id = token.get('id')
# 置換スタート
if (sentence_id, token_id) in replaces:
# 辞書から終了位置と代表参照表現を取り出し
(end, representative) = replaces[(sentence_id, token_id)]
# 代表参照表現+カッコを挿入(end=''で改行なし)
print('「' + representative + '」 (', end='')
# token出力(end=''で改行なし)
print(token.findtext('word'), end='')
# 置換の終わりなら閉じカッコを挿入(end=''で改行なし)
if int(token_id) == end:
print(')', end='')
end = 0
# 文末(ピリオド)などの前にスペースが付加されることは気にしない(end=''で改行なし)
print(' ', end='')
print() # sentence単位で改行
Natural language processing From Wikipedia , the free encyclopedia Natural language processing -LRB- NLP -RRB- is a field of computer science) , artificial intelligence , and linguistics concerned with the interactions between computers and human -LRB- natural -RRB- languages .
As such , 「NLP」 (NLP) is related to the area of humani-computer interaction .
Many challenges in 「NLP」 (NLP) involve natural language understanding , that is , enabling computers to derive meaning from human or natural language input , and others involve natural language generation .
History The history of 「NLP」 (NLP) generally starts in the 1950s , although work can be found from earlier periods .
In 1950 , Alan Turing published an article titled `` Computing Machinery and Intelligence '' which proposed what is now called the Turing test as a criterion of intelligence .
The Georgetown experiment in 1954 involved fully automatic translation of more than sixty Russian sentences into English .
The authors claimed that within three or five years , machine translation would be a solved problem .
However , real progress was much slower , and after the ALPAC report in 1966 , which found that ten year long research had failed to fulfill the expectations , funding for 「machine translation」 (machine translation) was dramatically reduced .
Little further research in 「machine translation」 (machine translation) was conducted until the late 1980s , when the first statistical machine translation systems were developed .
Some notably successful NLP systems developed in the 1960s were SHRDLU , a natural language system working in restricted `` blocks worlds '' with restricted vocabularies , and ELIZA , a simulation of a Rogerian psychotherapist , written by Joseph Weizenbaum between 1964 to 「1966」 (1966) .
Using almost no information about human thought or emotion , ELIZA sometimes provided a startlingly human-like interaction .
When the `` patient '' exceeded the very small knowledge base , 「ELIZA」 (ELIZA) might provide a generic response , for example , responding to `` My head hurts '' with `` Why do you say 「My head」 (your head) hurts ? ''
During the 1970s many programmers began to write ` conceptual ontologies ' , which structured real-world information into computer-understandable data .
Examples are MARGIE -LRB- Schank , 1975 -RRB- , SAM -LRB- Cullingford , 1978 -RRB- , PAM -LRB- Wilensky , 「1978」 (1978) -RRB- , TaleSpin -LRB- Meehan , 1976 -RRB- , QUALM -LRB- Lehnert , 1977 -RRB- , Politics -LRB- Carbonell , 1979 -RRB- , and Plot Units -LRB- Lehnert 1981 -RRB- .
During this time , many chatterbots were written including PARRY , Racter , and Jabberwacky .
Up to the 1980s , most 「NLP」 (NLP) systems were based on complex sets of hand-written rules .
Starting in the late 1980s , however , there was a revolution in 「NLP」 (NLP) with the introduction of machine learning algorithms for language processing .
This was due to both the steady increase in computational power resulting from Moore 's Law and the gradual lessening of the dominance of Chomskyan theories of linguistics -LRB- e.g. transformational grammar -RRB- , whose theoretical underpinnings discouraged the sort of corpus linguistics that underlies the machine-learning approach to language processing .
Some of the earliest-used machine learning algorithms , such as decision trees , produced systems of hard if-then rules similar to existing hand-written rules .
However , Part of speech tagging introduced the use of Hidden Markov Models to NLP , and increasingly , research has focused on statistical models , which make soft , probabilistic decisions based on attaching real-valued weights to the features making up the input data .
The cache language models upon which many speech recognition systems now rely are examples of such statistical models .
Such models are generally more robust when given unfamiliar input , especially input that contains errors -LRB- as is very common for real-world data -RRB- , and produce more reliable results when integrated into a larger system comprising multiple subtasks .
Many of the notable early successes occurred in the field of machine translation , due especially to work at IBM Research , where successively more complicated statistical models were developed .
「many speech recognition systems」 (These systems) were able to take advantage of existing multilingual textual corpora that had been produced by the Parliament of Canada and the European Union as a result of laws calling for the translation of all governmental proceedings into all official languages of the corresponding systems of government .
However , most other systems depended on corpora specifically developed for the tasks implemented by these systems , which was -LRB- and often continues to be -RRB- a major limitation in the success of 「many speech recognition systems」 (these systems) .
As a result , a great deal of research has gone into methods of more effectively learning from limited amounts of data .
Recent research has increasingly focused on unsupervised and semi-supervised learning algorithms .
Such algorithms are able to learn from data that has not been hand-annotated with the desired answers , or using a combination of annotated and non-annotated data .
Generally , this task is much more difficult than supervised learning , and typically produces less accurate results for a given amount of input data .
However , there is an enormous amount of non-annotated data available -LRB- including , among other things , the entire content of the World Wide Web -RRB- , which can often make up for the inferior results .
NLP using machine learning Modern NLP algorithms are based on machine learning , especially statistical machine learning .
「The machine-learning paradigm」 (The paradigm of machine learning) is different from that of most prior attempts at language processing .
Prior implementations of language-processing tasks typically involved the direct hand coding of large sets of rules .
The machine-learning paradigm calls instead for using general learning algorithms - often , although not always , grounded in statistical inference - to automatically learn such rules through the analysis of large corpora of typical real-world examples .
A corpus -LRB- plural , `` corpora '' -RRB- is a set of documents -LRB- or sometimes , individual sentences -RRB- that have been hand-annotated with the correct values to be learned .
Many different classes of machine learning algorithms have been applied to 「NLP」 (NLP) tasks .
「machine learning algorithms」 (These algorithms) take as input a large set of `` features '' that are generated from 「the input data」 (the input data) .
Some of 「machine learning algorithms」 (the earliest-used algorithms) , such as decision trees , produced systems of hard if-then rules similar to the systems of hand-written rules that were then common .
Increasingly , however , research has focused on statistical models , which make soft , probabilistic decisions based on attaching real-valued weights to each input feature .
Such models have the advantage that 「Such models」 (they) can express the relative certainty of many different possible answers rather than only one , producing more reliable results when such a model is included as a component of a larger system .
Systems based on machine-learning algorithms have many advantages over hand-produced rules : The learning procedures used during machine learning automatically focus on the most common cases , whereas when writing rules by hand it is often not obvious at all where the effort should be directed .
Automatic learning procedures can make use of statistical inference algorithms to produce models that are robust to unfamiliar input -LRB- e.g. containing words or structures that have not been seen before -RRB- and to erroneous input -LRB- e.g. with misspelled words or words accidentally omitted -RRB- .
Generally , handling such input gracefully with hand-written rules -- or more generally , creating systems of hand-written rules that make soft decisions -- extremely difficult , error-prone and time-consuming .
Systems based on automatically learning 「hand-written rules --」 (the rules) can be made more accurate simply by supplying more input data .
However , systems based on hand-written rules can only be made more accurate by increasing the complexity of the rules , which is a much more difficult task .
In particular , there is a limit to the complexity of systems based on hand-crafted rules , beyond which 「many speech recognition systems」 (the systems) become more and more unmanageable .
However , creating more data to input to machine-learning systems simply requires a corresponding increase in the number of man-hours worked , generally without significant increases in the complexity of the annotation process .
The subfield of 「NLP」 (NLP) devoted to learning approaches is known as Natural Language Learning -LRB- NLL -RRB- and 「NLP」 (its) conference CoNLL and peak body SIGNLL are sponsored by ACL , recognizing also their links with Computational Linguistics and Language Acquisition .
When the aims of computational language learning research is to understand more about human language acquisition , or psycholinguistics , 「NLL」 (NLL) overlaps into the related field of Computational Psycholinguistics .