Edited at

素人の言語処理100本ノック:50

More than 1 year has passed since last update.

言語処理100本ノック 2015の挑戦記録です。環境はUbuntu 16.04 LTS + Python 3.5.2 :: Anaconda 4.1.1 (64-bit)です。過去のノックの一覧はこちらからどうぞ。


第6章: 英語テキストの処理


英語のテキスト(nlp.txt)に対して,以下の処理を実行せよ.



50. 文区切り


(. or ; or : or ? or !) → 空白文字 → 英大文字というパターンを文の区切りと見なし,入力された文書を1行1文の形式で出力せよ.



出来上がったコード:


main.py

# coding: utf-8

import re

fname = 'nlp.txt'

def nlp_lines():
'''nlp.txtを1文ずつ読み込むジェネレータ
nlp.txtを順次読み込んで1文ずつ返す

戻り値:
1文の文字列
'''
with open(fname) as lines:

# 文切り出しの正規表現コンパイル
pattern = re.compile(r'''
(
^ # 行頭
.*? # 任意のn文字、最少マッチ
[\.|;|:|\?|!] # . or ; or : or ? or !
)
\s # 空白文字
(
[A-Z].* # 英大文字以降(=次の文以降)

)
''', re.MULTILINE + re.VERBOSE + re.DOTALL)

for line in lines:

line = line.strip() # 前後の空白文字除去
while len(line) > 0:

# 行から1文を取得
match = pattern.match(line)
if match:

# 切り出した文を返す
yield match.group(1) # 先頭の文
line = match.group(2) # 次の文以降

else:

# 区切りがないので、最後までが1文
yield line
line = ''

# 読み込み
for line in nlp_lines():
print(line)



実行結果:


端末:実行結果

Natural language processing

From Wikipedia, the free encyclopedia
Natural language processing (NLP) is a field of computer science, artificial intelligence, and linguistics concerned with the interactions between computers and human (natural) languages.
As such, NLP is related to the area of humani-computer interaction.
Many challenges in NLP involve natural language understanding, that is, enabling computers to derive meaning from human or natural language input, and others involve natural language generation.
History
The history of NLP generally starts in the 1950s, although work can be found from earlier periods.
In 1950, Alan Turing published an article titled "Computing Machinery and Intelligence" which proposed what is now called the Turing test as a criterion of intelligence.
The Georgetown experiment in 1954 involved fully automatic translation of more than sixty Russian sentences into English.
The authors claimed that within three or five years, machine translation would be a solved problem.
However, real progress was much slower, and after the ALPAC report in 1966, which found that ten year long research had failed to fulfill the expectations, funding for machine translation was dramatically reduced.
Little further research in machine translation was conducted until the late 1980s, when the first statistical machine translation systems were developed.
Some notably successful NLP systems developed in the 1960s were SHRDLU, a natural language system working in restricted "blocks worlds" with restricted vocabularies, and ELIZA, a simulation of a Rogerian psychotherapist, written by Joseph Weizenbaum between 1964 to 1966.
Using almost no information about human thought or emotion, ELIZA sometimes provided a startlingly human-like interaction.
When the "patient" exceeded the very small knowledge base, ELIZA might provide a generic response, for example, responding to "My head hurts" with "Why do you say your head hurts?".
During the 1970s many programmers began to write 'conceptual ontologies', which structured real-world information into computer-understandable data.
Examples are MARGIE (Schank, 1975), SAM (Cullingford, 1978), PAM (Wilensky, 1978), TaleSpin (Meehan, 1976), QUALM (Lehnert, 1977), Politics (Carbonell, 1979), and Plot Units (Lehnert 1981).
During this time, many chatterbots were written including PARRY, Racter, and Jabberwacky.
Up to the 1980s, most NLP systems were based on complex sets of hand-written rules.
Starting in the late 1980s, however, there was a revolution in NLP with the introduction of machine learning algorithms for language processing.
This was due to both the steady increase in computational power resulting from Moore's Law and the gradual lessening of the dominance of Chomskyan theories of linguistics (e.g. transformational grammar), whose theoretical underpinnings discouraged the sort of corpus linguistics that underlies the machine-learning approach to language processing.
Some of the earliest-used machine learning algorithms, such as decision trees, produced systems of hard if-then rules similar to existing hand-written rules.
However, Part of speech tagging introduced the use of Hidden Markov Models to NLP, and increasingly, research has focused on statistical models, which make soft, probabilistic decisions based on attaching real-valued weights to the features making up the input data.
The cache language models upon which many speech recognition systems now rely are examples of such statistical models.
Such models are generally more robust when given unfamiliar input, especially input that contains errors (as is very common for real-world data), and produce more reliable results when integrated into a larger system comprising multiple subtasks.
Many of the notable early successes occurred in the field of machine translation, due especially to work at IBM Research, where successively more complicated statistical models were developed.
These systems were able to take advantage of existing multilingual textual corpora that had been produced by the Parliament of Canada and the European Union as a result of laws calling for the translation of all governmental proceedings into all official languages of the corresponding systems of government.
However, most other systems depended on corpora specifically developed for the tasks implemented by these systems, which was (and often continues to be) a major limitation in the success of these systems.
As a result, a great deal of research has gone into methods of more effectively learning from limited amounts of data.
Recent research has increasingly focused on unsupervised and semi-supervised learning algorithms.
Such algorithms are able to learn from data that has not been hand-annotated with the desired answers, or using a combination of annotated and non-annotated data.
Generally, this task is much more difficult than supervised learning, and typically produces less accurate results for a given amount of input data.
However, there is an enormous amount of non-annotated data available (including, among other things, the entire content of the World Wide Web), which can often make up for the inferior results.
NLP using machine learning
Modern NLP algorithms are based on machine learning, especially statistical machine learning.
The paradigm of machine learning is different from that of most prior attempts at language processing.
Prior implementations of language-processing tasks typically involved the direct hand coding of large sets of rules.
The machine-learning paradigm calls instead for using general learning algorithms - often, although not always, grounded in statistical inference - to automatically learn such rules through the analysis of large corpora of typical real-world examples.
A corpus (plural, "corpora") is a set of documents (or sometimes, individual sentences) that have been hand-annotated with the correct values to be learned.
Many different classes of machine learning algorithms have been applied to NLP tasks.
These algorithms take as input a large set of "features" that are generated from the input data.
Some of the earliest-used algorithms, such as decision trees, produced systems of hard if-then rules similar to the systems of hand-written rules that were then common.
Increasingly, however, research has focused on statistical models, which make soft, probabilistic decisions based on attaching real-valued weights to each input feature.
Such models have the advantage that they can express the relative certainty of many different possible answers rather than only one, producing more reliable results when such a model is included as a component of a larger system.
Systems based on machine-learning algorithms have many advantages over hand-produced rules:
The learning procedures used during machine learning automatically focus on the most common cases, whereas when writing rules by hand it is often not obvious at all where the effort should be directed.
Automatic learning procedures can make use of statistical inference algorithms to produce models that are robust to unfamiliar input (e.g. containing words or structures that have not been seen before) and to erroneous input (e.g. with misspelled words or words accidentally omitted).
Generally, handling such input gracefully with hand-written rules -- or more generally, creating systems of hand-written rules that make soft decisions -- extremely difficult, error-prone and time-consuming.
Systems based on automatically learning the rules can be made more accurate simply by supplying more input data.
However, systems based on hand-written rules can only be made more accurate by increasing the complexity of the rules, which is a much more difficult task.
In particular, there is a limit to the complexity of systems based on hand-crafted rules, beyond which the systems become more and more unmanageable.
However, creating more data to input to machine-learning systems simply requires a corresponding increase in the number of man-hours worked, generally without significant increases in the complexity of the annotation process.
The subfield of NLP devoted to learning approaches is known as Natural Language Learning (NLL) and its conference CoNLL and peak body SIGNLL are sponsored by ACL, recognizing also their links with Computational Linguistics and Language Acquisition.
When the aims of computational language learning research is to understand more about human language acquisition, or psycholinguistics, NLL overlaps into the related field of Computational Psycholinguistics.

結果はGitHubにもアップしています。


ジェネレータ

これまでの問題と同じように、ファイルを必要な行だけ読み込んで順次処理する形にするため、ジェネレータにしました。

ファイルから1行読み込むとそこには複数の文が含まれているため、読み込んだらバッファ(line)に格納し、そこから最初の1文を切り出して返します。使わなかった2文目以降はバッファに残しておいて、バッファにある間はそこから文を順次切り出して返します。バッファが空になったら、またファイルから1行読み込んでバッファに格納します。この繰り返しです。

改行しかない行の扱いをちょっと悩みましたが、今回のコードではスキップしました。

(2016/12/17更新) 

shiracamusさんにアドバイスいただき、コードを更新しました。

以前のコードはテキストファイルをreadline()で1行ずつ読んでいたり、raise StopIterationでループの途中で抜けていたりと美しくなかったのが、すごくスッキリしました。

shiracamusさん、ありがとうございます!

51本目のノックは以上です。誤りなどありましたら、ご指摘いただけますと幸いです。


実行結果には、100本ノックで用いるコーパス・データで配布されているデータの一部が含まれます。この第6章で用いているデータのライセンスはクリエイティブ・コモンズ 表示-継承 3.0 非移植日本語訳)です。