More than 5 years have passed since last update.

言語処理100本ノック-50:文区切り

Last updated at 2020-03-12Posted at 2020-03-10

言語処理100本ノック 2015 「第6章: 英語テキストの処理」の50本目「文区切り」記録です。
難しかった49に比べて非常に簡単で小休止といった感じです。正規表現を使って文を区切ります。

参考リンク

リンク	備考
050.文区切り.ipynb	回答プログラムのGitHubリンク
素人の言語処理100本ノック:50	多くのソース部分のコピペ元

環境

種類	バージョン	内容
OS	Ubuntu18.04.01 LTS	仮想で動かしています
pyenv	1.2.16	複数Python環境を使うことがあるのでpyenv使っています
Python	3.8.1	pyenv上でpython3.8.1を使っていますパッケージはvenvを使って管理しています

第6章: 英語テキストの処理

学習内容

Stanford Core NLPを用いた英語のテキスト処理を通じて，自然言語処理の様々な基盤技術を概観します．

Stanford Core NLP, ステミング, 品詞タグ付け, 固有表現抽出, 共参照解析, 係り受け解析, 句構造解析, S式

ノック内容

英語のテキスト（nlp.txt）に対して，以下の処理を実行せよ．

50. 文区切り

(. or ; or : or ? or !) → 空白文字 → 英大文字というパターンを文の区切りと見なし，入力された文書を1行1文の形式で出力せよ．

回答

回答プログラム 050.文区切り.ipynb

import re

with open('./nlp.txt') as file_in, \
     open('./050.result.txt', 'w') as file_out:
    for line in file_in:
        if line != '\n':
            line = re.sub(r'''
                         (?<=[\.|;|:|\?|!]) # 肯定の後読みで . or ; or : or ? or !
                         \s                 # 空白(改行への置換対象)
                         (?=[A-Z])          # 肯定の先読みで英大文字
                       ''', '\n', line, flags = re.VERBOSE)
            print(line.rstrip(), file=file_out)

回答解説

肯定の先/後読み

今回は正規表現で肯定の先読みおよび後読みアサーションを使っています。
マッチ対象(今回は置換対象)に含めないけど検索条件にしています。詳しくは、「ゼロから覚えるPython正規表現の基本とTips」を参照ください。

出力結果(実行結果)

プログラム実行すると以下の結果(先頭10行のみ)が出力されます。

050.result.txt(先頭10行のみ)

Natural language processing
From Wikipedia, the free encyclopedia
Natural language processing (NLP) is a field of computer science, artificial intelligence, and linguistics concerned with the interactions between computers and human (natural) languages.
As such, NLP is related to the area of humani-computer interaction.
Many challenges in NLP involve natural language understanding, that is, enabling computers to derive meaning from human or natural language input, and others involve natural language generation.
History
The history of NLP generally starts in the 1950s, although work can be found from earlier periods.
In 1950, Alan Turing published an article titled "Computing Machinery and Intelligence" which proposed what is now called the Turing test as a criterion of intelligence.
The Georgetown experiment in 1954 involved fully automatic translation of more than sixty Russian sentences into English.
The authors claimed that within three or five years, machine translation would be a solved problem.

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up