More than 5 years have passed since last update.

言語処理100本ノック-52:ステミング

Last updated at 2020-03-12Posted at 2020-03-12

言語処理100本ノック 2015 「第6章: 英語テキストの処理」の52本目「ステミング」記録です。
言語処理においてよく使うであろうステミングです。実際に後続ノック71本目の機械学習系ノックでも利用しています。関数を呼び出すだけなので、技術的には非常に簡単です。

参考リンク

リンク	備考
052.ステミング.ipynb	回答プログラムのGitHubリンク
素人の言語処理100本ノック:52	多くのソース部分のコピペ元

環境

種類	バージョン	内容
OS	Ubuntu18.04.01 LTS	仮想で動かしています
pyenv	1.2.16	複数Python環境を使うことがあるのでpyenv使っています
Python	3.8.1	pyenv上でpython3.8.1を使っていますパッケージはvenvを使って管理しています

上記環境で、以下のPython追加パッケージを使っています。通常のpipでインストールするだけです。今回はノックで指定しているstemmingパッケージは使いませんでした。2010年から更新されておらず、今はnltkの方が一般的っぽかったからです。

種類	バージョン
nltk	3.4.5

第6章: 英語テキストの処理

学習内容

Stanford Core NLPを用いた英語のテキスト処理を通じて，自然言語処理の様々な基盤技術を概観します．

Stanford Core NLP, ステミング, 品詞タグ付け, 固有表現抽出, 共参照解析, 係り受け解析, 句構造解析, S式

ノック内容

英語のテキスト（nlp.txt）に対して，以下の処理を実行せよ．

52. ステミング

51の出力を入力として受け取り，Porterのステミングアルゴリズムを適用し，単語と語幹をタブ区切り形式で出力せよ． Pythonでは，Porterのステミングアルゴリズムの実装としてstemmingモジュールを利用するとよい．

課題補足(「ステミング」について)

「ステミング」は語幹のことで、単語の変化しない前方部分を指します(例：Naturalのステミングはnatur)。ステミングは後に71本目でも使います。
「ステミング」にはいくつか種類があって今回はポーターのアルゴリズムを使っています(これが有名らしい)。詳しく知りたい人はググったりして調べてみてください。

回答

回答プログラム 052.ステミング.ipynb

import re

from nltk.stem.porter import PorterStemmer as PS

ps = PS()

with open('./051.result.txt') as file_in, \
     open('./052.result.txt', 'w') as file_out:
    for token in file_in:
        if token != '\n':
            print(token.rstrip(), '\t', ps.stem(token.rstrip()), file=file_out)

回答解説

プログラムも短くたいして解説することがありません。ps.stem()とするだけでステミングでき、呼び出すだけなら非常に簡単です。

出力結果(実行結果)

プログラム実行すると以下の結果が出力されます(先頭30行抜粋)。

052.result.txt(先頭30行抜粋)

Natural 	 natur
language 	 languag
processing 	 process
From 	 from
Wikipedia 	 wikipedia
the 	 the
free 	 free
encyclopedia 	 encyclopedia
Natural 	 natur
language 	 languag
processing 	 process
(NLP) 	 (nlp)
is 	 is
a 	 a
field 	 field
of 	 of
computer 	 comput
science 	 scienc
artificial 	 artifici
intelligence 	 intellig
and 	 and
linguistics 	 linguist
concerned 	 concern
with 	 with
the 	 the
interactions 	 interact
between 	 between
computers 	 comput
and 	 and
human 	 human

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up