Help us understand the problem. What is going on with this article?

自然言語のpythonでの前処理のかんたん早見表(テキストクリーニング、分割、ストップワード、辞書の作成、数値化)

More than 1 year has passed since last update.

おつかれさまです、かきうち(@kakistuter)です。
すぐ忘れちゃうので、自然言語のpythonでの前処理のかんたん早見表つくりました。

1.テキストクリーニング

普通の文章の変な文字とかをなくすこと。
つまり、<p>Yes!!! Falling LOVE(⋈◍>◡<◍)。✧♡</p>yes falling loveにすること。

小文字にする

text = "This is a pen."
changed_text = text.lower()
print(changed_text)

出力は
'this is a pen.'

replaceをつかった置換

text = "123-4567-8910"
changed_text = text.replace("-", " ")
print(changed_text)

出力は
'123 4567 8910'

正規表現をつかった置換

import re
text = "Mathematics scores were 100 points!\nPhysical score was 100 points!"
changed_text = re.sub(r'[0-9]+', "0", text)
print(changed_text)

出力は
'Mathematics scores were 0 points!\nPhysical score was 0 points!'

改行「\n」の置換

まずは改行で分割する。

text = "This is a pen.\nI live in Osaka"
devided_text = text.splitlines()
print(devided_text)

出力は
['This is a pen.', 'I live in Osaka']

これを空白で再結合する。

devided_text = ['This is a pen.', 'I live in Osaka']
joined_devided_text = ' '.join(devided_text)
print(joined_devided_text)

出力は
'This is a pen. I live in Osaka'

HTML関連タグの除去

まずは完全なHTMLに変換する。

text = "<p>Mathematics scores were 100 points!</p><p>Physical score was 100 points!</p>"
changed_text = BeautifulSoup(text)
print(changed_text)

出力は
<html><body><p>Mathematics scores were 100 points!</p><p>Physical score was 100 points!</p></body></html>

次に文字のみを取り出す。

text = "<p>Mathematics scores were 100 points!</p><p>Physical score was 100 points!</p>"
changed_text = BeautifulSoup(text)
changed_changed_text = changed_text.get_text()
print(changed_changed_text)

出力は
'Mathematics scores were 100 points!Physical score was 100 points!'

2.テキスト分割

よんでそのまま。
つまり、yes falling love["yes", "falling", "love"]にすること。

空白で分割

text = "This is a pen."
devided_text = text.split()
print(devided_text)

出力は
['This', 'is', 'a', 'pen.']

改行で分割

text = "This is a pen.\nI live in Osaka"
devided_text = text.splitlines()
print(devided_text)

出力は
['This is a pen.', 'I live in Osaka']

3.ストップワードの削除

よんでそのまま。
今回の解析では「yes」は不要と判断されたと想定すると、
つまり、["yes", "falling", "love"]["falling", "love"]にすること。

ストップワードの削除

words = ['this', 'is', 'a', 'pen']
stop_words = ['is', 'a']
changed_words = [word for word in words if word not in stop_words]
print(changed_words)

出力は
['this', 'pen']

有名なストップワード取得

from nltk.corpus import stopwords
stop_words = stopwords.words('english')
print(stop_words)

出力は
['i', 'me', 'my', 'myself', 'we',..., "won't", 'wouldn', "wouldn't"]

4.辞書の作成

よんでそのまま。
つまり、["i", "am", "tom", "i", "am", "jack"]{'i': 0, 'am': 1, 'tom': 2, 'jack': 3}にすること。

from collections import Counter
words = ['this', 'is', 'a', 'pen', 'this', 'is', 'a', 'book']
dictionary = {}
for word, count in Counter(words).most_common():
    dictionary[word] = len(dictionary)
print(dictionary)

出力は
{'a': 2, 'book': 4, 'is': 1, 'pen': 3, 'this': 0}

補足

words = ['this', 'is', 'a', 'pen', 'this', 'is', 'a', 'book']
print(Counter(words).most_common())

この出力は
[('this', 2), ('is', 2), ('a', 2), ('pen', 1), ('book', 1)]
となります。
つまりCounter(単語リスト).most_common()メソッドの機能は、単語と出現回数の組み合わせを、出現回数の多い順にリストで表示するというものです。

5.文章の数値化

よんでそのまま。
つまり、辞書を使って["i", "am", "tom"][0, 1, 2]にすること。

words = ['this', 'is', 'a', 'pen']
dictionary = {'a': 2, 'book': 4, 'is': 1, 'pen': 3, 'this': 0}
sentence_vector = []
for word in words:
    word_index = dictionary[word]
    sentence_vector.append(word_index)
print(sentence_vector)

出力は
[0, 1, 2, 3]

Why do not you register as a user and use Qiita more conveniently?
  1. We will deliver articles that match you
    By following users and tags, you can catch up information on technical fields that you are interested in as a whole
  2. you can read useful information later efficiently
    By "stocking" the articles you like, you can search right away
Comments
Sign up for free and join this conversation.
If you already have a Qiita account
Why do not you register as a user and use Qiita more conveniently?
You need to log in to use this function. Qiita can be used more conveniently after logging in.
You seem to be reading articles frequently this month. Qiita can be used more conveniently after logging in.
  1. We will deliver articles that match you
    By following users and tags, you can catch up information on technical fields that you are interested in as a whole
  2. you can read useful information later efficiently
    By "stocking" the articles you like, you can search right away