131
134

# 自然言語のpythonでの前処理のかんたん早見表（テキストクリーニング、分割、ストップワード、辞書の作成、数値化）

Last updated at Posted at 2018-03-10

pythonによる自然言語の前処理方法のかんたん早見表を作成したので、データの前処理に役立ててください。また最後に、データ分析の流れを経験できるオススメ学習コンテンツを紹介したので、ご参考ください。

# 1.テキストクリーニング

つまり、`<p>Yes!!! Falling LOVE(⋈◍＞◡＜◍)。✧♡</p>``yes falling love`にすること。

## 小文字にする

``````text = "This is a pen."
changed_text = text.lower()
print(changed_text)
``````

`'this is a pen.'`

## replaceをつかった置換

``````text = "123-4567-8910"
changed_text = text.replace("-", " ")
print(changed_text)
``````

`'123 4567 8910'`

## 正規表現をつかった置換

``````import re
text = "Mathematics scores were 100 points!\nPhysical score was 100 points!"
changed_text = re.sub(r'[0-9]+', "0", text)
print(changed_text)
``````

`'Mathematics scores were 0 points!\nPhysical score was 0 points!'`

## 改行「\n」の置換

まずは改行で分割する。

``````text = "This is a pen.\nI live in Osaka"
devided_text = text.splitlines()
print(devided_text)
``````

`['This is a pen.', 'I live in Osaka']`

これを空白で再結合する。

``````devided_text = ['This is a pen.', 'I live in Osaka']
joined_devided_text = ' '.join(devided_text)
print(joined_devided_text)
``````

`'This is a pen. I live in Osaka'`

## HTML関連タグの除去

まずは完全なHTMLに変換する。

``````text = "<p>Mathematics scores were 100 points!</p><p>Physical score was 100 points!</p>"
changed_text = BeautifulSoup(text)
print(changed_text)
``````

`<html><body><p>Mathematics scores were 100 points!</p><p>Physical score was 100 points!</p></body></html>`

``````text = "<p>Mathematics scores were 100 points!</p><p>Physical score was 100 points!</p>"
changed_text = BeautifulSoup(text)
changed_changed_text = changed_text.get_text()
print(changed_changed_text)
``````

`'Mathematics scores were 100 points!Physical score was 100 points!'`

# 2.テキスト分割

つまり、`yes falling love``["yes", "falling", "love"]`にすること。

## 空白で分割

``````text = "This is a pen."
devided_text = text.split()
print(devided_text)
``````

` ['This', 'is', 'a', 'pen.']`

## 改行で分割

``````text = "This is a pen.\nI live in Osaka"
devided_text = text.splitlines()
print(devided_text)
``````

`['This is a pen.', 'I live in Osaka']`

# 3.ストップワードの削除

`["yes", "falling", "love"]``["falling", "love"]`にすること。

## ストップワードの削除

``````words = ['this', 'is', 'a', 'pen']
stop_words = ['is', 'a']
changed_words = [word for word in words if word not in stop_words]
print(changed_words)
``````

`['this', 'pen']`

## 有名なストップワード取得

``````from nltk.corpus import stopwords
stop_words = stopwords.words('english')
print(stop_words)
``````

`['i', 'me', 'my', 'myself', 'we',..., "won't", 'wouldn', "wouldn't"]`

# 4.辞書の作成

``````from collections import Counter
words = ['this', 'is', 'a', 'pen', 'this', 'is', 'a', 'book']
dictionary = {}
for word, count in Counter(words).most_common():
dictionary[word] = len(dictionary)
print(dictionary)
``````

`{'a': 2, 'book': 4, 'is': 1, 'pen': 3, 'this': 0}`

### 補足

``````words = ['this', 'is', 'a', 'pen', 'this', 'is', 'a', 'book']
print(Counter(words).most_common())
``````

この出力は
`[('this', 2), ('is', 2), ('a', 2), ('pen', 1), ('book', 1)]`
となります。
つまり`Counter(単語リスト).most_common()`メソッドの機能は、単語と出現回数の組み合わせを、出現回数の多い順にリストで表示するというものです。

# 5.文章の数値化

``````words = ['this', 'is', 'a', 'pen']
dictionary = {'a': 2, 'book': 4, 'is': 1, 'pen': 3, 'this': 0}
sentence_vector = []
for word in words:
word_index = dictionary[word]
sentence_vector.append(word_index)
print(sentence_vector)
``````

`[0, 1, 2, 3]`

## 【最後に】データ分析手法のコンテンツ（私が制作したもの）

131
134
0

Register as a new user and use Qiita more conveniently

1. You get articles that match your needs
2. You can efficiently read back useful information
3. You can use dark theme
What you can do with signing up
131
134