ストップワード
出現頻度は高いが、それ自体は独自・特徴的な意味をほとんど持たない単語(を処理の対象から除外すること)
ストップワード処理の注意点
文章中の記号をスペースに置き換えるという処理をするコードを入れたとき
→ ストップワードに's','t'などを入れ忘れる
例1
import re
stopwords = ["I", "am", "you", "are", "he", "is", "a"]
sentence = "he's a good student."
sentence = sentence.lower() # lower
sentence = re.sub(re.compile(r"[!-\?()' ‘’.,;/:-@[-`{-~]"), " ", sentence) # symbol to space
sentence = sentence.split(" ") # split words with space
sentence_words = []
for word in sentence:
if word in stopwords: # delete words with stopwords
continue
sentence_words.append(word)
print('sentence_words : ', sentence_words)
sentence_words : ['s', 'good', 'student', '']
heはストップワードとして処理されたがisの短縮形のsは処理されず残っている
例2(例1を訂正したもの、ストップワードに's'を加える)
import re
stopwords = ["I", "am", "you", "are", "he", "is", "a", "s"]
sentence = "he's a good student."
sentence = sentence.lower() # lower
sentence = re.sub(re.compile(r"[!-\?()' ‘’.,;/:-@[-`{-~]"), " ", sentence) # symbol to space
sentence = sentence.split(" ") # split words with space
sentence_words = []
for word in sentence:
if word in stopwords: # delete words with stopwords
continue
sentence_words.append(word)
print('sentence_words : ', sentence_words)
sentence_words : ['good', 'student', '']
残ってほしい単語だけ残るような処理ができている
その他に見落としがちなストップワード
"s", (he's, she's)
"t", (don't, didn't)
"don",
"didn",
"aren",
"isn",
"re", (you're, they're)
"ll", (I'll, we'll)
"ve" (I've, you've)