ストップワード処理の注意点

文章中の記号をスペースに置き換えるという処理をするコードを入れたとき
→　ストップワードに's','t'などを入れ忘れる

例1


import re

stopwords = ["I", "am", "you", "are", "he", "is", "a"]

sentence = "he's a good student."
sentence = sentence.lower() # lower 
sentence = re.sub(re.compile(r"[!-\?()' ‘’.,;/:-@[-`{-~]"), " ", sentence) # symbol to space
sentence = sentence.split(" ") # split words with space
sentence_words = []
for word in sentence:
    if word in stopwords: # delete words with stopwords
        continue
    sentence_words.append(word)        

print('sentence_words : ', sentence_words)

sentence_words : ['s', 'good', 'student', '']

heはストップワードとして処理されたがisの短縮形のsは処理されず残っている

例2(例1を訂正したもの、ストップワードに's'を加える)


import re

stopwords = ["I", "am", "you", "are", "he", "is", "a", "s"]

sentence = "he's a good student."
sentence = sentence.lower() # lower 
sentence = re.sub(re.compile(r"[!-\?()' ‘’.,;/:-@[-`{-~]"), " ", sentence) # symbol to space
sentence = sentence.split(" ") # split words with space
sentence_words = []
for word in sentence:
    if word in stopwords: # delete words with stopwords
        continue
    sentence_words.append(word)        

print('sentence_words : ', sentence_words)

sentence_words : ['good', 'student', '']

残ってほしい単語だけ残るような処理ができている

その他に見落としがちなストップワード

"s", (he's, she's)
"t", 　(don't, didn't)
"don",
"didn",
"aren",
"isn",

"re", （you're, they're)
"ll",　(I'll, we'll)
"ve"　(I've, you've)

[ RNN] 文章・時系列データで、見落としがちな英語ストップワード

ストップワード処理の注意点

例1

例2(例1を訂正したもの、ストップワードに's'を加える)

その他に見落としがちなストップワード