LoginSignup
0
0

More than 5 years have passed since last update.

[ RNN] 文章・時系列データで、見落としがちな英語ストップワード

Last updated at Posted at 2018-11-18

ストップワード

出現頻度は高いが、それ自体は独自・特徴的な意味をほとんど持たない単語(を処理の対象から除外すること)

ストップワード処理の注意点

文章中の記号をスペースに置き換えるという処理をするコードを入れたとき
→ ストップワードに's','t'などを入れ忘れる

例1


import re

stopwords = ["I", "am", "you", "are", "he", "is", "a"]

sentence = "he's a good student."
sentence = sentence.lower() # lower 
sentence = re.sub(re.compile(r"[!-\?()' ‘’.,;/:-@[-`{-~]"), " ", sentence) # symbol to space
sentence = sentence.split(" ") # split words with space
sentence_words = []
for word in sentence:
    if word in stopwords: # delete words with stopwords
        continue
    sentence_words.append(word)        

print('sentence_words : ', sentence_words)

sentence_words : ['s', 'good', 'student', '']

heはストップワードとして処理されたがisの短縮形のsは処理されず残っている

例2(例1を訂正したもの、ストップワードに's'を加える)


import re

stopwords = ["I", "am", "you", "are", "he", "is", "a", "s"]

sentence = "he's a good student."
sentence = sentence.lower() # lower 
sentence = re.sub(re.compile(r"[!-\?()' ‘’.,;/:-@[-`{-~]"), " ", sentence) # symbol to space
sentence = sentence.split(" ") # split words with space
sentence_words = []
for word in sentence:
    if word in stopwords: # delete words with stopwords
        continue
    sentence_words.append(word)        

print('sentence_words : ', sentence_words)

sentence_words : ['good', 'student', '']

残ってほしい単語だけ残るような処理ができている

その他に見落としがちなストップワード

"s", (he's, she's)
"t",  (don't, didn't)
"don",
"didn",
"aren",
"isn",

"re", (you're, they're)
"ll", (I'll, we'll)
"ve" (I've, you've)

0
0
0

Register as a new user and use Qiita more conveniently

  1. You get articles that match your needs
  2. You can efficiently read back useful information
  3. You can use dark theme
What you can do with signing up
0
0