More than 3 years have passed since last update.

【初心者向け】Bog of wordsでテキストデータのベクトル化しよう

Posted at 2021-04-17

はじめに

今回はテキストデータの前処理を行いたいと思います。
最初はベーシックな手法であるBag of Words(BoW)を紹介します。

Q.そもそもなぜ前処理が必要か？

これは簡単で機械学習などでテキストデータを学習させたい時に、そのままでは使えないからです。

機械学習モデルは文章を学習することを前提としていないので、文字列を数字に置き換える必要があります。

環境

今回はGoogle Colaboratory上で動かしたいと思います。

実際のコードはこちらのGoogle Colabのノートを参照してください。

英文をBag of Wordsで表現してみよう

Bag of Wordsとは？

Bag of Wordsは文で登場した単語の回数を数えるための手法です。

今回は次の3文をBag of Wordsで表現してみます。

This movie is very scary and long(この映画は怖くて長い)
This movie is not scary and slow(この映画は怖くなくて、ゆっくり)
This movie is spooky and good(この映画は不気味で、良い)

まずは３つの文をpandasのDataFrameに入れましょう

import pandas as pd

df = pd.DataFrame({
    'text':[
        'This movie is very scary and long',
        'This movie is not scary and  slow',
        'This movie is spooky and good'
    ]
})
df

次にscikit-learnのCountVectorizerと正規表現で文章をベクトル化しましょう

from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer(token_pattern=u'(?u)\\b\\w+\\b')
bag = vectorizer.fit_transform(df['text'])
bag.toarray()

最後に各indexに対応している単語を確認しましょう
print(vectorizer.vocaburary_)

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up