機械学習〜テキスト特徴量（CountVectorizer, TfidfVectorizer）〜

Last updated at 2019-07-23Posted at 2019-03-11

今回は、scikit-learn を使ったテキスト特徴量のベクトル化の手法を簡単に記載します。

テキストデータのベクトル化

テキストデータはそのまま特徴量としては使えないため、
テキストに出現する単語情報を数値に変換するプロセスを行います。

CountVectorizer

出現する単語のカウントを特徴量にする手法になります。
出現した単語を純粋にカウントします。

実装コード

sample の文字列をベクトル化します。

import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer

# ベクトル化する文字列
sample = np.array(['Apple computer of the apple mark', 'linux computer', 'windows computer'])

# CountVectorizer
vec_count = CountVectorizer()

# ベクトル化
vec_count.fit(sample)
X = vec_count.transform(sample)

print('Vocabulary size: {}'.format(len(vec_count.vocabulary_)))
print('Vocabulary content: {}'.format(vec_count.vocabulary_))

Vocabulary size: 7
Vocabulary content: {'apple': 0, 'computer': 1, 'of': 4, 'the': 5, 'mark': 3, 'linux': 2, 'windows': 6}

ベクトル化した内容を見てみます。

pd.DataFrame(X.toarray(), columns=vec_count.get_feature_names())

出現した単語数が単純にカウントしたベクトル化が行われました。

ただ、この手法は出現数の多い単語のベクトルが異常に強くなってしまう欠点があります。
このような欠点を補う手法として以下のような手法があります。

TfidfVectorizer

TF-IDF(索引語頻度逆文書頻度)という手法になります。
これは、TF（単語の出現頻度）とIDF（単語のレア度）とを掛け合わせたものになります。

TF: 文書における指定単語の出現頻度:  \frac{文書内の指定単語の出現回数}{文書内の全単語の出現回数}\\
IDF: 逆文書頻度(指定単語のレア度): log\frac{総文書数}{指定単語を含む文書数}

TFとIDFを掛け合わせます

TFIDF(索引語頻度逆文書頻度) = TF * IDF

・tf-idf(Wikipedia)
https://ja.wikipedia.org/wiki/Tf-idf

実装コード

インポート

各種ライブラリをインポートします。

import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer

# ベクトル化する文字列
sample = np.array(['Apple computer of the apple mark', 'linux computer', 'windows computer'])

# TfidfVectorizer
vec_tfidf = TfidfVectorizer()

# ベクトル化
X = vec_tfidf.fit_transform(sample)

print('Vocabulary size: {}'.format(len(vec_tfidf.vocabulary_)))
print('Vocabulary content: {}'.format(vec_tfidf.vocabulary_))

Vocabulary size: 7
Vocabulary content: {'apple': 0, 'computer': 1, 'of': 4, 'the': 5, 'mark': 3, 'linux': 2, 'windows': 6}

ベクトル化した内容を見てみます。

pd.DataFrame(X.toarray(), columns=vec_tfidf.get_feature_names())

テキスト[0]では 'computer' が弱いベクトルとなり 0.217 という数値になっています。
テキスト[3]では 'windows' が強いベクトルとなり 0.861 という数値になっています。

以上、今回は scikit-learn を使ったテキスト特徴量のベクトル化の手法を簡単に実行してみました。!

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up

機械学習 〜 テキスト特徴量（CountVectorizer, TfidfVectorizer） 〜

テキストデータのベクトル化

CountVectorizer

実装コード

TfidfVectorizer

実装コード

インポート

機械学習〜テキスト特徴量（CountVectorizer, TfidfVectorizer）〜