More than 5 years have passed since last update.

scikit-learn で TF-IDF を計算する

Last updated at 2014-11-04Posted at 2014-06-24

昨日触れた TF-IDF を求めるコードを実装します。機械学習については例によって scikit-learn を使います。

このような既知の計算については自力で実装するより完成度の高いライブラリを利用するべきでしょう。これにより車輪の再発明を避ける、品質を担保するという狙いがあります。

事前準備として、ホームディレクトリの docs ディレクトリに処理対象となる自然言語の文書を格納します。

import os
import MeCab
from sklearn.feature_extraction.text import TfidfVectorizer

home = os.path.expanduser('~')
target_dir = os.path.join(home, 'docs')
token_dict = {}

def tokenize(text):
    """ MeCab で分かち書きした結果をトークンとして返す """
    wakati = MeCab.Tagger("-O wakati")
    return wakati.parse(text)

# ひとつひとつファイルを読み込んで
# ファイル名に対して語彙群のディクショナリを生成する
for subdir, dirs, files in os.walk(target_dir):
    for file in files:
        file_path = os.path.join(subdir, file)
        shakes = open(file_path, 'r')
        text = shakes.read()
        lowers = text.lower()
        token_dict[file] = lowers

# scikit-learn の TF-IDF ベクタライザーを使う
tfidf = TfidfVectorizer(tokenizer=tokenize, stop_words='english')
tfs = tfidf.fit_transform(token_dict.values())

print(token_dict)
print(tfs.toarray())

索引文字列を抽出したのち既存のライブラリに突っ込めば TF-IDF が求まります。

参考

TfidfVectorizer
http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html#sklearn.feature_extraction.text.TfidfVectorizer

特徴量(素性)を作るときのメモ + scikit-learnにちょっと触る
http://sucrose.hatenablog.com/entry/2013/04/19/014258

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up