@RyuseiSato (SAI)posted at 2022-03-02

ValueError: empty vocabulary; perhaps the documents only contain stop words

Q&A

解決したいこと

以下の記事を参考にtwitterを利用したテキストマイニングをPythonを用いて作成しました。

ただ実行したところ以下の、ドキュメントにはストップワードしか含まれていないかもしれません。というエラーが起きてしまいました。
解決方法を教えていただきたいです。

発生している問題・エラー

[ec2-user@ip-172-31-22-104 ~]$ python textmining.py
2022-01-20
Traceback (most recent call last):
  File "textmining.py", line 75, in <module>
    tfidf_matrix = tfidf_vectorizer.fit_transform(target_day_nouns)
  File "/home/ec2-user/.local/lib/python3.8/site-packages/sklearn/feature_extraction/text.py", line 2077, in fit_transform
    X = super().fit_transform(raw_documents)
  File "/home/ec2-user/.local/lib/python3.8/site-packages/sklearn/feature_extraction/text.py", line 1330, in fit_transform
    vocabulary, X = self._count_vocab(raw_documents, self.fixed_vocabulary_)
  File "/home/ec2-user/.local/lib/python3.8/site-packages/sklearn/feature_extraction/text.py", line 1220, in _count_vocab
    raise ValueError(
ValueError: empty vocabulary; perhaps the documents only contain stop words

該当するソースコード

#!/usr/bin/env python
# -*- coding:utf-8 -*-
import MySQLdb
import MeCab
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer

### MySQL 上の Tweet データ取得用関数
def fetch_target_day_n_random_tweets(target_day, n = 2000):
    with MySQLdb.connect(
            host="〇〇〇〇",
            user="〇〇〇〇",
            passwd="〇〇〇〇",
            db="textdata",
            charset="utf8") as conn:
        cursor = conn.cursor()
        SQL = """
        SELECT
          text
        FROM
          tweet
        WHERE
          DATE(raw_created_at + INTERVAL 9 HOUR) = '%s'
        LIMIT %s;
        """ %(target_day, str(n))
        cursor.execute(SQL)
        result = cursor.fetchall()
        l = [x[0] for x in result]
        return l

### MeCab による単語への分割関数 (名詞のみ残す)
def split_text_only_noun(text):
    tagger = MeCab.Tagger()
    text_str = text.encode('utf-8') # str 型じゃないと動作がおかしくなるので変換
    node = tagger.parseToNode(text_str)
    words = []
    while node:
        pos = node.feature.split(",")[0]
        if pos == "名詞":
            # unicode 型に戻す
            word = node.surface.decode("utf-8")
            words.append(word)
        node = node.__next__
    return " ".join(words)

### TF-IDF の結果からi 番目のドキュメントの特徴的な上位 n 語を取り出す
def extract_feature_words(terms, tfidfs, i, n):
    tfidf_array = tfidfs[i]
    top_n_idx = tfidf_array.argsort()[-n:][::-1]
    words = [terms[idx] for idx in top_n_idx]
    return words

### メイン処理
docs_count = 2000 # 取得 Tweet 数
target_days = [
    "2022-01-20",
]

target_day_nouns = []
for target_day in target_days:
    print (target_day)
    # MySQL からのデータ取得
    txts = fetch_target_day_n_random_tweets(target_day, docs_count)
    # 名詞のみ抽出
    each_nouns = [split_text_only_noun(txt) for txt in txts]
    all_nouns = " ".join(each_nouns)
    target_day_nouns.append(all_nouns)
# TF-IDF 計算
# (合計6日以上出現した単語は除外)
tfidf_vectorizer = TfidfVectorizer(
    use_idf=True,
    lowercase=False,
    max_df=6
)
tfidf_matrix = tfidf_vectorizer.fit_transform(target_day_nouns)

# index 順の単語のリスト
terms = tfidf_vectorizer.get_feature_names()
# TF-IDF 行列 (numpy の ndarray 形式)
tfidfs = tfidf_matrix.toarray()

# 結果の出力
for i in range(0, len(target_days)):
    print ("\n------------------------------------------")
    print (target_days[i])
    for x in  extract_feature_words(terms, tfidfs, i, 10):
        print (x, end=' ')

MySQLに格納されているTweetデータ

mysql> SELECT * FROM tweet LIMIT 10\G
*************************** 1. row ***************************
            id: 1504730362676535298
       user_id: 85540685
          text: #hayırlıcumalar
Günün önerileri.. #işbirliği #sleepy #n11com
raw_created_at: 2022-03-18 17:04:14
*************************** 2. row ***************************
            id: 1504730362680803331
       user_id: 1115576562915528704
          text: RT @1612archives: 220318 #YESEO instagram update

“[
raw_created_at: 2022-03-18 17:04:14
*************************** 3. row ***************************
            id: 1504730362680860690
       user_id: 991587032
          text: RT @KonohaTreize: Si vous avez une fille en or dans les mains les frères faites pas n’importe quoi, c’est la pénurie en ce moment
raw_created_at: 2022-03-18 17:04:14
*************************** 4. row ***************************

0 likes

Are you sure you want to delete the question?

ValueError: empty vocabulary; perhaps the documents only contain stop words

解決したいこと

発生している問題・エラー

該当するソースコード

1Answer

Comments

Your answer might help someone💌