More than 5 years have passed since last update.

twitterからwordcloudを生成してみた

Posted at 2018-07-11

pythonの勉強のためにwordcloudを生成してみました

まず，twitterAPIを利用して特定のユーザのリツイートを除くツイート200件を取得します

次にそのツイートを形態素解析にかけ，名詞のみを抽出し，ストップワードを除いた単語をwordcloudモジュールに投げます

最後にmatplotlibを利用して画像を出力すれば終わりです．

プログラムではタイムラインを取得した時にtext形式で書き出してますが，これはもともとタイムラインを取得してtext形式で書き出すというプログラムをコピペしたからです．

プログラム

import sys
import json
import twitter_config
from requests_oauthlib import OAuth1Session
import csv
import MeCab
import re
import matplotlib.pyplot as plt
from wordcloud import WordCloud
from bs4 import BeautifulSoup
from collections import Counter, defaultdict

# twitterのタイムラインを取得する
def get_timeline(usr_name):
    CK = twitter_config.CONSUMER_KEY
    CS = twitter_config.CONSUMER_SECRET
    AT = twitter_config.ACCESS_TOKEN
    ATS = twitter_config.ACCESS_TOKEN_SECRET
    twitter = OAuth1Session(CK, CS, AT, ATS)

    url = "https://api.twitter.com/1.1/statuses/user_timeline.json"

    params = {'screen_name': usr_name, 'count': 200,
              'exclude_replies': True, 'include_rts': False}

    req = twitter.get(url, params=params)

    if req.status_code == 200:
        search_timeline = json.loads(req.text)
        with open(usr_name + '_tweet.txt', 'w') as f:
            for tweet in search_timeline:
                f.write(tweet['text'] + '\n')
    else:
        print("ERROR: %d" % req.status_code)

# テキスト整形
def format_text(text):
    text = re.sub(r"(https?|ftp)(:\/\/[-_\.!~*\'()a-zA-Z0-9;\/?:\@&=\+\$,%#]+)", "" ,text)


    return text

# 名詞を抽出する
def pic_noun(texts):
    t = MeCab.Tagger()
    words = []
    for chunk in t.parse(texts).splitlines()[:-1]:
        (surface, feature) = chunk.split('\t')
        feature = feature.split(',')[0]
        if feature == '名詞':
            words.append(surface)
    return words

        
def word_cloud(usr_name):
    file_name = usr_name + '_tweet.txt'
    with open(file_name, 'r') as f:
        texts = f.read()
        texts = format_text(texts)

    with open('stop_words.txt', 'r') as f:
        stop_list = []
        for line in f.readlines():
            stop_list.append(line[:-1])

    words = pic_noun(texts)
    text = ' '.join(words)

    # word_cloudの設定
    fpath = '/usr/share/fonts/truetype/fonts-japanese-gothic.ttf'
    wordcloud = WordCloud(background_color='white',
                          font_path=fpath, width=900, height=500, stopwords=set(stop_list)).generate(text)

    plt.figure(figsize=(15, 12))
    plt.imshow(wordcloud)
    plt.axis('off')
    plt.savefig(usr_name + '.png')

if __name__ == '__main__':
    arg = sys.argv
    if len(arg) == 1:
        print('ユーザ名がありません')
        sys.exit()

    get_timeline(arg[1])
    word_cloud(arg[1])

生成した例

参考

PythonでWordCloudを利用してTwitterアカウントを可視化する

【Pythonでテキストマイニング】TwitterデータをWordCloudで可視化してみる

Word Cloudで文章の単語出現頻度を可視化する。[Python]

スタバのTwitterデータをpythonで大量に取得し、データ分析を試みるその１

Pythonでサクッと簡単にTwitterAPIを叩いてみる

Twitter REST APIの使い方

TwitterAPI でツイートを大量に取得。サーバー側エラーも考慮（pythonで）

GET statuses/user_timeline - ユーザータイムラインを取得する

Pythonで余計な文字列を削除する方法

正規表現でURLを削除

ストップワード

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up