More than 1 year has passed since last update.

X APIを使ったスクレイピングと解析の方法

Last updated at 2023-12-31Posted at 2023-12-31

はじめに

「X」は、リアルタイムの情報を共有し、トレンドやトピックについての洞察を提供するための重要なプラットフォームです。この記事では、Pythonを使用してX APIを活用し、特定のキーワードに関連するポストデータを収集し、そのデータを詳細に解析および可視化する方法について詳しく説明します。具体的には、2023年の漢字である「税」というキーワードに関連するデータを取得し、そのデータを分析し、頻出単語を抽出し、単語の共起ネットワークを可視化します。

必要なライブラリのインストール

まず、必要なライブラリをインストールしましょう。以下のコマンドを使用して、必要なライブラリをインストールします。

!pip install tweepy
!pip install japanize_matplotlib

X APIのセットアップ

XAPIを使用するために、APIキーとアクセストークンをセットアップします。これらの認証情報を使用して、Xからデータを収集できます。

API_KEY = "API_KEY"
API_KEY_SECRET = "API_KEY_SECRET"
ACCESS_TOKEN = "ACCESS_TOKEN"
ACCESS_TOKEN_SECRET = "ACCESS_TOKEN_SECRET"
BEARER_TOKEN = "BEARER_TOKEN"
SEARCH_KEYWORD = "税"  # 取得したいキーワードを設定
MAX_RESULTS = 100  # 収集する最大ポスト数を設定

ポストデータの収集

X APIを使用して、指定されたキーワードに関連する最大100件のポストデータを収集します。以下のステップを実行します。
(1).X APIへのリクエストを送信し、データを取得します。
(2).レスポンスデータをJSON形式に変換します。
(3).ポストデータをPandasのDataFrameに変換します。

import requests

URL = f"https://api.twitter.com/2/tweets/search/recent?query={SEARCH_KEYWORD}&tweet.fields=public_metrics,author_id&max_results={MAX_RESULTS}"
HEADERS = {
    "Authorization": f"Bearer {BEARER_TOKEN}"
}

def get_tweets_with_keyword():
    response = requests.get(URL, headers=HEADERS)
    if response.status_code != 200:
        raise Exception(f"Request returned {response.status_code}: {response.text}")

    response_data = response.json()
    tweets = response_data['data']

    # ポストデータをPandasのDataFrameに変換
    df = convert_to_dataframe(tweets)

    return df

データ解析

データの収集に成功したら、以下のステップで簡単な分析を行なってみましょう！

1.テキストデータから単語を抽出

ポストのテキストデータから日本語の単語を抽出し、頻出単語をカウントします。

import pandas as pd
import re
from collections import Counter

text_data = df['text']

# 日本語の文字と英語以外の文字を含む単語を抽出
words = []
for text in text_data:
    japanese_words = re.findall(r'[\u3040-\u30FF\u3400-\u4DBF\u4E00-\u9FFF\U00020000-\U0002EBEF]+', text)
    words.extend(japanese_words)

# 必要のない単語をリストに入れて除外
names = ["日"] 
words = [word for word in words if word not in names]

# 頻出単語のカウント
word_counts = Counter(words)

2.頻出単語を可視化

頻出単語のトップ30を可視化して表示してみます。

import matplotlib.pyplot as plt


# 最も頻出する単語のトップ30
most_common_words = word_counts.most_common(30)

words, counts = zip(*most_common_words)

plt.figure(figsize=(12, 8))
plt.barh(words, counts, color='skyblue')
plt.xlabel('頻出数')
plt.ylabel('単語')
plt.title('頻出する単語のトップ30')
plt.gca().invert_yaxis()
plt.show()

3.単語の共起ネットワークを可視化

ポスト内の単語の共起ネットワークを作成し、可視化します。

import networkx as nx
from itertools import combinations
from collections import defaultdict

# 共起ネットワークのエッジを計算する関数
def tokenize_text(text):
    cleaned_text = re.sub(r'#\w+|http\S+|[^\w\s]|(?i)[a-z]', '', str(text))
    return cleaned_text.lower().split()

def compute_cooccurrence(text):
    tokens = tokenize_text(text)
    return list(combinations(tokens, 4))

# 共起ネットワークのエッジの計算
edges = df['text'].apply(compute_cooccurrence).explode().dropna()

edge_freq = defaultdict(int)
for edge in edges:
    edge_freq[tuple(sorted(edge))] += 1

# エッジの頻度が１以上のものだけをネットワークに追加
G = nx.Graph()
for edge, freq in edge_freq.items():
    if freq >= 1:
        G.add_edge(edge[0], edge[1], weight=freq)

フォントや出力サイズなどを整えて可視化

plt.figure(figsize=(50, 50))
pos = nx.spring_layout(G, k=0.4)
nx.draw(G, pos, with_labels=True, node_color='skyblue', node_size=1000, font_size=25, font_weight='bold', font_family="IPAexGothic", width=[G[u][v]['weight']*0.1 for u,v in G.edges()])
plt.title("単語の共起ネットワーク")
plt.show()

このように単語の共起ネットワークを可視化することで、ポスト内の単語同士の関連性や頻出の組み合わせを視覚的に把握できます。

最後に

この記事を通じて、Xに関するデータの収集と簡単な分析を紹介しました。この手法を活用して、マーケティングや他の分野でデータ駆動の洞察を得る手助けになれば幸いです。是非、自身のプロジェクトなどに適用してみてください！

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up