More than 1 year has passed since last update.

[備忘録]Xで指定したテキストを含むハッシュタグ付き等の投稿を検索・取得する

Last updated at 2025-01-12Posted at 2024-09-14

細々した話

Basicプランの課金タイミングについての注意

1ヶ月100ドルと書いてあったが、登録したらそこで「1ヶ月分」課金されるわけではなかった（2024年9月時点での報告）。

最初に1週間分課金された（日割り）。
1週間後に1ヶ月分課金された。

クレジットカードの締日と相談して経理系の手続きを行う場合は注意。

事前設定

APPを作ること。

必要な認証系の情報は、「Apps」の個別アプリの「Keys and tokens」リンク（鍵マーク）で行けるページから作成・取得できる。

API Key and Secret （Consumer Keysの下）
Bearer Token（Authentication Tokensの下）
Access Token and Secret （Authentication Tokensの下）

※ User authentication settingsはデータ取得のみなら設定不要かと思われる。

Searchの注意点（Basicプラン）

https://developer.x.com/en/portal/products/basic
より

GET /2/tweets/search/recent
についての制限（Rate Limit）

60 requests / 15 mins PER USER
60 requests / 15 mins PER APP

※これがあるので、一気に様々なクエリを投げたい場合は注意。

他の情報

10 default results per response
100 results per response
- 512 query length
- core operators

mex_resultsは10以上100以下。

関数

tweepyを使うのであらかじめインストールする。
使うライブラリを適当にimportしておく。

from time import sleep

import tweepy

今回は認証にいる情報や使いたいフィールドを別のファイルに切り分けておく。

import a_api_constant

API_Key = a_api_constant.API_KEY
API_Sec = a_api_constant.API_SEC
Token = a_api_constant.TOKEN
Token_Sec = a_api_constant.TOKEN_SEC
BEARER = a_api_constant.BEARER_TOKEN

TWEET_FIELDS_DEFAULT_SEARCHHT = a_api_constant.TWEET_FIELDS_DEFAULT
EXPANSIONS_DEFAULT_SEARCHHT = a_api_constant.EXPANSIONS_DEFAULT

デフォルトのフィールドの例。


TWEET_FIELDS_DEFAULT=[
    "created_at",
    "lang",
    "entities",
    "attachments",
    "conversation_id",
    "geo",
    "edit_history_tweet_ids",
    "public_metrics",
]
EXPANSIONS_DEFAULT = ["attachments.media_keys", "author_id", "in_reply_to_user_id"]

本体。

指定したテキストでハッシュタグ付き・RP(RT)以外の投稿を取得（言語指定可能）

# max_results: between 10 and 100. 
# filter_lang: 'ja', 'en'など。
def api_search_wtHTwoRT(search_text='(#abc)', filter_lang=None, max_results=50):
    
    client = tweepy.Client(
        bearer_token=BEARER,
        consumer_key=API_Key,
        consumer_secret=API_Sec,
        access_token=Token,
        access_token_secret=Token_Sec,
    )

    results = []
    search_query = search_text + ' -is:retweet has:hashtags'
    print(search_query)
    if filter_lang is not None:
        search_query = search_query + " lang:" + filter_lang

    response = client.search_recent_tweets(
        query=search_query,
        tweet_fields=TWEET_FIELDS_DEFAULT_SEARCHHT,
        expansions=EXPANSIONS_DEFAULT_SEARCHHT,
        max_results=max_results,
    )

    tweets = response.data
    
    sleep(1)

    if tweets != None:

        for tweet in tweets:
            obj = {}
            obj["tweet_id"] = tweet.id
            created_at = tweet.created_at
            created_at_str = created_at.isoformat()
            obj["created_at"] = created_at_str
            obj["text"] = tweet.text
            obj["author_id"] = tweet.author_id
            obj["conversation_id"] = tweet.conversation_id
            if tweet.in_reply_to_user_id is not None:
                obj["in_reply_to_user_id"] = tweet.in_reply_to_user_id
            if tweet.attachments is not None:
                obj["attachments"] = tweet.attachments
            if tweet.entities is not None:
                obj["entities"] = tweet.entities
            if tweet.geo is not None:
                obj["geo"] = tweet.geo

            results.append(obj)
    else:
        print('NOT FOUND')

    return results

ハッシュタグ付きの縛りがないバージョン。

def api_search_woRT(search_text='"abc"', filter_lang=None, max_results=50):
    
    client = tweepy.Client(
        bearer_token=BEARER,
        consumer_key=API_Key,
        consumer_secret=API_Sec,
        access_token=Token,
        access_token_secret=Token_Sec,
    )

    search_query = search_text + ' -is:retweet'
    print(search_query)
    if filter_lang is not None:
        search_query = search_query + " lang:" + filter_lang

    response = client.search_recent_tweets(
        query=search_query,
        tweet_fields=TWEET_FIELDS_DEFAULT_SEARCHHT,
        expansions=EXPANSIONS_DEFAULT_SEARCHHT,
        max_results=max_results,
    )

    tweets = response.data

    sleep(1)

    if tweets != None:

        for tweet in tweets:
            obj = {}
            obj["tweet_id"] = tweet.id
            created_at = tweet.created_at
            created_at_str = created_at.isoformat()
            obj["created_at"] = created_at_str
            obj["text"] = tweet.text
            obj["author_id"] = tweet.author_id
            obj["conversation_id"] = tweet.conversation_id
            if tweet.in_reply_to_user_id is not None:
                obj["in_reply_to_user_id"] = tweet.in_reply_to_user_id
            if tweet.attachments is not None:
                obj["attachments"] = tweet.attachments
            if tweet.entities is not None:
                obj["entities"] = tweet.entities
            if tweet.geo is not None:
                obj["geo"] = tweet.geo

            results.append(obj)
    else:
        print('NOT FOUND')

    return results

追記：public_metricsも取得したい際には、ループの中に↓も入れる。

            if tweet.public_metrics is not None:
                public_metrics_dict = tweet.public_metrics
                ordered_keys = ["retweet_count", "like_count", "reply_count", "quote_count"]
                for key in ordered_keys:
                    if key in public_metrics_dict.keys():
                        obj[key] = public_metrics_dict[key]

追記

CSVとして保存する際などのメモ。

長いとfloat扱いで指数表記（Eが入る）保存されることがある。
そのため対策しておくとよい。

in_reply_to_user_id の場合は、すべてのtweetが持つわけではないので、nanも防ぎたい。

    if "in_reply_to_user_id" in df.columns:
        df["in_reply_to_user_id"] = df["in_reply_to_user_id"].apply(lambda x: f"{x:.0f}" if isinstance(x, (int, float)) and x != "" else x)
        df["in_reply_to_user_id"] = df["in_reply_to_user_id"].apply(lambda x: "" if (pd.isna(x) or x=="nan") else x)

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up