More than 5 years have passed since last update.

【Python】TwitterのWebスクレイピング

Last updated at 2018-05-18Posted at 2018-05-17

はじめに

TwitterをPythonでWebスクレイピングするとしたらこんな感じ、というコードです。

※Twitter社に事前の承諾なしでスクレイピングを行うことは明示的に禁じられているのでご注意を…！
詳細はこちら↓
⇒Webスクレイピングにおける注意事項

スクロールして読み込むタイプのページに対するスクレイピングの参考資料としてご覧頂ければ幸いです。
今回のコード作成にあたり参考にさせて頂いた情報はページ下部にあります。

環境

Python3

使用ライブラリ

HTTPリクエスト：Requests
スクレイピング：BeautifulSoup4

ソース

TweetCollector.py

# coding: UTF-8
import requests
from bs4 import BeautifulSoup
import csv
import time
from datetime import datetime

#
# 指定されたTwitterのユーザー名のTweetを収集するクラス
# 　- collectTweet でインスタンスにTweetデータを保持
# - writeCSV で保持したTweetデータをCSVファイルとして保存
#
class TweetCollector:
	#Twitterの取得URL
	__TWITTER_URL = (
		"https://twitter.com/i/profiles/show/"
		"%(user_name)s/timeline/tweets?include_available_features=1&include_entities=1"
		"%(max_position)s&reset_error_state=false"
	)
	
	__user_name = ""	#取得するTwitterのユーザー名
	__tweet_data = []	#Tweetのブロックごと配列
	
	#
	# コンストラクタ
	#
	def __init__(self, user_name):
		self.__user_name = user_name
		
		#項目名の設定
		row = [
			"ツイートID",
			"名前",
			"ユーザー名",
			"投稿日",
			"本文",
			"返信数",
			"リツイート数",
			"いいね数"
		]
		self.__tweet_data.append(row)
	
	#
	# Tweetの収集を開始する
	#
	def collectTweet(self):
		self.nextTweet(0)
	
	#
	# 指定されたポジションを元に次のTweetを収集する
	#
	def nextTweet(self, max_position):
		# max_position に 0 が指定されていたらポジション送信値なし
		if max_position == 0:
			param_position = ""
		else:
			param_position = "&max_position=" + max_position
		
		# 指定パラメータをTwitterのURLに入れる
		url = self.__TWITTER_URL % {
			'user_name': self.__user_name, 
			'max_position': param_position
		}

		# HTMLをスクレイピングして、Tweetを配列として格納
		res = requests.get(url)
		soup = BeautifulSoup(res.json()["items_html"], "html.parser")
		
		items = soup.select(".js-stream-item")

		for item in items:
			row = []
			row.append(item.get("data-item-id")) #ツイートID
			row.append(item.select_one(".fullname").get_text().encode("cp932", "ignore").decode("cp932")) #名前
			row.append(item.select_one(".username").get_text()) #ユーザー名
			row.append(item.select_one("._timestamp").get_text()) #投稿日
			row.append(item.select_one(".js-tweet-text-container").get_text().strip().encode("cp932", "ignore").decode("cp932")) #本文
			row.append(item.select(".ProfileTweet-actionCountForPresentation")[0].get_text()) #返信カウント
			row.append(item.select(".ProfileTweet-actionCountForPresentation")[1].get_text()) #リツイートカウント
			row.append(item.select(".ProfileTweet-actionCountForPresentation")[3].get_text()) #いいねカウント

			self.__tweet_data.append(row)

		print(str(max_position) + "取得完了")
		time.sleep(2) #負荷かけないように
		
		# ツイートがまだあれば再帰処理
		if res.json()["min_position"] is not None:
			self.nextTweet(res.json()["min_position"])
	
	#
	# 取得したTweetをCSVとして書き出す
	#
	def writeCSV(self):
		today = datetime.now().strftime("%Y%m%d%H%M")
		with open(self.__user_name + "-" + today + ".csv", "w") as f:
			writer = csv.writer(f, lineterminator='\n')
			writer.writerows(self.__tweet_data)


# 処理			
twc = TweetCollector("*TWITTER_NAME*") #Twitterのユーザー名を指定
twc.collectTweet()
twc.writeCSV()

参考サイト

■ページをスクロールしないと読み込まないページのスクレイピング
twitterのタイムラインをスクレイピングする

■スクレイピングに関わる法律
【スクレイピングと法律】スクレイピングって法律的に何がOKで何がOUTなのかを弁護士が解説

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up