More than 5 years have passed since last update.

pythonでtwitterのつぶやき内に含まれるURLをクロールする

Last updated at 2016-06-30Posted at 2016-06-19

概要

ある指定したtwitterアカウントのつぶやき内にあるURLを取得
そのURLに特定の文字列を付与したページから、指定した情報をスクレイビングする
スクレイビングした情報をpostgresに保存する
すでに保存しているデータにぶつかったらクロールを止める
BeautifulSoupを利用

詳細

crawl.py

# -*- coding: utf-8 -*-
try:
    # Python 3
    from urllib import request
except ImportError:
    # Python 2
    import urllib2 as request

from bs4 import BeautifulSoup

import twpy
import time

# postgresql接続
import psycopg2

def main():
    #ツイートデータを読み込む
    api = twpy.api
    tweets = api.user_timeline(screen_name = "あるtwitterアカウントのID")

    connector = psycopg2.connect(host="hoge",port=5432,dbname="hogehoge",user="hoge",password="hoge")
    max_hoge_id_fetcher = connector.cursor()
    cursor = connector.cursor()

    max_hoge_id_fetcher.execute('select MAX(hoge_id) from hoge')

    #DBに保存済みの最新のhoge_idを取得
    for row in max_hoge_id_fetcher:
        max_hoge_id = row[0]
        print("保存済みの最新IDは"+str(hoge_id))

    #ツイートを1件ずつ読み込み、URLをクロールしていく
    for tweet in tweets:
        text = tweet.text
        url = tweet.entities['urls']
        expanded_url = url[0]['expanded_url']
        
        #今回、crawl先は特定の文字列を付与したURL
        crawl_url = expanded_url + "hogehoge"
        response = request.urlopen(crawl_url)

        # responseを読み込んでbodyに保存
        body = response.read()

        # HTMLをパースし、soupに入れる
        soup = BeautifulSoup(body,'html.parser')

        hoge_id = soup.find('id').text

        print(str(hoge_id)+"を始めます")

        #最新のhogeidにたどり着いたら、その後のクロールは行わない。
        if int(hoge_id) <= max_hoge_id:
            print('すでにこのデータは入っています。')
            break

        description = soup.find('description').text

        #中略

        #データを挿入する
        cursor.execute('insert into hoge(hoge_id,description,hogehoge,,,) values(%s,%s,hogehoge)',(hoge_id,description,hoge,))
        print("inserted!")

        #スリープ時間は3秒に設定
        time.sleep(3)

    #変更を保存する
    connector.commit()

    cursor.close()
    connector.close()


if __name__ == '__main__':
    main()

twpy.py

# !/usr/bin/env python
# -*- coding:utf-8 -*-

# Tweepyライブラリをインポート
import tweepy

# 各種キーをセット
CONSUMER_KEY = 'hoge'
CONSUMER_SECRET = 'hoge'
auth = tweepy.OAuthHandler(CONSUMER_KEY, CONSUMER_SECRET)
ACCESS_TOKEN = 'hoge'
ACCESS_SECRET = 'hoge'
auth.set_access_token(ACCESS_TOKEN, ACCESS_SECRET)

# APIインスタンスを作成
api = tweepy.API(auth)

# TwitterAPIをPythonから操作する準備完了。
print "Done!"

実行例

$ python crawl.py 
Done!
保存済みの最新IDは92
98を始めます
inserted!
97を始めます
inserted!
96を始めます
inserted!
95を始めます
inserted!
94を始めます
inserted!
93を始めます
inserted!
92を始めます
すでにこのデータは入っています。

参考させていただいたサイト(の一部)

Python: BeautifulSoup4 を使って Web サイトをスクレイピングする
 Tweepyを使って、PythonでTwitterのAPIを超簡単に操作する

ありがとうございました。

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up