6
5

Delete article

Deleted articles cannot be recovered.

Draft of this article would be also deleted.

Are you sure you want to delete this article?

More than 1 year has passed since last update.

久しぶりにスクレイピングしたくなったのでrequestsでのプロキシ操作やヘッダー偽装や非同期処理など

Posted at

python + requestsを用いたスクレイピングでよく使うコード

忘れっぽいのでメモ

ヘッダー偽装

pip install fake-useragent
from fake_useragent import UserAgent
ua = UserAgent()
header = {'User-Agent': str(ua.chrome)}
res = requests.get('https://example.com/', headers=header)

参考リンク

プロキシ使用

import requests

proxies = {
    'http' : 'http://user:pass@proxyip.0.0.1:80',
    'https' : 'http://user:pass@proxyip.0.0.1:80',
}

res = requests.get('https://google.com/', proxies=proxies)

equests.exceptions.SSLErrorが発生する場合

import requests

#警告無視
import urllib3
from urllib3.exceptions import InsecureRequestWarning
urllib3.disable_warnings(InsecureRequestWarning)

proxies = {
    'http' : 'http://user:pass@proxyip.0.0.1:80',
    'https' : 'http://user:pass@proxyip.0.0.1:80',
}

res = requests.get('https://google.com/', proxies=proxies, verify=False)

参考リンク

プロキシをランダム使用

proxy.txtにプロキシを保存

import random

#プロキシ読み込み
with open('proxy.txt') as f:
    proxies = f.readlines()
proxies = [pl.strip() for pl in proxies]

set_proxy = random.choice(proxies).split(":")
proxy = {
    'http' : f'http://{set_proxy[2]}:{set_proxy[3]}@{set_proxy[0]}:{set_proxy[1]}',
    'https' : f'http://{set_proxy[2]}:{set_proxy[3]}@{set_proxy[0]}:{set_proxy[1]}',
}

プロキシを順番に使用

whileやforでタスクを複数回回したり、asyncで非同期処理する場合は割と便利

import queue

#プロキシ読み込み
with open('proxy.txt') as f:
    proxies = f.readlines()
proxies = [pl.strip() for pl in proxies]

q = queue.Queue()

if q.empty():
    for i in proxies:
        q.put(i)

set_proxy = q.get().split(":")
proxy = {
    'http' : f'http://{set_proxy[2]}:{set_proxy[3]}@{set_proxy[0]}:{set_proxy[1]}',
    'https' : f'http://{set_proxy[2]}:{set_proxy[3]}@{set_proxy[0]}:{set_proxy[1]}',
}

asyncioと組み合わせた例

あれ今ってasyncio.ensure_futureってないんだっけ...?
一応実行環境python 3.7.x
もし新しいバージョンの書き方あればどなたかコメントいただけると助かります。

import asyncio
import aiohttp
import queue
from fake_useragent import UserAgent

#ヘッダー作成
ua = UserAgent()
header = {'User-Agent': str(ua.chrome)}

#プロキシ読み込み
with open('proxy.txt') as f:
    proxies = f.readlines()
proxies = [pl.strip() for pl in proxies]

#キュー作成
q = queue.Queue()

async def url_access(url):
    while True:
        if q.empty():
            for i in proxies:
                q.put(i)
    
        set_proxy = q.get().split(":")
        proxy = {
            'http' : f'http://{set_proxy[2]}:{set_proxy[3]}@{set_proxy[0]}:{set_proxy[1]}',
            'https' : f'http://{set_proxy[2]}:{set_proxy[3]}@{set_proxy[0]}:{set_proxy[1]}',
        }
        async with aiohttp.ClientSession(headers=header,connector=aiohttp.TCPConnector(verify_ssl=False)) as session:
                async with session.get(url,proxy=proxy,timeout=30) as response:
                    result = await response.text()

        await asyncio.sleep(30)

loop = asyncio.get_event_loop()
Urls = ["https://www.google.com/","https://www.yahoo.co.jp/","https://www.bing.com/"]
for u in Urls:
    asyncio.ensure_future(url_access(i))
loop.run_forever()

もっと効率よくかけるのかもしれないけどこんな感じでいけた。

6
5
0

Register as a new user and use Qiita more conveniently

  1. You get articles that match your needs
  2. You can efficiently read back useful information
  3. You can use dark theme
What you can do with signing up
6
5

Delete article

Deleted articles cannot be recovered.

Draft of this article would be also deleted.

Are you sure you want to delete this article?