More than 3 years have passed since last update.

Pythonプログラミング：urllib.requestを使ったtimeoutを指定したHTTPリクエスト発行

Posted at 2021-10-19

はじめに

Tipsです。
Webサイトを対象に（Sleepを入れて負荷をかけないように）スクレイピングをしていました。
が、時折ハングアップしてしまう場面にも直面していました。

・・・はい、timeoutの設定が漏れてました。
そのため、応答があるまで待ち続けてしまっていました（≒ハングアップの状況）。

というわけで、HTTPリクエストを発行する処理を見直したいと思います。

本稿で紹介すること

urllib.requestを使ったHTTPリクエストの発行

尚、筆者は以下のVersionで動作確認をしています。

Python: 3.6.10

サンプルコード

結論だけ、BeforeとAfterのCodeを紹介。

Codeを紹介

端的に言えば、urlopen関数で、timeoutパラメータを指定するだけです。
以下、Code例です。
元のCode全体は、**こちら**を参照されたし。

Before

def get_links_paging(query=None, page=None):
    print(URL.format(urllib.parse.quote_plus(query, encoding='euc-jp'), page))
    try:
        # HTTPリクエストヘッダーにUser Agentを設定
        req = Request(URL.format(urllib.parse.quote_plus(query, encoding='euc-jp'), page), data=None, headers={'User-Agent': USER_AGENT})
        with urlopen(req) as res:
            # HTMLドキュメントからBeautifulSoupを初期化
            soup = BeautifulSoup(res.read().decode('euc_jp', 'ignore'), 'html.parser')

After

def get_links_paging(query=None, page=None):
    print(URL.format(urllib.parse.quote_plus(query, encoding='euc-jp'), page))
    try:
        # HTTPリクエストヘッダーにUser Agentを設定
        req = Request(URL.format(urllib.parse.quote_plus(query, encoding='euc-jp'), page), data=None, headers={'User-Agent': USER_AGENT})
        # Timeout（10秒）を設定してHTTPリクエストを発行
        with urlopen(req, timeout=10) as res:
            # HTMLドキュメントからBeautifulSoupを初期化
            soup = BeautifulSoup(res.read().decode('euc_jp', 'ignore'), 'html.parser')

これで、待ち惚けは解消されます。
timeoutを迎えると、以下のようなメッセージを出して、処理を継続するようになりました。

Exception:  The read operation timed out

ちなみに、requestsモジュールでもtimeoutの設定はできます。
こちらもデフォルトだと、HTTPリクエストはtimeoutしません。

まとめ

urllib.requestを使ったtimeoutを指定したHTTPリクエスト発行する方法を紹介。
HTTPリクエストを発行して、数秒待って応答がなければエラーとして処理しましょう。
大量にデータを集めたいときは、いちいち立ち止まっていられません！timeout設定ありきで前進あるのみ！！

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up