More than 1 year has passed since last update.

requestsでアクセスできないサイトにアクセスする方法

Last updated at 2022-10-27Posted at 2022-08-31

結論

ブラウザ自動化ツール (Selenium等) を使う
ブロックを回避できるcliのツールを使う

問題

いつものようにPythonのrequestsでアクセスしようとすると SSLError が発生
- requests.exceptions.SSLError: HTTPSConnectionPool(host='ss148.litvp.com', port=443): Max retries exceeded with url:
ブラウザからはアクセスできる
cloudflareのサイトが多い

なぜアクセスできないのか

ボット対策
TLSフィンガープリントなどを使ってクライアントを見分けている
詳細は以下の記事が参考になる
- https://github.com/lwthiker/curl-impersonate/blob/main/README.md

1. ブラウザ自動化ツールを使う

メリット
- 実際のブラウザを動かすため、実環境と同様にアクセスできる
デメリット
- その分、動作速度が遅いことやメモリ使用量が多いことなどのデメリットがある

2. ブロックを回避できるcliのツールを使う

メリット
- 必要なアクセスのみを行うため、ブラウザ自動化ツールとくらべて軽量
デメリット
- cliのツールなのでPythonなどで使うときには、subprocessなどでコマンドを実行する必要がある
例
- Windows: Invoke-WebRequest, wslで curl-impersonate
- Mac, Linux: curl-impersonate

curl-impersonateのラッパー

エスケープがあたりが怪しい。あくまで参考に

import os
import subprocess

def get_from_curl_impersonate(url, headers=None):
    # -s 途中経過の非表示
    # -L リダイレクトを追跡する (Locationヘッダ)
    if os.name == 'nt':
        dirpath = '~/scripts/others/curl-impersonate/'
        replaced_url = url.replace('"', '""')
        command = f"wsl {dirpath}curl_chrome104 -s -L \"{replaced_url}\""
    elif os.name == 'posix':
        replaced_url = url.replace('"', '\\"')
        command = f'curl_chrome104 -s -L "{replaced_url}"'
    else:
        raise OSError('このOSにはまだ対応していません')

    if headers is not None:
        headers_str = ''
        for key, value in headers.items():
            if os.name == 'nt':
                formatted_key = key.replace('"', '""')
                formatted_value = value.replace('"', '""')
                headers_str += f"-H \"{formatted_key}: {formatted_value}\" "
            else:
                formatted_key = key.replace('"', '\\"')
                formatted_value = value.replace('"', '\\"')
                headers_str += f'-H "{formatted_key}: {formatted_value}" '
        command += f" {headers_str}"

    result = subprocess.run(command, shell=True, capture_output=True)
    html_str = result.stdout.decode()

    if result.stderr is not None and result.stderr != b'':
        print(result.stderr.decode(), file=sys.stderr)

    return html_str

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up