【Python】requestsとhttpxでのエンコーディング検出

Posted at 2025-09-30

requests のエンコーディング検出

requests ライブラリには、エンコーディングを取得するためのプロパティが主に2種類用意されています。

response.encoding

ヘッダーの Content-Type から判断します。

response.apparent_encoding

charset_normalizer または chardet ライブラリによって推測されたエンコーディングを提供します。

検証

def use_requests_with_content_type_encoding(url: str):
    print("requestsを使用してリクエストを送信（Content-Typeによるエンコーディング）")
    response = requests.get(url)
    encoding_by_content_type = response.encoding
    print(f"★ Content-Typeによるエンコーディング: {encoding_by_content_type}")
    if encoding_by_content_type is None:
        print("エンコーディングが取得できませんでした")
    else:
        html_text = response.content.decode(encoding_by_content_type)
        print("-" * 20 + "サイトの内容" + "-" * 20)
        print(html_text[:1000])

def use_requests_with_auto_encoding(url: str):
    print("requestsを使用してリクエストを送信（自動予測エンコーディング）")
    response = requests.get(url)
    encoding_by_auto = response.apparent_encoding
    print(f"★ 自動予測エンコーディング: {encoding_by_auto}")
    if encoding_by_auto is None:
        print("エンコーディングが取得できませんでした")
    else:
        html_text = response.content.decode(encoding_by_auto)
        print("-" * 20 + "サイトの内容" + "-" * 20)
        print(html_text[:1000])

def main():
    # Content-Typeによるエンコーディング: ISO-8859-1, 自動予測エンコーディング: CP932
    url = "http://www.asyura2.com/index.html"
    use_requests_with_content_type_encoding(url)
    print("\n" + "=" * 40 + "\n")
    use_requests_with_auto_encoding(url)

if __name__ == "__main__":
    main()

実行結果

★ Content-Typeによるエンコーディング: ISO-8859-1
--------------------サイトの内容--------------------
<TITLE>¢Côf¦Â@·×ÄÌ\ð\«A^ÀÉBµæ¤ÆµÄ¢é</TITLE>
★ 自動予測エンコーディング: CP932
--------------------サイトの内容--------------------
<TITLE>★阿修羅♪掲示板　すべての虚構を暴き、真実に到達しようとしている</TITLE>

Content-Type はあまり当てにならないようです。ライブラリによって予測されたエンコーディングで正しくデコードできています。

httpx でのエンコーディング検出

httpx と requests の共通点は、Content-Type から response.encoding によってエンコーディングを取得する点です。
一方、エンコーディング予測については httpx は requests とは異なります。クライアントを生成する際、予測するための関数を default_encoding に渡します。

検証

def use_httpx_with_auto_encoding(url: str):
    print("httpxを使用してリクエストを送信")
    client = httpx.Client(
        follow_redirects=True,
        default_encoding=autodetect
    )
    response = client.get(url)
    print(f"★ エンコーディング: {response.encoding}")
    html_text = response.text
    print("-" * 20 + "サイトの内容" + "-" * 20)
    print(html_text[:1000])

def autodetect(content: bytes) -> str:
    encoding = None
    try:
        encoding = chardet.detect(content).get("encoding")
        codecs.lookup(encoding)
        return encoding
    except LookupError:
        print(f"エンコーディングの自動検出に失敗しました: {encoding}")
        return "utf-8"

def main():
    url = "http://www.asyura2.com/index.html"
    use_httpx_with_auto_encoding(url)

実行結果

★ エンコーディング: CP932
--------------------サイトの内容--------------------
<TITLE>★阿修羅♪掲示板　すべての虚構を暴き、真実に到達しようとしている</TITLE>

正しくデコードできたことが確認できました。

次に Python の以下の標準エンコーディングにはないエンコーディングを検出した場合の検証を行います。

def main():
    url = "https://nl.ijs.si/janes/wp-content/uploads/2016/09/Graphic-Euphemisms-in-Slovenian-CMC.html"
    use_httpx_with_auto_encoding(url)

実行結果

エンコーディングの自動検出に失敗しました: EUC-TW
★ エンコーディング: utf-8
--------------------サイトの内容--------------------
<!DOCTYPE html>
<html lang="en">

<head>

    <meta charset="utf-8">
    <title>Graphic Euphemisms in Slovenian CMC</title>

EUC-TW が標準エンコーディングにはないため、LookupError をキャッチしました。サイトが charset="utf-8" となっていたため、うまくデコードできました。

参考

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up