環境

Python3.6.5
- requests 2.19.1
- chardet 3.0.4
- flask 1.0.2
Windows10

背景

Pythonのrequestsモジュールで、外部サイトにアクセスしたいです。外部サイトには文字コードが指定されていません（Content-Typeヘッダのcharsetディレクティブが指定されていない）。
以下のようなAPIを想定しています。

server.py

from flask import Flask, make_response, jsonify
import json
app = Flask(__name__)

@app.route('/test_OK')
def test_OK():
    resp = make_response("あ")
    resp.headers['Content-Type'] = 'text/plain;'
    return resp

@app.route('/test_NG')
def test_NG():
    resp = make_response("testあ")
    resp.headers['Content-Type'] = 'text/plain;'
    return resp

console

> set FLASK_APP=server.py
> set FLASK_ENV=development
> python -m flask run
 * Serving Flask app "server.py"
 * Environment: production
   WARNING: Do not use the development server in a production environment.
   Use a production WSGI server instead.
 * Debug mode: off
 * Running on http://127.0.0.1:5000/ (Press CTRL+C to quit)

requests.getでtest_OKにアクセス結果は、以下の通りです。

IPython

In [149]: r_ok = requests.get("http://localhost:5000/test_OK")

In [150]: r_ok.headers
Out[150]: {'Content-Type': 'text/plain;', 'Content-Length': '3', 'Server': 'Werkzeug/0.14.1 Python/3.6.6', 'Date': 'Fri, 28 Sep 2018 09:46:06 GMT'}

In [151]: r_ok.text
Out[151]: 'ã\x81\x82'

In [152]: r_ok.encoding
Out[152]: 'ISO-8859-1'

In [153]: r_ok.apparent_encoding
Out[153]: 'utf-8'

r_ok.encodingが"ISO-8859-1"であるため、r_ok.textは文字化けしています。apparent_encodingはchardetライブラリが判定した文字コードです。
r_ok.encodingにapparent_encodingの文字コードを設定して、文字化けを回避しました。

IPython

In [154]: r_ok.encoding = r_ok.apparent_encoding

In [157]: r_ok.text
Out[157]: 'あ'

https://kanji.hatenablog.jp/entry/python-requests-beautifulsoup-encoding 参考

問題

上記の方法でtest_NGのtextプロパティを表示しましたが、文字化けしていました。apparent_encodingはUTF-8でなくWindows-1254という文字コードだったためです。

IPython

In [158]: r_ng = requests.get("http://localhost:5000/test_NG")

In [159]: r_ng.encoding = r_ng.apparent_encoding

In [160]: r_ng.text
Out[160]: 'testã�‚'
In [161]: r_ng.apparent_encoding
Out[161]: 'Windows-1254'

Windows-1254という文字コード

Windowsのトルコ語を表す文字コードです。
https://uic.jp/charset/show/windows-1254/

なぜか、chardetの"Supported encoding"に Windows-1254は載っていませんでした。
https://chardet.readthedocs.io/en/latest/supported-encodings.html

原因

chardetの文字コード判定が間違っていたからです。文字コードを完全に判定する方法は存在しません。
当然ですね。しかし私は「chardetの文字コード判定が間違っているかもしれない」と疑うのに、時間がかかってしまいました。

対策

文字コードをUTF-8に固定しました。今の時代、まともなサイトならばUTF-8のはずだから、問題はないでしょう。

res.encoding = "utf-8"
res.text

備考

requestsモジュールのencodingの仕様

When you receive a response, Requests makes a guess at the encoding to use for decoding the response when you access the Response.text attribute. Requests will first check for an encoding in the HTTP header, and if none is present, will use chardet to attempt to guess the encoding.

The only time Requests will not do this is if no explicit charset is present in the HTTP headers and the Content-Type header contains text. In this situation, RFC 2616 specifies that the default charset must be ISO-8859-1. Requests follows the specification in this case. If you require a different encoding, you can manually set the Response.encoding property, or use the raw Response.content.

http://docs.python-requests.org/en/master/user/advanced/#encodings 引用

文字コードが定義されていない場合、chardetで判定する
文字コードが定義されていなくて、Content-Typeに"text"が含まれているとき、文字コードはISO-8859-1。

chardetの判定結果

文字数が少ないためかconfidence(信頼度）は約0.5と低いですね。
しかし、なぜTrukishなんだろう。。。

IPython

In [163]: chardet.detect( "あ".encode("utf-8"))                                                                            
Out[163]: {'encoding': 'utf-8', 'confidence': 0.505, 'language': ''}                                                      

In [164]: chardet.detect( "testあ".encode("utf-8"))                                                                        
Out[164]:                                                                                                                 
{'encoding': 'Windows-1254',                                                                                              
 'confidence': 0.5889255495043456,                                                                                        
 'language': 'Turkish'}                                                                                                   

In [165]: chardet.detect( "tあ".encode("utf-8"))                                                                           
Out[165]: {'encoding': 'utf-8', 'confidence': 0.505, 'language': ''}                                                      

In [166]: chardet.detect( "stあ".encode("utf-8"))                                                                          
Out[166]: {'encoding': 'utf-8', 'confidence': 0.505, 'language': ''}                                                      

In [167]: chardet.detect( "estあ".encode("utf-8"))                                                                         
Out[167]:                                                                                                                 
{'encoding': 'Windows-1254',                                                                                              
 'confidence': 0.5153098558163024,                                                                                        
 'language': 'Turkish'}

## jsonメソッドはUnicodeでデコードされる

RFC 4627（古いJSONの仕様）では、JSONはユニコードでエンコードされると定義されています。

JSON text SHALL be encoded in Unicode. The default encoding is UTF-8.

requestsモジュールのjsonメソッドは、この仕様に従っているので、Windows-1254でデコードされることはありません。

server.py

@app.route('/test_json')
def test_json():
    resp = jsonify({'results': "testあ"})
    resp.headers['Content-Type'] = 'application/json;'
    return resp

client.py

r = requests.get("http://localhost:5000/test_json")
print(r.json())
# {'results': 'testあ'}
# 文字化けしない

requests/requests/models.py

        if not self.encoding and self.content and len(self.content) > 3:
            # No encoding set. JSON RFC 4627 section 3 states we should expect
            # UTF-8, -16 or -32. Detect which one to use; If the detection or
            # decoding fails, fall back to `self.text` (using chardet to make
            # a best guess).
            encoding = guess_json_utf(self.content)
            if encoding is not None:
                try:
                    return complexjson.loads(
                        self.content.decode(encoding), **kwargs
                    )

https://github.com/requests/requests/blob/a6cd380c640087218695bc7c62311a4843777e43/requests/models.py#L883-L893 引用

疑問

chardetの"Supported encoding"に Windows-1254が載っていない理由
"testあ"がトルコ言語と判断される理由。文字コードの判定はどのような仕組みか？

Python requestsでapparent_encodingがWindows-1254になる。

環境

背景

問題

Windows-1254という文字コード

原因

対策

備考

requestsモジュールのencodingの仕様

chardetの判定結果

疑問