More than 5 years have passed since last update.

Pythonにおけるエンコーディング判定

Last updated at 2015-06-16Posted at 2015-06-15

定番ライブラリ

chardet

chardetの概要

バイト列を読み込み、そのパターンから用いられているエンコーディングを推測する。

chardetの基本的な使い方は二通り

detect関数にバイト列を読みこませる
読み込ませるバイト列のサイズが大きすぎる場合、UniversalDetectorオブジェクトを生成し、feedメソッドで少しずつ読み込ませる

方法1

import chardet
from urllib.request import urlopen

with urlopen('http://qiita.com/') as response:
    html = response.read()
    print(chardet.detect(html))  // {'confidence': 0.99, 'encoding': 'utf-8'}

方法2

UniversalDetectorの主なインターフェイス:

detector.feed: バイト列を読みこませるメソッド
detector.done: 信頼度がある閾値を超えるとTrue となる、終了判定のためのプロパティ
detector.result: 結果が格納されたプロパティ
detector.reset: オブジェクトを初期化するメソッド

from chardet.universaldetector import UniversalDetector
from urllib.request import urlopen

detector = UniversalDetector()

with urlopen('http://qiita.com/') as response:
    for l in response:
        detector.feed(l)
        if detector.done:
            break
detector.close()
print(detector.result)  // {'confidence': 0.99, 'encoding': 'utf-8'}

やっていることは簡単で、detector.feedでdetectorに一行ずつ読み込ませ、判定が終了したかどうかをdetecor.doneで都度確認し、最後に結果を表示するという流れ。

さらに勉強するために

chardetのドキュメンテーション内にあるHow it works
Dive Into Python 3 の第15章の前半

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up