[Python][chardet] ファイルの文字コードの自動判別

Last updated at 2017-03-03Posted at 2017-03-03

Pythonで文字コードの自動判定ができないか調べて、できたのでメモ

chardet というパッケージを利用すれば簡単にできました。

Usage — chardet 2.3.0 documentation

使用例

test.py

from chardet.universaldetector import UniversalDetector

def check_encoding(file_path):
    detector = UniversalDetector()
    with open(file_path, mode='rb') as f:
        for binary in f:
            detector.feed(binary)
            if detector.done:
                break
    detector.close()
    print(detector.result, end='')
    print(detector.result['encoding'], end='')

def main():
    check_encoding('/path/to/sjis.txt')
    check_encoding('/path/to/utf8.txt')

if __name__ == '__main__':
    main()

出力例

$ python test.py
{'encoding': 'CP932', 'confidence': 0.99}
CP932
{'encoding': 'utf-8', 'confidence': 0.99}
utf-8

大きめのファイルだと判別に若干時間がかかるようですので注意が必要です。
（上記の UniversalDetetor は判別ができ次第終了はするらしいです）

参考

Pythonにおけるエンコーディング判定 - Qiita
Usage — chardet 2.3.0 documentation

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up