More than 3 years have passed since last update.

[Python] ファイルの文字コードを取得する

Posted at 2020-08-21

３．ファイルを読み込むときの文字コードを取得するモヂュール

・ファイル読込時に毎回文字コードを調べて設定するのは面倒なので、自動で取得するためのモヂュールを作成しました。
・特にエクセルで作成された日本語を含むcsvファイルを取り込む際に重宝しています。
・ネット上のファイルの取り込みにも対応しています。
・戻り値をopen時にencodingに設定することで、今のところ問題なく動いています。

def check_encoding(file_path):
    '''ファイルの文字コードを取得する'''
    from chardet.universaldetector import UniversalDetector
    import requests

    detector = UniversalDetector()

    if file_path[:4] == 'http':
        r = requests.get(file_path)
        for binary in r:
            detector.feed(binary)
            if detector.done:
                break
        detector.close()

    else:
        with open(file_path, mode='rb') as f:
            for binary in f:
                detector.feed(binary)
                if detector.done:
                    break
        detector.close()

    print("  ", detector.result, end=' => ')
    print(detector.result['encoding'], end='\n')

    return detector.result['encoding']

・日本語を含むcsvはShift_JISが多いようで、次のモヂュールでさらに汎用的なcp932に変換すると良いようです。
・最初のモヂュールで得られた戻り値を引数に入れることで、最適な文字コード名が戻り値として得られます。

def change_encoding(encoding):
    '''encodingのsjis関係をcp932に変換する'''
    if encoding in ['Shift_JIS', 'SHIFT_JIS', 'shift_jis', 'sjis', 's_jis']:
        encoding = 'cp932'

    return encoding

監修、よろしくお願いします。

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up