More than 5 years have passed since last update.

テキストファイルのエンコーディングを自動判定して処理する

Python

Posted at 2013-06-24

テキストのエンコーディングを調べるには、
片っ端からdecodeしてみてデコード成功したものを利用するといいらしい。

def conv_encoding(data):
    lookup = ('utf_8', 'euc_jp', 'euc_jis_2004', 'euc_jisx0213',
            'shift_jis', 'shift_jis_2004','shift_jisx0213',
            'iso2022jp', 'iso2022_jp_1', 'iso2022_jp_2', 'iso2022_jp_3',
            'iso2022_jp_ext','latin_1', 'ascii')
    encode = None
    for encoding in lookup:
      try:
        data = data.decode(encoding)
        encode = encoding
        break
      except:
        pass
    if isinstance(data, unicode):
        return data,encode
    else:
        raise LookupError

# ファイル読込とエンコーディングの調査
fp = open(path,'r')
str,encoding = None,None
try:
  str,encoding = conv_encoding(fp.read())
finally:
  fp.close()

# 内容の編集
...<任意のコード>


# 元のエンコーディングでファイル書き込み
fp = open(path,'w')
try:
  fp.write(str.encode(encoding))
finally:
  fp.close()

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up