5
5

Delete article

Deleted articles cannot be recovered.

Draft of this article would be also deleted.

Are you sure you want to delete this article?

More than 5 years have passed since last update.

Pythonであるディレクトリ以下のファイル全てに対して文字コードが何であるかチェックして出力

Last updated at Posted at 2014-04-15

概要

[DIR_NAME]以下ファイル全てを対象に、
[TARGET_ENCODING_LIST]に定義されている文字コードのテキストファイルかチェックして、
[OUTPUT_NAME]のファイル名に出力します。
判別出来なければ、binaryと出力されます。

環境

Windows8+Python2.6系

コード

check_encoding.py
#!/usr/bin/python
# -*- coding: utf-8 -*-
# vim: fileencoding=utf-8

import os , sys

DIR_NAME = 'C:\\Program Files\\'
OUTPUT_NAME = 'result_file_encoding_list.txt'

TARGET_ENCODING_LIST = [
	'utf-8',
	'shift-jis',
	'euc-jp',
	'iso2022-jp'
]

FLAG_STDOUT = True
#FLAG_STDOUT = False

import os, sys

write = sys.stdout.write

def guess_charset(data):
	file = lambda d, encoding: d.decode(encoding) and encoding
	for enc in TARGET_ENCODING_LIST:
		try:
			file(data, enc)
			return enc
		except:
			pass
	return 'binary'

out = open(OUTPUT_NAME, 'w')
for dirpath, dirs, files in os.walk(DIR_NAME):
	for fn in files:
		path = os.path.join(dirpath, fn)
		fobj = file(path, 'rU')
		data = fobj.read()
		fobj.close()
		try:
			enc = guess_charset(data)
		except:
			continue
		str = path + ',' + enc + '\n'
		try:
			if FLAG_STDOUT == True:
				write(str)
			out.write(str)
		except:
			continue

補足

例外処理は、適当です。
ファイル名に日本語文字が含まれていると、文字化けしたりします。

5
5
0

Register as a new user and use Qiita more conveniently

  1. You get articles that match your needs
  2. You can efficiently read back useful information
  3. You can use dark theme
What you can do with signing up
5
5

Delete article

Deleted articles cannot be recovered.

Draft of this article would be also deleted.

Are you sure you want to delete this article?