More than 5 years have passed since last update.

Pythonであるディレクトリ以下のファイル全てに対して特定の文字列があるかチェックして対象行を出力

Last updated at 2014-04-15Posted at 2014-04-15

概要

[DIR_NAME]以下ファイル全てを対象に、
[TARGET_ENCODING_LIST]に定義されている文字コードのテキストファイルかチェックして、
テキストファイルなら、[SEARCH_WORD]があるか検索して、
結果を、[OUTPUT_NAME]のファイル名に出力します。

環境

Windows8＋Python2.6系

コード

find_directory.py

# !/usr/bin/python
# -*- coding: utf-8 -*-
# vim: fileencoding=utf-8

import os , sys , codecs

DIR_NAME = 'C:\\html\\HOGE\\'
OUTPUT_NAME = 'result_find_file_list.csv'

SEARCH_WORD = '<font'

TARGET_ENCODINGS = [
	'utf-8',
	'shift-jis',
	'euc-jp',
	'iso2022-jp'
]

FLAG_STDOUT = True
# FLAG_STDOUT = False

import os, sys

write = sys.stdout.write

def guess_charset(data):
	file = lambda d, encoding: d.decode(encoding) and encoding
	for enc in TARGET_ENCODINGS:
		try:
			file(data, enc)
			return enc
		except:
			pass
	return 'binary'

out = codecs.open(OUTPUT_NAME, 'w', 'shift-jis')
out.write('path,line_number,search,target_line\n')

for dirpath, dirs, files in os.walk(DIR_NAME):
	for fn in files:
		path = os.path.join(dirpath, fn)
		fobj = file(path, 'rU')
		data = fobj.read()
		fobj.close()
		try:
			enc = guess_charset(data)
		except:
			continue
		if enc == 'binary':
			continue
		count = 0
		try:
			for l in codecs.open(path, 'r', enc):
				count = count + 1
				if SEARCH_WORD in l:
					output = ''
					try:
						output = '"' + path + '","' + str(count) + '","' + SEARCH_WORD + '","' + l.replace('"',"'").replace('\r','').replace('\n','') + '"\r\n'
					except:
						continue
					if FLAG_STDOUT == True:
						write(output)
					out.write(output)
		except:
			continue

補足

例によって、例外処理は、適当です。
いろいろリファクタリングの余地ありですが、
明日実戦投入したいので、一旦、このまま投稿

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up