More than 5 years have passed since last update.

pythonでlog crawlerみたいツールを作ってみました！（引き続き拡張や改善中）

Last updated at 2018-02-01Posted at 2018-02-01

最近pythonが流行っていますので、独学のため、Qiitaに記事を書いています。
ちょうど仕事の現場でログ分析の作業があって、でもログの量が多くて手作業は無理でしたが。。。
あっれ、pythonでやってみようかというきっかけて本文ができました。

log crawlerって何？　作業の目標は？

どんなサーバーでも、このサーバーを監視、保守、分析するなど作業に、
様々なログが絶対に必要ですよね。

例えば、下記のようにログファイルがあります。（表示されたのは一部のみ）

log.txt

.
.
.
server-1:2018-01-01T01:32:26+09:00	linux.INFO  [test.pool.Executor] (EJB default - 12) doExecute end: para=[3/4], executedCount=[100/1225/5008, 200/1272/5100, 2345/1083/2750], times=100ms"}
server-1:2018-01-01T01:32:26+09:00	linux.INFO  [test.pool.Timer] (EJB timer - 13) doTrigger end, times=234234ms"}
server-1:2018-01-01T01:33:00+09:00	linux.INFO  [test.pool.Timer] (EJB timer - 4) doTrigger start"}
server-1:2018-01-01T01:33:26+09:00	linux.INFO  [test.pool.Executor] (EJB default - 8) doExecute end: thread=[1/4], executedCount=[353/1304/5008, 234/1273/5100, 657/1109/2750], times=255ms"}
server-1:2018-01-01T01:33:26+09:00	linux.INFO  [test.pool.Executor] (EJB default - 5) doExecute end: thread=[2/4], executedCount=[466/1236/5008, 568/1280/5100, 11/1153/2750], times=27275ms"}
server-1:2018-01-01T01:33:26+09:00	linux.INFO  [test.pool.Executor] (EJB default - 8) doExecute end: thread=[4/4], executedCount=[123/1243/5008, 12/1275/5100, 6776/1205/2750], times=333ms"}
server-1:2018-01-01T01:33:26+09:00	linux.INFO  [test.pool.Executor] (EJB default - 9) doExecute end: thread=[3/4], executedCount=[666/1225/5008, 0/1272/5100, 12423/1083/2750], times=12ms"}
server-1:2018-01-01T01:33:26+09:00	linux.INFO  [test.pool.Timer] (EJB timer - 15) doTrigger end, times=27646ms"}
server-1:2018-01-01T01:34:00+09:00	linux.INFO  [test.pool.Timer] (EJB timer - 46) doTrigger start"}
server-1:2018-01-01T01:34:25+09:00	linux.INFO  [test.pool.Executor] (EJB default - 13) doExecute end: thread=[1/4], executedCount=[234/1304/5008, 454/1273/5100, 567/1109/2750], times=3434ms"}
server-1:2018-01-01T01:34:25+09:00	linux.INFO  [test.pool.Executor] (EJB default - 56) doExecute end: thread=[2/4], executedCount=[777/1236/5008, 68/1280/5100, 0/1153/2750], times=56565ms"}
server-1:2018-01-01T01:34:25+09:00	linux.INFO  [test.pool.Executor] (EJB default - 78) doExecute end: thread=[4/4], executedCount=[889/1243/5008, 987/1275/5100, 46/1205/2750], times=55ms"}
server-1:2018-01-01T01:34:26+09:00	linux.INFO  [test.pool.Executor] (EJB default - 12) doExecute end: thread=[3/4], executedCount=[34/1225/5008, 234/1272/5100, 0/1083/2750], times=46ms"}
server-1:2018-01-01T01:34:26+09:00	linux.INFO  [test.pool.Timer] (EJB timer - 1) doTrigger end, times=12122ms"}
server-1:2018-01-01T01:35:00+09:00	linux.INFO  [test.pool.Timer] (EJB timer - 2) doTrigger start"}
server-1:2018-01-01T01:35:28+09:00	linux.INFO  [test.pool.Executor] (EJB default - 4) doExecute end: thread=[1/4], executedCount=[567/1328/5079, 3/1292/5166, 778/1117/4601], elapse=5656ms"}
server-1:2018-01-01T01:35:28+09:00	linux.INFO  [test.pool.Executor] (EJB default - 4) doExecute end: thread=[4/4], executedCount=[789/1259/5079, 0/1289/5166, 234/1219/4601], elapse=2765ms"}
server-1:2018-01-01T01:35:29+09:00	linux.INFO  [test.pool.Executor] (EJB default - 5) doExecute end: thread=[2/4], executedCount=[345/1256/5079, 0/1296/5166, 6787/1169/4601], elapse=7684ms"}
server-1:2018-01-01T01:35:29+09:00	linux.INFO  [test.pool.Executor] (EJB default - 9) doExecute end: thread=[3/4], executedCount=[6/1236/5079, 0/1289/5166, 234/1096/4601], elapse=44ms"}
server-1:2018-01-01T01:35:29+09:00	linux.INFO  [test.pool.Timer] (EJB timer - 2) doTrigger end, times=231ms"}
server-1:2018-01-01T01:36:00+09:00	linux.INFO  [test.pool.Timer] (EJB timer - 5) doTrigger start"}
.
.
.

で、要件は、性能などを分析するため、
Triggerごとで、startからendまで、4つthreadに、三つnode(分散サーバー)の実施済み件数（xxx/yyy/zzz）のxxx部分の総計と実施時間(time)を統計したいです。
でもファイル本体は1Gぐらいので、やっはり手作業で無理で、pythonで挑戦してみました。
簡単に言えば、

目標は要件により、ログを検索、統計、見やすく形に出力すること。

pythonがやることは：

input:一つまたは複数のログファイル
excute:検索、抽出、絞り、整形
output:一つまたは複数のファイル

できたツールの実施

c:[toolのpath]> python logCrawler.py input_test output_test log.txt

実施後、出力された結果

output.txt

処理開始時間	処理総件数	処理時間(ms)
2017-12-29T23:45:33	782	382
2017-12-30T07:32:42	900	529
2017-12-30T07:33:40	755	056
2017-12-30T08:46:00	5607	3404
2017-12-30T09:45:42	712	309
2017-12-30T10:45:37	616	424
2017-12-30T11:46:00	37685	1503
2017-12-30T02:45:32	515	45
2017-12-30T02:46:50	812	740
2017-12-31T08:46:00	851	276
2017-12-31T10:46:00	5258	800
2017-12-31T11:46:00	26194	9771
.
.
.

きれいになりましたよね。それで図を作るとか分析は簡単ですね。処理速度もすごく早いです！
さっさとコーディングにしようか！
PS：時間の都合で、コードの整理や拡張などまだそこまでやってないですが、
　　気持ちには、より高い柔軟性、拡張性のように、引き続き改善、拡張をやります。
　　不足や問題点があればどんどん指摘してください。

処理の流れ

①　ログファイルの読み込み
②　正規表現で得たい情報の検索、提出（重要）
③　整形や条件で絞り
④　ファイルに出力

メインメソッド

#########################
#     メイン処理
#########################
if __name__ == '__main__':

    # 初期化
    initialize(sys.argv)

    # ログを取得
    content = read_log_files()
    
    # 正規式を生成（重要）
    regex = make_regex()

    # 正規表現で分析、取得（重要）
    results = do_findall(regex, content)

    # 結果を出力プールに格納、整理
    output_pool = make_output_pool(results)
      
    # ファイルを出力
    output_files_from_pool(output_pool)
    
    # 完了
    quit()

それぞれメソッドは下記

0：コマンドラインからパスやログファイル名などの指定

# 初期化
def initialize(argv):
    # 引数のチェック
    if len(argv) < 4:
        print('引数を下記のように入力してください：\n'
              '引数[1]:入力ファイル(logファイル)のパス\n'
              '引数[2]:出力ファイル(outputファイル)のパス\n'
              '引数[3]:入力ファイル(logファイル)のファイル名\n')
        quit()

    # global宣言（しないと、変更しません）
    global i_file_path
    global o_file_path
    # i_file_nameが'*'で指定される場合は、全ファイルを検索対象に
    global i_file_name
    
    i_file_path = argv[1]
    o_file_path = argv[2]
    i_file_name = argv[3]
    
    print('初期化完了、引数１:{0}、引数２:{1}、引数３:{2}'.format(argv[1],argv[2],argv[3]))

1：ログファイルの読み込み

# パス下であるファイルを読み込み
# 複数のファイルを一括で読み込みできる
def read_log_files():

    all_log = ''
    
    # 現在のパス
    current_dir = os.path.dirname(os.path.abspath(__file__))
    
    # 現在のパス + 入力された相対パス
    input_dir = current_dir + '\\' + i_file_path
    # 入力パスが存在しない場合
    if not os.path.exists(input_dir):
        print('ERROR:指定されたパスが存在しません。\n{0}'.format(input_dir))
        quit()
        
    # 全ファイルを走査
    for parent,dirnames,filenames in os.walk(input_dir):
       for filename in filenames:
           # 対象ログを読み込み
           # *の場合は全ファイルを対象に
           if i_file_name == '*' or i_file_name == filename:
               full_path = os.path.join(parent,filename)
               
               # 対象ログの内容を結合
               all_log += get_log_from_file(full_path)
    
    if len(all_log) == 0:
        print('ERROR:指定されたパスにファイルまたは、ファイルの内容がありません。\n{0}'.format(input_dir))
        quit()
    
    print('ファイル読み込みは完了')
    
    return all_log


# 指定ファイルを読み込み
def get_log_from_file(file_name):

    # CP932のコードを?に
    log_file = open(file_name,'r',encoding='CP932', errors='replace')
    
    log = log_file.read()
    log_file.close()
    
    return log

２：正規表現で得たい情報の検索、提出

業務の核、すごく重要な部分。
ログの規則を見つけて、適切な正規式を作って、必要な情報を()で囲んで取り出した。

# 正規表現の作成
def make_regex():

    # 先頭
    start = '.*?doTrigger start.*?'
    
    # 実行時間 output_index(0)
    start_time = '(\d\d\d\d-\d\d-\d\dT\d\d:\d\d:\d\d)' + '.*?'
    
    # 処理数   3*4=12   output_index(1～12)
    # xx/xx/xx
    unit_count = ', (\d+)\/\d+\/\d+'
    run_count_1 = 'executedCount=\[(\d+)\/\d+\/\d+' + unit_count*2 + '.*?'
    run_count_4 = run_count_1 * 4
    
    # 処理時間  output_index(13)
    run_time = 'doTrigger end, times=(\d+)ms'
    
    # 末尾
    end = '.*?\n'
    
    regex = start + start_time + run_count_4 + run_time + end
    print('正規式作成完了:{0}'.format(regex))
    
    return regex


# 正規表現で分析、取得
def do_findall(regex, content):
    
    # 処理便利のため、"を除く
    content = re.sub('"', '', content)
    
    # findall実施
    # ★★★★★これでログに必要な情報を取りました！！！★★★★★
    results = re.findall(regex, content, re.S)
    
    if not results:
        print('結果件数：0')
        quit()
    
    print('走査完了、結果件数：{0}'.format(len(results)))
    
    return results

３：整形や条件で絞り

# 出力プールの作成
# 要件により、keyはファイル名として複数の出力ファイルの作成もできますが、今回は1ファイルに
def make_output_pool(results):

    output_pool = {}
    
    if results: #none以外の場合
       for result in results:
          # こちらで条件も指定できる
          # 10000件以上、または100000ms以上なら、統計になる
          # if int(result[2]) >= 10000 or int(result[5]) >= 100000:
             # 出力ファイル名
             key = 'output'
             # 出力ファイル名は、統計内容から取るのも可能
             #key = result[0] + '_' + result[1][:10]
             
             # 統計、加算
             tatol_count = 0
             for i in range(1,12):
                 tatol_count += int(result[i])
             
             # 1000件以下のデータが出力ほしいなら
             if tatol_count > 1000:
                 
                 # 出力の一行
                 line = result[0] + '\t'+ str(tatol_count) + '\t' + result[13] + '\n'
      
                 # 複数出力ファイルも対応
                 if key not in output_pool:
                    # タイトルの設定
                    output_pool[key] = '処理開始時間\t処理トータル件数\t処理時間(ms)\n'
                    # outputプールに格納
                    output_pool[key] += line
                 else:
                    # outputプールに追加
                    output_pool[key] += line
                    
    return output_pool

    print('出力プール構造完了、ファイル個数：{0}',len(output_pool))

４：ファイルに出力

# 用意されたプールの出力                
def output_files_from_pool(output_pool):
    for k, v in output_pool.items():
      output_file_name = o_file_path + '\\' + k + '.txt'
      file_output = open(output_file_name,'w')
      file_output.write(v)
      file_output.close()
      
      current_dir = os.path.dirname(os.path.abspath(__file__))
      print('ファイル出力完了：{0}'.format(current_dir + '\\' + o_file_path))

全コード

logCrawler.py

# -*- coding: utf-8 -*-

import re
import os
import sys

#########################
#    グローバル変数
#########################
# 入力ファイルのパス
i_file_path = ''
# 出力ファイルのパス
o_file_path = ''
# 入力ファイルのファイル名
i_file_name = ''


#########################
#   プライベートメソッド
#########################

# 初期化
def initialize(argv):
    # 引数のチェック
    if len(argv) < 4:
        print('引数を下記のように入力してください：\n'
              '引数[1]:入力ファイル(logファイル)のパス\n'
              '引数[2]:出力ファイル(outputファイル)のパス\n'
              '引数[3]:入力ファイル(logファイル)のファイル名\n')
        quit()

    # global宣言（しないと、変更しません）
    global i_file_path
    global o_file_path
    # i_file_nameが'*'で指定される場合は、全ファイルを検索対象に
    global i_file_name
    
    i_file_path = argv[1]
    o_file_path = argv[2]
    i_file_name = argv[3]
    
    print('初期化完了、引数１:{0}、引数２:{1}、引数３:{2}'.format(argv[1],argv[2],argv[3]))


# パス下であるファイルを読み込み
# 複数のファイルを一括で読み込みできる
def read_log_files():

    all_log = ''
    
    # 現在のパス
    current_dir = os.path.dirname(os.path.abspath(__file__))
    
    # 現在のパス + 入力された相対パス
    input_dir = current_dir + '\\' + i_file_path
    # 入力パスが存在しない場合
    if not os.path.exists(input_dir):
        print('ERROR:指定されたパスが存在しません。\n{0}'.format(input_dir))
        quit()
        
    # 全ファイルを走査
    for parent,dirnames,filenames in os.walk(input_dir):
       for filename in filenames:
           # 対象ログを読み込み
           # *の場合は全ファイルを対象に
           if i_file_name == '*' or i_file_name == filename:
               full_path = os.path.join(parent,filename)
               
               # 対象ログの内容を結合
               all_log += get_log_from_file(full_path)
    
    if len(all_log) == 0:
        print('ERROR:指定されたパスにファイルまたは、ファイルの内容がありません。\n{0}'.format(input_dir))
        quit()
    
    print('ファイル読み込みは完了')
    
    return all_log


# 指定ファイルを読み込み
def get_log_from_file(file_name):

    # CP932のコードを?に
    log_file = open(file_name,'r',encoding='CP932', errors='replace')
    
    log = log_file.read()
    log_file.close()
    
    return log
    

# 正規表現の作成
def make_regex():

    # 先頭
    start = '.*?doTrigger start.*?'
    
    # 実行時間 output_index(0)
    start_time = '(\d\d\d\d-\d\d-\d\dT\d\d:\d\d:\d\d)' + '.*?'
    
    # 処理数   3*4=12   output_index(1～12)
    # xx/xx/xx
    unit_count = ', (\d+)\/\d+\/\d+'
    run_count_1 = 'executedCount=\[(\d+)\/\d+\/\d+' + unit_count*2 + '.*?'
    run_count_4 = run_count_1 * 4
    
    # 処理時間  output_index(13)
    run_time = 'doTrigger end, times=(\d+)ms'
    
    # 末尾
    end = '.*?\n'
    
    regex = start + start_time + run_count_4 + run_time + end
    print('正規式作成完了:{0}'.format(regex))
    
    return regex


# 正規表現で分析、取得
def do_findall(regex, content):
    
    # 処理便利のため、"を除く
    content = re.sub('"', '', content)
    
    # findall実施
    # ★★★★★これでログに必要な情報を取りました！！！★★★★★
    results = re.findall(regex, content, re.S)
    
    if not results:
        print('結果件数：0')
        quit()
    
    print('走査完了、結果件数：{0}'.format(len(results)))
    
    return results


# 出力プールの作成
# 要件により、keyはファイル名として複数の出力ファイルの作成もできますが、今回は1ファイルに
def make_output_pool(results):

    output_pool = {}
    
    if results: #none以外の場合
       for result in results:
          # こちらで条件も指定できる
          # 10000件以上、または100000ms以上なら、統計になる
          # if int(result[2]) >= 10000 or int(result[5]) >= 100000:
             # 出力ファイル名
             key = 'output'
             # 出力ファイル名は、統計内容から取るのも可能
             #key = result[0] + '_' + result[1][:10]
             
             # 統計、加算
             tatol_count = 0
             for i in range(1,12):
                 tatol_count += int(result[i])
             
             # 1000件以下のデータが出力ほしいなら
             if tatol_count > 1000:
                 
                 # 出力の一行
                 line = result[0] + '\t'+ str(tatol_count) + '\t' + result[13] + '\n'
      
                 # 複数出力ファイルも対応
                 if key not in output_pool:
                    # タイトルの設定
                    output_pool[key] = '処理開始時間\t処理トータル件数\t処理時間(ms)\n'
                    # outputプールに格納
                    output_pool[key] += line
                 else:
                    # outputプールに追加
                    output_pool[key] += line
                    
    return output_pool

    print('出力プール構造完了、ファイル個数：{0}',len(output_pool))
               
# 用意されたプールの出力                
def output_files_from_pool(output_pool):
    for k, v in output_pool.items():
      output_file_name = o_file_path + '\\' + k + '.txt'
      file_output = open(output_file_name,'w')
      file_output.write(v)
      file_output.close()
      
      current_dir = os.path.dirname(os.path.abspath(__file__))
      print('ファイル出力完了：{0}'.format(current_dir + '\\' + o_file_path))


#########################
#     main処理
#########################
if __name__ == '__main__':

    # 初期化
    initialize(sys.argv)

    # ログを取得
    content = read_log_files()
    
    # 正規式を生成
    regex = make_regex()

    # 正規表現で分析、取得
    results = do_findall(regex, content)

    # 結果を出力プールに格納、整理
    output_pool = make_output_pool(results)
      
    # ファイルを出力
    output_files_from_pool(output_pool)
    
    # 完了
    quit()

まとめ

引き続き更新

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up