More than 5 years have passed since last update.

ApacheのアクセスログをPythonのモジュールで読み込む

Last updated at 2019-02-20Posted at 2019-02-19

まとめ

Pythonのモジュールで、ログ解析できるものはないか調べてみたら、次のブログがあった
- Apache logをpaserする - そこはかとなく書くよん
このブログでは、2種類のparser（apache-log-parserとapachelog）が紹介されている
両方試してみた結果、apache-log-parserを使うことにした

`apachelog`がダメだったところ

GitHub : apachelog
importしたらエラーがでた
Error Exceptionの書き方がPython3に対応してないみたい
- https://github.com/nickmoorman/apachelog/issues/1
ソースコードをちょちょいと書き直せば使えそうだけれど、そもそもメンテされてないみたいなので様子見

`apache-log-parser`がよかったところ

GitHub : apache-log-parser
上のブログには「ドキュメントがない」と書いてあったが、いまのGitHubページを確認すると、READMEがきちんとある
書いてあるとおりにやれば実行できたので、こっちを使うことにした

やったこと

モジュールのインストール
ログフォーマットの確認
パースできるかテスト

モジュールをインストールした

$ pip3 install apache-log-parser

サーバのログフォーマットを確認した

ログフォーマットの設定ファイルを探す

$ locate httpd.conf
/etc/httpd/conf/httpd.conf

ファイルを開きlogformatなどでで単語検索する
アクセスログはcombinedフォーマットで記録されていた
書式が何を指しているかは Apache公式ドキュメント - mod_log_config - カスタムログ書式を参照
- \" は " (quote) をエスケープしてる

LogFormat "%h %l %u %t \"%r\" %>s %b \"%{Referer}i\" \"%{User-Agent}i\"" combined

1行をパースできるかテストした

apache-log-parserのREADMEに従う
apache_log_parser.make_parserの引数を、自分の使っているログフォーマットに変更する
test_dataには実際のログの1行をコピペする

apache_log_parser

import apache_log_parser
from pprint import pprint

line_parser = apache_log_parser.make_parser("%h %l %u %t \"%r\" %>s %b \"%{Referer}i\" \"%{User-Agent}i\"")
test_data ='xxx.xxx.xxx.xxx - - [18/Feb/2019:23:58:36 +0900] "GET /ja/index.html HTTP/1.1" 301 240 "-" "Mozilla/5.0 (compatible; MSIE 10.0; Windows NT 6.1; Trident/6.0)"'
log_line_data = line_parser(test_data)
pprint(log_line_data)

pprint出力

{'remote_host': 'xxx.xxx.xxx.xxx',
 'remote_logname': '-',
 'remote_user': '-',
 'request_first_line': 'GET /ja/index.html HTTP/1.1',
 'request_header_referer': '-',
 'request_header_user_agent': 'Mozilla/5.0 (compatible; MSIE 10.0; Windows NT '
                              '6.1; Trident/6.0)',
 'request_header_user_agent__browser__family': 'IE',
 'request_header_user_agent__browser__version_string': '10.0',
 'request_header_user_agent__is_mobile': False,
 'request_header_user_agent__os__family': 'Windows',
 'request_header_user_agent__os__version_string': '7',
 'request_http_ver': '1.1',
 'request_method': 'GET',
 'request_url': '/ja/index.html',
 'request_url_fragment': '',
 'request_url_hostname': None,
 'request_url_netloc': '',
 'request_url_password': None,
 'request_url_path': '/ja/index.html',
 'request_url_port': None,
 'request_url_query': '',
 'request_url_query_dict': {},
 'request_url_query_list': [],
 'request_url_query_simple_dict': {},
 'request_url_scheme': '',
 'request_url_username': None,
 'response_bytes_clf': '240',
 'status': '301',
 'time_received': '[18/Feb/2019:23:58:36 +0900]',
 'time_received_datetimeobj': datetime.datetime(2019, 2, 18, 23, 58, 36),
 'time_received_isoformat': '2019-02-18T23:58:36',
 'time_received_tz_datetimeobj': datetime.datetime(2019, 2, 18, 23, 58, 36, tzinfo='0900'),
 'time_received_tz_isoformat': '2019-02-18T23:58:36+09:00',
 'time_received_utc_datetimeobj': datetime.datetime(2019, 2, 18, 14, 58, 36, tzinfo='0000'),
 'time_received_utc_isoformat': '2019-02-18T14:58:36+00:00'}

パースした後は辞書型のオブジェクトになっていた
キーが分かりやすい名前になっている（apachelogより読みやすい）
タイムスタンプは、様々なdatetimeオブジェクトに変換されている

キーと書式の対応を調べた

書式	キー
`%h`	`remote_host`
`%l`	`remote_logname`
`%u`	`remote_user`
`%t`	`time_received`
`%r`	`request_first_line`
`%s`	`status`
`%b`	`response_bytes_clf`
`%{Referer}i`	`request_header_referer`
`%{User-Agent}i`	`request_header_user_agent`

ファイルをパースできるようにした

ファイルからログを読み込むための関数を作った（read_apache_log）
引数：ファイル名（ifn）とログフォーマット（logformat）にした
パースに成功した値（辞書型）はリストPに追加（append）した
ValueErrorになった行はリストEに追加（append）した
パースした結果のサマリを表示することにした
- 成功に対して失敗の数が少なければ、アクセス解析の結果に問題はないだろうという考え
返り値：パースに成功した値のリストP

モジュールを読み込む

import apache_log_parser
from pprint import pprint

ログを読む関数を作った

def read_apache_log(ifn, logformat='%h %l %u %t \"%r\" %>s %b \"%{Referer}i\" \"%{User-Agent}i\"'):
    parser = apache_log_parser.make_parser(logformat)
    P = []
    E = []
    with open(ifn) as f:
        for line in f:
            try:
                parsed_line = parser(line)
                P.append(parsed_line)
            except ValueError:
                E.append(line)

    pprint('=== Read Summary ===')
    pprint('Parsed     : {0}'.format(len(P)))
    pprint('ValueError : {0}'.format(len(E)))
    pprint('====================')

    return P

実行した結果

ifn = 'access_log'  ## 100行のログファイル
read_apache_log(ifn)

# === Read Summary ===
# Parsed     : 100
# ValueError : 0
# ====================

CSVファイルに保存した

pandasを使ってCSVファイルに保存した
pandas.DataFrameに、パースしたログのリストを突っ込んだ
pandas.DataFrame.to_csvを使って出力した

import pandas as pd
ifn = 'access_log'
P = read_apache_log(ifn)
df = pd.DataFrame(P)
df.to_csv('output.csv')

さらにやりたいこと

Apacheのログが簡単にパースできることがわかったので、アクセス解析・集計をやってみる
request_header_user_agent__is_mobileを集めれば、モバイルによるアクセス状況が分かる
remote_hostごとに時系列（time_recievedなど）で並べれば、トラッキング情報も再構築できる？

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up

ApacheのアクセスログをPythonのモジュールで読み込む

まとめ

apachelogがダメだったところ

apache-log-parserがよかったところ

やったこと

モジュールをインストールした

サーバのログフォーマットを確認した

1行をパースできるかテストした

キーと書式の対応を調べた

ファイルをパースできるようにした

CSVファイルに保存した

さらにやりたいこと

`apachelog`がダメだったところ

`apache-log-parser`がよかったところ