6
3

Delete article

Deleted articles cannot be recovered.

Draft of this article would be also deleted.

Are you sure you want to delete this article?

More than 5 years have passed since last update.

Apacheログ→Python(Pandas)→DBロード処理

6
Last updated at Posted at 2018-01-26

Apacheログってのは大量のデータ件数になるわけで
要件によってはOSファイルのままの扱いでは日が暮れるようなことになってしまいます。

んじゃ、DBに入れてしまいましょうということで
ApacheログをPythonで処理して直接PostgreSQLに格納します。

う〜ん、これ便利。

ApacheログのParseはこの辺を参考に。
http://graphistry.github.io/pygraphistry/html/Tutorial%20Part%202%20(Apache%20Logs).html
http://blog.mmast.net/read-apache-access-log-pandas

ちょこっとカスタムして書くとこんな感じ。

import pandas as pd
from urllib.parse import unquote 

# ファイルの場所
url = '/tmp/access.log'

# column定義
fields = ['host', 'identity', 'user', 'time_part1', 'time_part2', 'cmd_path_proto', 
          'http_code', 'response_bytes', 'referer', 'user_agent', 'unknown']

# データ読み込み
data = pd.read_csv(url, sep=' ', header=None, names=fields, na_values=['-'])

# 時刻列parse
time = data.time_part1 + data.time_part2
time_trimmed = time.map(lambda s: s.strip('[]').split('+')[0])
data['time'] = pd.to_datetime(time_trimmed, format='%d/%b/%Y:%H:%M:%S')

# コマンド列parse
data['command'], data['path'], data['protocol'] = zip(*data['cmd_path_proto'].str.split().tolist())
data['path'] = data['path'].map(lambda s: unquote(s))

# 不要列削除
data1 = data.drop(['time_part1', 'time_part2', 'cmd_path_proto'], axis=1)

# 空データ削除
data2 = data1.dropna(axis=1, how='all')
# data frameまで完成
data2

	host	http_code	response_bytes	referer	user_agent	time	command	path	protocol
0	156.141.48.220	200	86	/item/software/2656	Mozilla/4.0 (compatible; MSIE 8.0; Windows NT ...	2018-01-25 17:33:51	GET	/category/games	HTTP/1.1
1	104.36.145.163	200	115	/search/?c=Cameras+Electronics	Mozilla/5.0 (Windows NT 6.0; rv:10.0.1) Gecko/...	2018-01-25 17:33:51	GET	/item/cameras/3857	HTTP/1.1
2	164.24.157.185	200	76	/item/electronics/1257	Mozilla/5.0 (Windows NT 6.0; rv:10.0.1) Gecko/...	2018-01-25 17:33:51	POST	/search/?c=Health	HTTP/1.1
3	76.195.39.107	200	40	/search/?c=Software	Mozilla/5.0 (compatible; Googlebot/2.1; +http:...	2018-01-25 17:33:51	GET	/item/software/1528	HTTP/1.1
4	224.138.43.127	200	63	/category/games	Mozilla/5.0 (compatible; MSIE 9.0; Windows NT ...	2018-01-25 17:33:51	GET	/category/software	HTTP/1.1
5	20.54.131.78	200	90	/search/?c=Networking+Books	Mozilla/5.0 (Windows NT 5.1) AppleWebKit/535.1...	2018-01-25 17:33:51	GET	/item/networking/3138	HTTP/1.1
6	92.213.33.71	200	128	/category/cameras	Mozilla/5.0 (Macintosh; Intel Mac OS X 10.6; r...	2018-01-25 17:33:51	GET	/item/sports/1164	HTTP/1.1
7	104.66.93.199	200	93	NaN	Mozilla/5.0 (Windows NT 6.0; rv:10.0.1) Gecko/...	2018-01-25 17:33:51	GET	/category/software	HTTP/1.1
8	20.36.49.62	200	62	NaN	Mozilla/5.0 (compatible; Googlebot/2.1; +http:...	2018-01-25 17:33:51	GET	/item/networking/1592	HTTP/1.1
9	84.198.67.176	200	72	/item/software/4560	Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKi...	2018-01-25 17:33:51	GET	/category/games	HTTP/1.1
10	224.180.221.171	200	56	/item/jewelry/3885	Mozilla/5.0 (Macintosh; Intel Mac OS X 10.6; r...	2018-01-25 17:33:51	GET	/category/cameras	HTTP/1.1
11	164.177.38.116	200	114	NaN	Mozilla/5.0 (compatible; Googlebot/2.1; +http:...	2018-01-25 17:33:51	GET	/category/sports?from=10	HTTP/1.1

データベースインスタンスの注意点はこの辺を。
https://www.marketechlabo.com/import-access-log-to-postgres/

やってみて気づいた点
1.テーブル名はすべてを大文字では作らない。一部大文字はOK。
全部大文字だとテーブルをダブルクオートで加工必要がある→いまいちすぎる仕様。

2.動作ログの表示
デフォルトだとログ表示しないのでオプションで設定する。動きが見えて楽しい。
engine = create_engine(url, echo=True)

3.psycopg2がないと怒られる
コード上ではつかっていないのだけどその奥で呼んでいる模様。
インストールしましょう。この辺のつながりはまた別途調査。

import pandas as pd
from sqlalchemy import create_engine
# import psycopg2
# url = 'postgresql://postgres@localhost:5432/web_analysys?charset=utf8'
url = 'postgresql://postgres@localhost:5432/web_analysys'
engine = create_engine(url, echo=True)

data2.to_sql('IHS_log', engine, if_exists='replace', index=False)

テーブル格納、きれいにできてます!

  host       | http_code | response_bytes |                                                                                             referer                                                                                              |                                                                                                 user_agent                                                                                                 |        time         | command |                path                | protocol 

-----------------+-----------+----------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------------+---------+------------------------------------+----------
156.141.48.220 | 200 | 86 | /item/software/2656 | Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.1; WOW64; Trident/4.0; GTB7.2; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; .NET4.0C) | 2018-01-25 17:33:51 | GET | /category/games | HTTP/1.1
104.36.145.163 | 200 | 115 | /search/?c=Cameras+Electronics | Mozilla/5.0 (Windows NT 6.0; rv:10.0.1) Gecko/20100101 Firefox/10.0.1 | 2018-01-25 17:33:51 | GET | /item/cameras/3857 | HTTP/1.1
164.24.157.185 | 200 | 76 | /item/electronics/1257 | Mozilla/5.0 (Windows NT 6.0; rv:10.0.1) Gecko/20100101 Firefox/10.0.1 | 2018-01-25 17:33:51 | POST | /search/?c=Health | HTTP/1.1
76.195.39.107 | 200 | 40 | /search/?c=Software | Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html) | 2018-01-25 17:33:51 | GET | /item/software/1528 | HTTP/1.1

ログをみるとDataFrameから自動的にテーブルCREATEしてます。
ほう。。。


web_analysys=# \d IHS_log
                  Table "public.ihs_log" 
     Column     |            Type             | Modifiers 
----------------+-----------------------------+-----------
 host           | text                        | 
 http_code      | bigint                      | 
 response_bytes | bigint                      | 
 referer        | text                        | 
 user_agent     | text                        | 
 time           | timestamp without time zone | 
 command        | text                        | 
 path           | text                        | 
 protocol       | text                        | 


↓ログから拾うと。。。

CREATE TABLE "IHS_log" (
    host TEXT, 
    http_code BIGINT, 
    response_bytes BIGINT, 
    referer TEXT, 
    user_agent TEXT, 
    time TIMESTAMP WITHOUT TIME ZONE, 
    command TEXT, 
    path TEXT, 
    protocol TEXT
)
6
3
0

Register as a new user and use Qiita more conveniently

  1. You get articles that match your needs
  2. You can efficiently read back useful information
  3. You can use dark theme
What you can do with signing up
6
3

Delete article

Deleted articles cannot be recovered.

Draft of this article would be also deleted.

Are you sure you want to delete this article?