More than 5 years have passed since last update.

holoviewsでアクセスログを可視化する

Last updated at 2018-09-05Posted at 2018-09-05

やりたいこと

apacheのアクセスログを可視化したい。
アクセス頻度の高いURL、サーバ、パラメータなど感覚的にわかるようにしたい。
一定時間ごとの推移を見たい
jupyter notebook上で表示したい
本当はこういうのを使うべきだが諸事情あるとき用。

ダミーデータの準備

以降のスクリプトは同一jupyter notebook上で実行

データのダウンロード

こちらを拝借し修正
実行するとアクセスログのダミーファイルが作成される。

import time
import datetime
import random
import pytz 
timestr = time.strftime("%Y%m%d-%H%M%S")
jst = pytz.timezone('Asia/Tokyo')

f = open('access_log_'+timestr+'.log','w')

ips=["123.221.14.56","16.180.70.237","10.182.189.79","218.193.16.244","198.122.118.164","114.214.178.92","233.192.62.103","244.157.45.12","81.73.150.239","237.43.24.118"]
referers=["-","http://www.casualcyclist.com","http://bestcyclingreviews.com/top_online_shops","http://bleater.com","http://searchengine.com"]
resources=["/handle-bars","/stems","/wheelsets","/forks","/seatposts","/saddles","/shifters","/Store/cart.jsp?productID="]
useragents=["Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0)","Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1944.0 Safari/537.36","Mozilla/5.0 (Linux; U; Android 2.3.5; en-us; HTC Vision Build/GRI40) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1","Mozilla/5.0 (iPad; CPU OS 6_0 like Mac OS X) AppleWebKit/536.26 (KHTML, like Gecko) Version/6.0 Mobile/10A5355d Safari/8536.25","Mozilla/5.0 (Windows; U; Windows NT 6.1; rv:2.2) Gecko/20110201","Mozilla/5.0 (Windows NT 5.1; rv:31.0) Gecko/20100101 Firefox/31.0","Mozilla/5.0 (Windows; U; MSIE 9.0; WIndows NT 9.0; en-US))"]

otime = jst.localize(datetime.datetime(2013,10,10))


for i in range(0,500):
    increment = datetime.timedelta(seconds=random.randint(30,300))

    otime += increment
    uri = str(random.choice(resources))
    if uri.find("Store") > 0:
        uri += str(random.randint(1000,1500))
    ip = random.choice(ips)
    useragent = random.choice(useragents)
    referer = random.choice(referers)
    # LogFormat "%h %l %u %t \"%r\" %>s %b \"%{Referer}i\" \"%{User-agent}i\""
    f.write('%s - - [%s] "GET %s HTTP/1.1" 200 %s "%s" "%s"\n' % (random.choice(ips),otime.strftime('%d/%b/%Y:%H:%M:%S %z'), uri, random.randint(2000,5000), referer, useragent)) 
f.close()

データの読み込み

ファイル取得

import glob 

file_list = glob.glob('access_log*log')
file_list

アクセスログのパースとデータフレームの作成

import pandas as pd 
import re 

col_names = ['remote_host',
             'remotelog_name', 
             'remote_user', 
             'timestamp', 
             'request_url', 
             'status', 
             'response_bytes', 
             'referer', 
             'user_agent',
             'method',
             'url_full', 
             'url',
             'query', 
             'prot']

regexp = r'(.*?) (.*?) (.*?) \[(.*?)\] "(.*?)" (.*?) (.*?) "(.*?)" "(.*?)"$'
df_list = []

for file_name in file_list:
    line_list = []
    
    with open (file_name, 'r') as f:
        for line in f:
            m = re.match(regexp,line)
            if m :
                tmp_list = list(m.groups())
                if tmp_list[4] != "-": 
                    mehod, url_full, prot = tmp_list[4].split(" ")
                    
                    url, query = (url_full.split('?') + [" "," "])[0:2]
                    
                    tmp_list.extend([mehod, url_full, url, query, prot])
                    
                else :
                    tmp_list.extend(["", "", "","",""])
                
                line_list.append(tmp_list)
                    
            else: 
                print(line)
    df_list.append( pd.DataFrame(line_list,columns=col_names))
df = pd.concat(df_list)

データの加工

集計しやすいように加工

resample用datetimeインデックス作成

datetimeindexのほうが何かと都合がよい

df['datetime'] = pd.to_datetime(df['timestamp'].str.replace(" \+0900",""), format="%d/%b/%Y:%H:%M:%S" )
df = df.set_index('datetime')

拡張子列作成

jspだけに絞り込むなど分析時に有用なため。今回はダミーデータのため作成しただけで使っていない。

df['ext'] = df['url'].str.extract(".+\.([a-zA-Z]+).*")

出力件数集計

1時間ごとの出力件数をurlごとに集計。

df['count'] = 1
df_rs = df.groupby("url").resample('h').sum()

　グラフ化

前準備

df_rs = df_rs.reset_index()

データセット作成

import holoviews as hv 
import numpy as np 
hv.extension('bokeh')
ds = hv.Dataset(df_rs, vdims=['count'])

ヒートマップ作成

%%opts HeatMap[ width=800 height=300 tools=['hover']]

hv.HeatMap(ds.aggregate(['datetime','url'], np.sum)).options(xrotation=90, cmap='Reds')

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up