1
2

Delete article

Deleted articles cannot be recovered.

Draft of this article would be also deleted.

Are you sure you want to delete this article?

More than 5 years have passed since last update.

holoviewsでアクセスログを可視化する

Last updated at Posted at 2018-09-05

やりたいこと

  • apacheのアクセスログを可視化したい。
  • アクセス頻度の高いURL、サーバ、パラメータなど感覚的にわかるようにしたい。
  • 一定時間ごとの推移を見たい
  • jupyter notebook上で表示したい
  • 本当はこういうのを使うべきだが諸事情あるとき用。

ダミーデータの準備

  • 以降のスクリプトは同一jupyter notebook上で実行

データのダウンロード

  • こちらを拝借し修正
  • 実行するとアクセスログのダミーファイルが作成される。
import time
import datetime
import random
import pytz 
timestr = time.strftime("%Y%m%d-%H%M%S")
jst = pytz.timezone('Asia/Tokyo')

f = open('access_log_'+timestr+'.log','w')

ips=["123.221.14.56","16.180.70.237","10.182.189.79","218.193.16.244","198.122.118.164","114.214.178.92","233.192.62.103","244.157.45.12","81.73.150.239","237.43.24.118"]
referers=["-","http://www.casualcyclist.com","http://bestcyclingreviews.com/top_online_shops","http://bleater.com","http://searchengine.com"]
resources=["/handle-bars","/stems","/wheelsets","/forks","/seatposts","/saddles","/shifters","/Store/cart.jsp?productID="]
useragents=["Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0)","Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1944.0 Safari/537.36","Mozilla/5.0 (Linux; U; Android 2.3.5; en-us; HTC Vision Build/GRI40) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1","Mozilla/5.0 (iPad; CPU OS 6_0 like Mac OS X) AppleWebKit/536.26 (KHTML, like Gecko) Version/6.0 Mobile/10A5355d Safari/8536.25","Mozilla/5.0 (Windows; U; Windows NT 6.1; rv:2.2) Gecko/20110201","Mozilla/5.0 (Windows NT 5.1; rv:31.0) Gecko/20100101 Firefox/31.0","Mozilla/5.0 (Windows; U; MSIE 9.0; WIndows NT 9.0; en-US))"]

otime = jst.localize(datetime.datetime(2013,10,10))


for i in range(0,500):
    increment = datetime.timedelta(seconds=random.randint(30,300))

    otime += increment
    uri = str(random.choice(resources))
    if uri.find("Store") > 0:
        uri += str(random.randint(1000,1500))
    ip = random.choice(ips)
    useragent = random.choice(useragents)
    referer = random.choice(referers)
    # LogFormat "%h %l %u %t \"%r\" %>s %b \"%{Referer}i\" \"%{User-agent}i\""
    f.write('%s - - [%s] "GET %s HTTP/1.1" 200 %s "%s" "%s"\n' % (random.choice(ips),otime.strftime('%d/%b/%Y:%H:%M:%S %z'), uri, random.randint(2000,5000), referer, useragent)) 
f.close()


データの読み込み

ファイル取得

import glob 

file_list = glob.glob('access_log*log')
file_list

アクセスログのパースとデータフレームの作成

import pandas as pd 
import re 

col_names = ['remote_host',
             'remotelog_name', 
             'remote_user', 
             'timestamp', 
             'request_url', 
             'status', 
             'response_bytes', 
             'referer', 
             'user_agent',
             'method',
             'url_full', 
             'url',
             'query', 
             'prot']

regexp = r'(.*?) (.*?) (.*?) \[(.*?)\] "(.*?)" (.*?) (.*?) "(.*?)" "(.*?)"$'
df_list = []

for file_name in file_list:
    line_list = []
    
    with open (file_name, 'r') as f:
        for line in f:
            m = re.match(regexp,line)
            if m :
                tmp_list = list(m.groups())
                if tmp_list[4] != "-": 
                    mehod, url_full, prot = tmp_list[4].split(" ")
                    
                    url, query = (url_full.split('?') + [" "," "])[0:2]
                    
                    tmp_list.extend([mehod, url_full, url, query, prot])
                    
                else :
                    tmp_list.extend(["", "", "","",""])
                
                line_list.append(tmp_list)
                    
            else: 
                print(line)
    df_list.append( pd.DataFrame(line_list,columns=col_names))
df = pd.concat(df_list)

データの加工

  • 集計しやすいように加工

resample用datetimeインデックス作成

  • datetimeindexのほうが何かと都合がよい
df['datetime'] = pd.to_datetime(df['timestamp'].str.replace(" \+0900",""), format="%d/%b/%Y:%H:%M:%S" )
df = df.set_index('datetime')

拡張子列作成

  • jspだけに絞り込むなど分析時に有用なため。今回はダミーデータのため作成しただけで使っていない。
df['ext'] = df['url'].str.extract(".+\.([a-zA-Z]+).*")

出力件数集計

  • 1時間ごとの出力件数をurlごとに集計。
df['count'] = 1
df_rs = df.groupby("url").resample('h').sum()

## グラフ化

前準備

df_rs = df_rs.reset_index()

データセット作成

import holoviews as hv 
import numpy as np 
hv.extension('bokeh')
ds = hv.Dataset(df_rs, vdims=['count'])

ヒートマップ作成

%%opts HeatMap[ width=800 height=300 tools=['hover']]

hv.HeatMap(ds.aggregate(['datetime','url'], np.sum)).options(xrotation=90, cmap='Reds')

bokeh_plot.png

1
2
0

Register as a new user and use Qiita more conveniently

  1. You get articles that match your needs
  2. You can efficiently read back useful information
  3. You can use dark theme
What you can do with signing up
1
2

Delete article

Deleted articles cannot be recovered.

Draft of this article would be also deleted.

Are you sure you want to delete this article?