More than 3 years have passed since last update.

Scipyで処理時間の分布を確かめる

Last updated at 2021-03-20Posted at 2021-03-20

はじめに

処理時間の計測を行ってみると、その値にはバラツキがあることがわかります。
正規分布とは違って、処理時間の長い値もそれなりにあり、ぱっと見では指数分布に従っているようにも見えます。
今回はpingの応答時間を計測して、処理時間(応答時間)の分布がどうなっているか確かめてみたいと思います。

pingのデータを収集

まずはpingの応答時間の収集です。Yahoo!Japanに対してpingしてみました。

ping www.yahoo.co.jp

収集したデータは下記のような感じになりました。省略していますが、ぜんぶで2485行あります。

64 bytes from 182.22.25.124: icmp_seq=0 ttl=55 time=3.673 ms
64 bytes from 182.22.25.124: icmp_seq=1 ttl=55 time=4.663 ms
64 bytes from 182.22.25.124: icmp_seq=2 ttl=55 time=3.395 ms
64 bytes from 182.22.25.124: icmp_seq=3 ttl=55 time=15.306 ms
64 bytes from 182.22.25.124: icmp_seq=4 ttl=55 time=12.701 ms
64 bytes from 182.22.25.124: icmp_seq=5 ttl=55 time=4.376 ms
64 bytes from 182.22.25.124: icmp_seq=6 ttl=55 time=3.259 ms
64 bytes from 182.22.25.124: icmp_seq=7 ttl=55 time=5.859 ms
64 bytes from 182.22.25.124: icmp_seq=8 ttl=55 time=4.249 ms
64 bytes from 182.22.25.124: icmp_seq=9 ttl=55 time=3.378 ms

ここから応答時間だけを抽出します。

import re
import numpy as np

pattern = re.compile(r"time=([\d\.]+) ms")

with open("ping.log") as f:
    lines = f.readlines()

times = []
for line in lines:
    match = pattern.search(line)
    if match:
        time = float(match.groups()[0])
        times.append(time)
times = np.array(times)

処理時間のヒストグラムを作ってみます。

import matplotlib.pyplot as plt

plt.hist(times, bins=100)
plt.show()

指数分布っぽく見えますね。

指数分布でフィッティングしてみる

では、Scipyを使用して指数分布にフィッティングしてみます。

import numpy as np
from scipy import stats
import matplotlib.pyplot as plt

# timesは上で求めた値を使用
loc, scale = stats.expon.fit(times)
x = np.arange(0, max(times))
y = stats.expon.pdf(x, loc=loc, scale=scale)
weights = np.ones(len(times))/float(len(times))
plt.hist(times, bins=100, weights=weights)
plt.plot(x, y)
plt.show()

ヒストグラムの表示でweightsを調整してヒストグラムの面積が1になるようにしています。これは確率分布と比較するためです。

あらら？指数分布では、ずれがあります。

対数正規分布でフィッティングしてみる

ロングテールを持つ分布としては対数正規分布もあります。
対数正規分布は、身長や体重、市町村の人口、個人所得などで見られる分布です。
では、対数正規分布でフィッティングしてみます。

shape, loc, scale = stats.lognorm.fit(times)
x = np.arange(0, max(times))
y = stats.lognorm.pdf(x, shape, loc=loc, scale=scale)
weights = np.ones(len(times))/float(len(times))
plt.hist(times, bins=100, weights=weights)
plt.plot(x, y)
plt.show()

うまくフィッティングできたようです。

最後に

今まで処理時間は指数分布に従っていると思っていましたが、実際は対数正規分布にフィットしていることがわかりました。今回はpingの応答時間から求めたものですが、処理する内容によっては違う結果になるかも知れません。とりあえずScipyでのフィッティング方法がわかったので、今後活用していきたいと思ってます。

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up