More than 1 year has passed since last update.

時系列クラスタリング

Last updated at 2024-01-28Posted at 2023-12-18

時系列クラスタリング

時系列クラスタリングは、時系の経過に伴って変化するデータを分析し、類似したパターンやトレンドを持つデータポイントを同じクラスタにまとめる手法です。コレにより、データの潜在的な構造やパターンを抽出することができます。

今回はPridict-Feature-Salesのデータを用いて、shop_idを時系列クラスタリングをしてみました。

ライブラリ&データの読み込み

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import warnings
from tslearn.clustering import TimeSeriesKMeans
from tslearn.utils import to_time_series_dataset
from tslearn.preprocessing import TimeSeriesScalerMeanVariance

plt.rcParams['font.family'] = 'Hiragino Maru Gothic Pro'
warnings.filterwarnings('ignore')

train = pd.read_csv('sales_train.csv')

原系列の可視化

簡単な前処理&月ごとの集計

train = train[train.item_cnt_day > 0]
print(train.shop_id.unique())
grouped = train[['date_block_num', 'shop_id', 'item_cnt_day']].groupby(['date_block_num', 'shop_id']).sum().reset_index()
grouped = grouped.pivot('shop_id', 'date_block_num', 'item_cnt_day').fillna(0)
grouped.head()

時系列データセット型にする&時系列の可視化

time_np = to_time_series_dataset(grouped)

fig, ax = plt.subplots(figsize=(20,8))

for i, x in enumerate(time_np[:]):
    ax.plot(x, label='shop'+str(i))

ax.legend(loc='upper left', bbox_to_anchor=(1.05, 1))
plt.show()

エルボー法で最適なクラスタを見つける

inertia = []
for n_clusters in range(1,12):
    km = TimeSeriesKMeans(n_clusters=n_clusters, metric='euclidean', random_state=0)
    km.fit(time_np)
    inertia.append(km.inertia_)

plt.figure(figsize=(10, 6))
plt.plot(range(1,12), inertia, marker='o')
plt.xlabel('クラスタ数')
plt.ylabel('SSE')
plt.title('エルボー法')
plt.show()

クラスタごとに表示

n = 4
km_euclidean = TimeSeriesKMeans(n_clusters=n, metric='euclidean', random_state=0)
labels_euclidean = km_euclidean.fit_predict(time_np)
print(labels_euclidean)

fig, axes = plt.subplots(n, figsize=(8,16))
plt.subplots_adjust(hspace=0.5)
for i in range(n):
    ax = axes[i]

    for x in time_np[labels_euclidean == i]:
        ax.plot(x.ravel(), 'k-', alpha=0.2)
    ax.plot(km_euclidean.cluster_centers_[i].ravel(), 'r-')

    datanum = np.count_nonzero(labels_euclidean == i)
    ax.text(0.5, (0.7+0.25), f'Cluster{(i)} : n = {datanum}')
    if i == 0:
        ax.set_title('時系列クラスタリング')
plt.show()
#出力
#[2 2 2 2 0 0 3 0 2 2 2 2 0 2 0 0 0 2 0 0 2 0 0 2 0 1 0 3 1 0 0 1 2 2 2 0 2
# 0 0 2 2 0 3 0 0 0 0 0 2 2 0 0 0 0 1 0 0 3 0 0]

全クラスタを表示

n = 4
km_euclidean = TimeSeriesKMeans(n_clusters=n, metric='euclidean', random_state=0)
labels_euclidean = km_euclidean.fit_predict(grouped)
print(labels_euclidean)

colors = pd.DataFrame(labels_euclidean).replace({0:'r', 1:'b', 2:'g', 3:'black'})

fig = plt.figure(figsize=(20,8))
index = 0
for shop in grouped.T.columns:
    plt.plot(grouped.T.index, grouped.T[shop], label='shop'+str(shop)+'cluster'+str(labels_euclidean[index]), color=colors[0][index])
    index += 1
plt.legend(loc='upper left', bbox_to_anchor=(1.05, 1))
plt.title('時系列クラスタリング')
plt.show()
#[2 2 2 2 0 0 3 0 2 2 2 2 0 2 0 0 0 2 0 0 2 0 0 2 0 1 0 3 1 0 0 1 2 2 2 0 2
# 0 0 2 2 0 3 0 0 0 0 0 2 2 0 0 0 0 1 0 0 3 0 0]

最後に

この記事をいいねボタンをおしてください、ぜひよろしくお願いします。
いいねボタンで、標準化したデータ可視化をしたいと思います。

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up