More than 3 years have passed since last update.

株式会社ネットワールド

Pandas を利用した時系列データを分析するための事前処理

Posted at 2022-03-18

概要

時系列データに対する特徴量エンジニアリング手法のまとめを参考に Pandas を利用して時系列データを分析するための事前処理の手法を確認してみました。

ローカル環境

macOS Monterey 12.1
python 3.8.12
Jupyter Notebook 6.4.6

前提条件

この記事で取得した全ての Azure Subscription の使用料金データが１つのファイルにマージされていること（今回は 2021/9/1 〜 2022/2/28 のデータ）

Jupyter Notebook での実装

ファイルの読込

cmd_1

# Azure課金データのDataFrameへの読み込み
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

UsageCosts = pd.read_json('./MargeUsageCost_Pandas_202202.json')
UsageCosts.head()

output_1

                        Date	ResourceGroup	            Subscription	UsageCost
0	2022-02-01 00:00:00+00:00	cloud-shell-storage-southeastasia	PSG1	3.714739
1	2022-02-02 00:00:00+00:00	cloud-shell-storage-southeastasia	PSG1	3.714859
2	2022-02-03 00:00:00+00:00	cloud-shell-storage-southeastasia	PSG1	3.714864
3	2022-02-04 00:00:00+00:00	cloud-shell-storage-southeastasia	PSG1	3.714745
4	2022-02-05 00:00:00+00:00	cloud-shell-storage-southeastasia	PSG1	3.714739

DataFrameの情報

cmd_2

print('dataframeの行数・列数の確認==>\n', UsageCosts.shape)
print('indexの確認==>\n', UsageCosts.index)
print('columnの確認==>\n', UsageCosts.columns)
print('dataframeの各列のデータ型を確認==>\n', UsageCosts.dtypes)

output_2

dataframeの行数・列数の確認==>
 (16007, 4)
indexの確認==>
 RangeIndex(start=0, stop=16007, step=1)
columnの確認==>
 Index(['Date', 'ResourceGroup', 'Subscription', 'UsageCost'], dtype='object')
dataframeの各列のデータ型を確認==>
 Date             datetime64[ns, UTC]
ResourceGroup                 object
Subscription                  object
UsageCost                    float64
dtype: object

cmd_3

# DataFrameの要約情報
UsageCosts.info()

output_3

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 16007 entries, 0 to 16006
Data columns (total 4 columns):
 #   Column         Non-Null Count  Dtype              
---  ------         --------------  -----              
 0   Date           16007 non-null  datetime64[ns, UTC]
 1   ResourceGroup  16007 non-null  object             
 2   Subscription   16007 non-null  object             
 3   UsageCost      16007 non-null  float64            
dtypes: datetime64[ns, UTC](1), float64(1), object(2)
memory usage: 500.3+ KB

cmd_4

# DataFrameの基本統計量
UsageCosts.describe()

output_4

	       UsageCost
count	16007.000000
mean	  180.960370
std	      618.940738
min	        0.000000
25%	        1.836792
50%	       34.506695
75%	      146.397421
max	    23124.650846

DataFrameの可視化

cmd_5

# 日次トレンドの可視化
plt.figure(figsize=(20, 10))
sns.lineplot(x='Date',y='UsageCost',data=UsageCosts)
plt.xticks(rotation=45) 
plt.title('Dayly total Cost of Azure Usage', fontsize=24)
plt.xlabel('Date', fontsize=18)
plt.ylabel('Usage Cost', fontsize=18)

DataFrameの件数取得

cmd_6

# Date列の集計（日毎のデータ件数）
UsageCosts['Date'].value_counts()

output_6

2022-02-14 00:00:00+00:00    110
2022-02-16 00:00:00+00:00    106
2022-02-15 00:00:00+00:00    106
2022-02-17 00:00:00+00:00    105
2022-02-09 00:00:00+00:00    104
                            ... 
2021-10-06 00:00:00+00:00     71
2021-10-04 00:00:00+00:00     70
2021-10-02 00:00:00+00:00     69
2021-10-01 00:00:00+00:00     69
2021-10-03 00:00:00+00:00     68
Name: Date, Length: 181, dtype: int64

DataFrameの並び替え

cmd_7

# 'Date'列を昇順で並び替える
UsageCosts.sort_values(by="Date", ascending=True).head(10)

output_7

	                        Date	                   ResourceGroup	Subscription	UsageCost
10456	2021-09-01 00:00:00+00:00	cloud-shell-storage-southeastasia	market-sc	     1.100733
13174	2021-09-01 00:00:00+00:00	                   wjt-mattttttoe	WJT	           319.969126
5788	2021-09-01 00:00:00+00:00	               rg-nhhhhhi-dev-001	PSG2	       268.044610
9238	2021-09-01 00:00:00+00:00	          rg-cscedddddd-share-001	cscedu	        43.039960
5787	2021-09-01 00:00:00+00:00	     mc_rg_ituru_aks_cp_japaneast	PSG2	       789.369759
7139	2021-09-01 00:00:00+00:00	                          vdi-lab	tech-share 	    36.940008
13144	2021-09-01 00:00:00+00:00	                     asugggggg_rg	WJT	           156.922133
7018	2021-09-01 00:00:00+00:00	       cloud-shell-storage-westus	tech-share	     2.432905
5848	2021-09-01 00:00:00+00:00	              rg-nnngggg-prod-001	PSG2	        11.307652
13234	2021-09-01 00:00:00+00:00	           wjt-wayway-rg-analllll	WJT	           498.959830

DataFrameの列の削除

cmd_8

# 不必要と思われる列（ResourceGroup, Subscription）の削除
UsageCostOnly = UsageCosts.drop(['ResourceGroup', 'Subscription'], axis=1)
UsageCostOnly.head()

output_8

                        Date	UsageCost
0	2022-02-01 00:00:00+00:00	3.714739
1	2022-02-02 00:00:00+00:00	3.714859
2	2022-02-03 00:00:00+00:00	3.714864
3	2022-02-04 00:00:00+00:00	3.714745
4	2022-02-05 00:00:00+00:00	3.714739

DataFrameのindex化

cmd_9

# Date列をindexにする
UsageCostOnly.set_index('Date', inplace=True)
display(UsageCostOnly.sort_index())

output_9

	                          UsageCost
                    Date	
2021-09-01 00:00:00+00:00	   1.100733
2021-09-01 00:00:00+00:00	 319.969126
2021-09-01 00:00:00+00:00	 268.044610
2021-09-01 00:00:00+00:00	  43.039960
2021-09-01 00:00:00+00:00	 789.369759
...	...
2022-02-28 00:00:00+00:00	   1.859265
2022-02-28 00:00:00+00:00	   1.238062
2022-02-28 00:00:00+00:00	 181.850885
2022-02-28 00:00:00+00:00	  48.400929
2022-02-28 00:00:00+00:00	1596.952795
16007 rows × 1 columns

DataFrameの集計（resampleメソッド）

cmd_10

# resampleメソッドで、日単位や月単位で簡単に集計できる
# UsageCostOnly.resample('M').mean() #月単位で平均値を出力
SumUsageCostOnly = UsageCostOnly.resample('D').sum() #日単位で合計値を出力
display(SumUsageCostOnly)

output_10

	                          UsageCost
                    Date	
2021-09-01 00:00:00+00:00	11317.017010
2021-09-02 00:00:00+00:00	12802.763686
2021-09-03 00:00:00+00:00	14390.742330
2021-09-04 00:00:00+00:00	12581.432865
2021-09-05 00:00:00+00:00	12270.136471
...	...
2022-02-24 00:00:00+00:00	43935.955541
2022-02-25 00:00:00+00:00	45967.778398
2022-02-26 00:00:00+00:00	45035.188760
2022-02-27 00:00:00+00:00	44734.041632
2022-02-28 00:00:00+00:00	47055.269872
181 rows × 1 columns

DataFrameの可視化（index化データから）

cmd_11

# 月次トレンドの可視化
plt.figure(figsize=(20, 10))
sns.lineplot(x='Date',y='UsageCost',data=SumUsageCostOnly)
plt.xticks(rotation=45) 
plt.title('Dayly total Cost of Azure Usage', fontsize=24)
plt.xlabel('Date', fontsize=18)
plt.ylabel('Usage Cost', fontsize=18)

DataFrameのindex削除

cmd_12

# reset_indexする
SumUsageCostOnly.reset_index(inplace=True)

DataFrameへの列の追加

時系列分析に必要な特徴量の生成（年、月、日、曜日、旬）

cmd_13

# 年,月,日,曜日（0:月 - 6:日） の特徴量を生成
SumUsageCostOnly['Year'] = SumUsageCostOnly['Date'].map(lambda x: x.year)
SumUsageCostOnly['Month'] = SumUsageCostOnly['Date'].map(lambda x: x.month)
SumUsageCostOnly['Day'] = SumUsageCostOnly['Date'].map(lambda x: x.day)
SumUsageCostOnly['DayOfWeek'] = SumUsageCostOnly['Date'].map(lambda x: x.dayofweek)

# cutメソッドでヒストグラムのビン指定的な処理をする
# 上・中・下旬（1:上旬, 2:下旬, 3:下旬） の特徴量を生成
labels = [1, 2, 3]
SumUsageCostOnly['Period'] = pd.cut(list(SumUsageCosts['Day']),  bins=[0,10,20,31], labels=labels, right=True) # 0<day≦10, 10<day≦20, 20<day≦31

#カラムの順番を変更
SumUsageCostOnly = SumUsageCostOnly.loc[:,['Date', 'Year','Month','Day','DayOfWeek', 'Period', 'UsageCost']]
display(SumUsageCostOnly)

output_13

	                    Date	Year	Month	Day	DayOfWeek	Period	UsageCost
0	2021-09-01 00:00:00+00:00	2021	    9	  1	        2	   1	11317.017010
1	2021-09-02 00:00:00+00:00	2021	    9	  2	        3	   1	12802.763686
2	2021-09-03 00:00:00+00:00	2021	    9	  3	        4	   1	14390.742330
3	2021-09-04 00:00:00+00:00	2021	    9	  4	        5	   1	12581.432865
4	2021-09-05 00:00:00+00:00	2021	    9	  5	        6	   1	12270.136471
...	...	...	...	...	...	...	...
176	2022-02-24 00:00:00+00:00	2022	    2	 24	        3	   3	43935.955541
177	2022-02-25 00:00:00+00:00	2022	    2	 25	        4	   3	45967.778398
178	2022-02-26 00:00:00+00:00	2022	    2	 26	        5	   3	45035.188760
179	2022-02-27 00:00:00+00:00	2022	    2	 27	        6	   3	44734.041632
180	2022-02-28 00:00:00+00:00	2022	    2	 28	        0	   3	47055.269872
181 rows × 7 columns

DataFrameへのラグ情報（列）の追加

cmd_14

# ユニークキーでソート
LagUsageCostOnly = SumUsageCostOnly.sort_values(
    ['Year', 'Month', 'Day'],
    ascending=[True, True, True]
).reset_index(drop=True)

#ラグに使用する値のリスト
lags = [1,2,3]
for lag in lags:
    LagUsageCostOnly = pd.concat(
        [LagUsageCostOnly, LagUsageCostOnly['UsageCost'].shift(lag).rename('#Cost_'+str(lag))],
        axis=1
    )

LagUsageCostOnly.head(10)

DataFrameへの移動平均情報（列）の追加

cmd_15

# 'UsageCost'列について、頭からwindowサイズ5で移動平均を計算する
SumUsageCostOnly['5MA'] = SumUsageCostOnly['UsageCost'].rolling(window=5).mean()
SumUsageCostOnly.head(10)

DataFrameの２つの情報の可視化

cmd_16

# 月次トレンドと5日移動平均の可視化
plt.figure(figsize=(20, 10))
plt.plot(SumUsageCostOnly['Date'], SumUsageCostOnly['UsageCost'], color='black',linestyle='-', label='UsageCost')
plt.plot(SumUsageCostOnly['Date'], SumUsageCostOnly['5MA'], color='dodgerblue',linestyle='--', label='5MA')
plt.legend(fontsize=14)  # 凡例
plt.xticks(rotation=45) 
plt.title('Monthly total Cost of Azure Usage with 5MA', fontsize=24)
plt.xlabel('Date', fontsize=18)
plt.ylabel('Usage Cost', fontsize=18)

DataFrameの複雑な集計

cmd_17

# ちょっと複雑な集計 ：　年月別・期間別の課金合計の可視化
monthly_df = pd.DataFrame(SumUsageCostOnly.groupby(['Year', 'Month', 'Period'])['UsageCost'].sum())
pivot_monthly_df = monthly_df.reset_index().pivot(index=['Year', 'Month'], columns='Period', values='UsageCost')
pivot_monthly_df

DataFrameの可視化（棒グラフ）

cmd_18

# 棒グラフ
pivot_monthly_df.plot(kind='bar', alpha=0.6, figsize=(12,3))
plt.title('Monthly and Period Costs')

まとめ

Pandas を利用した時系列データを分析するための事前処理の方法がなんとなく理解できたかなぁ、、、、、

参考記事

以下の記事を参考にさせていただきました。感謝申し上げます
データ分析で頻出のPandas基本操作

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up