More than 1 year has passed since last update.

pandas rolling関数で「groupby」と「頻度コード日付指定」を併用する

Last updated at 2024-04-04Posted at 2024-01-17

はじめに

時系列データで特徴量を作成する際、移動平均など特定の期間の幅でデータを扱うことが多いです。
rolling関数を使うと簡潔な記述ができ、処理速度も速いです。

基本的な使い方は割愛します。
pandasで窓関数を適用するrollingを使って移動平均などを算出 | note.nkmk.me

「Groupby」と「頻度コード日付指定」を併用する場合

グループ毎かつ日付列を基準にrolling関数を用いて集計を行う場合、注意すべきことが多いです。

例：各グループごとに30日間の移動平均を算出し列追加

使用データ

import pandas as pd

data = {
    'Group': ['A', 'A', 'A', 'A', 'A', 'A', 'B', 'B', 'B', 'B', 'B', 'B', 'B', 'C', 'C', 'C', 'C', 'C', 'C', 'C'],
    'Date': ['2023-01-07 21:25:23', '2023-02-01 13:16:21', '2023-04-20 12:12:29', '2023-08-17 13:17:32', '2023-09-02 21:58:14',
             '2023-12-05 06:45:54', '2023-01-05 21:21:04', '2023-01-23 20:40:57', '2023-03-20 03:25:55', '2023-06-13 04:23:22',
             '2023-08-22 14:59:13', '2023-10-23 20:31:54', '2023-11-25 12:35:42', '2023-02-06 04:54:36', '2023-02-10 06:19:04',
             '2023-03-06 05:27:49', '2023-03-08 13:46:13', '2023-05-15 14:08:39', '2023-08-12 00:36:52', '2023-12-27 21:12:06'],
    'Measure': [53, 51, 44, 15, 55, 99, 6, 46, 59, 25, 3, 11, 77, 42, 42, 25, 70, 44, 31, 94]
}

df = pd.DataFrame(data)

	Group	Date	Measure
0	A	2023-01-07 21:25:23	53
1	A	2023-02-01 13:16:21	51
2	A	2023-04-20 12:12:29	44
3	A	2023-08-17 13:17:32	15
4	A	2023-09-02 21:58:14	55
5	A	2023-12-05 06:45:54	99
6	B	2023-01-05 21:21:04	6
7	B	2023-01-23 20:40:57	46
8	B	2023-03-20 03:25:55	59
9	B	2023-06-13 04:23:22	25
10	B	2023-08-22 14:59:13	3
11	B	2023-10-23 20:31:54	11
12	B	2023-11-25 12:35:42	77
13	C	2023-02-06 04:54:36	42
14	C	2023-02-10 06:19:04	42
15	C	2023-03-06 05:27:49	25
16	C	2023-03-08 13:46:13	70
17	C	2023-05-15 14:08:39	44
18	C	2023-08-12 00:36:52	31
19	C	2023-12-27 21:12:06	94

コード（各グループごとに30日間の移動平均を算出し列追加）

# 日付型に変換
df['Date'] =  pd.to_datetime(df['Date'])
# 並び替え、リセットインデックス
df_rol = df.sort_values(by=['Group','Date'],ascending=[True,True]).reset_index(drop=True)

# 各グループごとに10日間の移動平均を算出
df_rol['Measure_mean'] = df_rol.groupby('Group').rolling(window='30D',on='Date')['Measure'].mean().reset_index()['Measure']

結果

	Group	Date	Measure	Measure_mean
0	A	2023-01-07 21:25:23	53	53.0000
1	A	2023-02-01 13:16:21	51	52.0000
2	A	2023-04-20 12:12:29	44	44.0000
3	A	2023-08-17 13:17:32	15	15.0000
4	A	2023-09-02 21:58:14	55	35.0000
5	A	2023-12-05 06:45:54	99	99.0000
6	B	2023-01-05 21:21:04	6	6.0000
7	B	2023-01-23 20:40:57	46	26.0000
8	B	2023-03-20 03:25:55	59	59.0000
9	B	2023-06-13 04:23:22	25	25.0000
10	B	2023-08-22 14:59:13	3	3.0000
11	B	2023-10-23 20:31:54	11	11.0000
12	B	2023-11-25 12:35:42	77	77.0000
13	C	2023-02-06 04:54:36	42	42.0000
14	C	2023-02-10 06:19:04	42	42.0000
15	C	2023-03-06 05:27:49	25	36.3333
16	C	2023-03-08 13:46:13	70	45.6667
17	C	2023-05-15 14:08:39	44	44.0000
18	C	2023-08-12 00:36:52	31	31.0000
19	C	2023-12-27 21:12:06	94	94.0000

解説

# 日付型に変換
df['Date'] =  pd.to_datetime(df['Date'])
# 並び替え、リセットインデックス
df_rol = df.sort_values(by=['Group','Date'],ascending=[True,True]).reset_index(drop=True)

sort_values：グループ列、日付列の順に並び替えてあげないとエラーになります。
reset_index：dfの新規列としてseriesを結合する際、index番号があってないとエラーになります。

reset_indexをせずにエラーになる例

コード（Measureが10以上に絞り込む）

# Measureが10以上に絞り込む
df_cl = df[df['Measure'] >= 10]
# 日付型に変換
df_cl['Date'] =  pd.to_datetime(df_cl['Date'])
# 並び替え # あえてreset_index()を行わない
df_cl_rol = df_cl.sort_values(by=['Group','Date'],ascending=[True,True])

結果

	Group	Date	Measure
0	A	2023-01-07 21:25:23	53
1	A	2023-02-01 13:16:21	51
2	A	2023-04-20 12:12:29	44
3	A	2023-08-17 13:17:32	15
4	A	2023-09-02 21:58:14	55
5	A	2023-12-05 06:45:54	99
7	B	2023-01-23 20:40:57	46
8	B	2023-03-20 03:25:55	59
9	B	2023-06-13 04:23:22	25
11	B	2023-10-23 20:31:54	11
12	B	2023-11-25 12:35:42	77
13	C	2023-02-06 04:54:36	42
14	C	2023-02-10 06:19:04	42
15	C	2023-03-06 05:27:49	25
16	C	2023-03-08 13:46:13	70
17	C	2023-05-15 14:08:39	44
18	C	2023-08-12 00:36:52	31
19	C	2023-12-27 21:12:06	94

コード（各グループごとに30日間の移動平均を算出し列追加）

# 各グループごとに10日間の移動平均を算出
df_cl_rol['Measure_mean'] = df_cl_rol.groupby('Group').rolling(window='30D',on='Date')['Measure'].mean().reset_index()['Measure']

結果

	Group	Date	Measure	Measure_mean
0	A	2023-01-07 21:25:23	53	53.0000
1	A	2023-02-01 13:16:21	51	52.0000
2	A	2023-04-20 12:12:29	44	44.0000
3	A	2023-08-17 13:17:32	15	15.0000
4	A	2023-09-02 21:58:14	55	35.0000
5	A	2023-12-05 06:45:54	99	99.0000
7	B	2023-01-23 20:40:57	46	59.0000
8	B	2023-03-20 03:25:55	59	25.0000
9	B	2023-06-13 04:23:22	25	11.0000
11	B	2023-10-23 20:31:54	11	42.0000
12	B	2023-11-25 12:35:42	77	42.0000
13	C	2023-02-06 04:54:36	42	36.3333
14	C	2023-02-10 06:19:04	42	45.6667
15	C	2023-03-06 05:27:49	25	44.0000
16	C	2023-03-08 13:46:13	70	31.0000
17	C	2023-05-15 14:08:39	44	94.0000
18	C	2023-08-12 00:36:52	31	nan
19	C	2023-12-27 21:12:06	94	nan

実はエラーメッセージも出ません！ 😱
計算された値はめちゃくちゃです。。。

クレンジングなどでindex番号が狂ったら、rollingを使う前にreset_indexを忘れずにやりましょう

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up