More than 3 years have passed since last update.

datetime形式のindexで設定される週を修正する

Last updated at 2021-01-16Posted at 2021-01-05

はじめに

時系列データを処理するに辺り、pandasでindexをdatetimeにすると便利だが、indexから週を抽出する際に、例えば2019/12/30と12/31など最終週が2019年の第1週となってしまうため、これを修正する方法を調べた備忘録である。

問題点

12月最終週と翌年1月最初の週は同じ週となる。週番号の付け方は色々あるらしいが、pandasではISO準拠のヨーロッパ式のようであり、その週の平日から第1週目と認識される。

参考：2019年のウィークカレンダーと週番号の一覧

しかし、2019/12/30, 31は df.index.yaer -> 2019、df.index.week -> 1 となるため、2019年の第1週と認識されてしまうため、週単位でのデータ集計などの処理に都合が悪い。

下記のようなdatetime形式のindexを例に示す。

df.index

DatetimeIndex(['2019-12-29 22:00:00+00:00', '2019-12-29 22:00:00+00:00',
               '2019-12-29 22:00:00+00:00', '2019-12-29 22:00:00+00:00',
               '2019-12-29 22:00:00+00:00', '2019-12-29 22:08:00+00:00',
               '2019-12-29 23:00:00+00:00', '2019-12-30 01:47:00+00:00',
               '2019-12-30 02:48:00+00:00', '2019-12-30 02:48:00+00:00',
               '2019-12-30 12:34:00+00:00', '2019-12-30 14:51:00+00:00',
               '2019-12-30 14:53:00+00:00', '2019-12-30 14:56:00+00:00',
               '2019-12-31 04:50:00+00:00', '2019-12-31 13:41:00+00:00',
               '2019-12-31 14:42:00+00:00', '2019-12-31 14:45:00+00:00',
               '2019-12-31 15:56:00+00:00', '2019-12-31 15:56:00+00:00',
               '2019-12-31 15:58:00+00:00', '2019-12-31 15:58:00+00:00'],
              dtype='datetime64[ns, UTC]', name='date', freq=None)

このデータフレームに対して、下記のようにMultiIndexを設定すると、12/29は52週なのに対して、12/30, 31は2019年の第1週になってしまう事が確認できる。

df_w = df.set_index([df.index.year, df.index.month,
                     df.index.week, df.index])
df_w.index.names = ['year', 'month', 'week', 'date']
df_w.sort_index(inplace=True)
df_w.index

MultiIndex([(2019, 12,  1, '2019-12-30 01:47:00+00:00'),
            (2019, 12,  1, '2019-12-30 02:48:00+00:00'),
            (2019, 12,  1, '2019-12-30 02:48:00+00:00'),
            (2019, 12,  1, '2019-12-30 12:34:00+00:00'),
            (2019, 12,  1, '2019-12-30 14:51:00+00:00'),
            (2019, 12,  1, '2019-12-30 14:53:00+00:00'),
            (2019, 12,  1, '2019-12-30 14:56:00+00:00'),
            (2019, 12,  1, '2019-12-31 04:50:00+00:00'),
            (2019, 12,  1, '2019-12-31 13:41:00+00:00'),
            (2019, 12,  1, '2019-12-31 14:42:00+00:00'),
            (2019, 12,  1, '2019-12-31 14:45:00+00:00'),
            (2019, 12,  1, '2019-12-31 15:56:00+00:00'),
            (2019, 12,  1, '2019-12-31 15:56:00+00:00'),
            (2019, 12,  1, '2019-12-31 15:58:00+00:00'),
            (2019, 12,  1, '2019-12-31 15:58:00+00:00'),
            (2019, 12, 52, '2019-12-29 22:00:00+00:00'),
            (2019, 12, 52, '2019-12-29 22:00:00+00:00'),
            (2019, 12, 52, '2019-12-29 22:00:00+00:00'),
            (2019, 12, 52, '2019-12-29 22:00:00+00:00'),
            (2019, 12, 52, '2019-12-29 22:00:00+00:00'),
            (2019, 12, 52, '2019-12-29 22:08:00+00:00'),
            (2019, 12, 52, '2019-12-29 23:00:00+00:00')],
           names=['year', 'month', 'week', 'date'])

改善策

参考サイトに私の求める解決策があり、自分のケースにあてはめました。

一旦indexを解除
dt.weekで週を抽出
強制的に52週に変更
indexを再設定

df.reset_index(inplace=True)
df["year"] = df["date"].dt.year
df["month"] = df["date"].dt.month
df["week"] = df["date"].dt.week
df["week"] = df["date"].apply(
    lambda x: 52 if x.year == 2019 and x.month==12 and x.day in [30, 31] else x.week)
df_w = df.set_index([df["year"], df["month"],
                     df["week"], df["date"]])
df_w.index.names = ['year', 'month', 'week', 'date']
df_w.sort_index(inplace=True)
df_w.index

MultiIndex([(2019, 12, 52, '2019-12-29 22:00:00+00:00'),
            (2019, 12, 52, '2019-12-29 22:00:00+00:00'),
            (2019, 12, 52, '2019-12-29 22:00:00+00:00'),
            (2019, 12, 52, '2019-12-29 22:00:00+00:00'),
            (2019, 12, 52, '2019-12-29 22:00:00+00:00'),
            (2019, 12, 52, '2019-12-29 22:08:00+00:00'),
            (2019, 12, 52, '2019-12-29 23:00:00+00:00'),
            (2019, 12, 52, '2019-12-30 01:47:00+00:00'),
            (2019, 12, 52, '2019-12-30 02:48:00+00:00'),
            (2019, 12, 52, '2019-12-30 02:48:00+00:00'),
            (2019, 12, 52, '2019-12-30 12:34:00+00:00'),
            (2019, 12, 52, '2019-12-30 14:51:00+00:00'),
            (2019, 12, 52, '2019-12-30 14:53:00+00:00'),
            (2019, 12, 52, '2019-12-30 14:56:00+00:00'),
            (2019, 12, 52, '2019-12-31 04:50:00+00:00'),
            (2019, 12, 52, '2019-12-31 13:41:00+00:00'),
            (2019, 12, 52, '2019-12-31 14:42:00+00:00'),
            (2019, 12, 52, '2019-12-31 14:45:00+00:00'),
            (2019, 12, 52, '2019-12-31 15:56:00+00:00'),
            (2019, 12, 52, '2019-12-31 15:56:00+00:00'),
            (2019, 12, 52, '2019-12-31 15:58:00+00:00'),
            (2019, 12, 52, '2019-12-31 15:58:00+00:00')],
           names=['year', 'month', 'week', 'date'])

これにより、各年毎に週単位の集計が可能になりました。

参考サイト

同じ悩みに対する解決策が大変参考になりました。感謝！
Pandas - wrong week extracted week from date

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up