pandasの時系列データフレームに前月の気温カラムを追加する方法

Last updated at 2018-11-18Posted at 2018-11-18

本稿でやること

pythonのpandasで時系列データを扱っていた際に、翌月・前月の情報を加ようとして手が止まったので、その時の様子と解決方法について書きたいと思います。

使用データと変換処理

時系列

ベースとなるデータとしては、東京電力の2017年の電力使用実績を用います。
http://www.tepco.co.jp/forecast/index-j.html
これをpandasで取り込み、resampleして月ごとの合計値を出します。

# データの読み込み
df_electricity = pd.read_csv("tepco_juyo-2017.csv", encoding="shift-jis", header=1).rename(columns={"実績(万kW)":"kW"})

# DATEとTIMEが別々のカラムにテキストで入っているので、
# くっつけてtimestampに変換
df_electricity["datetime"] = pd.to_datetime(df_electricity["DATE"] + " " + df_electricity["TIME"])

# resampleして月ごとの合計値をとります。
df_monthly_electricity = df_electricity.set_index("datetime").resample("M").sum().reset_index()

これにより、以下のようなデータが得られます。

datetime	kW
0	2017-01-31	2660552
1	2017-02-28	2399320
2	2017-03-31	2514744
3	2017-04-30	2120716
4	2017-05-31	2099972
5	2017-06-30	2173333
6	2017-07-31	2684309
7	2017-08-31	2573115
8	2017-09-30	2232854
9	2017-10-31	2215836
10	2017-11-30	2276771
11	2017-12-31	2657793

追加するデータ

上記のデータに、気象庁が公表する月別気温をもとに計算した過去3年間の東京の月別平均気温を追加したいと思います。
https://www.data.jma.go.jp/gmd/risk/obsdl/index.php
時系列データフレームにmapするために、辞書型に変換します。

# データを読み込む
df_temperature = pd.read_csv("climate_temperature.csv", encoding="shift-jis", header=2, usecols=(0,1)).dropna()

# 年月カラムから月をint型で取り出して、月別平均をとる
df_temperature['month'] = df_temperature["年月"].str.split("/").map(lambda x: int(x[1]))
df_monthly_temperature = df_temperature.groupby("month")["平均気温(℃)"].mean()

# 辞書型にする
mean_temperature_dict = df_monthly_temperature.to_dict()

こうして得られたデータが以下のようになります。

mean_temperature_dict

{1: 5.5333333333333323,
 2: 6.5,
 3: 10.1,
 4: 14.866666666666665,
 5: 20.433333333333334,
 6: 22.166666666666668,
 7: 26.299999999999997,
 8: 26.733333333333331,
 9: 23.266666666666666,
 10: 17.966666666666665,
 11: 12.4,
 12: 8.2666666666666675}

一か月前の平均気温をカラムに加える

月をint型にして引き算する

まず最初に、以下のようにすることを考えました。

時系列データフレームのタイムスタンプから月を取り出す
単純に1を引いて前月にする
mapで月ごと平均気温を取得しカラムに加える

df_monthly_electricity["temperature_previous_month"] = \
(df_monthly_electricity["datetime"].dt.month-1).map(mean_temperature_dict)

しかしこうしてしまうと、

df_monthly_electricity

	datetime	kW	temperature_previous_month
0	2017-01-31	2660552	NaN
1	2017-02-28	2399320	5.533333
2	2017-03-31	2514744	6.500000
3	2017-04-30	2120716	10.100000
4	2017-05-31	2099972	14.866667
5	2017-06-30	2173333	20.433333
6	2017-07-31	2684309	22.166667
7	2017-08-31	2573115	26.300000
8	2017-09-30	2232854	26.733333
9	2017-10-31	2215836	23.266667
10	2017-11-30	2276771	17.966667
11	2017-12-31	2657793	12.400000

1月の前月平均気温が正常に得られません。
これは、1月が0月として計算されてしまうためです。

きちんと、int型の足し算でなくて、datetime型の足し算にしなくてはいけません。

datetimeオブジェクトのtimedeltaメソッドを使う

それならと、timestampを一か月前のものにしようと以下のようにしてみました。

import datetime

df_monthly_electricity["datetime"] - datetime.timedelta(months=1)
# ---------------------------------------------------------------------------
# TypeError                                 Traceback (most recent call last)
# <ipython-input-88-4f5bea914dd4> in <module>()
# ----> 1 df_monthly_electricity["datetime"] + datetime.timedelta(months=1)
#
# TypeError: 'months' is an invalid keyword argument for this function

しかし、エラーが返ってきました。
ドキュメントを読んでみると、datetime.timedeltaは、daysまでしか対応していないようです。

relativedeltaを使う

仕方がないので、月単位の差分もとれるrelativedeltaを新たに読み込んで使用します。

from dateutil.relativedelta import relativedelta

df_monthly_electricity["datetime"] - relativedelta(months=1)

# ---------------------------------------------------------------------------
# TypeError                                 Traceback (most recent call last)
# <ipython-input-89-199b48c11213> in <module>()
#       1 from dateutil.relativedelta import relativedelta
#       2 
# ----> 3 df_monthly_electricity["datetime"] + relativedelta(months=1)
#
# （中略）
#
# TypeError: incompatible type [object] for a datetime/timedelta operation

しかし、これでもエラーが返ってきます。
どうやらdatetime.timedeltaと違い、relativedeltaはブロードキャストしてくれないようです。

仕方がないのでmapとlambda式により対応しました。

df_monthly_electricity["datetime"].map(lambda x: x - relativedelta(months=1))
# 0    2016-12-31
# 1    2017-01-28
# 2    2017-02-28
# 3    2017-03-30
# 4    2017-04-30
# 5    2017-05-30
# 6    2017-06-30
# 7    2017-07-31
# 8    2017-08-30
# 9    2017-09-30
# 10   2017-10-30
# 11   2017-11-30

これでようやく正しく前の月にすることができます。

最終的に、以下のようにして前月の平均気温を加えることができました。

df_monthly_electricity["temperature_previous_month"] = \
df_monthly_electricity["datetime"].map(lambda x: x - relativedelta(months=1)).dt.month.map(mean_temperature_dict)

df_monthly_electricity
# 	datetime	kW	temperature_previous_month
# 0	2017-01-31	2660552	8.266667
# 1	2017-02-28	2399320	5.533333
# 2	2017-03-31	2514744	6.500000
# 3	2017-04-30	2120716	10.100000
# 4	2017-05-31	2099972	14.866667
# 5	2017-06-30	2173333	20.433333
# 6	2017-07-31	2684309	22.166667
# 7	2017-08-31	2573115	26.300000
# 8	2017-09-30	2232854	26.733333
# 9	2017-10-31	2215836	23.266667
# 10	2017-11-30	2276771	17.966667
# 11	2017-12-31	2657793	12.400000

mapを二回使用するのがちょっと辛いですね。

オマケ

じつはこの問題、以下のようにするだけで簡単に解決できたりします。

mean_temperature_dict[0] = mean_temperature_dict[1]

df_monthly_electricity["temperature_previous_month"] = \
(df_monthly_electricity["datetime"].dt.month-1).map(mean_temperature_dict)

なんてことはない、時系列ではなく辞書の方に細工を加えて、0月(=12月)の気温を入れてしまえばいいわけです。
この方がずっとシンプルですね。

実行速度で見ても、後者の方が僅かに速いです。

# 意地でもdatetime型で計算する
%timeit df_monthly_electricity["datetime"].map(lambda x: x - relativedelta(months=1)).dt.month.map(mean_temperature_dict)

# 1000 loops, best of 3: 1.11 ms per loop

# 辞書の方を細工する
%%timeit 
mean_temperature_dict[13] = mean_temperature_dict[1]
(df_monthly_electricity["datetime"].dt.month-1).map(mean_temperature_dict)

# 1000 loops, best of 3: 860 µs per loop

まとめ？

datetime.timedeltaでmonthsの足し算引き算をすることはできない。
ralativedeltaならmonthsの計算ができるが、ブロードキャストしてくれない。
詰まったら視点を変えてみよう

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up