More than 5 years have passed since last update.

pandasのto_datetimeが遅い！（formatの指定がとても有効！）

Last updated at 2019-07-06Posted at 2019-07-06

問題

Courseraの"How to Win a Data Science Competition: Learn from Top Kagglers"というコースのWeek1の課題をやっていたところ、pandasのto_datetimeを使う場面に遭遇。
正直、必ずto_datetimeを使わないといけないというわけではなかったが、使ったことなかったので果敢にチャレンジした笑
ところが、データのサイズが2,935,849と巨大で、なかなか処理が終わらない！ってなりました。
こんな感じのデータです。この'date'のカラムに対してto_datetimeを適用したい！

	date	shop_id	item_id	item_price	item_cnt_day
0	02.01.2013	59	22154	999.00	1.0
1	03.01.2013	25	2552	899.00	1.0
2	05.01.2013	25	2552	899.00	-1.0
3	06.01.2013	25	2554	1709.05	1.0
4	15.01.2013	25	2555	1099.00	1.0

とりあえずこんな感じにやってみたわけです。

t['date_'] = pd.to_datetime(t['date'])

原因、と思われるもの及び解決法

to_datetime is too slow on large datasetによると、どうもparseの過程でものすごく時間がかかっている模様。

In this instance, pandas falls back to dateutil.parser.parse for parsing the strings when no format string is supplied (more flexible, but also slower).

to_datetimeの引数にformatを指定するか、infer_datetime_format=Trueを指定してやるとparseが早くなる様子。

pandas.to_datetimeによれば、

infer_datetime_format : boolean, default False
　　　If True and no format is given, attempt to infer the format of the datetime strings, and if it can be inferred, switch to a faster method of parsing them. In some cases this can increase the parsing speed by ~5-10x.

format : string, default None
　　　strftime to parse time, eg “%d/%m/%Y”, note that “%f” will parse all the way up to nanoseconds.

結果

2,935,849のデータはさすがに大きいため、初めの50,000件のデータについて試してみたところ以下のような結果に。

引数なし	infer_datetime_format=True	format='%d.%m.%Y'
13.87	13.85	0.25

format指定するのが爆速という結果に！！
今回の'%d.%m.%Y'という型に対してはうまくinferできなかったようで、infer_datetime_formatをTrueにしてもいい結果は得られませんでした。
なお、2,935,849件全てのデータについてもformatを指定してやると、14.57秒で全てを処理してくれました！優秀！

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up