Help us understand the problem. What is going on with this article?

【速度比較】4000万レコードの前処理を速くしたい

概要

4000万レコードあるデータの前処理を速く処理する方法について比較しました。 :star2:

この記事ではTimestampのColumn から 年,月,日,時,分,秒のColumnに分けることを目的とします

結論

Pandasのto_datetimeメソッド使うのが一番早そう。

実行環境

  • MacBook Pro (13-inch, 2017, Two Thunderbolt 3 ports)
  • macOS Mojave 10.14.6
  • 2.5 GHz Intel Core i7
  • 16 GB 2133 MHz LPDDR3

準備

4000万件のタイムスタンプを作ります。

準備1
echo "created_at" > timestamp_sample.csv
for ((i=0;i<40000000;i++)); do date -r $(($RANDOM*1000+1500000000)) "+%Y-%m-%d %H:%M:%S" >> timestamp_sample.csv; done

あらかじめCSVファイルをPythonでインポートしておきます。

準備2
import pandas as pd
import time
import sys, io, subprocess

t1 = time.time()
in_df = pd.read_csv('timestamp_sample.csv')
print(time.time()-t1) # 大体 93 sec位かかった

方法1 🥇

Pandasのto_datetimeメソッドを使う (これが一番速そう)

method1
t1 = time.time()
out_df = pd.DataFrame()
out_df['year'] = pd.to_datetime(in_df['created_at']).dt.year
out_df['month'] = pd.to_datetime(in_df['created_at']).dt.month
out_df['day'] = pd.to_datetime(in_df['created_at']).dt.day
out_df['hour'] = pd.to_datetime(in_df['created_at']).dt.hour
out_df['minute'] = pd.to_datetime(in_df['created_at']).dt.minute
out_df['second'] = pd.to_datetime(in_df['created_at']).dt.second
print(time.time()-t1) # 60secぐらい

方法2 🥈

シェルスクリプトを使う
ファイル読み込み込で考えるとこれが一番速い

convert_timestamp.sh
#!/bin/bash

LF=$(printf '\\\012_')
LF=${LF%_}

cut -d ',' -f 1 timestamp_sample.csv|\
sed -e 's/[^0-9]/,/g' |\
sed -e '1d' |\
sed -e '1s/^/year,month,day,hour,minute,second'"$LF"'/'
method2
t1 = time.time()
res = subprocess.run(["./convert_timestamp.sh"], stdout=subprocess.PIPE)
df = pd.read_csv(io.BytesIO(res.stdout))
print(time.time()-t1) # 145sec

方法3 🥉

mapを使う

slice_func
def slice1(_str):
    return int(_str[0:4])
def slice2(_str):
    return int(_str[5:7])
def slice3(_str):
    return int(_str[8:10])
def slice4(_str):
    return int(_str[11:13])
def slice5(_str):
    return int(_str[14:16])
def slice6(_str):
    return int(_str[17:19])

方法3-1

method3-1
t1 = time.time()
out_df = pd.DataFrame()
out_df.loc[:, 'year'] = in_df.created_at.map(slice1)
out_df.loc[:, 'month'] = in_df.created_at.map(slice2)
out_df.loc[:, 'day'] = in_df.created_at.map(slice3)
out_df.loc[:, 'hour'] = in_df.created_at.map(slice4)
out_df.loc[:, 'minute'] = in_df.created_at.map(slice5)
out_df.loc[:, 'second'] = in_df.created_at.map(slice6)
print(time.time()-t1) # 191sec

方法3-2

loc使わないほうが速い

method3-2
t1 = time.time()
tmp1 = in_df.created_at.map(slice1)
tmp2 = in_df.created_at.map(slice2)
tmp3= in_df.created_at.map(slice3)
tmp4 = in_df.created_at.map(slice4)
tmp5 = in_df.created_at.map(slice5)
tmp6 = in_df.created_at.map(slice6)
master = pd.DataFrame(
    data={
        'year': tmp1,
        'month': tmp2,
        'day': tmp3,
        'hour': tmp4,
        'minute': tmp5,
        'second': tmp6
    },
    columns=['year', 'month', 'day', 'hour', 'minute', 'second']
)
print(time.time()-t1) # 175sec

方法3-3

concat使ったほうが速い

method3-3
t1 = time.time()
tmp1 = in_df.created_at.map(slice1)
tmp2 = in_df.created_at.map(slice2)
tmp3= in_df.created_at.map(slice3)
tmp4 = in_df.created_at.map(slice4)
tmp5 = in_df.created_at.map(slice5)
tmp6 = in_df.created_at.map(slice6)
master = pd.concat([tmp1, tmp2, tmp3, tmp4, tmp5, tmp6], axis=1)
master.columns=['year', 'month', 'day', 'hour', 'minute', 'second']
print(time.time()-t1)# 169sec

方法4

forを使う

方法4-1

method4
t1 = time.time()
tmp1 = tuple(int(t[0:4]) for t in in_df.created_at.tolist())
tmp2 = tuple(int(t[5:7]) for t in in_df.created_at.tolist())
tmp3 = tuple(int(t[8:10]) for t in in_df.created_at.tolist())
tmp4 = tuple(int(t[11:13]) for t in in_df.created_at.tolist())
tmp5 = tuple(int(t[14:16]) for t in in_df.created_at.tolist())
tmp6 = tuple(int(t[17:19]) for t in in_df.created_at.tolist())
master = pd.DataFrame(
    data={
        'year': tmp1,
        'month': tmp2,
        'day': tmp3,
        'hour': tmp4,
        'minute': tmp5,
        'second': tmp6
    },
    columns=['year', 'month', 'day', 'hour', 'minute', 'second']
)
print(time.time()-t1)# 185sec

方法4-2

関数化したほうがちょっとだけ速い

method4-2
t1 = time.time()
tmp1 = tuple(slice1(t) for t in in_df.created_at.tolist())
tmp2 = tuple(slice2(t)  for t in in_df.created_at.tolist())
tmp3 = tuple(slice3(t)  for t in in_df.created_at.tolist())
tmp4 = tuple(slice4(t)  for t in in_df.created_at.tolist())
tmp5 = tuple(slice5(t)  for t in in_df.created_at.tolist())
tmp6 = tuple(slice6(t)  for t in in_df.created_at.tolist())
master = pd.DataFrame(
    data={
        'year': tmp1,
        'month': tmp2,
        'day': tmp3,
        'hour': tmp4,
        'minute': tmp5,
        'second': tmp6
    },
    columns=['year', 'month', 'day', 'hour', 'minute', 'second']
)
print(time.time()-t1) #181sec

方法5

zipを使う

method5
def sprit_timestamp(_str: str):
    return  int(_str[0:4]), int(_str[5:7]), int(_str[8:10]), int(_str[11:13]), int(_str[14:16]), int(_str[17:19])


t1 = time.time()
tmp = tuple(zip(*(sprit_timestamp(t) for t in in_df.created_at.tolist())))
master = pd.DataFrame(
    data={
        'year': tmp[0],
        'month': tmp[1],
        'day': tmp[2],
        'hour': tmp[3],
        'minute': tmp[4],
        'second': tmp[5]
    },
    columns=['year', 'month', 'day', 'hour', 'minute', 'second']
)
print(time.time()-t1)# 205sec

感想

書き方によって実行時間が大きく変わることがわかりました。
これから大量のデータを扱うときは気をつけたいですね。

次は正規表現や数値計算等の複雑な処理についての比較を行いたいと思います。

Why do not you register as a user and use Qiita more conveniently?
  1. We will deliver articles that match you
    By following users and tags, you can catch up information on technical fields that you are interested in as a whole
  2. you can read useful information later efficiently
    By "stocking" the articles you like, you can search right away
Comments
Sign up for free and join this conversation.
If you already have a Qiita account
Why do not you register as a user and use Qiita more conveniently?
You need to log in to use this function. Qiita can be used more conveniently after logging in.
You seem to be reading articles frequently this month. Qiita can be used more conveniently after logging in.
  1. We will deliver articles that match you
    By following users and tags, you can catch up information on technical fields that you are interested in as a whole
  2. you can read useful information later efficiently
    By "stocking" the articles you like, you can search right away