More than 5 years have passed since last update.

【速度比較】4000万レコードの前処理を速くしたい

Posted at 2019-09-22

概要

4000万レコードあるデータの前処理を速く処理する方法について比較しました。

この記事ではTimestampのColumn から 年,月,日,時,分,秒のColumnに分けることを目的とします

結論

Pandasのto_datetimeメソッド使うのが一番早そう。

実行環境

MacBook Pro (13-inch, 2017, Two Thunderbolt 3 ports)
macOS Mojave 10.14.6
2.5 GHz Intel Core i7
16 GB 2133 MHz LPDDR3

準備

4000万件のタイムスタンプを作ります。

準備1

echo "created_at" > timestamp_sample.csv
for ((i=0;i<40000000;i++)); do date -r $(($RANDOM*1000+1500000000)) "+%Y-%m-%d %H:%M:%S" >> timestamp_sample.csv; done

あらかじめCSVファイルをPythonでインポートしておきます。

準備2

import pandas as pd
import time
import sys, io, subprocess

t1 = time.time()
in_df = pd.read_csv('timestamp_sample.csv')
print(time.time()-t1) # 大体 93 sec位かかった

方法1 🥇

Pandasのto_datetimeメソッドを使う (これが一番速そう)

method1

t1 = time.time()
out_df = pd.DataFrame()
out_df['year'] = pd.to_datetime(in_df['created_at']).dt.year
out_df['month'] = pd.to_datetime(in_df['created_at']).dt.month
out_df['day'] = pd.to_datetime(in_df['created_at']).dt.day
out_df['hour'] = pd.to_datetime(in_df['created_at']).dt.hour
out_df['minute'] = pd.to_datetime(in_df['created_at']).dt.minute
out_df['second'] = pd.to_datetime(in_df['created_at']).dt.second
print(time.time()-t1) # 60secぐらい

方法2 🥈

シェルスクリプトを使う
ファイル読み込み込で考えるとこれが一番速い

convert_timestamp.sh

# !/bin/bash

LF=$(printf '\\\012_')
LF=${LF%_}

cut -d ',' -f 1 timestamp_sample.csv|\
sed -e 's/[^0-9]/,/g' |\
sed -e '1d' |\
sed -e '1s/^/year,month,day,hour,minute,second'"$LF"'/'

method2

t1 = time.time()
res = subprocess.run(["./convert_timestamp.sh"], stdout=subprocess.PIPE)
df = pd.read_csv(io.BytesIO(res.stdout))
print(time.time()-t1) # 145sec

方法3 🥉

mapを使う

slice_func

def slice1(_str):
    return int(_str[0:4])
def slice2(_str):
    return int(_str[5:7])
def slice3(_str):
    return int(_str[8:10])
def slice4(_str):
    return int(_str[11:13])
def slice5(_str):
    return int(_str[14:16])
def slice6(_str):
    return int(_str[17:19])

方法3-1

method3-1

t1 = time.time()
out_df = pd.DataFrame()
out_df.loc[:, 'year'] = in_df.created_at.map(slice1)
out_df.loc[:, 'month'] = in_df.created_at.map(slice2)
out_df.loc[:, 'day'] = in_df.created_at.map(slice3)
out_df.loc[:, 'hour'] = in_df.created_at.map(slice4)
out_df.loc[:, 'minute'] = in_df.created_at.map(slice5)
out_df.loc[:, 'second'] = in_df.created_at.map(slice6)
print(time.time()-t1) # 191sec

方法3-2

loc使わないほうが速い

method3-2

t1 = time.time()
tmp1 = in_df.created_at.map(slice1)
tmp2 = in_df.created_at.map(slice2)
tmp3= in_df.created_at.map(slice3)
tmp4 = in_df.created_at.map(slice4)
tmp5 = in_df.created_at.map(slice5)
tmp6 = in_df.created_at.map(slice6)
master = pd.DataFrame(
    data={
        'year': tmp1,
        'month': tmp2,
        'day': tmp3,
        'hour': tmp4,
        'minute': tmp5,
        'second': tmp6
    },
    columns=['year', 'month', 'day', 'hour', 'minute', 'second']
)
print(time.time()-t1) # 175sec

方法3-3

concat使ったほうが速い

method3-3

t1 = time.time()
tmp1 = in_df.created_at.map(slice1)
tmp2 = in_df.created_at.map(slice2)
tmp3= in_df.created_at.map(slice3)
tmp4 = in_df.created_at.map(slice4)
tmp5 = in_df.created_at.map(slice5)
tmp6 = in_df.created_at.map(slice6)
master = pd.concat([tmp1, tmp2, tmp3, tmp4, tmp5, tmp6], axis=1)
master.columns=['year', 'month', 'day', 'hour', 'minute', 'second']
print(time.time()-t1)# 169sec

方法4

forを使う

方法4-1

method4

t1 = time.time()
tmp1 = tuple(int(t[0:4]) for t in in_df.created_at.tolist())
tmp2 = tuple(int(t[5:7]) for t in in_df.created_at.tolist())
tmp3 = tuple(int(t[8:10]) for t in in_df.created_at.tolist())
tmp4 = tuple(int(t[11:13]) for t in in_df.created_at.tolist())
tmp5 = tuple(int(t[14:16]) for t in in_df.created_at.tolist())
tmp6 = tuple(int(t[17:19]) for t in in_df.created_at.tolist())
master = pd.DataFrame(
    data={
        'year': tmp1,
        'month': tmp2,
        'day': tmp3,
        'hour': tmp4,
        'minute': tmp5,
        'second': tmp6
    },
    columns=['year', 'month', 'day', 'hour', 'minute', 'second']
)
print(time.time()-t1)# 185sec

方法4-2

関数化したほうがちょっとだけ速い

method4-2

t1 = time.time()
tmp1 = tuple(slice1(t) for t in in_df.created_at.tolist())
tmp2 = tuple(slice2(t)  for t in in_df.created_at.tolist())
tmp3 = tuple(slice3(t)  for t in in_df.created_at.tolist())
tmp4 = tuple(slice4(t)  for t in in_df.created_at.tolist())
tmp5 = tuple(slice5(t)  for t in in_df.created_at.tolist())
tmp6 = tuple(slice6(t)  for t in in_df.created_at.tolist())
master = pd.DataFrame(
    data={
        'year': tmp1,
        'month': tmp2,
        'day': tmp3,
        'hour': tmp4,
        'minute': tmp5,
        'second': tmp6
    },
    columns=['year', 'month', 'day', 'hour', 'minute', 'second']
)
print(time.time()-t1) #181sec

方法5

zipを使う

method5

def sprit_timestamp(_str: str):
    return  int(_str[0:4]), int(_str[5:7]), int(_str[8:10]), int(_str[11:13]), int(_str[14:16]), int(_str[17:19])


t1 = time.time()
tmp = tuple(zip(*(sprit_timestamp(t) for t in in_df.created_at.tolist())))
master = pd.DataFrame(
    data={
        'year': tmp[0],
        'month': tmp[1],
        'day': tmp[2],
        'hour': tmp[3],
        'minute': tmp[4],
        'second': tmp[5]
    },
    columns=['year', 'month', 'day', 'hour', 'minute', 'second']
)
print(time.time()-t1)# 205sec

感想

書き方によって実行時間が大きく変わることがわかりました。
これから大量のデータを扱うときは気をつけたいですね。

次は正規表現や数値計算等の複雑な処理についての比較を行いたいと思います。

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up