Python appendとmapの速度を比較

Python3

Posted at 2020-10-02

社内向けに書いたものを転載します

目的

自分が最初に書いていたコード(appendを使ったもの)とレビューで指摘をいただいたコード(mapを使ったもの)の処理速度とメモリ使用量の差分がどんなものかを確認します

検証用コード

二つの配列をマージして出力するプログラムを用意します。 (元々はfieldsとDBから取ってきたrowsをマージして、jsonにして出力するプログラムでした)

計測に用いたツールは以下です。

memory_profiler: 行単位でメモリの増減を計測できる。現在の使用量と増分なども確認できる
line_profiler: 行単位で処理速度を計測できる。

検証1

from memory_profiler import profile
from line_profiler import LineProfiler


@profile(precision=8)
def main():
    a = [1, 2, 3]
    b = ['1', '2', '3']

    results = []
    for _ in range(1000):
        results.append(dict(zip(a, b)))

    for r in results:
        print(r)　# {1: '1', 2: '2', 3: '3'}


if __name__ == "__main__":
    # メモリを計測する時
    #    main()

    # 処理速度を比較する時
    prof = LineProfiler()
    prof.add_function(main)
    prof.runcall(main)
    prof.print_stats(output_unit=1e-9)

検証2

メソッドの中身以外は検証1と同じなので省略します

def main():
    a = [1, 2, 3]
    b = ['1', '2', '3']

    results = map(dict, (zip(a, b) for _ in range(1000)))

    for r in results:
        print(r) # {1: '1', 2: '2', 3: '3'}

結果

そもそものデータ量が少なかったのですが、差は顕著に出ました。

処理速度

検証2の方が 0.02sほど早かった。
for r in results: の箇所で検証2の方が長かったのは、iteratorではこの時点でデータに都度アクセスしに行っていたからと思われる。ただそれでもappend使ったコードより速かったです。

検証1

Total time: 0.087002 s
File: append.py
Function: main at line 6

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
     6                                           def main():
     7         1       4000.0   4000.0      0.0      a = [1, 2, 3]
     8         1       1000.0   1000.0      0.0      b = ['1', '2', '3']
     9                                           
    10         1          0.0      0.0      0.0      results = []
    11      1001     395000.0    394.6      0.5      for _ in range(1000):
    12      1000    1973000.0   1973.0      2.3          results.append(dict(zip(a, b)))
    13                                           
    14      1001    5854000.0   5848.2      6.7      for r in results:
    15      1000   78775000.0  78775.0     90.5          print(r)

検証2

Total time: 0.069483 s
File: map.py
Function: main at line 7

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
     7                                           def main():
     8         1       4000.0   4000.0      0.0      a = [1, 2, 3]
     9         1       1000.0   1000.0      0.0      b = ['1', '2', '3']
    10                                           
    11         1       2000.0   2000.0      0.0      results = map(dict, (zip(a, b) for _ in range(1000)))
    12                                           
    13      1001    8476000.0   8467.5     12.2      for r in results:
    14      1000   61000000.0  61000.0     87.8          print(r)

メモリ使用量

トータルで比較すると、mapの方が0.16MBほど少なかったです。
検証2の書き方では必要以上にメモリが使われないことが分かりました。（ここ重要！）

検証1

Line #    Mem usage    Increment   Line Contents
================================================
     5  38.28125000 MiB  38.28125000 MiB   @profile(precision=8)
     6                             def main():
     7  38.28515625 MiB   0.00390625 MiB       a = [1, 2, 3]
     8  38.28515625 MiB   0.00000000 MiB       b = ['1', '2', '3']
     9                             
    10  38.28515625 MiB   0.00000000 MiB       results = []
    11  38.38671875 MiB   0.00390625 MiB       for _ in range(1000):
    12  38.38671875 MiB   0.00390625 MiB           results.append(dict(zip(a, b)))
    13                             
    14  38.39453125 MiB   0.00000000 MiB       for r in results:
    15  38.39453125 MiB   0.00781250 MiB           print(r)

検証2

Line #    Mem usage    Increment   Line Contents
================================================
     5  38.22656250 MiB  38.22656250 MiB   @profile(precision=8)
     6                             def main():
     7  38.23046875 MiB   0.00390625 MiB       a = [1, 2, 3]
     8  38.23046875 MiB   0.00000000 MiB       b = ['1', '2', '3']
     9                             
    10  38.23828125 MiB   0.00000000 MiB       results = map(dict, (zip(a, b) for _ in range(1000)))
    11                             
    12  38.23828125 MiB   0.00000000 MiB       for r in results:
    13  38.23828125 MiB   0.00781250 MiB           print(r)

おわりに

こういった検証用ツールがあるのを初めて知りました。コードの行数ごとに見れるの楽しい...!
早い遅いだけではなく、今回やりたかったケースのようなrowsがDBから取ってきた大量にあるデータの場合は、必要以上にメモリを使わないmapを使った書き方にしないと本番実行時に障害に繋がりかねないので気をつけなければと身に沁みました。

変なところあれば教えてください。

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up