each内でfindしてる処理（線形探索）をハッシュ探索、バイナリ探索にするとどれくらい早いのか計測してみた。

Last updated at 2024-07-27Posted at 2024-07-27

前提

以下、過去の開発で実際にあった例です。

stations = [{id: 1, name: station1}, {id: 2, name: station2}, ...]
facility_counts_by_station = [{id: 1, count: 10}, {id: 3, count: 20}, ...]
idはどちらもstation_idを指す。

のような配列がそれぞれあり、stationsに同一idのcountをmergeしたい。

これを処理するうえで、線形探索、ハッシュ探索、バイナリ探索それぞれのアルゴリズムで処理速度がどれくらい違うのか比較してみた。

補足

ある施設の数を駅ごとに表示するため用意しているデータです。

1つ目の配列はActiveRecordから取得、2つ目の配列はElasticSearchのaggregationから取得しているため、それを突合させる必要がありました。

なお、stationに属するfacilityが0件のケースはfacility_counts_by_stationに要素が作られないため、stationsに比べて1/3のデータ量にしています。

今見るとElasticSearch側のレスポンスにnameも含めてcountが0でも要素を返してもらうようにすればいい気がしますが、そうできない事情があったのかもしれません・・・

実装

def main
  [1500, 15000, 150000].each do |size|
    stations, facility_counts_by_station = init(size)

    start_time = Time.now
    linear_result = merge_by_linear_search(stations, facility_counts_by_station)
    p "線形探索処理時間 (size=#{size}): #{Time.now - start_time}s"

    start_time = Time.now
    hash_result = merge_by_hash_search(stations, facility_counts_by_station)
    p "ハッシュ探索処理時間 (size=#{size}): #{Time.now - start_time}s"

    start_time = Time.now
    binary_result = merge_by_binary_search(stations, facility_counts_by_station)
    p "バイナリ探索処理時間 (size=#{size}): #{Time.now - start_time}s"
    
    if linear_result == hash_result && linear_result == binary_result
      p '正常'
    elsif linear_result != hash_result
      p 'linear_result != hash_result'
    elsif linear_result != binary_result
      p 'linear_result != binary_result'
    end
  end
end

def init(size)
  stations = []
  facility_counts_by_station = []

  size.times do |i|
    id = i + 1
    stations << { id: id, name: "station#{id}"}

    if i % 3 == 0
      facility_counts_by_station << { id: id, count: rand(100) }
    end
  end

  return stations, facility_counts_by_station
end

def merge_by_linear_search(stations, facility_counts_by_station)
  stations.map do |station|
    target = facility_counts_by_station.find{|h| h[:id] == station[:id]}
    station.merge({count: target&.[](:count) || 0})
  end
end

def merge_by_hash_search(stations, facility_counts_by_station)
  facility_counts_hash = {}

  facility_counts_by_station.each do |facility_count|
    facility_counts_hash[facility_count[:id]] = facility_count[:count]
  end

  stations.map do |station|
    station.merge({count: facility_counts_hash[station[:id]] || 0})
  end
end

def bsearch_original(sorted_arr, target_id)
  head = 0
  tail = sorted_arr.length - 1

  while head <= tail
    center = (tail + head) / 2

    if sorted_arr[center][:id] == target_id
      return sorted_arr[center]
    elsif sorted_arr[center][:id] > target_id
      tail = center - 1
    elsif sorted_arr[center][:id] < target_id
      head = center + 1
    end
  end

  return {}
end

def merge_by_binary_search(stations, facility_counts_by_station)
  stations.map do |station|
    target = bsearch_original(facility_counts_by_station, station[:id])
    station.merge({count: target&.[](:count) || 0})
  end
end

main

測定結果

"線形探索処理時間 (size=1500): 0.054236214s"
"ハッシュ探索処理時間 (size=1500): 0.00072462s"
"バイナリ探索処理時間 (size=1500): 0.002575072s"

"線形探索処理時間 (size=15000): 5.983366226s"
"ハッシュ探索処理時間 (size=15000): 0.011276715s"
"バイナリ探索処理時間 (size=15000): 0.03400995s"

"線形探索処理時間 (size=150000): 958.449376416s"
"ハッシュ探索処理時間 (size=150000): 0.141020173s"
"バイナリ探索処理時間 (size=150000): 0.353612058s"

このコードにおいて、

N1 = stations.length (= 1,500 or 15,000 or 150,000)
N2 = facility_counts_by_station.length (= 500 or 5,000 or 50,000)

線形探索: O(N1 * N2)
ハッシュ探索(ハッシュ変換込み): O(N1 + N2)
バイナリ探索: O(N1 * LogN2)

となるので、やはりハッシュが一番早かった。

おまけ

バイナリ探索のほうが早い場合はないのかと考えたところ、オーダーの式的にN2の数がN1に比べて圧倒的に多いときそうなりそうだと思ったので、N1とN2を逆にして(facility_counts_by_station.each内でstationsに対して探索)かつ、N1 : N2 = 1 : 100 にして計測してみた。

N1 = facility_counts_by_station.length (= 15 or 15)
N2 = stations.length (= 1,500 or 15,000)

"線形探索処理時間 (size=1500): 0.000868799s"
"ハッシュ探索処理時間 (size=1500): 0.0002049s"
"バイナリ探索処理時間 (size=1500): 2.16e-05s"

"線形探索処理時間 (size=15000): 0.099216185s"
"ハッシュ探索処理時間 (size=15000): 0.0035582s"
"バイナリ探索処理時間 (size=15000): 0.0003478s"

その通りになった。つまり、N2>>>N1ならバイナリ探索が速い。

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up