More than 5 years have passed since last update.

Rubyで表形式データから検索値リストによるデータ抽出処理をSQLっぽく書く（インデックス化による高速化も）

Last updated at 2015-06-14Posted at 2015-06-14

先に結論

Ruby で表形式データから検索値リストによるデータ抽出処理をSQLっぽく書くには、以下のようにする。

require 'active_support'
require 'active_support/core_ext'

list_1 = list_1.index_by {|v| v } # Ruby 2.2 以上なら、
list_2 = list_2.index_by {|v| v } # index_by(&:itself) が使える

result = rows.select {|row|
  row[:column_1].in?(list_1) and
  row[:column_2].in?(list_2)
}

これで CSV からのデータ抽出や複数 RDBMS を跨いだデータ集計などの作業が捗りそう。

やりたいこと

Ruby を使って CSV から特定の値を持つデータを抽出したり、
複数の RDBMS を跨いだデータを Ruby の層でいい感じにごにょごにょしたい。

具体的に言うと、例えば以下の様なデータがあったときに、

column_1	column_2	...
111	'aaa'	...
222	'bbb'	...
333	'ccc'	...
...	...	...

MySQL, PostgreSQL, Amazon Redshift などの RDBMS では以下のように書くデータ抽出処理を、

SELECT
  *
FROM
  `rows`
WHERE
  `column_1` IN(:list_1) AND  -- :list_1 の具体例: 111,222,333,...
  `column_2` IN(:list_2)      -- :list_2 の具体例: 'aaa','bbb','ccc',...
;

Ruby における以下のようなデータに対しても同じような感じで実現したい。

rows = [
  { column_1: 111, column_2: 'aaa', ... },
  { column_1: 222, column_2: 'bbb', ... },
  { column_1: 333, column_2: 'ccc', ... },
  ...
]

できれば SQL っぽい表現で、できれば高速に。

実現方法

Enumerable#select と Array#include? を使う（イマイチ）

result = rows.select {|row|
  list_1.include?(row[:column_1]) and
  list_2.include?(row[:column_2])
}

これは以下の点でイマイチ。

可読性：検索値リスト→検索対象値の順で書かれているため、ブロック内の記述が分かりにくい
処理速度： rows.size を m、list_*.size を n とすると、select ブロック内の計算量は O(m * n) になる

【改善①】Array#include? の代わりに Object#in? を使う（可読性アップ）

require 'active_support'
require 'active_support/core_ext'

result = rows.select {|row|
  row[:column_1].in?(list_1) and
  row[:column_2].in?(list_2)
}

value.in?(list) は list.include?(value) と等価（ActiveSupport による Object クラス拡張）
これにより、select ブロック内の記述が SQL と同様になり可読性がアップする
が、実質的な処理内容は変わっていないので計算量は O(m * n) のまま

【改善②】検索値リストを Enumerable#index_by でインデックス化する（処理速度アップ）

require 'active_support'
require 'active_support/core_ext'

list_1 = list_1.index_by {|v| v } # Ruby 2.2 以上なら、
list_2 = list_2.index_by {|v| v } # index_by(&:itself) が使える

result = rows.select {|row|
  row[:column_1].in?(list_1) and
  row[:column_2].in?(list_2)
}

[1, 2, 3].index_by(&:itself) は {1 => 1, 2 => 2, 3 => 3} を返す（ActiveSupport による Enumerable モジュール拡張）
Hash#include? は Hash#has_key? と等価なので、value.in?(list) は「Array の値探索処理」 → 「Hash のキーアクセス処理」になる
これにより、select ブロック内の探索計算量が O(m * n) → O(m) になる

パフォーマンス確認

測定スクリプト：

column_1（整数）と column_2（文字列）を持つ 100万レコードのデータに対し、
検索値リストの数を 1, 10, 100, 1000, 10000 にした場合のデータ抽出処理時間を計測する。

require 'active_support'
require 'active_support/core_ext'
require 'benchmark'

num_rows = 1_000_000
num_list_variations = [1, 10, 100, 1_000, 10_000]

num_list_variations.each do |num_list|
  rows = (1..num_rows).map {|v| { column_1: v.to_i, column_2: v.to_s } }
  list_1 = (1..num_list).map(&:to_i)
  list_2 = (1..num_list).map(&:to_s)

  puts "\n【レコード数: #{num_rows}, 検索値リスト数: #{num_list}】"

  Benchmark.bm(10) do |x|
    x.report('インデックス化なし') do
      result = rows.select {|row|
        row[:column_1].in?(list_1) and
        row[:column_2].in?(list_2)
      }
    end

    x.report('インデックス化あり') do
      list_1 = list_1.index_by(&:itself)
      list_2 = list_2.index_by(&:itself)

      result = rows.select {|row|
        row[:column_1].in?(list_1) and
        row[:column_2].in?(list_2)
      }
    end
  end
end

測定結果：

OS X Yosemite 10.10.3
2.6 GHz Intel Core i5
8 GB 1600 MHz DDR3
ruby 2.2.2p95 (2015-04-13 revision 50295)
activesupport (4.2.1)

【レコード数: 1000000, 検索値リスト数: 1】
                       user     system      total        real
インデックス化なし    0.170000   0.000000   0.170000 (  0.172847)
インデックス化あり    0.180000   0.000000   0.180000 (  0.180457)

【レコード数: 1000000, 検索値リスト数: 10】
                       user     system      total        real
インデックス化なし    0.230000   0.000000   0.230000 (  0.235147)
インデックス化あり    0.170000   0.000000   0.170000 (  0.173922)

【レコード数: 1000000, 検索値リスト数: 100】
                       user     system      total        real
インデックス化なし    0.880000   0.000000   0.880000 (  0.888002)
インデックス化あり    0.230000   0.000000   0.230000 (  0.229938)

【レコード数: 1000000, 検索値リスト数: 1000】
                       user     system      total        real
インデックス化なし    6.770000   0.010000   6.780000 (  6.772243)
インデックス化あり    0.190000   0.000000   0.190000 (  0.197593)

【レコード数: 1000000, 検索値リスト数: 10000】
                       user     system      total        real
インデックス化なし   66.680000   0.040000  66.720000 ( 66.787141)
インデックス化あり    0.220000   0.000000   0.220000 (  0.221374)

検索値リストの数が少ない場合は両者に差が出ないが、
これが多くなるにつれて検索値リストをインデックス化することの効果が顕著になる。

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up