PostgreSQL テーブル差集合のパフォーマンス
お題
次の記事を参照後、興味があってPostgreSQL におけるテーブル差集合のパフォーマンスを測定しました。
今回は灰色部分を求めます。
結論
測定結果は次のようになりました
not exists
>>> [Big Wall] >>> left join / is NULL
>= not in
> except all
[not in]
Planning Time: 0.047 ms
Execution Time: 475.355 ms
[not exists]
Planning Time: 0.024 ms
Execution Time: 3.766 ms
[left join / is NULL]
Planning Time: 0.033 ms
Execution Time: 349.660 ms
[except all]
Planning Time: 0.017 ms
Execution Time: 718.943 ms
差集合を得たい場合は exists
を使えば、パフォーマンス問題に悩まされないようです。
参考
データに応じて測定結果が変わりますが、データ量を増減させた時のパフォーマンス特性も掲載しておきます。
t2 の行数を変化させながら測定しています。
PostgreSQLの9.2以降はnot in も not existsと遜色ないという記事もあるが、そんなことはありませんでした。
実験
[環境]
- Ubuntu 22.04
- PostgreSQL 14
psql Cli から DB 作成後にアクセス
root@masami-L ~# createdb -U postgres sample
root@masami-L /# psql -U postgres -d sample
ベンチマーク用関数用意
sample=# CREATE FUNCTION f_test(ct int, sql text)
RETURNS void
LANGUAGE plpgsql AS
$func$
DECLARE
i int;
BEGIN
FOR i IN 1 .. $1 LOOP
EXECUTE sql;
END LOOP;
END
$func$;
CREATE FUNCTION
データ準備
sample=# create table t1 (
id serial,
val int default random()*1000);
CREATE TABLE
sample=# create table t2 (
id serial,
val int default random()*1000);
CREATE TABLE
sample=# insert into t1
select from generate_series(1,10000);
INSERT 0 10000
sample=# insert into t2
select from generate_series(1,10000);
INSERT 0 10000
■not in
実行計画の確認
sample=# explain analyze select * from t1 where (val) not in (select val from t2);
QUERY PLAN
--------------------------------------------------------------------------------------------------------------
Seq Scan on t1 (cost=172.12..344.25 rows=5085 width=8) (actual time=16.781..16.783 rows=0 loops=1)
Filter: (NOT (hashed SubPlan 1))
Rows Removed by Filter: 10000
SubPlan 1
-> Seq Scan on t2 (cost=0.00..146.70 rows=10170 width=4) (actual time=0.037..4.310 rows=10000 loops=1)
Planning Time: 1.118 ms
Execution Time: 17.252 ms
(7 rows)
パフォーマンスの確認
sample=# EXPLAIN ANALYZE
SELECT f_test(100, 'select * from t1 where (val) not in (select val from t2);');
QUERY PLAN
----------------------------------------------------------------------------------------
Result (cost=0.00..0.26 rows=1 width=4) (actual time=475.229..475.230 rows=1 loops=1)
Planning Time: 0.047 ms
Execution Time: 475.355 ms
(3 rows)
■not exists
実行計画の確認
sample=# explain analyze select * from t1 where not exists (select val from t2);
QUERY PLAN
----------------------------------------------------------------------------------------------------------
Result (cost=0.01..145.01 rows=10000 width=8) (actual time=0.028..0.030 rows=0 loops=1)
One-Time Filter: (NOT $0)
InitPlan 1 (returns $0)
-> Seq Scan on t2 (cost=0.00..145.00 rows=10000 width=0) (actual time=0.023..0.023 rows=1 loops=1)
-> Seq Scan on t1 (cost=0.01..145.01 rows=10000 width=8) (never executed)
Planning Time: 0.159 ms
Execution Time: 0.081 ms
(7 rows)
パフォーマンスの確認
sample=# EXPLAIN ANALYZE
SELECT f_test(100, 'select * from t1 where not exists (select val from t2);');
QUERY PLAN
------------------------------------------------------------------------------------
Result (cost=0.00..0.26 rows=1 width=4) (actual time=3.752..3.752 rows=1 loops=1)
Planning Time: 0.024 ms
Execution Time: 3.766 ms
(3 rows)
■left join / is NUL
実行計画の確認
sample=# explain analyze select * from t1 left join t2 using (val) where t2.val is NULL;
QUERY PLAN
------------------------------------------------------------------------------------------------------------------
Hash Anti Join (cost=270.00..490.00 rows=1 width=12) (actual time=6.997..6.999 rows=0 loops=1)
Hash Cond: (t1.val = t2.val)
-> Seq Scan on t1 (cost=0.00..145.00 rows=10000 width=8) (actual time=0.013..1.048 rows=10000 loops=1)
-> Hash (cost=145.00..145.00 rows=10000 width=8) (actual time=3.554..3.555 rows=10000 loops=1)
Buckets: 16384 Batches: 1 Memory Usage: 519kB
-> Seq Scan on t2 (cost=0.00..145.00 rows=10000 width=8) (actual time=0.008..1.291 rows=10000 loops=1)
Planning Time: 1.331 ms
Execution Time: 7.039 ms
(8 rows)
パフォーマンスの確認
sample=# EXPLAIN ANALYZE
SELECT f_test(100, 'select * from t1 left join t2 using (val) where t2.val is NULL;');
QUERY PLAN
----------------------------------------------------------------------------------------
Result (cost=0.00..0.26 rows=1 width=4) (actual time=349.641..349.642 rows=1 loops=1)
Planning Time: 0.033 ms
Execution Time: 349.660 ms
(3 rows)
■except all
実行計画確認
sample=# explain analyze select * from t1
except all select * from t2;
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------
HashSetOp Except All (cost=0.00..690.00 rows=10000 width=12) (actual time=8.851..9.891 rows=9986 loops=1)
-> Append (cost=0.00..590.00 rows=20000 width=12) (actual time=0.008..5.381 rows=20000 loops=1)
-> Subquery Scan on "*SELECT* 1" (cost=0.00..245.00 rows=10000 width=12) (actual time=0.008..2.070 rows=10000 loops=1)
-> Seq Scan on t1 (cost=0.00..145.00 rows=10000 width=8) (actual time=0.005..0.759 rows=10000 loops=1)
-> Subquery Scan on "*SELECT* 2" (cost=0.00..245.00 rows=10000 width=12) (actual time=0.007..2.002 rows=10000 loops=1)
-> Seq Scan on t2 (cost=0.00..145.00 rows=10000 width=8) (actual time=0.006..0.747 rows=10000 loops=1)
Planning Time: 0.055 ms
Execution Time: 10.738 ms
(8 rows)
パフォーマンス測定
sample=# EXPLAIN ANALYZE
SELECT f_test(100, 'select * from t1
except all select * from t2;');
QUERY PLAN
----------------------------------------------------------------------------------------
Result (cost=0.00..0.26 rows=1 width=4) (actual time=718.931..718.932 rows=1 loops=1)
Planning Time: 0.017 ms
Execution Time: 718.943 ms
(3 rows)