More than 3 years have passed since last update.

PostgreSQL 14がやってくる(4) - TOAST圧縮方式の変更

Last updated at 2021-06-08Posted at 2021-06-07

はじめに

にゃーん。
今回は、PostgreSQL 14から導入されたTOAST圧縮方式の変更機能について調べてみた。

TOASTって？

TOASTというのは、"The Oversized-Attribute Storage Technique"(過大属性格納技法)の略称。

すごく雑に言うと、超長いデータを8KB固定のページに収めるための技術。
ユーザからみると基本的にはこれを意識することなく、勝手にPostgreSQLでやってくれる。

昨年、ちょっと自分でもTOASTについて調べてみた。そのときの資料はSpeakerDeckにあげてある。→ TOAST

TOASTの圧縮方式の変更

PostgreSQL 14のRelease notesより。(引用はbeta1時点のもの)

Add ability to use LZ4 compression on TOAST data (Dilip Kumar)
This can be set at the column level, or set as a default via server setting default_toast_compression. The server must be compiled with --with-lz4 to support this feature; the default is still pglz.

TOAST機能でデータを圧縮する場合、従来は規定の圧縮方式(pglz)のみを使っていたが、PostgreSQL 14からはlz4を選択可能になった。

PostgreSQL 14でlz4を使うための準備

今回はPostgreSQL 14 beta1を使って試してみる。
PostgreSQLをソースからビルドするときには、最初にconfigureという構成スクリプトを実行する。このときに--with-lz4というオプションを付与してconfigureを実行する。

しかし、--with-lz4オプションをつけるだけではconfigureに失敗することがある。これは環境の問題。

checking for liblz4... no
configure: error: Package requirements (liblz4) were not met:

No package 'liblz4' found

Consider adjusting the PKG_CONFIG_PATH environment variable if you
installed software in a non-standard prefix.

Alternatively, you may set the environment variables LZ4_CFLAGS
and LZ4_LIBS to avoid the need to call pkg-config.
See the pkg-config man page for more details.

lz4-develがその環境にインストールされていないために発生しているぽいので、lz-develをyum intallして再度configureを実行する。
configureが成功したら、あとはいつものように、make, make installしておく。

圧縮方式の指定

圧縮方式の指定は大きく分けて2つの方法がある。

パラメータ default_toast_compression で指定

PostgreSQLのパラメータdefault_toast_compressionに指定する方法。

このパラメータにlz4を指定すると、新規にテーブルを作成したときにTOAST対象となる列のCompressionは全てlz4となる。
default_toast_compressionのcontextはuserなので、1つのセッション内でこのパラメータにlz4をセットして、その直後にCREATE TABLEしても良い。

testdb=# CREATE TABLE foo (id int, data text);
CREATE TABLE
testdb=# SET default_toast_compression = 'lz4';
SET
testdb=# CREATE TABLE bar (id int, data text);
CREATE TABLE
testdb=# \d+ foo
                                           Table "public.foo"
 Column |  Type   | Collation | Nullable | Default | Storage  | Compression | Stats target | Description
--------+---------+-----------+----------+---------+----------+-------------+--------------+-------------
 id     | integer |           |          |         | plain    |             |              |
 data   | text    |           |          |         | extended | pglz        |              |
Access method: heap

testdb=# \d+ bar
                                           Table "public.bar"
 Column |  Type   | Collation | Nullable | Default | Storage  | Compression | Stats target | Description
--------+---------+-----------+----------+---------+----------+-------------+--------------+-------------
 id     | integer |           |          |         | plain    |             |              |
 data   | text    |           |          |         | extended | lz4         |              |
Access method: heap

testdb=#

ALTER TABLEコマンドで変更

ALTERコマンドで列単位に圧縮方式を指定することもできる。

testdb=# \d+ foo
                                           Table "public.foo"
 Column |  Type   | Collation | Nullable | Default | Storage  | Compression | Stats target | Description
--------+---------+-----------+----------+---------+----------+-------------+--------------+-------------
 id     | integer |           |          |         | plain    |             |              |
 data   | text    |           |          |         | extended | pglz        |              |
Access method: heap

testdb=# ALTER TABLE foo ALTER data SET COMPRESSION lz4;
ALTER TABLE
testdb=# \d+ foo
                                           Table "public.foo"
 Column |  Type   | Collation | Nullable | Default | Storage  | Compression | Stats target | Description
--------+---------+-----------+----------+---------+----------+-------------+--------------+-------------
 id     | integer |           |          |         | plain    |             |              |
 data   | text    |           |          |         | extended | lz4         |              |
Access method: heap

testdb=#

既に格納されているデータはどうなるの？

例えば、圧縮方式pglzで指定した列に対して何件か挿入し、その列に対して圧縮方式をlz4に変更した場合、既に挿入されpglzで圧縮されているデータはどうなるのだろう？
結論から言えば、既に格納済み(圧縮ずみ)のデータは変わらない(TOASTのStorage戦略と同じ)。なので、同一の列で、複数の圧縮方式が混在した行が発生することになる。

え？どの行がどの圧縮方式使っているのかわからなくならないか？
と思ったら、きちんと各行のデータがどう圧縮されたのかを表示するpg_column_compression()関数が用意されている。
なお、この関数はTOAST対象にならない短いデータの場合にはNULLを返却する。

testdb=# \d+ foo
                                           Table "public.foo"
 Column |  Type   | Collation | Nullable | Default | Storage  | Compression | Stats target | Description
--------+---------+-----------+----------+---------+----------+-------------+--------------+-------------
 id     | integer |           |          |         | plain    |             |              |
 data   | text    |           |          |         | extended | pglz        |              |
Access method: heap

testdb=# INSERT INTO foo VALUES (1, repeat(generate_random_text(200), 80));
INSERT 0 1
testdb=# ALTER TABLE foo ALTER data SET COMPRESSION lz4;
ALTER TABLE
testdb=# INSERT INTO foo VALUES (2, repeat(generate_random_text(200), 80));
INSERT 0 1
testdb=# INSERT INTO foo VALUES (3, repeat(generate_random_text(200), 1));
INSERT 0 1
testdb=# \pset null (null)
Null display is "(null)".
testdb=# SELECT id, pg_column_compression(data), pg_column_size(data) FROM foo;
 id | pg_column_compression | pg_column_size
----+-----------------------+----------------
  1 | pglz                  |            415
  2 | lz4                   |            280
  3 | (null)                |            204
(3 rows)

testdb=#

※注：generate_random_text関数は自作のpl/pgsqlスクリプト

lz4変更による処理時間の影響

さて、圧縮方式がpglzからlz4に変わることで、どういった影響があるのか。
INSERTによる圧縮＆格納処理時間と、SELECTによる展開処理時間を簡単に測ってみる。

検証用のテーブル

検証用にt_pglz, t_lz4, t_externalという3つのテーブルを用意する。それぞれのテーブルのdatta列を以下のように定義しておく。

テーブル名	dataのstorage	dataのcompression
t_pglz	extended	pglz
t_lz4	extended	lz4
t_external	external	pglz

\d+で表示させるとこんな感じ。

testdb=# \d+ t*
                                        Table "public.t_external"
 Column |  Type   | Collation | Nullable | Default | Storage  | Compression | Stats target | Description
--------+---------+-----------+----------+---------+----------+-------------+--------------+-------------
 id     | integer |           |          |         | plain    |             |              |
 data   | text    |           |          |         | external | pglz        |              |
Access method: heap

                                          Table "public.t_lz4"
 Column |  Type   | Collation | Nullable | Default | Storage  | Compression | Stats target | Description
--------+---------+-----------+----------+---------+----------+-------------+--------------+-------------
 id     | integer |           |          |         | plain    |             |              |
 data   | text    |           |          |         | extended | lz4         |              |
Access method: heap

                                          Table "public.t_pglz"
 Column |  Type   | Collation | Nullable | Default | Storage  | Compression | Stats target | Description
--------+---------+-----------+----------+---------+----------+-------------+--------------+-------------
 id     | integer |           |          |         | plain    |             |              |
 data   | text    |           |          |         | extended | pglz        |              |
Access method: heap

testdb=#

INSERTによる格納処理時間

各テーブルに対して、「ランダム200文字*40k回繰り返し」の文字を10000回INSERTする処理を実行し、その処理時間を\timingで表示される時間で測定する。

testdb=# INSERT INTO t_pglz VALUES (generate_series(1, 10000), repeat(generate_random_text(200), 80));
INSERT 0 10000
Time: 1894.367 ms (00:01.894)
testdb=# INSERT INTO t_lz4 VALUES (generate_series(1, 10000), repeat(generate_random_text(200), 80));
INSERT 0 10000
Time: 697.585 ms
testdb=# INSERT INTO t_external VALUES (generate_series(1, 10000), repeat(generate_random_text(200), 80));
INSERT 0 10000
Time: 4586.773 ms (00:04.587)

また、各テーブルのサイズを測定してみる。

testdb=# SELECT pg_total_relation_size('t_pglz');
 pg_total_relation_size
------------------------
                4595712
(1 row)

Time: 0.545 ms
testdb=# SELECT pg_total_relation_size('t_lz4');
 pg_total_relation_size
------------------------
                3317760
(1 row)

Time: 0.396 ms
testdb=# SELECT pg_total_relation_size('t_external');
 pg_total_relation_size
------------------------
              167288832
(1 row)

Time: 0.378 ms
testdb=#

検索

今度は格納されたデータをフルスキャンするときのSeqScan timeをEXPLAIN ANALYZEで取得してみる。（バッファに乗った状態で比較する）

testdb=# EXPLAIN ANALYZE SELECT id, data FROM t_pglz ;
                                                 QUERY PLAN
------------------------------------------------------------------------------------------------------------
 Seq Scan on t_pglz  (cost=0.00..656.00 rows=10000 width=418) (actual time=0.007..4.587 rows=10000 loops=1)
 Planning Time: 0.035 ms
 Execution Time: 8.545 ms
(3 rows)

testdb=# EXPLAIN ANALYZE SELECT id, data FROM t_lz4;
                                                QUERY PLAN
-----------------------------------------------------------------------------------------------------------
 Seq Scan on t_lz4  (cost=0.00..500.00 rows=10000 width=284) (actual time=0.007..4.522 rows=10000 loops=1)
 Planning Time: 0.034 ms
 Execution Time: 8.510 ms
(3 rows)

testdb=# EXPLAIN ANALYZE SELECT id, data FROM t_external;
                                                  QUERY PLAN
---------------------------------------------------------------------------------------------------------------
 Seq Scan on t_external  (cost=0.00..164.00 rows=10000 width=22) (actual time=0.008..4.405 rows=10000 loops=1)
 Planning Time: 0.037 ms
 Execution Time: 8.393 ms
(3 rows)

testdb=#

雑なまとめ

まとめるとこんな感じ。

テーブル名	INSERT時間(ms)	SeqScan時間(ms)	リレーションサイズ(byte)
t_pglz	1894.367	4.587	4595712
t_lz4	697.585	4.522	3317760
t_external	4586.773	4.405	167288832

格納処理時間はlz4が優位。
SeqScan処理時間はpglz, lz4, externalともに大きな差はない。
- モデルが悪いのかな？
圧縮率もlz4のほうがpglzより勝っているようだ。
- externalが遅いのはリレーションサイズが他の2つにくらべて数桁違うくらい大きいからなのだろう。

おわりに

今回の検証では、lz4による格納処理時間がpglzと比較するとかなり速いという結果になったが、他の実世界にあるような長大文字列での格納性能も測定しておきたい。
（青空文庫の日本語テキストあたりだとどうなるのか、ちょっと興味はある）

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up