Re: Re: 履歴テーブルから最新の1件を取ってくる方法

Last updated at 2024-12-10Posted at 2024-12-10

この記事の目的

本日立て続けにXのタイムラインに流れてきた次の二つの記事へのalternativeを、DuckDBで書いてみようと思う

結論

最初に結論だが、手法としては二つ目の記事の choplin 氏の提案と概念は同じであり、使う関数が異なるだけである。実行環境がDuckDBという特殊性もあるが。

Prerequisite

brew install duckdb
duckdb

-- DuckDBに `SERIAL` は存在しない
CREATE SEQUENCE id START 1;
CREATE TABLE history (
    id INTEGER DEFAULT nextval('id') PRIMARY KEY,
    user_id INTEGER NOT NULL,
    data TEXT,
    created_at TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP
);

INSERT INTO history (user_id, data, created_at) VALUES
(1, 'First entry of user1',  '2024-01-01 10:00:00'),
(1, 'Second entry of user1', '2024-01-02 09:30:00'),
(2, 'First entry of user2',  '2024-01-01 11:00:00'),
(2, 'Second entry of user2', '2024-01-02 08:45:00'),
(2, 'Third entry of user2',  '2024-01-03 07:15:00'),
(3, 'First entry of user3',  '2024-01-01 12:15:00');

解説

DuckDBでは unnest関数の引数に recursive:=true を指定して以下で書くことで展開できる。

max_by での書き方

SELECT 
  unnest(
    max_by(history, created_at, 1),
    recursive:=true
  )
FROM history
GROUP BY user_id;

┌───────┬─────────┬───────────────────────┬─────────────────────┐
│  id   │ user_id │         data          │     created_at      │
│ int32 │  int32  │        varchar        │      timestamp      │
├───────┼─────────┼───────────────────────┼─────────────────────┤
│     2 │       1 │ Second entry of user1 │ 2024-01-02 09:30:00 │
│     5 │       2 │ Third entry of user2  │ 2024-01-03 07:15:00 │
│     6 │       3 │ First entry of user3  │ 2024-01-01 12:15:00 │
└───────┴─────────┴───────────────────────┴─────────────────────┘

any_value での書き方

SELECT
  unnest(
    any_value(history ORDER BY created_at DESC),
    recursive:=true
  )
FROM history
GROUP BY user_id;

┌───────┬─────────┬───────────────────────┬─────────────────────┐
│  id   │ user_id │         data          │     created_at      │
│ int32 │  int32  │        varchar        │      timestamp      │
├───────┼─────────┼───────────────────────┼─────────────────────┤
│     2 │       1 │ Second entry of user1 │ 2024-01-02 09:30:00 │
│     5 │       2 │ Third entry of user2  │ 2024-01-03 07:15:00 │
│     6 │       3 │ First entry of user3  │ 2024-01-01 12:15:00 │
└───────┴─────────┴───────────────────────┴─────────────────────┘

PostgreSQLの例では複合型を使いさらに列に展開する場合 .* を使っているが、この記法は（PostgreSQLと同じSyntaxを採用しているDuckDBだが Composite Types は存在しないので）DuckDBではエラーとなる。

-- PostgreSQL
SELECT 
  (any_value(history ORDER BY created_at DESC)).*
FROM history 
GROUP BY user_id;

max_by と any_value の違い

実行計画は一か所だけ異なる。なぜだかよくわからんが any_value の実行計画には UNNEST が存在していないため、微妙に早い。結論としては any_value を使うのが良さそう。一方で max_by のメリットは、第三引数を2以上の指定で最新のn件を取得することができるという違いがある。

EXPLAIN ANALYZE SELECT
    unnest(
      max_by(history, created_at, 1),
      recursive:=1
    )
  FROM history
  GROUP BY user_id;
┌─────────────────────────────────────┐
│┌───────────────────────────────────┐│
││    Query Profiling Information    ││
│└───────────────────────────────────┘│
└─────────────────────────────────────┘
EXPLAIN ANALYZE SELECT    unnest(     max_by(history, created_at, 1),     recursive:=1   ) FROM history GROUP BY user_id;
┌────────────────────────────────────────────────┐
│┌──────────────────────────────────────────────┐│
││              Total Time: 0.0008s             ││
│└──────────────────────────────────────────────┘│
└────────────────────────────────────────────────┘
┌───────────────────────────┐
│           QUERY           │
└─────────────┬─────────────┘
┌─────────────┴─────────────┐
│      EXPLAIN_ANALYZE      │
│    ────────────────────   │
│           0 Rows          │
│          (0.00s)          │
└─────────────┬─────────────┘
┌─────────────┴─────────────┐
│         PROJECTION        │
│    ────────────────────   │
│             id            │
│          user_id          │
│            data           │
│         created_at        │
│                           │
│           3 Rows          │
│          (0.00s)          │
└─────────────┬─────────────┘
┌─────────────┴─────────────┐
│           UNNEST          │
│    ────────────────────   │
│           3 Rows          │
│          (0.00s)          │
└─────────────┬─────────────┘
┌─────────────┴─────────────┐
│   PERFECT_HASH_GROUP_BY   │
│    ────────────────────   │
│         Groups: #0        │
│                           │
│        Aggregates:        │
│     max_by(#1, #2, #3)    │
│                           │
│           3 Rows          │
│          (0.00s)          │
└─────────────┬─────────────┘
┌─────────────┴─────────────┐
│         PROJECTION        │
│    ────────────────────   │
│          user_id          │
│  struct_pack(id, user_id, │
│      data, created_at)    │
│         created_at        │
│             1             │
│                           │
│           6 Rows          │
│          (0.00s)          │
└─────────────┬─────────────┘
┌─────────────┴─────────────┐
│         TABLE_SCAN        │
│    ────────────────────   │
│          history          │
│                           │
│        Projections:       │
│          user_id          │
│             id            │
│            data           │
│         created_at        │
│                           │
│           6 Rows          │
│          (0.00s)          │
└───────────────────────────┘

EXPLAIN ANALYZE SELECT
    unnest(
      any_value(history ORDER BY created_at DESC),
      recursive:=1
    )
  FROM history
  GROUP BY user_id;
┌─────────────────────────────────────┐
│┌───────────────────────────────────┐│
││    Query Profiling Information    ││
│└───────────────────────────────────┘│
└─────────────────────────────────────┘
EXPLAIN ANALYZE SELECT   unnest(     any_value(history ORDER BY created_at DESC),     recursive:=1   ) FROM history GROUP BY user_id;
┌────────────────────────────────────────────────┐
│┌──────────────────────────────────────────────┐│
││              Total Time: 0.0006s             ││
│└──────────────────────────────────────────────┘│
└────────────────────────────────────────────────┘
┌───────────────────────────┐
│           QUERY           │
└─────────────┬─────────────┘
┌─────────────┴─────────────┐
│      EXPLAIN_ANALYZE      │
│    ────────────────────   │
│           0 Rows          │
│          (0.00s)          │
└─────────────┬─────────────┘
┌─────────────┴─────────────┐
│         PROJECTION        │
│    ────────────────────   │
│             id            │
│          user_id          │
│            data           │
│         created_at        │
│                           │
│           3 Rows          │
│          (0.00s)          │
└─────────────┬─────────────┘
┌─────────────┴─────────────┐
│   PERFECT_HASH_GROUP_BY   │
│    ────────────────────   │
│         Groups: #0        │
│                           │
│        Aggregates:        │
│      arg_min(#1, #2)      │
│                           │
│           3 Rows          │
│          (0.00s)          │
└─────────────┬─────────────┘
┌─────────────┴─────────────┐
│         PROJECTION        │
│    ────────────────────   │
│          user_id          │
│  struct_pack(id, user_id, │
│      data, created_at)    │
│ create_sort_key(created_at│
│    , 'DESC NULLS LAST')   │
│                           │
│           6 Rows          │
│          (0.00s)          │
└─────────────┬─────────────┘
┌─────────────┴─────────────┐
│         TABLE_SCAN        │
│    ────────────────────   │
│          history          │
│                           │
│        Projections:       │
│          user_id          │
│             id            │
│            data           │
│         created_at        │
│                           │
│           6 Rows          │
│          (0.00s)          │
└───────────────────────────┘

経験を伴わない勝手な想像

PostgreSQLのテーブルにて、history系のテーブルでは業務ドメイン次第だがuuid型の user_id 列で cardinality が高いケースでは Hash index が、created_at など実時間の順番通り挿入される列は BRIN index などが効くのではないか、と勝手に想像してみた。

その他

PostgreSQLでは「DISTINCT ON（PostgreSQL拡張）」を使って書くことができるが、近年の処理系(Snowflake, BigQuery, DuckDB, etc.) では qualify句 でwindow関数の指定ができるのでよく目にするようになってきている。ただしこいつは標準SQLではないので移植性が低いことに注意。

SELECT *
FROM history
QUALIFY ROW_NUMBER() OVER (PARTITION BY user_id ORDER BY created_at DESC) = 1;

以上

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up