はじめに
こちらのそーだいさんの記事にある、履歴テーブルから最新の1件を取ってくる方法のTiDB(MySQL)版です。
TiDB/MySQLは特殊な関数がないので、オーソドックスにWindow関数を使ったやり方を試してみます。
それだけだと面白くないので、本記事ではより大きなデータセットを用いて、実行計画の改善を試し、改善度合いを見てみることにします。また、これはTiFlash(HTAP)のアクセラレーションの良い例でもあるので、TiFlashも使ってみました。
準備
まずは、大量のデータセットを用意します。テーブル定義はそーだいさんのブログに記載のテーブル定義をそのまま利用し、それにあわせたデータを作成するプログラムを作成しました。
言語は何でもいいと思いますが、ダミーデータを作成するライブラリがあると楽です。ここではgo言語の gofakeit を使いました。
大量データの場合、多くのケースにおいて直接INSERTするよりCSV等をバルクインサートする方が速いです。
このプログラムもCSVを書き出しているだけです。
package main
import (
"encoding/csv"
"fmt"
"math/rand"
"os"
"strconv"
"time"
"github.com/brianvoe/gofakeit/v7"
)
func main() {
if len(os.Args) != 5 {
fmt.Println("Usage: datagen <seed> <number_of_rows> <max_id> <output_file>")
return
}
seed, err := strconv.ParseInt(os.Args[1], 10, 64)
if err != nil {
fmt.Println("Invalid seed:", os.Args[1])
return
}
numRows, err := strconv.Atoi(os.Args[2])
if err != nil {
fmt.Println("Invalid number of rows:", os.Args[2])
return
}
maxID, err := strconv.Atoi(os.Args[3])
if err != nil {
fmt.Println("Invalid max_id:", os.Args[3])
return
}
outputFile := os.Args[4]
gofakeit.Seed(seed)
file, err := os.Create(outputFile)
if err != nil {
fmt.Println("Could not create file:", outputFile)
return
}
defer file.Close()
writer := csv.NewWriter(file)
defer writer.Flush()
// Write CSV header
writer.Write([]string{"id", "user_id", "data", "created_at"})
for i := 0; i < numRows; i++ {
id := i + 1
user_id := rand.Intn(maxID) + 1
data := gofakeit.Sentence(10)
created_at := gofakeit.Date().Format(time.RFC3339)
writer.Write([]string{strconv.Itoa(id), strconv.Itoa(user_id), data, created_at})
}
}
とりあえずこれで10000ユーザー/1千万行くらいのCSVファイルを生成します。900MB程度ありますが、M1Macでの生成時間は数分といったところです。これをS3に保存します。
TiDB Cloud側であらかじめテーブルを作成しておきます。テーブル定義はそーだいさんのブログのものをそのまま利用しました。
-- MySQL
CREATE TABLE history (
id INT AUTO_INCREMENT PRIMARY KEY,
user_id INT NOT NULL,
data TEXT,
created_at DATETIME NOT NULL DEFAULT CURRENT_TIMESTAMP
);
このテーブルに対し、S3からデータをimportしておきます。これで準備は完了です。
ベースラインのクエリパフォーマンス
では、何もせずそのままの状態でクエリを実行してみることにします。10000ユーザーのそれぞれの最新の1件を取得するクエリです。10000件出力すると長いので、100件にしぼります。
window 関数
SELECT *
FROM (SELECT history.*,
ROW_NUMBER() OVER (PARTITION BY user_id ORDER BY created_at DESC) AS ranking
FROM history) latest
WHERE ranking = 1
limit 100;
+---------+---------+---------------------------------------------------------------------------------+---------------------+---------+
| id | user_id | data | created_at | ranking |
+---------+---------+---------------------------------------------------------------------------------+---------------------+---------+
| 8644123 | 3 | You some however enchanted example troop riches Eastern this along. | 2024-12-11 20:17:55 | 1 |
| 4604397 | 14 | It is what recently colorful out dynasty water besides very. | 2024-12-07 09:03:51 | 1 |
| 8960784 | 24 | Tense whereas Italian yearly first sandwich yourself data it about. | 2024-12-29 09:05:30 | 1 |
...
| 4613743 | 494 | Example always you each occasionally out down troupe hers thoughtfully. | 2024-11-15 03:28:30 | 1 |
| 7343243 | 498 | Result those many joy his were today in you substantial. | 2024-12-30 11:11:45 | 1 |
| 7143736 | 501 | Plane where often in upon across batch whom point near. | 2024-11-19 03:43:41 | 1 |
+---------+---------+---------------------------------------------------------------------------------+---------------------+---------+
100 rows in set (23.42 sec)
23.42秒かかっています。
実行計画も見てみます。
mysql> explain analyze SELECT * FROM (SELECT history.*, ROW_NUMBER() OVER (PARTITION BY user_id ORDER BY created_at DESC) AS ranking FROM history) latest WHERE ranking = 1 limit 100;
+------------------------------------+-------------+----------+-----------+---------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------+----------+---------+
| id | estRows | actRows | task | access object | execution info | operator info | memory | disk |
+------------------------------------+-------------+----------+-----------+---------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------+----------+---------+
| Limit_14 | 100.00 | 100 | root | | time:16.6s, loops:2, RU:20481.047892 | offset:?, count:? | N/A | N/A |
| └─Selection_15 | 100.00 | 100 | root | | time:16.6s, loops:1 | eq(Column#6, ?) | 122.2 KB | N/A |
| └─Shuffle_20 | 100.00 | 99328 | root | | time:16.6s, loops:97, ShuffleConcurrency:5 | execution info: concurrency:5, data sources:[TableReader_18] | N/A | N/A |
| └─Window_16 | 100.00 | 103424 | root | | time:1m41.4s, loops:102 | row_number()->Column#6 over(partition by firsthist.history.user_id order by firsthist.history.created_at desc rows between current row and current row) | N/A | N/A |
| └─Sort_19 | 10000000.00 | 107520 | root | | time:1m41.4s, loops:106 | firsthist.history.user_id, firsthist.history.created_at:desc | 248.3 MB | 0 Bytes |
| └─ShuffleReceiver_21 | 10000000.00 | 10000000 | root | | time:1m8.7s, loops:9774 | | N/A | N/A |
| └─TableReader_18 | 10000000.00 | 10000000 | root | | time:53.5ms, loops:9782, cop_task: {num: 291, max: 68.6ms, min: 1.05ms, avg: 39.1ms, p95: 59.3ms, max_proc_keys: 50144, p95_proc_keys: 50144, tot_proc: 8.71s, tot_wait: 25ms, copr_cache: disabled, build_task_duration: 31.5µs, max_distsql_concurrency: 13}, rpc_info:{Cop:{num_rpc:291, total_time:11.4s}} | data:TableFullScan_17 | 49.6 MB | N/A |
| └─TableFullScan_17 | 10000000.00 | 10000000 | cop[tikv] | table:history | tikv_task:{proc max:50ms, min:0s, avg: 26.3ms, p80:40ms, p95:50ms, iters:10920, tasks:291}, scan_detail: {total_process_keys: 10000000, total_process_keys_size: 1142807139, total_keys: 10000291, get_snapshot_time: 18.4ms, rocksdb: {key_skipped_count: 10000000, block: {cache_hit_count: 39879}}}, time_detail: {total_process_time: 8.71s, total_suspend_time: 18.6ms, total_wait_time: 25ms, total_kv_read_wall_time: 7.64s, tikv_wall_time: 8.79s} | keep order:false | N/A | N/A |
+------------------------------------+-------------+----------+-----------+---------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------+----------+---------+
8 rows in set (22.65 sec)
ここで、cop[tikv]
はTiKVレイヤーの処理を、root
はTiDBレイヤーでの処理を示しています。全量TiKVから読み出して、TiDBサーバ側でフィルタ処理をしているようです。元のSQLからみても想像がつく実行計画ではないでしょうか。
集計関数
今回のように、日付型など大小比較が可能なものについては、集計演算を使う方法もあります(Window関数がそれほど広まってなかったときは、こちらのやり方が主流だったと思います)
ただこちらは厳密には先のwindow関数とは異なっていて、最新のレコードが複数あった場合、複数出てきてしまいます。
select h1.*
from history h1 join
(select user_id, max(created_at) as latest from history group by user_id) h2
on h1.created_at = h2.latest and h1.user_id = h2.user_id
limit 100;
+---------+---------+---------------------------------------------------------------------------------+---------------------+
| id | user_id | data | created_at |
+---------+---------+---------------------------------------------------------------------------------+---------------------+
| 1894623 | 2150 | Nevertheless of will an hers late it anyway tomorrow so. | 2024-12-05 09:49:01 |
| 5684905 | 9067 | It how nest place he rather muster electricity soon where. | 2024-07-12 18:13:00 |
| 948625 | 878 | Party this caravan untie mob Congolese these turn kiss library. | 2024-12-04 09:58:19 |
...
| 999449 | 5354 | Gossip heavily child lean may next class hmm it others. | 2024-12-10 01:56:49 |
| 999484 | 2252 | It dance hundreds respects regularly from had tonight that where. | 2024-12-15 12:05:14 |
| 1003066 | 9095 | Now you then handsome consequently neither hence oops daily whenever. | 2024-11-30 04:09:12 |
+---------+---------+---------------------------------------------------------------------------------+---------------------+
100 rows in set (1.43 sec)
1.43秒です。なんか間違ってるんじゃないかと思ってcount(distinct)してみましたが、結果は正しかったです。
こちらも実行計画をみてみます。
mysql> explain analyze select h1.* from history h1 join (select user_id, max(created_at) as latest from history group by user_id) h2 on h1.created_at = h2.latest and h1.user_id = h2.user_id limit 100;
+------------------------------------+-------------+----------+-----------+---------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------+----------+---------+
| id | estRows | actRows | task | access object | execution info | operator info | memory | disk |
+------------------------------------+-------------+----------+-----------+---------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------+----------+---------+
| Limit_16 | 100.00 | 100 | root | | time:1.39s, loops:2, RU:30174.050516 | offset:?, count:? | N/A | N/A |
| └─HashJoin_17 | 100.00 | 1024 | root | | time:1.39s, loops:1, build_hash_table:{total:806.9ms, fetch:804ms, build:2.93ms}, probe:{concurrency:5, total:6.96s, max:1.39s, probe:952.7ms, fetch and wait:6.01s} | inner join, equal:[eq(firsthist.history.created_at, Column#9) eq(firsthist.history.user_id, firsthist.history.user_id)] | 878.7 KB | 0 Bytes |
| ├─Selection_21(Build) | 8000.00 | 10000 | root | | time:804.4ms, loops:11 | not(isnull(Column#9)) | 19.0 KB | N/A |
| │ └─HashAgg_26 | 10000.00 | 10000 | root | | time:803.7ms, loops:12, partial_worker:{wall_time:801.523879ms, concurrency:5, task_num:88, tot_wait:3.896902186s, tot_exec:110.531283ms, tot_time:4.007506382s, max:801.508052ms, p95:801.508052ms}, final_worker:{wall_time:804.546583ms, concurrency:5, task_num:25, tot_wait:17.448µs, tot_exec:9.896434ms, tot_time:4.020676143s, max:804.534267ms, p95:804.534267ms} | group by:firsthist.history.user_id, funcs:max(Column#10)->Column#9, funcs:firstrow(firsthist.history.user_id)->firsthist.history.user_id | 12.5 MB | N/A |
| │ └─TableReader_27 | 10000.00 | 332777 | root | | time:798.6ms, loops:89, cop_task: {num: 90, max: 721.9ms, min: 1.36ms, avg: 90ms, p95: 713.4ms, max_proc_keys: 915864, p95_proc_keys: 915737, tot_proc: 7.75s, tot_wait: 14.9ms, copr_cache: disabled, build_task_duration: 15.7µs, max_distsql_concurrency: 13}, rpc_info:{Cop:{num_rpc:90, total_time:8.09s}} | data:HashAgg_22 | 340.6 KB | N/A |
| │ └─HashAgg_22 | 10000.00 | 332777 | cop[tikv] | | tikv_task:{proc max:720ms, min:0s, avg: 86ms, p80:10ms, p95:710ms, iters:9774, tasks:90}, scan_detail: {total_process_keys: 10000000, total_process_keys_size: 1142807139, total_keys: 10000090, get_snapshot_time: 13.2ms, rocksdb: {key_skipped_count: 10000000, block: {cache_hit_count: 39153}}}, time_detail: {total_process_time: 7.75s, total_suspend_time: 14.2ms, total_wait_time: 14.9ms, total_kv_read_wall_time: 6.75s, tikv_wall_time: 7.79s} | group by:firsthist.history.user_id, funcs:max(firsthist.history.created_at)->Column#10 | N/A | N/A |
| │ └─TableFullScan_25 | 10000000.00 | 10000000 | cop[tikv] | table:history | tikv_task:{proc max:649ms, min:0s, avg: 75ms, p80:10ms, p95:630ms, iters:9774, tasks:90} | keep order:false | N/A | N/A |
| └─TableReader_20(Probe) | 125000.00 | 4875872 | root | | time:402ms, loops:4767, cop_task: {num: 183, max: 71.1ms, min: 983µs, avg: 34ms, p95: 66.5ms, max_proc_keys: 50144, p95_proc_keys: 50144, tot_proc: 4.4s, tot_wait: 20.5ms, copr_cache: disabled, build_task_duration: 34.1µs, max_distsql_concurrency: 13}, rpc_info:{Cop:{num_rpc:183, total_time:6.22s}} | data:TableFullScan_19 | 18.0 MB | N/A |
| └─TableFullScan_19 | 125000.00 | 4905568 | cop[tikv] | table:h1 | tikv_task:{proc max:50ms, min:0s, avg: 20.8ms, p80:40ms, p95:50ms, iters:5503, tasks:183}, scan_detail: {total_process_keys: 4905568, total_process_keys_size: 560603018, total_keys: 4905751, get_snapshot_time: 16.3ms, rocksdb: {key_skipped_count: 4905568, block: {cache_hit_count: 19694}}}, time_detail: {total_process_time: 4.4s, total_suspend_time: 9.64ms, total_wait_time: 20.5ms, total_kv_read_wall_time: 3.81s, tikv_wall_time: 4.49s} | keep order:false | N/A | N/A |
+------------------------------------+-------------+----------+-----------+---------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------+----------+---------+
9 rows in set (1.40 sec)
こちらは、サブクエリ部分の集計演算と、それを元にHashJoinを行う二段階に分かれています。TiKVは集計演算をpushdownで処理できる機能を備えており、集計演算は効率的に分散処理されます。HashJoinも同様に分散処理されます。
このようにうまくTiKVに処理を散らすことで、高速に処理ができたのではないかと考えられます。
パフォーマンス改善
次に、パフォーマンスの改善方法について考えてみます。今回の例は created_at
で最新のものを取るという簡単な検索条件ですので、そーだいさんのブログで言及されているように、インデックスの効果がありそうです。
とりあえずどちらのクエリでも使われている {created_at, user_id}
のインデックスを作ってみます。
mysql> alter table history add index idx_created_at_user_id (created_at, user_id);
Query OK, 0 rows affected (48.05 sec)
window関数
window関数の実行計画に変更はなく、実行時間も変わりません。
mysql> explain analyze SELECT * FROM (SELECT history.*, ROW_NUMBER() OVER (PARTITION BY user_id ORDER BY created_at DESC) AS ranking FROM history) latest WHERE ranking = 1 limit 100;
+------------------------------------+-------------+----------+-----------+---------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------+----------+---------+
| id | estRows | actRows | task | access object | execution info | operator info | memory | disk |
+------------------------------------+-------------+----------+-----------+---------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------+----------+---------+
| Limit_14 | 100.00 | 100 | root | | time:17s, loops:2, RU:20467.833327 | offset:?, count:? | N/A | N/A |
| └─Selection_15 | 100.00 | 100 | root | | time:17s, loops:1 | eq(Column#6, ?) | 122.2 KB | N/A |
| └─Shuffle_20 | 100.00 | 99328 | root | | time:17s, loops:97, ShuffleConcurrency:5 | execution info: concurrency:5, data sources:[TableReader_18] | N/A | N/A |
| └─Window_16 | 100.00 | 103424 | root | | time:1m44.4s, loops:102 | row_number()->Column#6 over(partition by firsthist.history.user_id order by firsthist.history.created_at desc rows between current row and current row) | N/A | N/A |
| └─Sort_19 | 10000000.00 | 107520 | root | | time:1m44.4s, loops:106 | firsthist.history.user_id, firsthist.history.created_at:desc | 248.3 MB | 0 Bytes |
| └─ShuffleReceiver_21 | 10000000.00 | 10000000 | root | | time:1m9.8s, loops:9774 | | N/A | N/A |
| └─TableReader_18 | 10000000.00 | 10000000 | root | | time:49ms, loops:9782, cop_task: {num: 291, max: 63.9ms, min: 483.8µs, avg: 38.8ms, p95: 58.9ms, max_proc_keys: 50144, p95_proc_keys: 50144, tot_proc: 8.68s, tot_wait: 11.9ms, copr_cache: disabled, build_task_duration: 34.6µs, max_distsql_concurrency: 13}, rpc_info:{Cop:{num_rpc:291, total_time:11.3s}} | data:TableFullScan_17 | 49.6 MB | N/A |
| └─TableFullScan_17 | 10000000.00 | 10000000 | cop[tikv] | table:history | tikv_task:{proc max:50ms, min:0s, avg: 26.7ms, p80:40ms, p95:50ms, iters:10920, tasks:291}, scan_detail: {total_process_keys: 10000000, total_process_keys_size: 1142807139, total_keys: 10000291, get_snapshot_time: 5.32ms, rocksdb: {key_skipped_count: 10000000, block: {cache_hit_count: 39879}}}, time_detail: {total_process_time: 8.68s, total_suspend_time: 18.9ms, total_wait_time: 11.9ms, total_kv_read_wall_time: 7.78s, tikv_wall_time: 8.74s} | keep order:false | N/A | N/A |
+------------------------------------+-------------+----------+-----------+---------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------+----------+---------+
8 rows in set (23.38 sec)
集計関数
集計関数では、作成したインデックスが利用されていました。
mysql> alter table history drop index idx_user_id_created_at;
Query OK, 0 rows affected (0.55 sec)
mysql> explain analyze select h1.* from history h1 join (select user_id, max(created_at) as latest from history group by user_id) h2 on h1.created_at = h2.latest and h1.user_id = h2.user_id limit 100;
+------------------------------------+-------------+----------+-----------+------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------+------+
| id | estRows | actRows | task | access object | execution info | operator info | memory | disk |
+------------------------------------+-------------+----------+-----------+------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------+------+
| Limit_16 | 100.00 | 100 | root | | time:1.13s, loops:2, RU:11177.269251 | offset:?, count:? | N/A | N/A |
| └─IndexHashJoin_22 | 100.00 | 512 | root | | time:1.13s, loops:1, inner:{total:122.5ms, concurrency:5, task:8, construct:13ms, fetch:108.7ms, build:2.3ms, join:777.7µs} | inner join, inner:IndexLookUp_19, outer key:Column#9, firsthist.history.user_id, inner key:firsthist.history.created_at, firsthist.history.user_id, equal cond:eq(Column#9, firsthist.history.created_at), eq(firsthist.history.user_id, firsthist.history.user_id) | 3.53 MB | N/A |
| ├─Selection_30(Build) | 100.00 | 10000 | root | | time:1.1s, loops:15 | not(isnull(Column#9)) | 19.0 KB | N/A |
| │ └─HashAgg_36 | 100.00 | 10000 | root | | time:1.1s, loops:13, partial_worker:{wall_time:1.10178037s, concurrency:5, task_num:98, tot_wait:5.395027701s, tot_exec:113.363927ms, tot_time:5.508474641s, max:1.101718142s, p95:1.101718142s}, final_worker:{wall_time:1.10493501s, concurrency:5, task_num:25, tot_wait:16.524µs, tot_exec:10.326862ms, tot_time:5.522236628s, max:1.104834178s, p95:1.104834178s} | group by:firsthist.history.user_id, funcs:max(Column#10)->Column#9, funcs:firstrow(firsthist.history.user_id)->firsthist.history.user_id | 12.5 MB | N/A |
| │ └─IndexReader_37 | 100.00 | 363373 | root | | time:1.1s, loops:99, cop_task: {num: 100, max: 1.06s, min: 1.44ms, avg: 83.1ms, p95: 748.1ms, max_proc_keys: 1292773, p95_proc_keys: 928256, tot_proc: 8.12s, tot_wait: 21.8ms, copr_cache: disabled, build_task_duration: 27.1ms, max_distsql_concurrency: 15}, rpc_info:{Cop:{num_rpc:100, total_time:8.3s}} | index:HashAgg_31 | 312.9 KB | N/A |
| │ └─HashAgg_31 | 100.00 | 363373 | cop[tikv] | | tikv_task:{proc max:1.06s, min:0s, avg: 80.4ms, p80:10ms, p95:750ms, iters:9776, tasks:100}, scan_detail: {total_process_keys: 10000000, total_process_keys_size: 550000000, total_keys: 10000100, get_snapshot_time: 19.7ms, rocksdb: {key_skipped_count: 10000000, block: {cache_hit_count: 712, read_count: 19000, read_byte: 175.8 MB, read_time: 120.9ms}}}, time_detail: {total_process_time: 8.12s, total_suspend_time: 18.6ms, total_wait_time: 21.8ms, total_kv_read_wall_time: 7.06s, tikv_wall_time: 8.18s} | group by:firsthist.history.user_id, funcs:max(firsthist.history.created_at)->Column#10 | N/A | N/A |
| │ └─IndexFullScan_35 | 10000000.00 | 10000000 | cop[tikv] | table:history, index:idx_created_at_user_id(created_at, user_id) | tikv_task:{proc max:910ms, min:0s, avg: 70.6ms, p80:10ms, p95:650ms, iters:9776, tasks:100} | keep order:false | N/A | N/A |
| └─IndexLookUp_19(Probe) | 100.00 | 960 | root | | time:94.6ms, loops:12, index_task: {total_time: 37.4ms, fetch_handle: 37.4ms, build: 5.42µs, wait: 12.4µs}, table_task: {total_time: 22.4ms, num: 4, concurrency: 5}, next: {wait_index: 37.8ms, wait_table_lookup_build: 863.5µs, wait_table_lookup_resp: 21.5ms} | | 32.4 KB | N/A |
| ├─IndexRangeScan_17(Build) | 100.00 | 1408 | cop[tikv] | table:h1, index:idx_created_at_user_id(created_at, user_id) | time:37.3ms, loops:12, cop_task: {num: 9, max: 8.75ms, min: 0s, avg: 5.09ms, p95: 8.75ms, max_proc_keys: 288, p95_proc_keys: 288, tot_proc: 28.8ms, tot_wait: 270.3µs, copr_cache: disabled, build_task_duration: 268µs, max_distsql_concurrency: 3}, rpc_info:{Cop:{num_rpc:9, total_time:59ms}, rpc_errors:{context canceled:2}}, tikv_task:{proc max:10ms, min:0s, avg: 6.25ms, p80:10ms, p95:10ms, iters:23, tasks:8}, scan_detail: {total_process_keys: 1184, total_process_keys_size: 65120, total_keys: 2370, get_snapshot_time: 113.3µs, rocksdb: {key_skipped_count: 1184, block: {cache_hit_count: 8302}}}, time_detail: {total_process_time: 28.8ms, total_suspend_time: 479.1µs, total_wait_time: 270.3µs, total_kv_read_wall_time: 50ms, tikv_wall_time: 31.6ms} | range: decided by [eq(firsthist.history.created_at, Column#9) eq(firsthist.history.user_id, firsthist.history.user_id)], keep order:false | N/A | N/A |
| └─TableRowIDScan_18(Probe) | 100.00 | 960 | cop[tikv] | table:h1 | time:21.4ms, loops:8, cop_task: {num: 44, max: 5.3ms, min: 0s, avg: 899.6µs, p95: 3ms, max_proc_keys: 59, p95_proc_keys: 53, tot_proc: 12.7ms, tot_wait: 18.4ms, copr_cache: disabled, build_task_duration: 262.1µs, max_distsql_concurrency: 10, max_extra_concurrency: 1, store_batch_num: 24}, rpc_info:{Cop:{num_rpc:20, total_time:39.3ms}}, tikv_task:{proc max:10ms, min:0s, avg: 227.3µs, p80:0s, p95:0s, iters:54, tasks:44}, scan_detail: {total_process_keys: 960, total_process_keys_size: 109651, total_keys: 960, get_snapshot_time: 17.6ms, rocksdb: {block: {cache_hit_count: 1979}}}, time_detail: {total_process_time: 12.7ms, total_wait_time: 18.4ms, total_kv_read_wall_time: 10ms, tikv_wall_time: 16.5ms} | keep order:false | N/A | N/A |
+------------------------------------+-------------+----------+-----------+------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------+------+
10 rows in set (1.17 sec)
今回の例では、インデックスは集計でもJOINのキーとしても利用されているので、IndexHashJoinの両側で使われているのがわかります。
ただ、性能はそこまで変わりません。TiDBの場合、IndexもKey-Valueとして保存されるので、今回のようにデータ量がそんなに変わらなければ、Index Full ScanもTable Full Scanもコストはほぼ一緒です。実際のレコードはもっと大きいのでIndex Full ScanとTable full Scanの差は出るかもしれません。
TiFlash(HTAP)
さて、TiDBはTiFlashというカラムナストレージも持っています。これは特に集計演算などの列を軸とする演算に向いています。また、多くの関数を実装しておりpushdownも可能です。こちらの効果を見てみましょう。
まずはTiflashにテーブルをレプリケーションします。
mysql> alter table history set tiflash replica 1;
window関数
mysql> explain analyze SELECT * FROM (SELECT history.*, ROW_NUMBER() OVER (PARTITION BY user_id ORDER BY created_at DESC) AS ranking FROM history) latest WHERE ranking = 1 limit 100;
+--------------------------------------+-------------+---------+--------------+---------------+--------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------+------+
| id | estRows | actRows | task | access object | execution info | operator info | memory | disk |
+--------------------------------------+-------------+---------+--------------+---------------+--------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------+------+
| Limit_14 | 100.00 | 100 | root | | time:1.17s, loops:2, RU:0.000000 | offset:?, count:? | N/A | N/A |
| └─TableReader_35 | 100.00 | 1024 | root | | time:1.17s, loops:1, cop_task: {num: 1, max: 0s, proc_keys: 0, copr_cache: disabled} | MppVersion: 2, data:ExchangeSender_34 | 105.3 KB | N/A |
| └─ExchangeSender_34 | 100.00 | 0 | mpp[tiflash] | | | ExchangeType: PassThrough | N/A | N/A |
| └─Selection_17 | 100.00 | 0 | mpp[tiflash] | | | eq(Column#6, ?), stream_count: 8 | N/A | N/A |
| └─Window_33 | 100.00 | 0 | mpp[tiflash] | | | row_number()->Column#6 over(partition by firsthist.history.user_id order by firsthist.history.created_at desc rows between current row and current row), stream_count: 8 | N/A | N/A |
| └─Sort_23 | 10000000.00 | 0 | mpp[tiflash] | | | firsthist.history.user_id, firsthist.history.created_at:desc, stream_count: 8 | N/A | N/A |
| └─ExchangeReceiver_22 | 10000000.00 | 0 | mpp[tiflash] | | | stream_count: 8 | N/A | N/A |
| └─ExchangeSender_21 | 10000000.00 | 0 | mpp[tiflash] | | | ExchangeType: HashPartition, Compression: FAST, Hash Cols: [name: firsthist.history.user_id, collate: binary], stream_count: 8 | N/A | N/A |
| └─TableFullScan_20 | 10000000.00 | 0 | mpp[tiflash] | table:history | | keep order:false | N/A | N/A |
+--------------------------------------+-------------+---------+--------------+---------------+--------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------+------+
9 rows in set (1.25 sec)
TiKV、TiDBで行っていた処理のほとんどをTiFlashで実行しています。処理時間も20秒あまりから1.25秒と、劇的に改善しました。
集計関数
mysql> explain analyze select h1.* from history h1 join (select user_id, max(created_at) as latest from history group by user_id) h2 on h1.created_at = h2.latest and h1.user_id = h2.user_id limit 100;
+--------------------------------------------+-------------+----------+--------------+-------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------+------+
| id | estRows | actRows | task | access object | execution info | operator info | memory | disk |
+--------------------------------------------+-------------+----------+--------------+-------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------+------+
| Limit_16 | 100.00 | 100 | root | | time:158.4ms, loops:2, RU:2033.104218 | offset:?, count:? | N/A | N/A |
| └─IndexHashJoin_24 | 100.00 | 1024 | root | | time:158.4ms, loops:1, inner:{total:362.5ms, concurrency:5, task:6, construct:9.05ms, fetch:352.2ms, build:1.62ms, join:1.15ms} | inner join, inner:IndexLookUp_21, outer key:Column#9, firsthist.history.user_id, inner key:firsthist.history.created_at, firsthist.history.user_id, equal cond:eq(Column#9, firsthist.history.created_at), eq(firsthist.history.user_id, firsthist.history.user_id) | 2.77 MB | N/A |
| ├─TableReader_90(Build) | 100.00 | 10000 | root | | time:88.8ms, loops:18, cop_task: {num: 9, max: 0s, min: 0s, avg: 0s, p95: 0s, copr_cache: disabled} | MppVersion: 2, data:ExchangeSender_89 | 19.7 KB | N/A |
| │ └─ExchangeSender_89 | 100.00 | 10000 | mpp[tiflash] | | tiflash_task:{time:80.2ms, loops:8, threads:8} | ExchangeType: PassThrough | N/A | N/A |
| │ └─Selection_51 | 100.00 | 10000 | mpp[tiflash] | | tiflash_task:{time:80.2ms, loops:8, threads:8} | not(isnull(Column#9)) | N/A | N/A |
| │ └─Projection_85 | 100.00 | 10000 | mpp[tiflash] | | tiflash_task:{time:80.2ms, loops:8, threads:8} | Column#9, firsthist.history.user_id | N/A | N/A |
| │ └─HashAgg_86 | 100.00 | 10000 | mpp[tiflash] | | tiflash_task:{time:80.2ms, loops:8, threads:8} | group by:firsthist.history.user_id, funcs:max(Column#18)->Column#9, funcs:firstrow(firsthist.history.user_id)->firsthist.history.user_id, stream_count: 8 | N/A | N/A |
| │ └─ExchangeReceiver_88 | 100.00 | 10000 | mpp[tiflash] | | tiflash_task:{time:80.2ms, loops:8, threads:8} | stream_count: 8 | N/A | N/A |
| │ └─ExchangeSender_87 | 100.00 | 10000 | mpp[tiflash] | | tiflash_task:{time:86ms, loops:1, threads:1} | ExchangeType: HashPartition, Compression: FAST, Hash Cols: [name: firsthist.history.user_id, collate: binary], stream_count: 8 | N/A | N/A |
| │ └─HashAgg_83 | 100.00 | 10000 | mpp[tiflash] | | tiflash_task:{time:86ms, loops:1, threads:1} | group by:firsthist.history.user_id, funcs:max(firsthist.history.created_at)->Column#18 | N/A | N/A |
| │ └─TableFullScan_41 | 10000000.00 | 10000000 | mpp[tiflash] | table:history | tiflash_task:{time:36ms, loops:159, threads:8}, tiflash_scan:{mvcc_input_rows:0, mvcc_input_bytes:0, mvcc_output_rows:0, lm_skip_rows:0, local_regions:13, remote_regions:0, tot_learner_read:2ms, region_balance:{instance_num: 1, max/min: 13/13=1.000000}, delta_rows:0, delta_bytes:0, segments:11, stale_read_regions:0, tot_build_snapshot:0ms, tot_build_bitmap:0ms, tot_build_inputstream:1ms, min_local_stream:9ms, max_local_stream:29ms, dtfile:{data_scanned_rows:10000000, data_skipped_rows:0, mvcc_scanned_rows:0, mvcc_skipped_rows:0, lm_filter_scanned_rows:0, lm_filter_skipped_rows:0, tot_rs_index_check:0ms, tot_read:119ms}} | keep order:false | N/A | N/A |
| └─IndexLookUp_21(Probe) | 100.00 | 1250 | root | | time:340.3ms, loops:8, index_task: {total_time: 172.8ms, fetch_handle: 172.8ms, build: 3.65µs, wait: 16.6µs}, table_task: {total_time: 10.9ms, num: 3, concurrency: 5}, next: {wait_index: 107.9ms, wait_table_lookup_build: 930.4µs, wait_table_lookup_resp: 7.8ms} | | 73.2 KB | N/A |
| ├─IndexRangeScan_19(Build) | 100.00 | 4387 | cop[tikv] | table:h1, index:idx_created_at_user_id(created_at, user_id) | time:172.7ms, loops:7, cop_task: {num: 17, max: 44.6ms, min: 0s, avg: 19.8ms, p95: 44.6ms, max_proc_keys: 480, p95_proc_keys: 480, tot_proc: 84.9ms, tot_wait: 5.8ms, copr_cache: disabled, build_task_duration: 9.08ms, max_distsql_concurrency: 3}, rpc_info:{Cop:{num_rpc:17, total_time:344ms}, rpc_errors:{context canceled:1}}, tikv_task:{proc max:10ms, min:0s, avg: 6.47ms, p80:10ms, p95:10ms, iters:50, tasks:17}, scan_detail: {total_process_keys: 3907, total_process_keys_size: 214885, total_keys: 7823, get_snapshot_time: 5.43ms, rocksdb: {key_skipped_count: 3907, block: {cache_hit_count: 19577, read_count: 1, read_byte: 32.0 KB, read_time: 46µs}}}, time_detail: {total_process_time: 84.9ms, total_suspend_time: 3ms, total_wait_time: 5.8ms, total_kv_read_wall_time: 100ms, tikv_wall_time: 103.9ms} | range: decided by [eq(firsthist.history.created_at, Column#9) eq(firsthist.history.user_id, firsthist.history.user_id)], keep order:false | N/A | N/A |
| └─TableRowIDScan_20(Probe) | 100.00 | 1250 | cop[tikv] | table:h1 | time:8.96ms, loops:5, cop_task: {num: 33, max: 3.92ms, min: 0s, avg: 1.09ms, p95: 3.63ms, max_proc_keys: 112, p95_proc_keys: 104, tot_proc: 15.4ms, tot_wait: 13.4ms, copr_cache: disabled, build_task_duration: 230.5µs, max_distsql_concurrency: 11, store_batch_num: 8}, rpc_info:{Cop:{num_rpc:25, total_time:48.3ms}, rpc_errors:{context canceled:11}}, tikv_task:{proc max:0s, min:0s, avg: 0s, p80:0s, p95:0s, iters:38, tasks:22}, scan_detail: {total_process_keys: 1250, total_process_keys_size: 142970, total_keys: 1250, get_snapshot_time: 12.7ms, rocksdb: {block: {cache_hit_count: 3239, read_count: 1, read_byte: 32.0 KB, read_time: 16.1µs}}}, time_detail: {total_process_time: 15.4ms, total_suspend_time: 75.3µs, total_wait_time: 13.4ms, tikv_wall_time: 19.8ms} | keep order:false | N/A | N/A |
+--------------------------------------------+-------------+----------+--------------+-------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------+------+
14 rows in set (0.21 sec)
なんと200msになりました。集計演算はTiFlashで行い、HashJoinのテーブル側はTiKVで行うという合せ技です。TiFlashのデータ断面はTiKVと一致しているため、同じ実行計画の中で両方のエンジンを使うことも可能です。
おわりに
最新のレコードを引くというテーマをTiDBで実行してみました。今回おこなった改善案はごく一部に過ぎませんし、汎用的なものではありませんが、様々な効率化手段を試してみるきっかけになればと思います。