先日リリースされたMySQL 8.0.3 RCでは、オプティマイザでヒストグラム統計が使えるようになりました。
これを使うと、INDEXがないカラムでも、値の分布から行の絞り込みが可能になります。
- 8.9.6 Optimizer Statistics(MySQL 8.0 Reference Manual)
- 13.7.2.1 ANALYZE TABLE Syntax(MySQL 8.0 Reference Manual)
さっそく試してみよう…と思ったら、すでに試してブログ記事にされている方がいらっしゃいました。
- UCO-Tech(MySQL 8.0 RCヒストグラム統計)(UCOブログ)
基本的な使い方は↑のブログ記事のほうがわかりやすいと思いますのでそちらを見ていただくとして、私のほうはちょっとだけ遊んでみます。
1. 一応、基本的な使い方
- INDEXを作成していないカラムを対象に、**「ANALYZE TABLE 【テーブル名】 UPDATE HISTOGRAM ON 【カラム名】 WITH 【バケット数】 BUCKETS」**でヒストグラム統計を有効にすると使える
- バケット数の指定で全行をいくつに分割するかを決める
- 「カーディナリティ≦バケット数」の場合はタイプ「singleton」(有効な1つのバケット⇒1種類の値を持つ行の集合)
- 「カーディナリティ>バケット数」の場合はタイプ「equi-height」(1つのバケット⇒ある範囲の値を持つ行の集合)
- **「ANALYZE TABLE 【テーブル名】 DROP HISTOGRAM ON 【カラム名】」**でヒストグラム統計を無効にできる
2. 注意点
- **行を挿入・更新・削除してもヒストグラム統計は更新されない。**その後に「ANALYZE TABLE UPDATE HISTOGRAM」を実行することでヒストグラム統計が更新される
- テーブルスペース暗号化(こちらの記事を参照)しているテーブルには使えない
- **「innodb_read_only」**が有効なサーバでは使えない(オプティマイザ統計情報を保持するデータディクショナリがInnoDBを使っているので)
3. 実験
以下、長いので、飽きたら右側のリンクで「4. まとめ」まで読み飛ばしてください。
3-0. 準備
$ mysql -u root -p
Enter password:
Welcome to the MySQL monitor. Commands end with ; or \g.
Your MySQL connection id is 9
Server version: 8.0.3-rc-log MySQL Community Server (GPL)
Copyright (c) 2000, 2017, Oracle and/or its affiliates. All rights reserved.
Oracle is a registered trademark of Oracle Corporation and/or its
affiliates. Other names may be trademarks of their respective
owners.
Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.
mysql> CREATE DATABASE hist_test;
Query OK, 1 row affected (0.01 sec)
mysql> USE hist_test;
Database changed
mysql> CREATE TABLE test_data (id INT(10) PRIMARY KEY AUTO_INCREMENT, str VARCHAR(10), val1 INT(1), val2 INT(1), val3 INT(1));
Query OK, 0 rows affected (0.02 sec)
(テストデータを投入/以下はその内容)
mysql> SELECT * FROM hist_test.test_data;
+-----+------------+------+------+------+
| id | str | val1 | val2 | val3 |
+-----+------------+------+------+------+
| 1 | abcdefg | 0 | 1 | 0 |
| 2 | hijklmn | 1 | 2 | 4 |
| 3 | opqrstu | 0 | 3 | 1 |
| 4 | vwxyzab | 1 | 4 | 3 |
| 5 | cdefghi | 0 | 5 | 2 |
(中略)
| 51 | vwxyzabcd | 0 | 1 | NULL |
| 52 | efghijklm | 0 | 2 | NULL |
| 53 | nopqrstuv | 0 | 3 | NULL |
| 54 | wxyzabcde | 0 | 4 | NULL |
| 55 | fghijklmn | 0 | 5 | NULL |
(中略)
| 76 | wxyzabcdef | 0 | 6 | 1 |
| 77 | ghijklmnop | 0 | 7 | 1 |
| 78 | qrstuvwxyz | 0 | 8 | 1 |
| 79 | aaa | 0 | 9 | 1 |
| 80 | bbb | 0 | 0 | 1 |
(中略)
| 96 | rrr | 0 | 6 | 0 |
| 97 | sss | 0 | 7 | 0 |
| 98 | ttt | 0 | 8 | 0 |
| 99 | uuu | 0 | 9 | 0 |
| 100 | vvv | 0 | 0 | 0 |
+-----+------------+------+------+------+
100 rows in set (0.00 sec)
3-1. 文字列/タイプ:equi-height
mysql> ANALYZE TABLE test_data UPDATE HISTOGRAM ON str WITH 10 BUCKETS;
+---------------------+-----------+----------+------------------------------------------------+
| Table | Op | Msg_type | Msg_text |
+---------------------+-----------+----------+------------------------------------------------+
| hist_test.test_data | histogram | status | Histogram statistics created for column 'str'. |
+---------------------+-----------+----------+------------------------------------------------+
1 row in set (0.00 sec)
mysql> SELECT * FROM INFORMATION_SCHEMA.COLUMN_STATISTICS;
+-------------+------------+-------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| SCHEMA_NAME | TABLE_NAME | COLUMN_NAME | HISTOGRAM |
+-------------+------------+-------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| hist_test | test_data | str | {"buckets": [["base64:type254:YWFh", "base64:type254:Y2RlZmdoaQ==", 0.1, 10], ["base64:type254:Y2RlZmdoaWo=", "base64:type254:ZWZnaGlqa2xt", 0.2, 10], ["base64:type254:ZWZnaGlqa2xtbg==", "base64:type254:aGho", 0.3, 10], ["base64:type254:aGlqa2xtbg==", "base64:type254:amtsbW5vcHFy", 0.4, 10], ["base64:type254:a2tr", "base64:type254:bW5vcHFycw==", 0.5, 10], ["base64:type254:bW5vcHFyc3Q=", "base64:type254:b3BxcnN0dXZ3", 0.6, 10], ["base64:type254:b3BxcnN0dXZ3eA==", "base64:type254:cnJy", 0.7, 10], ["base64:type254:cnN0dXZ3eA==", "base64:type254:dHV2d3h5emFi", 0.8, 10], ["base64:type254:dXV1", "base64:type254:d3h5emFiY2Q=", 0.9, 10], ["base64:type254:d3h5emFiY2Rl", "base64:type254:emFiY2RlZmdo", 1, 10]], "data-type": "string", "charset-id": 255, "null-values": 0, "last-updated": "2017-09-30 12:23:38.000000", "sampling-rate": 1, "histogram-type": "equi-height", "number-of-buckets-specified": 10} |
+-------------+------------+-------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
1 row in set (0.00 sec)
mysql> EXPLAIN SELECT * FROM test_data WHERE str='aaa';
+----+-------------+-----------+------------+------+---------------+------+---------+------+------+----------+-------------+
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+-----------+------------+------+---------------+------+---------+------+------+----------+-------------+
| 1 | SIMPLE | test_data | NULL | ALL | NULL | NULL | NULL | NULL | 100 | 1.00 | Using where |
+----+-------------+-----------+------------+------+---------------+------+---------+------+------+----------+-------------+
1 row in set, 1 warning (0.00 sec)
「rows」は「100」(全行)ですが、**「filtered」が「1.00」**なので対象行が絞り込まれていることがわかります。
バケットが10分割で等分(1バケット10行)されているので「filtered」は「10.00」(1割)になるのかと思いましたが、「1.00」でした。
ここで、オプティマイザトレース(こちらの記事を参照)を使ってみます。
mysql> EXPLAIN SELECT * FROM test_data WHERE str='aaa';
+----+-------------+-----------+------------+------+---------------+------+---------+------+------+----------+-------------+
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+-----------+------------+------+---------------+------+---------+------+------+----------+-------------+
| 1 | SIMPLE | test_data | NULL | ALL | NULL | NULL | NULL | NULL | 100 | 1.00 | Using where |
+----+-------------+-----------+------------+------+---------------+------+---------+------+------+----------+-------------+
1 row in set, 1 warning (0.00 sec)
mysql> SET optimizer_trace='enabled=on';
Query OK, 0 rows affected (0.00 sec)
mysql> SELECT * FROM test_data WHERE str='aaa';
+----+------+------+------+------+
| id | str | val1 | val2 | val3 |
+----+------+------+------+------+
| 79 | aaa | 0 | 9 | 1 |
+----+------+------+------+------+
1 row in set (0.00 sec)
mysql> SELECT * FROM INFORMATION_SCHEMA.OPTIMIZER_TRACE\G
*************************** 1. row ***************************
QUERY: SELECT * FROM test_data WHERE str='aaa'
TRACE: {
"steps": [
{
"join_preparation": {
"select#": 1,
"steps": [
{
"expanded_query": "/* select#1 */ select `test_data`.`id` AS `id`,`test_data`.`str` AS `str`,`test_data`.`val1` AS `val1`,`test_data`.`val2` AS `val2`,`test_data`.`val3` AS `val3` from `test_data` where (`test_data`.`str` = 'aaa')"
}
]
}
},
{
"join_optimization": {
"select#": 1,
"steps": [
{
"condition_processing": {
"condition": "WHERE",
"original_condition": "(`test_data`.`str` = 'aaa')",
"steps": [
{
"transformation": "equality_propagation",
"resulting_condition": "multiple equal('aaa', `test_data`.`str`)"
},
{
"transformation": "constant_propagation",
"resulting_condition": "multiple equal('aaa', `test_data`.`str`)"
},
{
"transformation": "trivial_condition_removal",
"resulting_condition": "multiple equal('aaa', `test_data`.`str`)"
}
]
}
},
{
"substitute_generated_columns": {
}
},
{
"table_dependencies": [
{
"table": "`test_data`",
"row_may_be_null": false,
"map_bit": 0,
"depends_on_map_bits": [
]
}
]
},
{
"ref_optimizer_key_uses": [
]
},
{
"rows_estimation": [
{
"table": "`test_data`",
"table_scan": {
"rows": 100,
"cost": 0.25
}
}
]
},
{
"considered_execution_plans": [
{
"plan_prefix": [
],
"table": "`test_data`",
"best_access_path": {
"considered_access_paths": [
{
"rows_to_scan": 100,
"access_type": "scan",
"resulting_rows": 100,
"cost": 10.25,
"chosen": true
}
]
},
"condition_filtering_pct": 100,
"rows_for_plan": 100,
"cost_for_plan": 10.25,
"chosen": true
}
]
},
{
"attaching_conditions_to_tables": {
"original_condition": "(`test_data`.`str` = 'aaa')",
"attached_conditions_computation": [
],
"attached_conditions_summary": [
{
"table": "`test_data`",
"attached": "(`test_data`.`str` = 'aaa')"
}
]
}
},
{
"refine_plan": [
{
"table": "`test_data`"
}
]
}
]
}
},
{
"join_execution": {
"select#": 1,
"steps": [
]
}
}
]
}
MISSING_BYTES_BEYOND_MAX_MEM_SIZE: 0
INSUFFICIENT_PRIVILEGES: 0
1 row in set (0.00 sec)
…長いので見落としているかもしれませんが、ヒストグラム統計に関する詳細情報にはまだ対応していないようです(costの計算には反映されているようですが、正確かどうかは…?)。
3-2. 数値/複数カラム
mysql> ANALYZE TABLE test_data UPDATE HISTOGRAM ON val1, val2, val3 WITH 5 BUCKETS;
+---------------------+-----------+----------+-------------------------------------------------+
| Table | Op | Msg_type | Msg_text |
+---------------------+-----------+----------+-------------------------------------------------+
| hist_test.test_data | histogram | status | Histogram statistics created for column 'val1'. |
| hist_test.test_data | histogram | status | Histogram statistics created for column 'val2'. |
| hist_test.test_data | histogram | status | Histogram statistics created for column 'val3'. |
+---------------------+-----------+----------+-------------------------------------------------+
3 rows in set (0.00 sec)
mysql> SELECT * FROM INFORMATION_SCHEMA.COLUMN_STATISTICS;
+-------------+------------+-------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| SCHEMA_NAME | TABLE_NAME | COLUMN_NAME | HISTOGRAM |
+-------------+------------+-------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| hist_test | test_data | str | {"buckets": [["base64:type254:YWFh", "base64:type254:Y2RlZmdoaQ==", 0.1, 10], ["base64:type254:Y2RlZmdoaWo=", "base64:type254:ZWZnaGlqa2xt", 0.2, 10], ["base64:type254:ZWZnaGlqa2xtbg==", "base64:type254:aGho", 0.3, 10], ["base64:type254:aGlqa2xtbg==", "base64:type254:amtsbW5vcHFy", 0.4, 10], ["base64:type254:a2tr", "base64:type254:bW5vcHFycw==", 0.5, 10], ["base64:type254:bW5vcHFyc3Q=", "base64:type254:b3BxcnN0dXZ3", 0.6, 10], ["base64:type254:b3BxcnN0dXZ3eA==", "base64:type254:cnJy", 0.7, 10], ["base64:type254:cnN0dXZ3eA==", "base64:type254:dHV2d3h5emFi", 0.8, 10], ["base64:type254:dXV1", "base64:type254:d3h5emFiY2Q=", 0.9, 10], ["base64:type254:d3h5emFiY2Rl", "base64:type254:emFiY2RlZmdo", 1, 10]], "data-type": "string", "charset-id": 255, "null-values": 0, "last-updated": "2017-09-30 12:31:52.000000", "sampling-rate": 1, "histogram-type": "equi-height", "number-of-buckets-specified": 10} |
| hist_test | test_data | val1 | {"buckets": [[0, 0.81], [1, 1]], "data-type": "int", "charset-id": 8, "null-values": 0, "last-updated": "2017-09-30 12:32:23.000000", "sampling-rate": 1, "histogram-type": "singleton", "number-of-buckets-specified": 5} |
| hist_test | test_data | val2 | {"buckets": [[0, 1, 0.2, 2], [2, 3, 0.4, 2], [4, 5, 0.6, 2], [6, 7, 0.8, 2], [8, 9, 1, 2]], "data-type": "int", "charset-id": 8, "null-values": 0, "last-updated": "2017-09-30 12:32:23.000000", "sampling-rate": 1, "histogram-type": "equi-height", "number-of-buckets-specified": 5} |
| hist_test | test_data | val3 | {"buckets": [[0, 0.18], [1, 0.31], [2, 0.44], [3, 0.5700000000000001], [4, 0.75]], "data-type": "int", "charset-id": 8, "null-values": 0.25, "last-updated": "2017-09-30 12:32:23.000000", "sampling-rate": 1, "histogram-type": "singleton", "number-of-buckets-specified": 5} |
+-------------+------------+-------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
4 rows in set (0.00 sec)
複数カラムをまとめて指定した場合、複合カラムではなく各カラム独立でヒストグラム統計がバケット分割されます。
「val1」(カーディナリティ:2)・「val2」(カーディナリティ:10)・「val3」(カーディナリティ:5)でそれぞれカーディナリティが違うので、独立して個別にバケット数を指定し、3カラムすべてのタイプが「singleton」になるようにしてみます。
mysql> ANALYZE TABLE test_data UPDATE HISTOGRAM ON val1 WITH 2 BUCKETS;
+---------------------+-----------+----------+-------------------------------------------------+
| Table | Op | Msg_type | Msg_text |
+---------------------+-----------+----------+-------------------------------------------------+
| hist_test.test_data | histogram | status | Histogram statistics created for column 'val1'. |
+---------------------+-----------+----------+-------------------------------------------------+
1 row in set (0.00 sec)
mysql> ANALYZE TABLE test_data UPDATE HISTOGRAM ON val2 WITH 10 BUCKETS;
+---------------------+-----------+----------+-------------------------------------------------+
| Table | Op | Msg_type | Msg_text |
+---------------------+-----------+----------+-------------------------------------------------+
| hist_test.test_data | histogram | status | Histogram statistics created for column 'val2'. |
+---------------------+-----------+----------+-------------------------------------------------+
1 row in set (0.00 sec)
mysql> SELECT * FROM INFORMATION_SCHEMA.COLUMN_STATISTICS;
+-------------+------------+-------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| SCHEMA_NAME | TABLE_NAME | COLUMN_NAME | HISTOGRAM |
+-------------+------------+-------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| hist_test | test_data | str | {"buckets": [["base64:type254:YWFh", "base64:type254:Y2RlZmdoaQ==", 0.1, 10], ["base64:type254:Y2RlZmdoaWo=", "base64:type254:ZWZnaGlqa2xt", 0.2, 10], ["base64:type254:ZWZnaGlqa2xtbg==", "base64:type254:aGho", 0.3, 10], ["base64:type254:aGlqa2xtbg==", "base64:type254:amtsbW5vcHFy", 0.4, 10], ["base64:type254:a2tr", "base64:type254:bW5vcHFycw==", 0.5, 10], ["base64:type254:bW5vcHFyc3Q=", "base64:type254:b3BxcnN0dXZ3", 0.6, 10], ["base64:type254:b3BxcnN0dXZ3eA==", "base64:type254:cnJy", 0.7, 10], ["base64:type254:cnN0dXZ3eA==", "base64:type254:dHV2d3h5emFi", 0.8, 10], ["base64:type254:dXV1", "base64:type254:d3h5emFiY2Q=", 0.9, 10], ["base64:type254:d3h5emFiY2Rl", "base64:type254:emFiY2RlZmdo", 1, 10]], "data-type": "string", "charset-id": 255, "null-values": 0, "last-updated": "2017-09-30 12:31:52.000000", "sampling-rate": 1, "histogram-type": "equi-height", "number-of-buckets-specified": 10} |
| hist_test | test_data | val1 | {"buckets": [[0, 0.81], [1, 1]], "data-type": "int", "charset-id": 8, "null-values": 0, "last-updated": "2017-09-30 12:34:17.000000", "sampling-rate": 1, "histogram-type": "singleton", "number-of-buckets-specified": 2} |
| hist_test | test_data | val2 | {"buckets": [[0, 0.1], [1, 0.2], [2, 0.30000000000000004], [3, 0.4], [4, 0.5], [5, 0.6], [6, 0.7], [7, 0.7999999999999999], [8, 0.8999999999999999], [9, 0.9999999999999999]], "data-type": "int", "charset-id": 8, "null-values": 0, "last-updated": "2017-09-30 12:34:23.000000", "sampling-rate": 1, "histogram-type": "singleton", "number-of-buckets-specified": 10} |
| hist_test | test_data | val3 | {"buckets": [[0, 0.18], [1, 0.31], [2, 0.44], [3, 0.5700000000000001], [4, 0.75]], "data-type": "int", "charset-id": 8, "null-values": 0.25, "last-updated": "2017-09-30 12:32:23.000000", "sampling-rate": 1, "histogram-type": "singleton", "number-of-buckets-specified": 5} |
+-------------+------------+-------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
4 rows in set (0.00 sec)
数値カラムが「singleton」になりました。
なお、「val3」カラムにNULL行が25%存在することがわかります。
複数カラムを組み合わせた条件でSELECTする場合のEXPLAINを見てみます。
mysql> EXPLAIN SELECT * FROM test_data WHERE val1=1 AND val2 < 5 AND val3 < 2;
+----+-------------+-----------+------------+------+---------------+------+---------+------+------+----------+-------------+
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+-----------+------------+------+---------------+------+---------+------+------+----------+-------------+
| 1 | SIMPLE | test_data | NULL | ALL | NULL | NULL | NULL | NULL | 100 | 2.94 | Using where |
+----+-------------+-----------+------------+------+---------------+------+---------+------+------+----------+-------------+
1 row in set, 1 warning (0.00 sec)
mysql> SELECT * FROM test_data WHERE val1=1 AND val2 < 5 AND val3 < 2;
+----+----------+------+------+------+
| id | str | val1 | val2 | val3 |
+----+----------+------+------+------+
| 10 | lmnopqr | 1 | 0 | 0 |
| 12 | zabcdef | 1 | 2 | 1 |
| 30 | yzabcdef | 1 | 0 | 0 |
| 32 | opqrstuv | 1 | 2 | 1 |
+----+----------+------+------+------+
4 rows in set (0.00 sec)
mysql> EXPLAIN SELECT * FROM test_data WHERE val1=1 AND val2 < 5 AND val3 IS NULL;
+----+-------------+-----------+------------+------+---------------+------+---------+------+------+----------+-------------+
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+-----------+------------+------+---------------+------+---------+------+------+----------+-------------+
| 1 | SIMPLE | test_data | NULL | ALL | NULL | NULL | NULL | NULL | 100 | 2.38 | Using where |
+----+-------------+-----------+------------+------+---------------+------+---------+------+------+----------+-------------+
1 row in set, 1 warning (0.00 sec)
NULLを含まない場合も含む場合も、filteredされているのがわかります。
実際にSELECTした結果と、若干ずれている気はしますが。
3-3. 結合(JOIN)
JOINはどうでしょうか?
mysql> CREATE TABLE join_data (id INT(10) PRIMARY KEY AUTO_INCREMENT, str VARCHAR(10) UNIQUE, val_j INT(1));
Query OK, 0 rows affected (0.02 sec)
(テストデータを投入/以下はその内容)
mysql> SELECT * FROM join_data;
+----+------+-------+
| id | str | val_j |
+----+------+-------+
| 1 | aaa | 1 |
| 2 | bbb | 2 |
| 3 | ccc | 3 |
| 4 | ddd | 4 |
| 5 | eee | 5 |
(中略)
| 22 | vvv | 2 |
| 23 | www | 3 |
| 24 | xxx | 4 |
| 25 | yyy | 5 |
| 26 | zzz | 6 |
+----+------+-------+
26 rows in set (0.01 sec)
mysql> EXPLAIN SELECT * FROM test_data t INNER JOIN join_data j ON t.str=j.str;
+----+-------------+-------+------------+------+---------------+------+---------+-----------------+------+----------+-------------+
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+-------+------------+------+---------------+------+---------+-----------------+------+----------+-------------+
| 1 | SIMPLE | t | NULL | ALL | NULL | NULL | NULL | NULL | 100 | 100.00 | Using where |
| 1 | SIMPLE | j | NULL | ref | str | str | 43 | hist_test.t.str | 1 | 100.00 | NULL |
+----+-------------+-------+------------+------+---------------+------+---------+-----------------+------+----------+-------------+
2 rows in set, 1 warning (0.00 sec)
絞り込みは行われていないようです。~~行数が少なすぎて、~~フルスキャンしたほうが良いと判断したのかもしれません。
数千行×数千行にしても同じでした。
以下、INDEXを作成した場合の結果です。比較してみてください(駆動表と内部表が逆転しています)。
mysql> ALTER TABLE test_data ADD INDEX idx_str(str);
Query OK, 0 rows affected (0.02 sec)
Records: 0 Duplicates: 0 Warnings: 0
mysql> EXPLAIN SELECT * FROM test_data t INNER JOIN join_data j ON t.str=j.str;
+----+-------------+-------+------------+------+---------------+---------+---------+-----------------+------+----------+-------------+
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+-------+------------+------+---------------+---------+---------+-----------------+------+----------+-------------+
| 1 | SIMPLE | j | NULL | ALL | str | NULL | NULL | NULL | 26 | 100.00 | Using where |
| 1 | SIMPLE | t | NULL | ref | idx_str | idx_str | 43 | hist_test.j.str | 1 | 100.00 | NULL |
+----+-------------+-------+------------+------+---------------+---------+---------+-----------------+------+----------+-------------+
2 rows in set, 1 warning (0.00 sec)
3-4. INDEXとヒストグラムの同時利用
同じカラムに対して同時利用することはできませんが、INDEXがあるカラムとヒストグラム統計を持つカラムを同時にWHERE句で指定してみます。
mysql> EXPLAIN SELECT * FROM test_data WHERE str LIKE 'a%' AND val2 > 5;
+----+-------------+-----------+------------+-------+---------------+---------+---------+------+------+----------+------------------------------------+
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+-----------+------------+-------+---------------+---------+---------+------+------+----------+------------------------------------+
| 1 | SIMPLE | test_data | NULL | range | idx_str | idx_str | 43 | NULL | 5 | 40.00 | Using index condition; Using where |
+----+-------------+-----------+------------+-------+---------------+---------+---------+------+------+----------+------------------------------------+
1 row in set, 1 warning (0.00 sec)
mysql> SELECT * FROM test_data WHERE str LIKE 'a%';
+----+------------+------+------+------+
| id | str | val1 | val2 | val3 |
+----+------------+------+------+------+
| 79 | aaa | 0 | 9 | 1 |
| 1 | abcdefg | 0 | 1 | 0 |
| 27 | abcdefgh | 0 | 7 | 3 |
| 40 | abcdefghi | 0 | 0 | 2 |
| 66 | abcdefghij | 0 | 6 | 4 |
+----+------------+------+------+------+
5 rows in set (0.00 sec)
mysql> SELECT * FROM test_data WHERE str LIKE 'a%' AND val2 > 5;
+----+------------+------+------+------+
| id | str | val1 | val2 | val3 |
+----+------------+------+------+------+
| 79 | aaa | 0 | 9 | 1 |
| 27 | abcdefgh | 0 | 7 | 3 |
| 66 | abcdefghij | 0 | 6 | 4 |
+----+------------+------+------+------+
3 rows in set (0.00 sec)
カラム「str」のINDEXとカラム「val2」のヒストグラム統計の両方が利用されているようですね。
3-5. UPDATE
mysql> EXPLAIN UPDATE test_data SET val2=0 WHERE str LIKE 'a%' AND val2 > 5;
+----+-------------+-----------+------------+-------+---------------+---------+---------+-------+------+----------+-------------+
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+-----------+------------+-------+---------------+---------+---------+-------+------+----------+-------------+
| 1 | UPDATE | test_data | NULL | range | idx_str | idx_str | 43 | const | 5 | 100.00 | Using where |
+----+-------------+-----------+------------+-------+---------------+---------+---------+-------+------+----------+-------------+
1 row in set (0.00 sec)
残念ながら利用されないようです(行数が少なすぎる可能性はあります行数を少し増やしてみてもダメでした)。
実際にUPDATEしてみます。
mysql> UPDATE test_data SET val2=0 WHERE str LIKE 'a%' AND val2 > 5;
Query OK, 3 rows affected (0.00 sec)
Rows matched: 3 Changed: 3 Warnings: 0
mysql> SELECT * FROM test_data WHERE str LIKE 'a%' AND val2 > 5;
Empty set (0.00 sec)
mysql> EXPLAIN SELECT * FROM test_data WHERE str LIKE 'a%' AND val2 > 5;
+----+-------------+-----------+------------+-------+---------------+---------+---------+------+------+----------+------------------------------------+
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+-----------+------------+-------+---------------+---------+---------+------+------+----------+------------------------------------+
| 1 | SIMPLE | test_data | NULL | range | idx_str | idx_str | 43 | NULL | 5 | 40.00 | Using index condition; Using where |
+----+-------------+-----------+------------+-------+---------------+---------+---------+------+------+----------+------------------------------------+
1 row in set, 1 warning (0.00 sec)
EXPLAIN結果を見ると、ヒストグラム統計はUPDATE前(3-4. のとき)と変わっていないようです。注意書きの通りですね。
「ANALYZE TABLE UPDATE HISTOGRAM」してみます。
mysql> ANALYZE TABLE test_data UPDATE HISTOGRAM ON val2 WITH 10 BUCKETS;
+---------------------+-----------+----------+-------------------------------------------------+
| Table | Op | Msg_type | Msg_text |
+---------------------+-----------+----------+-------------------------------------------------+
| hist_test.test_data | histogram | status | Histogram statistics created for column 'val2'. |
+---------------------+-----------+----------+-------------------------------------------------+
1 row in set (0.00 sec)
mysql> EXPLAIN SELECT * FROM test_data WHERE str LIKE 'a%' AND val2 > 5;
+----+-------------+-----------+------------+-------+---------------+---------+---------+------+------+----------+------------------------------------+
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+-----------+------------+-------+---------------+---------+---------+------+------+----------+------------------------------------+
| 1 | SIMPLE | test_data | NULL | range | idx_str | idx_str | 43 | NULL | 5 | 37.00 | Using index condition; Using where |
+----+-------------+-----------+------------+-------+---------------+---------+---------+------+------+----------+------------------------------------+
1 row in set, 1 warning (0.01 sec)
ちょっとわかりづらいですが、更新されました。ずれは大きいようですが。
3-6. パフォーマンス
200万行のテーブルで、INDEXと比較してみます。
mysql> CREATE TABLE flag_test_i (id INT(10) PRIMARY KEY AUTO_INCREMENT, str VARCHAR(100), flag INT(1), INDEX (flag));
Query OK, 0 rows affected (0.03 sec)
mysql> CREATE TABLE flag_test_h (id INT(10) PRIMARY KEY AUTO_INCREMENT, str VARCHAR(100), flag INT(1));
Query OK, 0 rows affected (0.02 sec)
mysql> SELECT COUNT(*) FROM flag_test_i;
+----------+
| COUNT(*) |
+----------+
| 2000000 |
+----------+
1 row in set (3.58 sec)
mysql> SELECT COUNT(*) FROM flag_test_h;
+----------+
| COUNT(*) |
+----------+
| 2000000 |
+----------+
1 row in set (2.03 sec)
mysql> SELECT * FROM flag_test_i ORDER BY id ASC LIMIT 5;
+----+------------------------------------------------------------------------------------------------------+------+
| id | str | flag |
+----+------------------------------------------------------------------------------------------------------+------+
| 1 | 1234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890 | 0 |
| 2 | 1234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890 | 0 |
| 3 | 1234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890 | 0 |
| 4 | 1234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890 | 0 |
| 5 | 1234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890 | 0 |
+----+------------------------------------------------------------------------------------------------------+------+
5 rows in set (0.00 sec)
mysql> SELECT * FROM flag_test_h ORDER BY id ASC LIMIT 5;
+----+------------------------------------------------------------------------------------------------------+------+
| id | str | flag |
+----+------------------------------------------------------------------------------------------------------+------+
| 1 | 1234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890 | 0 |
| 2 | 1234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890 | 0 |
| 3 | 1234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890 | 0 |
| 4 | 1234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890 | 0 |
| 5 | 1234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890 | 0 |
+----+------------------------------------------------------------------------------------------------------+------+
5 rows in set (0.00 sec)
↑からはわかりませんが、1万行あたり1行「flag=1」の行を作ってあります。
2つ目のテーブルでは、まだヒストグラム統計は取っていません。
念のためサーバ再起動を挟んで、「flag=1」の行をCOUNT(*)してみます。
mysql> USE hist_test;
No connection. Trying to reconnect...
Connection id: 7
Current database: *** NONE ***
Reading table information for completion of table and column names
You can turn off this feature to get a quicker startup with -A
Database changed
mysql> SELECT COUNT(*) FROM flag_test_i WHERE FLAG=1;
+----------+
| COUNT(*) |
+----------+
| 200 |
+----------+
1 row in set (0.00 sec)
mysql> SELECT COUNT(*) FROM flag_test_h WHERE flag=1;
+----------+
| COUNT(*) |
+----------+
| 200 |
+----------+
1 row in set (1.61 sec)
INDEX有無による差は(当然)あります。
ここで、再度サーバ再起動を挟んで、2つ目のテーブルで「ANALYZE TABLE UPDATE HISTOGRAM」してから「flag=1」の行をCOUNT(*)してみます。
mysql> ANALYZE TABLE flag_test_h UPDATE HISTOGRAM ON flag WITH 2 BUCKETS;
+-----------------------+-----------+----------+-------------------------------------------------+
| Table | Op | Msg_type | Msg_text |
+-----------------------+-----------+----------+-------------------------------------------------+
| hist_test.flag_test_h | histogram | status | Histogram statistics created for column 'flag'. |
+-----------------------+-----------+----------+-------------------------------------------------+
1 row in set (1.39 sec)
mysql> USE hist_test;
No connection. Trying to reconnect...
Connection id: 7
Current database: *** NONE ***
Reading table information for completion of table and column names
You can turn off this feature to get a quicker startup with -A
Database changed
mysql> SELECT COUNT(*) FROM flag_test_h WHERE FLAG=1;
+----------+
| COUNT(*) |
+----------+
| 200 |
+----------+
1 row in set (0.96 sec)
ヒストグラム統計なしの場合よりは速いですが、INDEXと比較するとやはり遅いようです。
4. まとめ
SELECT対象の絞り込みではある程度、普通に使えます。
以前、**「MySQL(InnoDB)でカーディナリティの低いカラムにINDEXを張る」**という記事を書きましたが、この記事で取り上げた目的(⇒ごく一部の行だけを抽出する)で使うのであれば、INDEXではなくヒストグラム統計を使うことを検討したほうが良いケースもあると思います。
但し、INDEXを使う場合と比較すると、やはりパフォーマンスは落ちるようですし、UPDATEにも対応していないようですので(行数が少なすぎただけかもしれませんが、もし本当に対応していないとすると、フラグの一括更新には使えません)、要件によって使い分けることになるでしょう。
なお、「ANALYZE TABLE UPDATE HISTOGRAM」を実行して、統計情報を更新しないと正しくコスト評価できない点が気になるところですが、これはマニュアルに
・ An index must be updated when table data is modified.
・ A histogram is created or updated only on demand, so it adds no overhead when table data is modified.
と書かれている通り、**「INDEXを使うと行挿入・更新・削除で常にオーバーヘッドが生じるが、ヒストグラム統計を使う場合は必要な時にだけANALYZE TABLE UPDATE HISTOGRAMすれば余計なオーバーヘッドが生じない」**という利点にもつながります。
例えば、バッチ処理で「flag=1の行を取り出して、1行ずつ『何らかの処理を施してからflagをUPDATE』する」ようなことをしている場合、バッチ処理の初め(だけ)に「ANALYZE TABLE UPDATE HISTOGRAM」することで(INDEXを使うよりも)バッチ処理以外の更新時オーバーヘッドを削減することができる、ということです。
【おまけ】
MySQL 8.0関連投稿記事へのリンクを集めました。