More than 3 years have passed since last update.

Colabで全文検索（その１：MySQL編）

Last updated at 2023-02-10Posted at 2022-02-26

ご覧いただきありがとうございます。
Google Colaboratoryにアカウントをお持ちの方は、上の「Open in Colab」と書かれた青いボタンを押せば直接notebookをColabで開けます。ぜひ動かしてみてください。
過去の記事も含め、全てのコードをGithubで公開しています。

各種全文検索ツールをColabで動かしてみるシリーズです。全7回の予定です。今回はMySQLです。

処理時間の計測はストレージのキャッシュとの兼ね合いがあるので、2回測ります。2回目は全てがメモリに載った状態での性能評価になります。ただ1回目もデータを投入した直後なので、メモリに載ってしまっている可能性があります。

準備

まずは検索対象のテキストを日本語wikiから取得して、Google Driveに保存します。

Google Driveに約１GBの空き容量が必要です。以前のデータが残っている場合は取得せず再利用します。

Google Driveのマウント

from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive

jawikiの取得とjson形式への変換。90分ほど時間がかかります。

他の全文検索シリーズでも同じデータを使うので、他の記事も試す場合は wiki.json.bz2 を捨てずに残しておくことをおすすめします。

%%time
%cd /content/
import os
if not os.path.exists('/content/drive/MyDrive/wiki.json.bz2'):
    !wget https://dumps.wikimedia.org/jawiki/latest/jawiki-latest-pages-articles.xml.bz2
    !pip install wikiextractor
    !python -m wikiextractor.WikiExtractor --no-templates --processes 4 --json -b 10G -o - jawiki-latest-pages-articles.xml.bz2 | bzip2 -c > /content/drive/MyDrive/wiki.json.bz2

/content
Requirement already satisfied: wikiextractor in /usr/local/lib/python3.7/dist-packages (3.0.6)
INFO: Starting page extraction from jawiki-latest-pages-articles.xml.bz2.
INFO: Using 4 extract processes.
INFO: Extracted 100000 articles (266.4 art/s)
INFO: Extracted 200000 articles (409.4 art/s)
INFO: Extracted 300000 articles (504.7 art/s)
INFO: Extracted 400000 articles (556.6 art/s)
INFO: Extracted 500000 articles (610.1 art/s)
INFO: Extracted 600000 articles (650.8 art/s)
INFO: Extracted 700000 articles (659.1 art/s)
INFO: Extracted 800000 articles (679.1 art/s)
INFO: Extracted 900000 articles (614.8 art/s)
INFO: Extracted 1000000 articles (645.4 art/s)
INFO: Extracted 1100000 articles (654.3 art/s)
INFO: Extracted 1200000 articles (637.7 art/s)
INFO: Extracted 1300000 articles (610.6 art/s)
INFO: Extracted 1400000 articles (593.2 art/s)
INFO: Extracted 1500000 articles (641.7 art/s)
INFO: Extracted 1600000 articles (634.1 art/s)
INFO: Extracted 1700000 articles (602.7 art/s)
INFO: Extracted 1800000 articles (629.0 art/s)
INFO: Extracted 1900000 articles (621.6 art/s)
INFO: Extracted 2000000 articles (683.6 art/s)
INFO: Extracted 2100000 articles (688.4 art/s)
INFO: Finished 4-process extraction of 2119081 articles in 3690.6s (574.2 art/s)
CPU times: user 26.7 s, sys: 3.28 s, total: 30 s
Wall time: 1h 1min 34s

json形式に変換されたデータを確認

import json
import bz2

with bz2.open('/content/drive/MyDrive/wiki.json.bz2', 'rt', encoding='utf-8') as fin:
    for n, line in enumerate(fin):
        data = json.loads(line)
        print(data['title'].strip(), data['text'].replace('\n', '')[:40], sep='\t')
        if n == 5:
            break

アンパサンド	アンパサンド（&amp;, ）は、並立助詞「…と…」を意味する記号である。ラテン
言語	言語（げんご）は、広辞苑や大辞泉には次のように解説されている。『日本大百科事典』
日本語	 日本語（にほんご、にっぽんご）は、日本国内や、かつての日本領だった国、そして日
地理学	地理学（ちりがく、、、伊：geografia、）は、。地域や空間、場所、自然環境
EU (曖昧さ回避)	EU
国の一覧	国の一覧（くにのいちらん）は、世界の独立国の一覧。対象.国際法上国家と言えるか否

MySQLのインストール

!sudo apt update
!sudo apt install mysql-server mysql-client

Get:1 http://security.ubuntu.com/ubuntu bionic-security InRelease [88.7 kB]
Get:2 https://cloud.r-project.org/bin/linux/ubuntu bionic-cran40/ InRelease [3,626 B]
Ign:3 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64  InRelease
・
・
Setting up mysql-server (5.7.37-0ubuntu0.18.04.1) ...
Processing triggers for systemd (237-3ubuntu10.53) ...
Processing triggers for man-db (2.8.3-2ubuntu0.1) ...
Processing triggers for libc-bin (2.27-3ubuntu1.3) ...
/sbin/ldconfig.real: /usr/local/lib/python3.7/dist-packages/ideep4py/lib/libmkldnn.so.0 is not a symbolic link

!mysql --version

mysql  Ver 14.14 Distrib 5.7.37, for Linux (x86_64) using  EditLine wrapper

MySQLの立ち上げ

!service mysql start

 * Starting MySQL database server mysqld
No directory, logging in with HOME=/
   ...done.

!service mysql status

 * /usr/bin/mysqladmin  Ver 8.42 Distrib 5.7.37, for Linux on x86_64
Copyright (c) 2000, 2022, Oracle and/or its affiliates.

Oracle is a registered trademark of Oracle Corporation and/or its
affiliates. Other names may be trademarks of their respective
owners.

Server version		5.7.37-0ubuntu0.18.04.1
Protocol version	10
Connection		Localhost via UNIX socket
UNIX socket		/var/run/mysqld/mysqld.sock
Uptime:			1 sec

Threads: 1  Questions: 8  Slow queries: 0  Opens: 105  Flush tables: 1  Open tables: 98  Queries per second avg: 8.000

DB作成

!echo "create database db" | mysql

Pythonクライアントのインストール

!pip install mysqlclient

Collecting mysqlclient
  Downloading mysqlclient-2.1.0.tar.gz (87 kB)
[K     |████████████████████████████████| 87 kB 3.1 MB/s 
[?25hBuilding wheels for collected packages: mysqlclient
  Building wheel for mysqlclient (setup.py) ... [?25l[?25hdone
  Created wheel for mysqlclient: filename=mysqlclient-2.1.0-cp37-cp37m-linux_x86_64.whl size=99970 sha256=b02277159390cd4b15d6de2a531b35a56ec5150e53609641eddc9ffd4b8d7ce1
  Stored in directory: /root/.cache/pip/wheels/97/d4/df/08cd6e1fa4a8691b268ab254bd0fa589827ab5b65638c010b4
Successfully built mysqlclient
Installing collected packages: mysqlclient
Successfully installed mysqlclient-2.1.0

データのインポート

テーブルを作成して、データを50万件登録します。10分ほど時間がかかります。

import MySQLdb
import json
import bz2
from tqdm.notebook import tqdm

db = MySQLdb.connect(host='localhost', user='root', db='db', charset='utf8mb4')
cursor = db.cursor()

cursor.execute('drop table if exists wiki_jp')
cursor.execute('create table wiki_jp('
 'id bigint unsigned not null auto_increment primary key,'
 'title tinytext collate utf8mb4_unicode_ci storage memory,'
 'body mediumtext collate utf8mb4_unicode_ci storage memory)')

limit = 500000
insert_wiki = 'insert into wiki_jp (title, body) values (%s, %s);'

with bz2.open('/content/drive/MyDrive/wiki.json.bz2', 'rt', encoding='utf-8') as fin:
    n = 0
    for line in tqdm(fin, total=limit*1.5):
        data = json.loads(line)
        title = data['title'].strip()
        body = data['text'].replace('\n', '')
        if len(title) > 0 and len(body) > 0:
            cursor.execute(insert_wiki, (title, body))
            n += 1
        if n == limit:
            break
db.commit()
db.close()

  0%|          | 0/750000.0 [00:00<?, ?it/s]

テーブル定義を確認します。

!echo "show columns from db.wiki_jp" | mysql

Field	Type	Null	Key	Default	Extra
id	bigint(20) unsigned	NO	PRI	NULL	auto_increment
title	tinytext	YES		NULL	
body	mediumtext	YES		NULL

登録件数を確認します。

!echo "select count(*) from db.wiki_jp" | mysql

count(*)
500000

インデックスを使わない検索

like検索でシーケンシャルに検索した場合を測定します。mysqlコマンドに-vvvオプションを付けると、出力の最後から3行目にマッチ数と処理時間が出力されるので、その部分だけをtailコマンドとheadコマンドで切り出しています。

まずはインデックスを使っていないことを確認します。

!echo "explain select sql_no_cache * from db.wiki_jp where body like '%日本語%'" | mysql

id	select_type	table	partitions	type	possible_keys	key	key_len	ref	rows	filtered	Extra
1	SIMPLE	wiki_jp	NULL	ALL	NULL	NULL	NULL	NULL	271545	11.11	Using where

bodyに「日本語」を含むレコードの数を取得します。

%%time
!echo "select sql_no_cache count(*) from db.wiki_jp where body like '%日本語%'" | mysql -vvv

--------------
select sql_no_cache count(*) from db.wiki_jp where body like '%日本語%'
--------------

+----------+
| count(*) |
+----------+
|    17006 |
+----------+
1 row in set, 1 warning (11.47 sec)

Bye
CPU times: user 107 ms, sys: 19.5 ms, total: 127 ms
Wall time: 11.6 s

bodyに「日本語」を含むレコードを取得します。

%%time
!echo "select sql_no_cache * from db.wiki_jp where body like '%日本語%'" | mysql -vvv | tail -3 | head -1

17006 rows in set, 1 warning (10.51 sec)
CPU times: user 126 ms, sys: 19.5 ms, total: 146 ms
Wall time: 13.6 s

2回目

%%time
!echo "select sql_no_cache * from db.wiki_jp where body like '%日本語%'" | mysql -vvv | tail -3 | head -1

17006 rows in set, 1 warning (10.25 sec)
CPU times: user 121 ms, sys: 16.8 ms, total: 138 ms
Wall time: 13.8 s

インデックスを使わない場合、単に数を集計するだけでもbodyにアクセスしなければならないので、処理時間はあまり変わりません。

全文検索用インデックスの作成

インデックスの作成には60分ほどかかります。

!echo "alter table db.wiki_jp add fulltext index ngram_idx (body) with parser ngram" | mysql -vvv

--------------
alter table db.wiki_jp add fulltext index ngram_idx (body) with parser ngram
--------------

Query OK, 0 rows affected, 1 warning (1 hour 33.16 sec)
Records: 0  Duplicates: 0  Warnings: 1

Bye

インデックスを使った検索

まずはインデックスを使っていることを確認します。

!echo "explain select sql_no_cache * from db.wiki_jp where match (body) against ('日本語' in boolean mode)" | mysql

id	select_type	table	partitions	type	possible_keys	key	key_len	ref	rows	filtered	Extra
1	SIMPLE	wiki_jp	NULL	fulltext	ngram_idx	ngram_idx	0	const	1	100.00	Using where; Ft_hints: no_ranking

bodyに「日本語」を含むレコードの数を取得します。

%%time
!echo "select sql_no_cache count(*) from db.wiki_jp where match (body) against ('日本語' in boolean mode)" | mysql -vvv

--------------
select sql_no_cache count(*) from db.wiki_jp where match (body) against ('日本語' in boolean mode)
--------------

+----------+
| count(*) |
+----------+
|    17006 |
+----------+
1 row in set, 1 warning (2.45 sec)

Bye
CPU times: user 22.2 ms, sys: 12 ms, total: 34.2 ms
Wall time: 2.53 s

bodyに「日本語」を含むレコードを取得します。

%%time
!echo "select sql_no_cache * from db.wiki_jp where match (body) against ('日本語' in boolean mode)" | mysql -vvv | tail -3 | head -1

17006 rows in set, 1 warning (4.31 sec)
CPU times: user 73.3 ms, sys: 14.2 ms, total: 87.5 ms
Wall time: 8.05 s

2回目

%%time
!echo "select sql_no_cache * from db.wiki_jp where match (body) against ('日本語' in boolean mode)" | mysql -vvv | tail -3 | head -1

17006 rows in set, 1 warning (3.97 sec)
CPU times: user 60.8 ms, sys: 14.2 ms, total: 75 ms
Wall time: 7.65 s

数の集計はまあまあ速いですが、他と比べると断然遅い。数の集計の場合でもインデックスだけで処理が済まず、bodyにアクセスしているのかもしれません。インデックスを使っても、使わない場合の半分にもならないのはいかがなものかと。

参考までにidとtitleのみを取得してみます。

%%time
!echo "select sql_no_cache id title from db.wiki_jp where match (body) against ('日本語' in boolean mode)" | mysql -vvv | tail -3 | head -1

17006 rows in set, 1 warning (3.03 sec)
CPU times: user 33.8 ms, sys: 6.21 ms, total: 40 ms
Wall time: 3.13 s

大して速くなりません。これはMySQLの欠点と言っても過言ではありませんね。

DBの停止

!service mysql stop

 * Stopping MySQL database server mysqld
   ...done.

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up