BM25のPython高速ライブラリBM25-Sparseを日本語で使いたい

Posted at 2025-01-17

Introduction

BM25 (Best Matching 25) は単語ベースでの検索・ランク付けにおいて高いパフォーマンスを発揮する手法です。RAG の Retireval の部分でもよく使われています。

ベクトルベースでの検索とは異なり文脈や意味合いを考慮しないので、類語での言い換えや、日本語ですと漢字かひらがなかの違いなどには少々弱いです。まあ結局日本語は難しいですよね...

そのBM25をPythonで容易に使えるようにして、なおかつ高速化した BM25-Sparse (BM25S) というライブラリが公開されています。

Scipy の恩恵に預かることで ElasticSearch よりも高速な検索を実現しているようです。
英語でのサンプルはありますが、日本語のものがないのでトライしてみましょう。

Experiments

Setup

ひとまず下記をインストールします。

pip install bm25s
pip install PyStemmer

Python は Python 3.9.6 を利用します。

英語版

README の Quickstart を試しに動かしてみましょう。

import bm25s
import Stemmer  # optional: for stemming

# Create your corpus here
corpus = [
    "a cat is a feline and likes to purr",
    "a dog is the human's best friend and loves to play",
    "a bird is a beautiful animal that can fly",
    "a fish is a creature that lives in water and swims",
]

# optional: create a stemmer
stemmer = Stemmer.Stemmer("english")

# Tokenize the corpus and only keep the ids (faster and saves memory)
corpus_tokens = bm25s.tokenize(corpus, stopwords="en", stemmer=stemmer)

# Create the BM25 model and index the corpus
retriever = bm25s.BM25()
retriever.index(corpus_tokens)

# Query the corpus
query = "does the fish purr like a cat?"
query_tokens = bm25s.tokenize(query, stemmer=stemmer)

# Get top-k results as a tuple of (doc ids, scores). Both are arrays of shape (n_queries, k).
# To return docs instead of IDs, set the `corpus=corpus` parameter.
results, scores = retriever.retrieve(query_tokens, k=2)

for i in range(results.shape[1]):
    doc, score = results[0, i], scores[0, i]
    print(f"Rank {i+1} (score: {score:.2f}): {doc}")

# You can save the corpus along with the model
retriever.save("animal_index_bm25", corpus=corpus)

出力結果は以下のようになります。

Rank 1 (score: 1.59): 0
Rank 2 (score: 0.48): 3

Query "does the fish purr like a cat?" という入力に対して、corpus[0] "a cat is a feline and likes to purr" が最も近い内容という結果になりました。明らかに重複している単語が多いですからね。

2番目に近いのは corpus[3] となりました。fish が被っていますね。

また、実行したフォルダに animal_index_bm25 というフォルダが作成され、以下の情報が保存されます。

corpus.jsonl
- 定義したcorpusに連番IDが付与され、保存されています (デフォルトはなし)
corpus.mmindex.json
- わかりません...(デフォルトはなし)
vocab.index.json
- 辞書
params.index.json
- 検索時に利用されたパラメータ

参考に vocab.index.json はこちら

{"cat": 0, "fli": 1, "love": 2, "dog": 3, "best": 4, "human": 5, "beauti": 6, "anim": 7, "water": 8, "fish": 9, "purr": 10, "creatur": 11, "swim": 12, "bird": 13, "play": 14, "like": 15, "can": 16, "friend": 17, "live": 18, "felin": 19, "": 20}

日本語版

corpusとクエリを日本語に翻訳し、そのまま実行してみます。

import bm25s
import Stemmer  # optional: for stemming

corpus = [
    "猫はネコ科の動物で、喉を鳴らすのが好きです",
    "犬は人間の親友であり、遊ぶのが大好きです",
    "鳥は飛べる美しい動物です",
    "魚は水中に生息し泳ぐ生き物である",
]

# optional: create a stemmer
stemmer = Stemmer.Stemmer("english")

# Tokenize the corpus and only keep the ids (faster and saves memory)
corpus_tokens = bm25s.tokenize(corpus, stopwords="en", stemmer=stemmer)

# Create the BM25 model and index the corpus
retriever = bm25s.BM25()
retriever.index(corpus_tokens)

# Query the corpus
query = "魚は猫のように喉を鳴らしますか？"
query_tokens = bm25s.tokenize(query, stemmer=stemmer)

# Get top-k results as a tuple of (doc ids, scores). Both are arrays of shape (n_queries, k).
# To return docs instead of IDs, set the `corpus=corpus` parameter.
results, scores = retriever.retrieve(query_tokens, k=2)

for i in range(results.shape[1]):
    doc, score = results[0, i], scores[0, i]
    print(f"Rank {i+1} (score: {score:.2f}): {doc}")

# You can save the corpus along with the model
retriever.save("animal_index_bm25", corpus=corpus)

出力結果は以下になりました。

Rank 1 (score: 0.00): 3
Rank 2 (score: 0.00): 2

スコアが全部 0.00 ですねー
生成された vocab.index.json を見てみます。

{"遊ぶのが大好きです": 0, "魚は水中に生息し泳ぐ生き物である": 1, "犬は人間の親友であり": 2, "喉を鳴らすのが好きです": 3, "鳥は飛べる美しい動物です": 4, "猫はネコ科の動物で": 5, "": 6}

明らかにtokenizeができていないですね。英語用の設定をそのまま使っているので当たり前ですが。

ぱっと見、関係しているのは Stemmer と stopwords の設定ですね。

残念ながら PyStemmer に日本語用はありません。もともとオプションなので今回はスルーします。

stopwords も日本語は設定されていません。

Listとして与えるだけなので自分で設定することも可能ですが、こちらも今回はスルーします。
必要なら例えば下記サイトから取得できます。

現状、トークン化ができていないことが問題なので、何らかの形態素解析ライブラリが必要です。
今回は janome を使ってみます。

pip install janome

バージョンは 0.5.0 でした。
bm25s.tokenization.Tokenized は単語IDのリストと辞書を与えてあげる必要がありますので、Quickstartを下記のように修正しました。

import bm25s
from janome.tokenizer import Tokenizer

def jtokenize(corpus):
    t = Tokenizer()
    vocabs = {}
    ids = []
    for text in corpus:
        vocab = [tk.base_form for tk in t.tokenize(text)]
        id_tmp = []
        for v in vocab:
            if v not in vocabs.keys():
                vocabs[v] = len(vocabs)
            id_tmp.append(vocabs[v])
        ids.append(id_tmp)
    return bm25s.tokenization.Tokenized(ids=ids, vocab=vocabs)

corpus = [
    "猫はネコ科の動物で、喉を鳴らすのが好きです",
    "犬は人間の親友であり、遊ぶのが大好きです",
    "鳥は飛べる美しい動物です",
    "魚は水中に生息し泳ぐ生き物である",
]

corpus_tokens = jtokenize(corpus)

# Create the BM25 model and index the corpus
retriever = bm25s.BM25()
retriever.index(corpus_tokens)

# Query the corpus
query = "魚は猫のように喉を鳴らしますか？"
query_tokens = jtokenize([query])

# Get top-k results as a tuple of (doc ids, scores). Both are arrays of shape (n_queries, k).
# To return docs instead of IDs, set the `corpus=corpus` parameter.
results, scores = retriever.retrieve(query_tokens, k=2)

for i in range(results.shape[1]):
    doc, score = results[0, i], scores[0, i]
    print(f"Rank {i+1} (score: {score:.2f}): {doc}")

# You can save the corpus along with the model
retriever.save("animal_index_bm25", corpus=corpus)

出力結果は以下のようになりました。

Rank 1 (score: 2.05): 0
Rank 2 (score: 1.05): 3

スコアは違いますが Rank は英語の時と同じ結果になりましたね。

vocab.index.json にもきちんと出力されています。

{"猫": 0, "は": 1, "ネコ": 2, "科": 3, "の": 4, "動物": 5, "で": 6, "、": 7, "喉": 8, "を": 9, "鳴らす": 10, "が": 11, "好き": 12, "です": 13, "犬": 14, "人間": 15, "親友": 16, "だ": 17, "ある": 18, "遊ぶ": 19, "大好き": 20, "鳥": 21, "飛べる": 22, "美しい": 23, "魚": 24, "水中": 25, "に": 26, "生息": 27, "する": 28, "泳ぐ": 29, "生き物": 30, "": 31}

まとめ

BM25をPythonでサクッと実装できる BM25S を紹介させていただきました。
日本語ムズカシイ！

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up