【Python】RustベースのTfidfVectorizerとBM25Vectorizerを試す【lenlp】

Last updated at 2024-12-03Posted at 2024-12-03

はじめに

TFIDFやBM25のベクトライズができるpythonライブラリを探していたら、Rustベースで実装されたlenlpというライブラリを見つけたので触ってみます。

lenlp

Rustベースの自然言語処理のツールキットです。
pythonから使うことができpipでインストール可能です。
ライセンスはMITです。
ベクトライザはfit、transformなど、sklearnと類似のAPIで使うことができます。
CountVectorizer、TfidfVectorizer、BM25Vectorizerの３種類あり、BM25にも対応しているところがscikitlearn（以下sklearn）との違いとして嬉しいところです。

以下は公式ページのサンプルコードです。

from lenlp import sparse

vectorizer = sparse.TfidfVectorizer(
    ngram_range=(3, 5), # Range of n-grams
    analyzer="char_wb", # Options: word, char, char_wb
    normalize=True, # Lowercase and strip accents
    stop_words=["based"] # List of stop words
)

X = [
    "Hello World", 
    "Rust based vectorizer"
]

matrix = vectorizer.fit_transform(X)

fitしてtransformでOKという感じで、sklearnを触ったことがあると非常に親しみやすいです。
試してみる、という言葉通りだと、サンプルが動けば十分ではあるのですが、せっかくなので、sklearnと対比しながらより細かい挙動を理解したいと思います。
というのも、TFIDFやBM25は、計算式や前処理、後処理がバリエーションが豊富なので、デフォルト値を使うと、ライブラリ間で、重みやスコアが結構変わってしまうためです。この変化は意外とランキングに影響を与えます。

TfidfRetriever

fit

初期化オプション

初期化とその挙動確認のためのfitを実行してみます。sklearnと比べながら試してみます。

ライブラリをインストールします。

pip install lenlp scikit-learn numpy datasets

ライブラリをインポートして、検証用のデータセット（ag_news）をロードします。

from lenlp import sparse
from sklearn.feature_extraction.text import TfidfVectorizer as SklearnTfidfVectorizer
import time
import sys
import numpy as np

from datasets import load_dataset

# AG Newsデータセットのダウンロード
ag_news = load_dataset('ag_news')
ag_news_train = ag_news['train']

とりあえずデフォルト値でfitして、語彙数を比べてみます。

# sklearnのTF-IDFベクトライザで1000件の文書に対してfitし、語彙サイズを確認する
sklearn_tfidf_1000 = SklearnTfidfVectorizer()
sklearn_tfidf_1000.fit(ag_news_train["text"][:1000])
print(f"sklearn TF-IDF vocabulary size for 1000 documents: {len(sklearn_tfidf_1000.vocabulary_)}")

# lenlpのTF-IDFベクトライザで1000件の文書に対してfitし、語彙サイズを確認する
lenlp_tfidf_1000 = sparse.TfidfVectorizer()
lenlp_tfidf_1000.fit(ag_news_train["text"][:1000])
print(f"lenlp TF-IDF vocabulary size for 1000 documents: {len(lenlp_tfidf_1000.vocabulary)}")

sklearn TF-IDF vocabulary size for 1000 documents: 7520
lenlp TF-IDF vocabulary size for 1000 documents: 8264

sklearnのほうが語彙数が少なくなっています。
これは、sklearnのanalyzerの初期値はwordなのですが、wordでは空白でのsplit以外にもハイフンでsplitしたり、token_patternに応じてトークンをフィルタリングしたりしているからです。

後の検証のために、lenlpと挙動を合わせたいので、analyzerに関数を指定します。同時に、lenlpではnormalizeの初期値がtrueなのでfalseにしておきます。

# sklearnのTF-IDFベクトライザで1000件の文書に対してfitし、語彙サイズを確認する
sklearn_tfidf_1000 = SklearnTfidfVectorizer(analyzer=lambda x: x.split())
sklearn_tfidf_1000.fit(ag_news_train["text"][:1000])
print(f"sklearn TF-IDF vocabulary size for 1000 documents: {len(sklearn_tfidf_1000.vocabulary_)}")

# lenlpのTF-IDFベクトライザで1000件の文書に対してfitし、語彙サイズを確認する
lenlp_tfidf_1000 = sparse.TfidfVectorizer(normalize=False)
lenlp_tfidf_1000.fit(ag_news_train["text"][:1000])
print(f"lenlp TF-IDF vocabulary size for 1000 documents: {len(lenlp_tfidf_1000.vocabulary)}")

sklearn TF-IDF vocabulary size for 1000 documents: 11567
lenlp TF-IDF vocabulary size for 1000 documents: 11567

これで語彙数が揃いました。

以下、lenlpの初期化オプションについて補足です。

インスタンス生成時に指定できる初期化オプションはngram_range、analyzer、normalize、stop_wordsの4種類のみです。
ngram_rangeとstop_wordsはsklearnと同様で、その他は仕様が違います。

analyzerはword, char, char_wbのみを指定でき、sklearnのようにトーカナイズ用の関数を指定できないことに注意です。このため、日本語に対応させるには、分かち書きを別に実行しておく必要があります。またこのlenlpにおけるanalyzer=wordは空白での単なるスプリットです。sklearnのanalyzer=wordが空白でのsplitに加えていろいろな処理を同時にするので、比較するとシンプルです。個人的には、暗黙によくわからないことをされるよりはシンプルな処理のほうが好みです。

normalizeはベクトルではなく文章の正規化です。Trueにすると大文字は小文字になり、アクセント記号は除去されるようです。sklearnのlowercade=Trueかつstrip_accentsにNone以外（unicodeとasciiのどちらと同一なのかはわかりません）を指定した場合に相当します。sklearnのnorm（ベクトルの正規化）に相当するオプションはありません。

idf

lenlpが採用しているidfの計算式を確認しておきます。
文書数（matrix.shape[0]）と文書頻度（tf）に1を加算しており、これはsklearnでいうとsmooth_idf=Trueの場合に相当します。

python/lenlp/sparse/tfidf_vectorizer.py

    def update(self, matrix: csr_matrix) -> csr_matrix:
        """Update the idf values."""
        tf = (matrix > 0).sum(axis=0)
        self.idf = (
            np.squeeze(a=np.asarray(a=np.log((matrix.shape[0] + 1.0) / (tf + 1.0)))) + 1
        )

sklearnのidfと比べてみます。

# sklearnのidfとlenlpのidf値を比較

# sklearnのidf値を取得
sklearn_idf = sklearn_tfidf_1000.idf_

# lenlpのidf値を取得
lenlp_idf = lenlp_tfidf_1000.idf

# sklearnのidf値とlenlpのidf値を比較し、idfが高いものを10こ出力する
idf_values = []
for term, sklearn_value in list(zip(sklearn_tfidf_1000.get_feature_names_out(), sklearn_idf))[:1000]:
    lenlp_value = lenlp_idf[lenlp_tfidf_1000.vocabulary[term]] if term in lenlp_tfidf_1000.vocabulary else None
    idf_values.append((term, sklearn_value, lenlp_value))

# sklearnのidf値が低い順にソートし、低いほうのIDFを観察する。高いIDFはどれも同じ値で正しく取得できているが判断しにくいため。」
idf_values.sort(key=lambda x: x[1], reverse=False)
for i in range(10):
    term, sklearn_value, lenlp_value = idf_values[i]
    print(f"単語: {term}, sklearnのidf値: {sklearn_value}, lenlpのidf値: {lenlp_value}")

単語: -, sklearnのidf値: 1.812930216882996, lenlpのidf値: 1.812930216882996
単語: (Reuters), sklearnのidf値: 2.7439688053917064, lenlpのidf値: 2.7439688053917064
単語: (AP), sklearnのidf値: 2.833580964081394, lenlpのidf値: 2.833580964081394
単語: AP, sklearnのidf値: 3.0965704239428034, lenlpのidf値: 3.0965704239428034
単語: --, sklearnのidf値: 3.3237873006446486, lenlpのidf値: 3.3237873006446486
単語: A, sklearnのidf値: 3.3548778877146797, lenlpのidf値: 3.3548778877146797
単語: ATHENS, sklearnのidf値: 4.541458949328746, lenlpのidf値: 4.541458949328746
単語: ..., sklearnのidf値: 4.650658241293739, lenlpのidf値: 4.650658241293739
単語: (AFP), sklearnのidf値: 4.773260563386071, lenlpのidf値: 4.773260563386071
単語: AFP, sklearnのidf値: 4.773260563386071, lenlpのidf値: 4.773260563386071

lenlpとsklearnでidfが一致していました。

速度

fitの速度をsklearnとlenlpで比較してみます。
lenlpのほうがRustベースなので速いことが期待されます。

10万件のコーパスのfitを10回やって平均と標準偏差を出力します。

# sklearnのTF-IDFベクトライザのfit時間を10回計測し、平均値と標準偏差を求める
sklearn_times = []
for _ in range(10):
    start_time = time.time()
    sklearn_tfidf = SklearnTfidfVectorizer(analyzer=lambda x: x.split())
    sklearn_tfidf.fit(ag_news_train["text"][:100000])
    end_time = time.time()
    sklearn_times.append(end_time - start_time)

sklearn_mean = np.mean(sklearn_times)
sklearn_std = np.std(sklearn_times)
print(f"sklearn TF-IDF fit time for 100,000 documents: mean = {sklearn_mean:.4f} seconds, std = {sklearn_std:.4f} seconds")

# lenlpのTF-IDFベクトライザのfit時間を10回計測し、平均値と標準偏差を求める
lenlp_times = []
for _ in range(10):
    start_time = time.time()
    lenlp_tfidf = sparse.TfidfVectorizer(normalize=False)
    lenlp_tfidf.fit(ag_news_train["text"][:100000])
    end_time = time.time()
    lenlp_times.append(end_time - start_time)

lenlp_mean = np.mean(lenlp_times)
lenlp_std = np.std(lenlp_times)
print(f"lenlp TF-IDF fit time for 100,000 documents: mean = {lenlp_mean:.4f} seconds, std = {lenlp_std:.4f} seconds")

sklearn TF-IDF fit time for 100,000 documents: mean = 1.9273 seconds, std = 0.0813 seconds
lenlp TF-IDF fit time for 100,000 documents: mean = 1.8600 seconds, std = 0.0572 seconds

意外とスピードは変わりませんでした。誤差の範囲内と思われます。

transform

速度

transformの速度をsklearnと比較してみます。
10万件のコーパスのtransformを10回やって平均と標準偏差を出力します。

# transformの時間をはかる。
# sklearnのTF-IDFベクトライザのtransform時間を10回計測し、平均値と標準偏差を求める
sklearn_transform_times = []
sklearn_tfidf = SklearnTfidfVectorizer(analyzer=lambda x: x.split())
sklearn_tfidf.fit(ag_news_train["text"][:100000])
for _ in range(10):
    start_time = time.time()
    sklearn_tfidf.transform(ag_news_train["text"][:100000])
    end_time = time.time()
    sklearn_transform_times.append(end_time - start_time)

sklearn_transform_mean = np.mean(sklearn_transform_times)
sklearn_transform_std = np.std(sklearn_transform_times)
print(f"sklearn TF-IDF transform time for 100,000 documents: mean = {sklearn_transform_mean:.4f} seconds, std = {sklearn_transform_std:.4f} seconds")

# lenlpのTF-IDFベクトライザのtransform時間を10回計測し、平均値と標準偏差を求める
lenlp_transform_times = []
lenlp_tfidf = sparse.TfidfVectorizer(normalize=False)
lenlp_tfidf.fit(ag_news_train["text"][:100000])

for _ in range(10):
    start_time = time.time()
    lenlp_tfidf.transform(ag_news_train["text"][:100000])
    end_time = time.time()
    lenlp_transform_times.append(end_time - start_time)

lenlp_transform_mean = np.mean(lenlp_transform_times)
lenlp_transform_std = np.std(lenlp_transform_times)
print(f"lenlp TF-IDF transform time for 100,000 documents: mean = {lenlp_transform_mean:.4f} seconds, std = {lenlp_transform_std:.4f} seconds")

sklearn TF-IDF transform time for 100,000 documents: mean = 1.5869 seconds, std = 0.0751 seconds
lenlp TF-IDF transform time for 100,000 documents: mean = 1.0966 seconds, std = 0.0282 seconds

transformについては誤差の範囲を超えて、lenlpのほうが早そうです。実行時間で大体2/3くらいです。

類似度

sklearnと置き換えて使えるか確認するために、TFIDF重みベクトルから計算された類似度（内積）がsklearnと一致するか確認します。

# 1000件の文書をそれぞれtransform
sklearn_vectors_1000 = sklearn_tfidf_1000.fit_transform(ag_news_train["text"][:1000])
lenlp_vectors_1000 = lenlp_tfidf_1000.fit_transform(ag_news_train["text"][:1000])

# 内積を計算
sklearn_dot_product = sklearn_vectors_1000.dot(sklearn_vectors_1000.T)
lenlp_dot_product = lenlp_vectors_1000.dot(lenlp_vectors_1000.T)

# 内積が一致するか確認
dot_product_similar = np.allclose(sklearn_dot_product.toarray(), lenlp_dot_product.toarray())
print(f"内積が一致するか: {dot_product_similar}")

内積が一致するか: True

一致しました。np.allcloseのデフォルト値はatol=10e-8、rtol=10e-5なので、絶対値として10e-8以下の差、または相対値として10e-5以下の差には収まっていることになります。

ベクトルのサイズ

先の検証で、内積がsklearnの場合と一致することから、lenlpではsklearnのデフォルト（norm="l2"）と同様に、ベクトルのサイズが1になるように正規化されているようです。

念の為確認してみます。

vectors = lenlp_tfidf_1000.transform(ag_news_train["text"][:10])
# それぞれのベクトルのノルムを計算して出力する
for i, vector in enumerate(vectors):
    norm = np.linalg.norm(vector.toarray(), ord=2)
    print(f"ベクトル {i+1} のノルム（L2ノルム）: {norm}")

ベクトル 1 のノルム（L2ノルム）: 1.0
ベクトル 2 のノルム（L2ノルム）: 1.0
ベクトル 3 のノルム（L2ノルム）: 1.0
ベクトル 4 のノルム（L2ノルム）: 1.0
ベクトル 5 のノルム（L2ノルム）: 1.0
ベクトル 6 のノルム（L2ノルム）: 1.0
ベクトル 7 のノルム（L2ノルム）: 1.0
ベクトル 8 のノルム（L2ノルム）: 1.0
ベクトル 9 のノルム（L2ノルム）: 1.0
ベクトル 10 のノルム（L2ノルム）: 1.0

確かにサイズ1のようです。
注意点として、l2以外のnormで正規化、または正規化しない、ということはlenlpではできなさそうです。

日本語処理

mecabと組み合わせて日本語のfitとtransformをやってみます。
先に述べたように、lenlpではanalyzerに関数を指定できないため、分かち書きは関数の外側でやる必要があります。

mecabと辞書をインストールします。

pip install mecab-python3 ipadic

検証用の日本語データセットをダウンロードします（以下は、train, val, testに分ける必要はないのですが、面倒なのでサンプルコードをそのまま使っています）

dataset = load_dataset(
    "shunk031/livedoor-news-corpus", 
    train_ratio=0.8,
    val_ratio=0.1,
    test_ratio=0.1,
    random_state=42, 
    shuffle=True,
)
train_data = dataset['train']
print(f"Number of training samples: {len(train_data)}")

Number of training samples: 5894

分かち書き用の関数を作ります。
analyzer=wordによる空白分割を利用するため、スペースで分かち書きしたstrを返す関数にします。

import MeCab
mecab = MeCab.Tagger('-Owakati')
def tokenize(text: str) -> str:
    return mecab.parse(text).strip()
print(tokenize(train_data["content"][0]))

3 日 に 放送 さ れ た 「 サンデージャポン 」 （ TBS 系 ） 番組 内 で は 、 片山 さつき 議員 と 元 衆議院 議員 で 現在 は タレント 活動 を 行う 杉村 太蔵 が 、 河本 準一 母 の 生活 保護 受給 問題 について 議論 し（以下略）

fitやtransformに入力する前にtokenizeを適用することで、日本語を扱うことができます。analyzerに関数を指定できるsklearnと比べると使いにくいですね。

# fit
splitted_tokenized_texts = [tokenize(text) for text in train_data["content"]]
lenlp_tfidf.fit(splitted_tokenized_texts)

# transform
splitted_text = tokenize(train_data["content"][0])
transformed = lenlp_tfidf.transform([splitted_text])
print(transformed)

<Compressed Sparse Row sparse matrix of dtype 'float32'
	with 159 stored elements and shape (1, 67510)>
  Coords	Values
  (0, 0)	0.06647482514381409
  ...

BM25Vectorizer

TfidfVectorizerと異なる箇所を中心に見ていきます。

fit

初期化オプション

初期化時の引数はTfidfVectorizerと共通の4つ（analyzer, ngram_range、normalize、stop_words）に加えて、BM25固有のオプションであるb, k1, epsilonが存在します。

python/lenlp/sparse/bm25_vectorizer.py

class BM25Vectorizer(TfidfVectorizer):
    """BM25Vectorizer is a class that converts a collection of text documents to a sparse
    bm25 matrix.

    Parameters
    ----------
    analyzer
        {word, char, char_wb}, default=word.
        Whether the feature should be made of word n-gram or character n-grams. Option
        char_wb creates character n-grams only from text inside word boundaries;
        n-grams at the edges of words are padded with space.
    ngram_range
        tuple (min_n, max_n), default=(1, 1).
        The lower and upper boundary of the range of n-values for different n-grams to
        be extracted. All values of n such that min_n <= n <= max_n will be used.
    normalize
        bool, default=True.
        Whether to normalize the text before counting. It will lowercase the text and remove
        punctuation.
    stop_words
        list of str, default=None.
        A list of stop words that will be removed from the text.
    b
        The impact of document length normalization.  Default is `0.75`, Higher will
        penalize longer documents more.
    k1
        How quickly the impact of term frequency saturates.  Default is `1.5`, Higher
        will make term frequency more influential.
    epsilon
        Smoothing term. Default is `0`.

b, k1, epsilonの初期値をpythonでよく使われるrank-bm25と比べると、b、k1はそれぞれ0.75、1.5で共通、epsilonはlenlpは0で、rank-bm25の0.25と異なります。また、後述しますが、epsilonの使われ方が、rank-bm25とは異なります。

k1は1.2か2.0が標準的という意見もあったり、実際、elasticsearchでは初期値が1.2だったりするので、他のライブラリと挙動を揃えたい場合は、意識して調整が必要です。

idf

lenlpのidfの実装を確認します。

python/lenlp/sparse/bm25_vectorizer.py

    def update(self, matrix: csr_matrix) -> csr_matrix:
        """Update the idf values."""
        self.tf = (matrix > 0).sum(axis=0)
        len_documents = (matrix).sum(axis=1)
        self.average_len = len_documents.mean()
        self.count = matrix.shape[0]

        self.idf = np.squeeze(
            a=np.asarray(
                a=np.log((self.count - self.tf + 0.5) / (self.tf + 0.5) + 1),
                dtype=np.float32,
            )
        )

rank-bm25と比べます。

rank_bm25.py

    def _calc_idf(self, nd):
        """
        Calculates frequencies of terms in documents and in corpus.
        This algorithm sets a floor on the idf values to eps * average_idf
        """
        # collect idf sum to calculate an average idf for epsilon value
        idf_sum = 0
        # collect words with negative idf to set them a special epsilon value.
        # idf can be negative if word is contained in more than half of documents
        negative_idfs = []
        for word, freq in nd.items():
            idf = math.log(self.corpus_size - freq + 0.5) - math.log(freq + 0.5)
            self.idf[word] = idf
            idf_sum += idf
            if idf < 0:
                negative_idfs.append(word)
        self.average_idf = idf_sum / len(self.idf)

        eps = self.epsilon * self.average_idf
        for word in negative_idfs:
            self.idf[word] = eps

lenlpでは、idfの計算でlogをとるまえに1を加算することで、idfが負になることを防いでます。
elasticsearchでも同様の方法が採用されているようです。

                a=np.log((self.count - self.tf + 0.5) / (self.tf + 0.5) + 1),

rank-bm25では、logをとるまえに1を足さずに、idfが負の場合は平均idfとepsilonの積に置き換えるということをしています。

        self.average_idf = idf_sum / len(self.idf)

        eps = self.epsilon * self.average_idf
        for word in negative_idfs:
            self.idf[word] = eps

wikipediaによればBM25のIDFの計算では

idf値の最小値を0とし、一般的な用語を完全に無視する
idf値の最小値を定数とし、一般的な用語を完全に無視することを避けつつ、影響を減らす
idfが必ず正となる定義式に変える

のバリエーションがあるらしく、lenlpは3つめ、rank-bm25は2つめのマイナーチェンジ版という感じに見えます。rank-bm25のようにaverage_idfと定数epsilonの積をとる方法は、置き換え後の微小値のスケールを他とあわせられることを狙っているのだと思いますが、idfに外れ値があった場合、average_idfが思ったより大きい値になってしまうリスクもあるので、少し不安定だなと思います。
個人的には、lenlpの方法のほうがリスクが少ないように感じます。

transform

transformの実装を確認します。

python/lenlp/sparse/bm25_vectorizer.py

    def _transform(self, matrix: csr_matrix) -> csr_matrix:
        """Transform a count matrix to a bm25 matrix."""
        len_documents = (matrix).sum(axis=1)
        regularization = np.squeeze(
            a=np.asarray(
                a=(
                    self.k1 * (1 - self.b + self.b * (len_documents / self.average_len))
                ).flatten()
            )
        )

        denominator = matrix.tocsc()
        denominator.data += np.take(a=regularization, indices=denominator.indices)
        matrix.data = (
            (matrix.data * (self.k1 + 1)) / denominator.tocsr().data
        ) + self.epsilon

        matrix = matrix.multiply(other=self.idf).tocsr()
        inplace_csr_row_normalize_l2(matrix)
        return matrix

epsilon

epsilonはrank-bm25ではidfが負になった場合にそれを置き換える用途で使われています。一方で、lenlpでは以下のように、補正したtfに加算されています。補正したtfの下限値を定めているということだと思います。

python/lenlp/sparse/bm25_vectorizer.py

        matrix.data = (
            (matrix.data * (self.k1 + 1)) / denominator.tocsr().data
        ) + self.epsilon

ベクトル正規化

重みベクトルをサイズ1に正規化しているようです。
BM25は正規化しないのが普通なので、なぜかはよくわかりません。

python/lenlp/sparse/bm25_vectorizer.py

        inplace_csr_row_normalize_l2(matrix)

念の為サイズを確認してみます。

lenlp_bm25 = sparse.BM25Vectorizer(normalize=False)
lenlp_bm25.fit(ag_news_train["text"][:1000])
vectors = lenlp_bm25.transform(ag_news_train["text"][:10])

# それぞれのベクトルのノルムを計算して出力する
for i, vector in enumerate(vectors):
    norm = np.linalg.norm(vector.toarray(), ord=2)
    print(f"ベクトル {i+1} のノルム（L2ノルム）: {norm}")

ベクトル 1 のノルム（L2ノルム）: 1.0
ベクトル 2 のノルム（L2ノルム）: 1.0
ベクトル 3 のノルム（L2ノルム）: 1.0
ベクトル 4 のノルム（L2ノルム）: 1.0
ベクトル 5 のノルム（L2ノルム）: 1.0
ベクトル 6 のノルム（L2ノルム）: 1.0
ベクトル 7 のノルム（L2ノルム）: 1.0
ベクトル 8 のノルム（L2ノルム）: 1.0
ベクトル 9 のノルム（L2ノルム）: 1.0
ベクトル 10 のノルム（L2ノルム）: 1.0

やはり正規化されているようです。

rank-bm25との比較

rank-bm25とlenlpで出力が揃うように、ソースコードの調整を試みます。

lenlpからは、ベクトルの正規化処理をコメントアウトします。また、重み計算時のepsilonの加算をやめます。

from lenlp import sparse
from scipy.sparse import csr_matrix
class LenlpBM25Vectorizer(sparse.BM25Vectorizer):
    # override
    def _transform(self, matrix: csr_matrix) -> csr_matrix:
        """Transform a count matrix to a bm25 matrix."""
        len_documents = (matrix).sum(axis=1)
        regularization = np.squeeze(
            a=np.asarray(
                a=(
                    self.k1 * (1 - self.b + self.b * (len_documents / self.average_len))
                ).flatten()
            )
        )

        denominator = matrix.tocsc()
        denominator.data += np.take(a=regularization, indices=denominator.indices)
        matrix.data = (
            (matrix.data * (self.k1 + 1)) / denominator.tocsr().data
        )# + self.epsilon epsilonは使わない

        matrix = matrix.multiply(other=self.idf).tocsr()
        # inplace_csr_row_normalize_l2(matrix) # l2正規化しない
        return matrix

rank-bm25も、idfの計算をlenlpによせるように変更します。

from rank_bm25 import BM25Okapi
import math

class RankBM25Okapi(BM25Okapi):
    # override
    def _calc_idf(self, nd):
        """
        Calculates frequencies of terms in documents and in corpus.
        This algorithm sets a floor on the idf values to eps * average_idf
        """
        # collect idf sum to calculate an average idf for epsilon value
        idf_sum = 0
        # collect words with negative idf to set them a special epsilon value.
        # idf can be negative if word is contained in more than half of documents
        # idfが負になることはないので、コメントアウト
        # negative_idfs = []
        for word, freq in nd.items():
            #idf = math.log( self.corpus_size - freq + 0.5) - math.log(freq + 0.5)
            idf = math.log( (self.corpus_size - freq + 0.5) / (freq + 0.5) + 1) # logを取る前に1をたすことでidfが負にならないようにする
            self.idf[word] = idf
            idf_sum += idf
        # idfが負になることはないので、コメントアウト
        #    if idf < 0:
        #        negative_idfs.append(word)
        self.average_idf = idf_sum / len(self.idf)

        # idfが負になることはないので、コメントアウト
        #eps = self.epsilon * self.average_idf
        #for word in negative_idfs:
        #    self.idf[word] = eps

これで互いの計算結果が一致すると期待されるので試してみます。

まず語彙数とidfを比較します。

corpus = ag_news_train["text"][:1000]
tokenized_corpus = [text.split() for text in corpus]

lenlp_bm25 = LenlpBM25Vectorizer(normalize=False)
lenlp_bm25.fit(corpus)

rankbm25_bm25 = RankBM25Okapi(tokenized_corpus, k1 = 1.5)

# rankbm25_bm25のidfを取得
rankbm25_idf = rankbm25_bm25.idf

# lenlpのidfを取得
lenlp_vocab = lenlp_bm25.vocabulary
lenlp_idf = lenlp_bm25.idf

print(f"rankbm25のidfの長さ: {len(rankbm25_idf)}")
print(f"lenlpのidfの長さ: {len(lenlp_idf)}")



# 5個の共通の単語についてidfを比較
for i, (word, id) in enumerate(list(lenlp_vocab.items())[:5]):
    print(f"単語: {word}")
    print(f"rankbm25のidf: {rankbm25_idf[word]}")
    print(f"lenlpのidf: {lenlp_idf[id]}")
    print("-" * 30)

rankbm25のidfの長さ: 11567
lenlpのidfの長さ: 11567
単語: \$297
rankbm25のidf: 6.503289671207057
lenlpのidf: 6.503289699554443
------------------------------
単語: coast
rankbm25のidf: 6.503289671207057
lenlpのidf: 6.503289699554443
------------------------------
単語: Warmer
rankbm25のidf: 5.992464047441065
lenlpのidf: 5.992464065551758
------------------------------
単語: Personal
rankbm25のidf: 5.655991810819852
lenlpのidf: 5.655992031097412
------------------------------
単語: lawmakers.
rankbm25のidf: 6.503289671207057
lenlpのidf: 6.503289699554443
------------------------------

語彙数は一致しました。
idfもほぼ一致しています。浮動小数点計算に伴う誤差らしきものはあります。

スコアを確認します。
BM25はクエリのカウントベクトルと文書の重みベクトルの内積として計算されるので、BM25Vectorizerに加えてCountVectorizerも定義します。

corpus = ag_news_train["text"][:1000]
tokenized_corpus = [text.split() for text in corpus]

lenlp_count = sparse.CountVectorizer(normalize=False)
lenlp_count.fit(corpus)
lenlp_bm25 = LenlpBM25Vectorizer(normalize=False)
lenlp_bm25.fit(corpus)

rankbm25_bm25 = RankBM25Okapi(tokenized_corpus, k1 = 1.5)
lenlp_scores = lenlp_count.transform(corpus).dot(lenlp_bm25.transform(corpus).T)

rankbm25_scores = []
for query in tokenized_corpus:
    rankbm25_scores.append(rankbm25_bm25.get_scores(query))
rankbm25_scores = np.array(rankbm25_scores)


# ２つのscoresが同じか判定
scores_similar = np.allclose(lenlp_scores.toarray(), rankbm25_scores, atol=1e-8, rtol=1e-5)
print(f"２つのscoresが同じか: {scores_similar}")

２つのscoresが同じか: False

一致しませんでした。具体的に中身を見てみます。

for i in range(10):
    print(f"RankBM25 Score {i+1}: {rankbm25_scores[i][0]}")
    print(f"Lenlp Score {i+1}: {lenlp_scores.toarray()[i][0]}")
    print("-" * 30)

RankBM25 Score 1: 120.17851884569164
Lenlp Score 1: 119.9459457397461
------------------------------
RankBM25 Score 2: 8.223393887078299
Lenlp Score 2: 29.620906829833984
------------------------------
RankBM25 Score 3: 11.535474278898663
Lenlp Score 3: 32.01129150390625
------------------------------
RankBM25 Score 4: 7.142324337969737
Lenlp Score 4: 20.73758888244629
------------------------------
RankBM25 Score 5: 1.355759101922211
Lenlp Score 5: 7.254471778869629
------------------------------
RankBM25 Score 6: 7.142324337969737
Lenlp Score 6: 20.73758888244629
------------------------------
RankBM25 Score 7: 2.7579612688974473
Lenlp Score 7: 17.17241859436035
------------------------------
RankBM25 Score 8: 1.6768917197888855
Lenlp Score 8: 8.289098739624023
------------------------------
RankBM25 Score 9: 2.4368286510307726
Lenlp Score 9: 16.13779067993164
------------------------------
RankBM25 Score 10: 108.34672674860514
Lenlp Score 10: 105.46074676513672
------------------------------

確かに一致していません。
原因を探るため、lenlpのCountVectorizerとBM25Vectorizerのvocabularyを比べてみます。

for (word, id) in list(lenlp_bm25.vocabulary.items())[:10]:
    print(f"単語: {word}, bm25 id: {id}, count id: {lenlp_count.vocabulary[word]}")

単語: Roadblocks\\Today,, bm25 id: 6595, count id: 6598
単語: ubiquitous, bm25 id: 9969, count id: 9982
単語: Q2, bm25 id: 5974, count id: 5972
単語: bargain, bm25 id: 705, count id: 715
単語: needed, bm25 id: 5081, count id: 5089
単語: shelters..., bm25 id: 9388, count id: 9392
単語: Tampa, bm25 id: 9867, count id: 9862
単語: Service,, bm25 id: 10810, count id: 10813
単語: Lasting, bm25 id: 10870, count id: 10871
単語: pulls, bm25 id: 11351, count id: 11355

単語がBM25VectorizerとCountVectorizerで異なるIDに振られていました。
スコアが違ったのもこれが原因なようです。
BM25VectorizerとCountVectorizerを独立にfitしてvocabularyのidを揃える方法はよくわかりませんでした。
幸い、CountVectorizerはBM25Vectorizerと継承関係にある（祖父クラス）ので

sparse.CountVectorizer.transform(lenlp_bm25, corpus)

のようにしてlenlp_bm25と条件を揃えてCountVectorizerのメソッドを使うことができます。
あんまり多用したい方法ではないですが、これでなんとかしてみます。

corpus = ag_news_train["text"][:1000]
tokenized_corpus = [text.split() for text in corpus]

lenlp_bm25 = LenlpBM25Vectorizer(normalize=False)
lenlp_bm25.fit(corpus)

rankbm25_bm25 = RankBM25Okapi(tokenized_corpus, k1 = 1.5)
lenlp_scores = sparse.CountVectorizer.transform(lenlp_bm25, corpus).dot(lenlp_bm25.transform(corpus).T)

rankbm25_scores = []
for query in tokenized_corpus:
    rankbm25_scores.append(rankbm25_bm25.get_scores(query))
rankbm25_scores = np.array(rankbm25_scores)


# ２つのscoresが同じか判定
scores_similar = np.allclose(lenlp_scores.toarray(), rankbm25_scores, atol=1e-8, rtol=1e-5)
print(f"２つのscoresが同じか: {scores_similar}")

２つのscoresが同じか: True

無事一致しました。

改善案としては、以下のように、BM25Vectorzerのメソッドとして追加してしまう方が、sparse.CountVectorizerを直接呼び出すよりは、多少保守性は高いかもしれません。
五十歩百歩かもですが。

class LenlpBM25Vectorizer(sparse.BM25Vectorizer):
    """略"""
    def transform_to_count(self, raw_documents: list[str]) -> csr_matrix:
        return sparse.CountVectorizer.transform(self, raw_documents = raw_documents)

おわりに

たまたま見つけたRustベースの自然言語処理ツールlenlpを試してみました。
ベクトライザはsklearnとインターフェースが類似しており、親しみやすいです。
速度についてはfitは同速でしたが、transformは結構速そうでした。
ただ、分かち書きはクラスの外で行う必要があり手間なので、基本はsklearnで良いかなと思います。

BM25Vectorizerはsklearnにはないので魅力的です。ただ、実装には少しクセがあり、オーバーライドなどを多用していくと、自分で実装したほうがマシに感じてくるかもしれません。

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up