0
0

Delete article

Deleted articles cannot be recovered.

Draft of this article would be also deleted.

Are you sure you want to delete this article?

【langchain】BM25Retrieverの高速化(scikit-learn vs rank_bm25)

Last updated at Posted at 2024-10-03

概要

langchainのBM25Retrieverをオリジナルをそのまま用いた場合(rank_bm25)とscikit-learnベースのBM25のベクトライザを内部で使うように書き換えた場合とで、速度比較しました。
結論としては、scikit-learnベースのものを使うようにすることでかなりの速度改善が見込めそうです。

背景

langchainのBM25Retrieverは内部でrank_bm25を使用しています。
BM25Retrieverは検索対象文書のベクトルを保持しておらず、rank_bm25をの(BM25Okapiクラスのget_top_n関数の中で使われている)get_scores関数を使って類似度スコアを算出しています。get_scores関数は単語の一致を確認するような実装になっており、実行に時間がかかります。(参考)。

一方で、langchainのTFIDFRetireverは、tfidf_arrayというプラパティで、検索対象文書のベクトルを保持しており、クエリベクトルとの類似度をnumpyを使って計算しているため、比較するとかなり早いです。
BM25Retrieverでも同じことができればよいのですが、rank_bm25には、文章をベクトルに変換するメソッドがないため、簡単ではありません。

と思っていたのですが、最近、scikit-learnベースの、TfidfVectorizerと同じAPIをもつ、BM25のvectorizerの実装を公開してくださった方がいらっしゃいました。

こちらの実装をお借りし、TFIDFRetrieverをBM25仕様に改修することで、オリジナルのBM25Retrieverよりも速度改善が見込めそうですので、試してみました。

インストール

pip install langchain-community==0.3.1 rank-bm25 scikit-learn datasets

実装

bm25.pyをメインのスクリプトと同じ階層に保存し、BM25Vectorizerをimportできるようにしておきます。

次に、langchainのTFIDFVectorizerを継承したBM25SklearnRetrieverクラスを作成します。from_textsをオーバーライドし、内部でBM25Vectorizerを使用するようにします。
他のメソッドは特に書き換え不要です(と思います)。

from bm25 import BM25Vectorizer
from typing import Iterable, Optional, Dict, Any
from langchain_core.documents import Document
from langchain.retrievers import TFIDFRetriever

class BM25SklearnRetriever(TFIDFRetriever):

    @classmethod
    def from_texts(
        cls,
        texts: Iterable[str],
        metadatas: Optional[Iterable[dict]] = None,
        tfidf_params: Optional[Dict[str, Any]] = None,
        **kwargs: Any,
    ) -> TFIDFRetriever:

        tfidf_params = tfidf_params or {}
        # BM25Vectorizerを使うように書き換え
        #vectorizer = TfidfVectorizer(**tfidf_params)
        vectorizer = BM25Vectorizer(**tfidf_params)
        tfidf_array = vectorizer.fit_transform(texts)
        metadatas = metadatas or ({} for _ in texts)
        docs = [Document(page_content=t, metadata=m) for t, m in zip(texts, metadatas)]
        return cls(vectorizer=vectorizer, docs=docs, tfidf_array=tfidf_array, **kwargs)

*上記の実装だと、コーパスもクエリも同じ重みでtransformされるのですが、BM25のスコアリングの定義から考えると、クエリは本当はCountVectorizerでtransformされる必要がある気がします。ただそうすると、対称性が満たされないので微妙な気もします。BM25の本来のスコアリング定義とは異なりますが、TFIDFの類似度検索と同様にしても良い気もしますどちらともいえないのと、実装が面倒なので、いったん気にせず進みます。

速度比較

BM25RetrieverとBM25SklearnRetrieverの速度を比較します。

まず検証用のコーパスを作成します。
ag_newsから10万件の文章を取得します。
またクエリも1000件取得します。

from datasets import load_dataset

# Load the AG News dataset
dataset = load_dataset('ag_news')

# Define sub-corpus size
corpus_size = 100000

# Extract the text data from the dataset
corpus = [item['text'] for item in dataset['train'].select(range(corpus_size))]  # First sub_corpus_size items for corpus_a

queries = [item['text'] for item in dataset['test'].select(range(1000))]

2種類のリトリーバーを作成します。
それぞれ上位1000件の文書を検索結果として得るように設定します。

from langchain_community.retrievers import BM25Retriever
rank_bm25_retriever = BM25Retriever.from_texts(corpus)
skearn_bm25_retriever = BM25SklearnRetriever.from_texts(corpus)

rank_bm25_retriever.k = 1000
rank_bm25_retriever.k = 1000

クエリ10件の検索に要した時間を測定します。

import time
import tqdm

# Define the number of queries to search
num_queries = 10

# Measure the time taken to search the specified number of queries using rank_bm25_retriever
start_time = time.time()
for query in tqdm.tqdm(queries[:num_queries]):
    rank_bm25_retriever.invoke(query)
rank_bm25_time = time.time() - start_time

# Measure the time taken to search the specified number of queries using skearn_bm25_retriever
start_time = time.time()
for query in tqdm.tqdm(queries[:num_queries]):
    skearn_bm25_retriever.invoke(query)
skearn_bm25_time = time.time() - start_time

print(f"Time taken for rank_bm25_retriever to search {num_queries} queries: {rank_bm25_time:.2f} seconds")
print(f"Time taken for skearn_bm25_retriever to search {num_queries} queries: {skearn_bm25_time:.2f} seconds")

100%|██████████| 10/10 [00:13<00:00,  1.33s/it]
100%|██████████| 10/10 [00:00<00:00, 40.04it/s]
Time taken for rank_bm25_retriever to search 10 queries: 13.31 seconds
Time taken for skearn_bm25_retriever to search 10 queries: 0.25 seconds

BM25SklearnRetrieverのほうがかなり時間が短縮されることがわかりました(1/60ほど)。

検索結果比較

rank_bm25を用いたオリジナルのBM25RetrieverとBM25SklearnRetrieverで検索結果がおなじになるのか調べてみます。

100件のクエリに対して、検索結果の上位10件の類似度を比較します。
比較にはjaccard係数を用います。

num_queries = 100
# Measure the relevance scores for the specified number of queries using rank_bm25_retriever
rank_bm25_texts = []
for query in tqdm.tqdm(queries[:num_queries]):
    documents = rank_bm25_retriever.invoke(query)
    texts = [document.page_content for document in documents]
    rank_bm25_texts.append(texts)

# Measure the relevance scores for the specified number of queries using skearn_bm25_retriever
skearn_bm25_texts = []
for query in tqdm.tqdm(queries[:num_queries]):
    documents = skearn_bm25_retriever.invoke(query)
    texts = [document.page_content for document in documents]
    skearn_bm25_texts.append(texts)

import numpy as np
num_results = 10
def jaccard_similarity(list1, list2):
    set1 = set(list1)
    set2 = set(list2)
    intersection = len(set1.intersection(set2))
    union = len(set1.union(set2))
    return intersection / union

# Calculate the average Jaccard similarity score for the specified number of queries using rank_bm25_retriever
jaccard_scores = []
for i in range(num_queries):
    jaccard = jaccard_similarity(rank_bm25_texts[i][:num_results], skearn_bm25_texts[i][:num_results])
    jaccard_scores.append(jaccard)

average_jaccard_score = np.mean(jaccard_scores)
print(f"Average Jaccard similarity score for {num_queries} queries: {average_jaccard_score:.2f}")
100%|██████████| 100/100 [01:32<00:00,  1.08it/s]
100%|██████████| 100/100 [00:02<00:00, 39.48it/s]
Average Jaccard similarity score for 100 queries: 0.23

jaccard係数は0.23でした。
正直、非常に高いとはいえないです。
ただ、明示的に指定していないパラメータのデフォルト値などが異なるため、こんなものなのかもしれません。

検索結果の文章も眺めてみます。上位4件を比べてみます。

# Output and compare the texts

num_queries = 10
# Define the number of texts to compare
num_texts_to_compare = 4

# Output and compare the texts for each query
for i in range(num_queries):
    print(f"Query {i+1}:")

    print("Rank BM25 Retriever Texts:")
    for text in rank_bm25_texts[i][:num_texts_to_compare]:
        print(text)
    print("\n")

    print("Skearn BM25 Retriever Texts:")
    for text in skearn_bm25_texts[i][:num_texts_to_compare]:
        print(text)
    print("\n")

    # Compare the texts
    rank_bm25_set = set(rank_bm25_texts[i][:num_texts_to_compare])
    skearn_bm25_set = set(skearn_bm25_texts[i][:num_texts_to_compare])
    common_texts = rank_bm25_set.intersection(skearn_bm25_set)
    print("number of ommon texts:", len(common_texts))
Query 1:
Rank BM25 Retriever Texts:
Federal-Mogul may sell T amp;N assets after pension offer snub Federal-Mogul, the engineering company whose bankruptcy in the US is threatening the pensions of thousands of workers at its subsidiary Turner  amp; Newall, is sizing up a sale of its UK businesses.
Unions in talks over Jaguar blow Unions are to hold an emergency meeting with workers at the doomed Browns Lane Jaguar plant in Coventry. Parent company Ford plans to stop car production at the plant, with 400 voluntary redundancies and 425 jobs moved to the Castle Bromwich factory.
N Korea stalls nuclear talks North Korea rules out further nuclear talks unless South Korea's own nuclear tests are "fully probed".
IBM in pension settlement talks NEW YORKIBM said yesterday it is in talks to settle a mammoth lawsuit alleging a pension plan adopted by the firm in the 1990s discriminated against 140,000 older workers.


Skearn BM25 Retriever Texts:
Federal-Mogul may sell T amp;N assets after pension offer snub Federal-Mogul, the engineering company whose bankruptcy in the US is threatening the pensions of thousands of workers at its subsidiary Turner  amp; Newall, is sizing up a sale of its UK businesses.
Pay dispute ends in South Africa South Africa's wage dispute ends after unions representing public sector workers accept a revised government offer.
Unions in talks over Jaguar blow Unions are to hold an emergency meeting with workers at the doomed Browns Lane Jaguar plant in Coventry. Parent company Ford plans to stop car production at the plant, with 400 voluntary redundancies and 425 jobs moved to the Castle Bromwich factory.
IBM in pension settlement talks NEW YORKIBM said yesterday it is in talks to settle a mammoth lawsuit alleging a pension plan adopted by the firm in the 1990s discriminated against 140,000 older workers.


number of ommon texts: 3
Query 2:
Rank BM25 Retriever Texts:
SpaceShipOne Wins  #36;10 Million Ansari X Prize in Historic 2nd Trip to Space (SPACE.com) SPACE.com - MOJAVE, CALIFORNIA - Human \  flight took a significant step forward today as the privately built SpaceShipOne \  flew into suborbital space for the second time in five days, apparently securing \  the  #36;10 million Ansari X Prize.
The Next Great Space Race: SpaceShipOne and Wild Fire to Go For the Gold (SPACE.com) SPACE.com - A piloted rocket ship race to claim a  #36;10 million Ansari X Prize purse for privately financed flight to the edge of space is heating up.
SpaceShipOne a Flight Away From  #36;10M Prize (AP) AP - SpaceShipOne is one flight away from clinching the Ansari X Prize, a  #36;10 million award for the first privately developed manned rocket to reach space twice within 14 days.
Private rocketship begins quest for \$10 million X Prize The first private manned rocket to reach space soared toward the edge of the atmosphere on Wednesday in a bid to earn the \$10 million Ansari X Prize.


Skearn BM25 Retriever Texts:
Canadian Team Joins Rocket Launch Contest (AP) AP - The  #36;10 million competition to send a private manned rocket into space started looking more like a race Thursday, when a Canadian team announced plans to launch its rocket three days after an American group intends to begin qualifying for the Ansari X prize.
SpaceShipOne Wins  #36;10 Million Ansari X Prize in Historic 2nd Trip to Space (SPACE.com) SPACE.com - MOJAVE, CALIFORNIA - Human \  flight took a significant step forward today as the privately built SpaceShipOne \  flew into suborbital space for the second time in five days, apparently securing \  the  #36;10 million Ansari X Prize.
SpaceShipOne wins \$10MX Prize SpaceShipOne has completed its second suborbital flight in five days, securing the \$10 million Ansari X Prize. SpaceShipOne is the first privately built, manned vehicle to reach space.
SpaceShipOne to Try for  #36;10 Million Ansari X-Prize (Reuters) Reuters - The U.S. team that took the first\privately funded, manned ship into space makes a bid to capture\a  #36;10 million prize this week -- signaling that commercial\space travel has nearly arrived.


number of ommon texts: 1
Query 3:
Rank BM25 Retriever Texts:
Volcanoes May Have Sparked Life on Earth, Study Says Peptides are chains of amino acids. They form the proteins that are the basis of living cells. Scientists have had little success in demonstrating a plausible chemical reaction that could have formed peptides 
Study Links Tree Rings to Global Warming (AP) AP - Did global warming spur severe drought in the Western United States? A new study co-authored by a tree-ring researcher at the University of Arizona shows a possible connection.
Scientists to Study the Genes of Soybeans (AP) AP - Indiana University has received a three-year,  #36;2.6 million grant to study genes that make soybean plants resist disease.
UAPB Gets  #36;2.5M Science Grant From NSF (AP) AP - The National Science Foundation has awarded a  #36;2.5 million grant to the University of Arkansas at Pine Bluff to steer minority students into the sciences, math and technology.


Skearn BM25 Retriever Texts:
Volcanoes May Have Sparked Life on Earth, Study Says Peptides are chains of amino acids. They form the proteins that are the basis of living cells. Scientists have had little success in demonstrating a plausible chemical reaction that could have formed peptides 
MAKING PEPTIDES ON EARLY EARTH Whether formed on Earth or brought here by meteorites, -amino acids are widely assumed to have been present in the prebiotic chemical soup.
Component of volcanic gas may have played a significant role in &lt;b&gt;...&lt;/b&gt; Scientists at The Scripps Research Institute and the Salk Institute for Biological Studies are reporting a possible answer to a longstanding question in research on the origins of life on Earth--how did the first amino acids form the first peptides?
Israeli, American Chemists Win Nobel  Two Israelis and an American won the Nobel Prize in  Chemistry yesterday for discovering the method by which cells tag proteins that are defective or have outlived their usefulness and direct them to the cellular machinery that grinds them up into reusable parts.


number of ommon texts: 1
Query 4:
Rank BM25 Retriever Texts:
Mariners will start anew with Hargrove as manager Mike Hargrove knows just what he #39;s getting into as the new manager of the last-place Seattle Mariners. After all, he lost 98, 95 and 91 games in his final three years with Baltimore.
Smudger on Sport ALL golfers will turn their attention to the BMW Open in Munich this weekend, when the last five places in the European Ryder Cup team will be decided.
HUGHES TO PICK SOUNESS BRAIN Mark Hughes will pick Graeme Souness brains about his new job as boss of Blackburn - but the Welshman insists he will bring his own brand of management to Ewood Park.
South Carolina Coach Holtz Gives Up Game (AP) AP - South Carolina coach Lou Holtz announced Monday he will retire, ending one of the most successful and colorful college football careers.


Skearn BM25 Retriever Texts:
Yankees GM knows his next mission Told his job is safe for now, Cashman will seek answers to team #39;s pitching collapse. By Mike Fitzpatrick. NEW YORK -- Brian Cashman #39;s job is safe -- at least for now.
Mariners will start anew with Hargrove as manager Mike Hargrove knows just what he #39;s getting into as the new manager of the last-place Seattle Mariners. After all, he lost 98, 95 and 91 games in his final three years with Baltimore.
South Carolina Coach Holtz Gives Up Game (AP) AP - South Carolina coach Lou Holtz announced Monday he will retire, ending one of the most successful and colorful college football careers.
Lightning Strike Injures 40 on Texas Field (AP) AP - About 40 players and coaches with the Grapeland High School football team in East Texas were injured, two of them critically, when lightning struck near their practice field Tuesday evening, authorities said.


number of ommon texts: 2
Query 5:
Rank BM25 Retriever Texts:
Calif. OKs World's Toughest Smog Rules (AP) AP - California air regulators Friday unanimously approved the world's most stringent rules to reduce auto emissions that contribute to global warming  #151; a move that could affect car and truck buyers from coast to coast.
Calif. Regulators Weigh Smog Restrictions (AP) AP - California air regulators on Thursday took up the world's most ambitious rules to reduce car emissions that contribute to global warming  #151; an effort that could have a sweeping effect on how the country fights vehicle pollution.
Calif. OKs Toughest Auto Emissions Rules LOS ANGELES - California has adopted the world's first rules to reduce greenhouse emissions for autos, taking what supporters see as a dramatic step toward cleaning up the environment but also ensuring higher costs for drivers.    The rules may lead to sweeping changes in vehicles nationwide, especially if other states opt to follow California's example...
 #36;1.3M Plan Aims to Save Calif. State Fish (AP) AP - Federal and state officials plan to announce Friday an agreement to spend  #36;1.3 million over the next five years to save California's state fish.


Skearn BM25 Retriever Texts:
Calif. Regulators Weigh Smog Restrictions (AP) AP - California air regulators on Thursday took up the world's most ambitious rules to reduce car emissions that contribute to global warming  #151; an effort that could have a sweeping effect on how the country fights vehicle pollution.
Calif. OKs World's Toughest Smog Rules (AP) AP - California air regulators Friday unanimously approved the world's most stringent rules to reduce auto emissions that contribute to global warming  #151; a move that could affect car and truck buyers from coast to coast.
Calif. Air Board Prepares to Vote on Car Emissions (Reuters) Reuters - California pollution regulators\conducted a review on Thursday of a far-reaching proposal to\order the automobile industry to cut greenhouse gas emissions\from new cars and trucks sold in the state.
San Francisco Plan Aims to Slash Greenhouse Gases (Reuters) Reuters - Three days after California\regulators adopted tough rules to cut car pollution, San\Francisco's mayor unveiled a plan on Monday to reduce\greenhouse gas emissions, saying cities must take action\because the Bush administration is ignoring global warming.


number of ommon texts: 2
Query 6:
Rank BM25 Retriever Texts:
Among the well-heeled at the world polo championships in Chantilly (AFP) AFP - Maybe it's something to do with the fact that the playing area is so vast that you need a good pair of binoculars to see the action if it's not taking place right in front of the stands.
You Say You Wanna Revolution Do you hate the government?  Do you want to smash the corporate slave state?  Are you an     anarchist, punk, eco-freak with a bad haircut and attitude?      Is your idea of a fun hobby sitting in your basement practicing your bomb-making skills?      Do you listen to Rage Against the Machine all the time and have     your walls lined with posters of Che     Guevara?  Do you actually want to do something to bring about the Revolution instead of getting stoned and rambling about the     Zapatistas?  Well here's something easy and powerful you can do     to help bring the walls down:    Vote for Bush.
Snubbing the RIAA, Part II A long time ago (or so it seems), I wrote Snubbing the RIAA, Part I, in an attempt to provide some sources of non-RIAA backed music to the readers of K5.  What follows is Part II, in which I present some more sources of great music, and some ways to tell if your favourite artist belongs to an RIAA-member label.
What's wrong with the CBC Up until recently I have been a staunch supporter of the CBC, it's ideals, the reason for it's existence. However, recent events have made me question whether or not this once fine institution has fallen victim to the ruthless grasp of corperate greed in North America.


Skearn BM25 Retriever Texts:
EMI's download music sales soar EMI sees download music sales rise by nearly 600 in the six months to the end of September and says they are becoming a major part of its business.
EMI sees music market improving EMI, the world #39;s third-largest music group, reported a drop in first-half profits on Friday but said the beleaguered industry was rebounding as online music sales start to take off.
The Great Write-In Vote Protest That Never Was What if, on election day, you wrote in your presidential vote? I'm not trying to persuade you to vote for one or the other, or even for a third party. It's not a trick, My real name isn't George M. Bush, and I'm not going to go to court and claim you were really voting for me. Just write it in. Make a statement that you vote for whom you choose and that it has nothing to do with whichever they feel like printing on the ballot.
UK's EMI Says to Face Music Industry Probe in U.S.  LONDON/NEW YORK (Reuters) - EMI Group PLC, the world's  third-largest music company, on Friday said it and other music  companies faced a New York probe into how music companies  influence what songs are played on the radio.


number of ommon texts: 0
Query 7:
Rank BM25 Retriever Texts:
Microsoft: Payout of Sasser bounty hinges on conviction Sven Jaschan, the alleged author of the Sasser worm and several variants of the Netsky virus, was charged this week by German police, but the informant who led authorities to the suspect will have to wait for a promised \$250,000 reward, Microsoft 
worm has turned for teen virus king Sven Jaschan, 18, from Germany, faces up to five years in prison for writing and spreading the Sasser and Netsky worms, said to have cost businesses around the world millions of pounds.
Firm justifies job for virus writer A German computer security firm has defended its decision to hire the self-confessed teenage author of the Sasser and Netsky worms.
Security firm hires teenage accused of writing Sasser virus Sven Jaschan, an 18-year-old from Waffensen in Lower Saxony, who is also thought to be behind the Netsky virus and is currently awaiting trial for writing the Sasser worm, could be about to start work with German firewall company Securepoint.


Skearn BM25 Retriever Texts:
Microsoft: Payout of Sasser bounty hinges on conviction Sven Jaschan, the alleged author of the Sasser worm and several variants of the Netsky virus, was charged this week by German police, but the informant who led authorities to the suspect will have to wait for a promised \$250,000 reward, Microsoft 
Teenager charged over Sasser worm The German teenager who allegedly wrote the Sasser and Netsky computer worms has been charged. Sven Jaschan, now 18, was arrested in May this year at his parents #39; home in Waffensen, North Germany.
Virus writer gets security job Virus writer Sven Jaschan, who claimed responsibility for the Sasser and Netsky worms, has been given a job at an internet security company.
worm has turned for teen virus king Sven Jaschan, 18, from Germany, faces up to five years in prison for writing and spreading the Sasser and Netsky worms, said to have cost businesses around the world millions of pounds.


number of ommon texts: 2
Query 8:
Rank BM25 Retriever Texts:
Distributed Social Whitelists \\Sam blogs about his wiki spam problems  and implements a posting throttle.\\With new spam-capable zombie PCs and with wikis that aren't updated very often\this isn't a solution.  If I were to go back to my wiki after a month it would\be covered with spam links.\\One strong solution is emergent and distributed social whitelists (AKA my\FOAFKey  proposal).\\With a FOAFKey enabled wiki/weblog you could allow a whitelist of a few hundred\thousand users to post to the wiki without any problems.  All they would need to\enter is their email address for confirmation (or the SHA1 hash of their email).\\For the rare times when the user isn't within the original whitelist we can just\have the user uplo ...\\
Selling good local produce provides food for thought FOOD - where would we be without it? A lot of businesses have made a lot of money from one of lifes few essentials by providing answers to the question of how much we can afford to pay for food, and the 
What are the best cities for business in Asia? One of our new categories in the APMF Sense of Place survey is for best Asian business city. After a couple of days, Singapore leads the pack, followed by Bangkok, Thailand and Hong Kong. Enter your vote and comments and make your views count. More new categories include best city for livability, and best tourism destinations.
Varitek: Captain courageous SEATTLE -- So, Jason Varitek, we all know you are the de facto captain of the Boston Red Sox. You are the quiet leader of this team, the man who turned a season around by shoving your mitt into the face of Alex Rodriguez, and a man who's got a 15-game hitting streak in the wake of last night's 13-2 ...


Skearn BM25 Retriever Texts:
Distributed Social Whitelists \\Sam blogs about his wiki spam problems  and implements a posting throttle.\\With new spam-capable zombie PCs and with wikis that aren't updated very often\this isn't a solution.  If I were to go back to my wiki after a month it would\be covered with spam links.\\One strong solution is emergent and distributed social whitelists (AKA my\FOAFKey  proposal).\\With a FOAFKey enabled wiki/weblog you could allow a whitelist of a few hundred\thousand users to post to the wiki without any problems.  All they would need to\enter is their email address for confirmation (or the SHA1 hash of their email).\\For the rare times when the user isn't within the original whitelist we can just\have the user uplo ...\\
Should Your Next Car Be New or Used? Used cars have a lot to offer -- if you know what you're doing.
Microsoft under your thumb New keyboard and mouse from the company's hardware division include fingerprint readers.
The Scalability of Full Content Feeds \\There has been a lot of talk  recently about the problem of RSS feeds which\include full content on high bandwidth sites such as MSDN blogs.\\When RSS is used for a site with both a great amount of users, and with frequent\updates, the bandwidth required to deliver realtime events can be problematic.\\Its a real problem.  The RSS model requires clients to download the ENTIRE feed\if even ONE item has been modified/added.  This means that if you have 55k\subscribers, with 15 RSS items in your feed, and you want to publish one more\post, all 14 additional posts need to be re-downloaded by the client.\\There are a lot of potential solutions. HTTP deltas are one solution but too\difficult to imp ...\\


number of ommon texts: 1
Query 9:
Rank BM25 Retriever Texts:
E-mail scam plays on US elections People are being warned about a scam e-mail which uses the US presidential poll to part them from their cash.
Police Step Up Hunt for Embassy Bombers Indonesian authorities stepped up their hunt today for the alleged masterminds behind the bomb attack on the Australian embassy, as Australias police chief warned that another suicide squad was at large - and may be planning more attacks.
Police chief's suspension lifted Police chief David Westwood, criticised after the Soham murders, has his suspension lifted, but is to retire early.
New EU industry chief warns against protectionism The European Union #39;s incoming industry chief warned Thursday against protectionist leanings of some European governments, as well as fears about new EU member states abusing their position.


Skearn BM25 Retriever Texts:
FDIC Warns About E-Mail 'Phishing' Scam (Reuters) Reuters - The FDIC on Friday issued an alert\about an increasingly common e-mail scam designed to steal\personal information and money from millions of unwary\consumers.
FDIC Warns Consumers on E-Mail  #39;Phishing #39; Scam The FDIC on Friday issued an alert about an increasingly common e-mail scam designed to steal personal information and money from millions of unwary consumers.
'Phishing' scam targets NatWest NatWest bank suspends some online banking facilities after bogus e-mails ask customers for their account details in a 'phishing' scam.
E-mail scam plays on US elections People are being warned about a scam e-mail which uses the US presidential poll to part them from their cash.


number of ommon texts: 1
Query 10:
Rank BM25 Retriever Texts:
Card fraud prevention 'pays off' Fraud on UK credit cards has fallen - but identity fraud is on the up, a survey from analysts Datamonitor finds.
Card fraud shows sharp increase Credit and debit card fraud rose by nearly a fifth to 478.8m in the year up to July, an industry body says.
Ad campaign touts multimedia cards Multimedia Card Association kicks off ad campaign to promote the small cards based on its specifications.
Nokia Signs Up for SD Memory Card Bowing to the growing popularity of the Secure Digital memory card standard, Nokia has signed a licensing agreement to use SD cards in its cell phones.


Skearn BM25 Retriever Texts:
Card fraud prevention 'pays off' Fraud on UK credit cards has fallen - but identity fraud is on the up, a survey from analysts Datamonitor finds.
Police arrest phishing mob suspect A suspected Russian gangster and phisherman, caught red-handed with \$200,000 worth of stolen goods and \$15,000 cash, has been charged in the US on several counts of identity and credit card fraud.
Secret Service Busts Cyber Gangs The US Secret Service Thursday announced arrests in eight states and six foreign countries of 28 suspected cybercrime gangsters on charges of identity theft, computer fraud, credit-card fraud, and conspiracy.
Secret Service Busts Cyber Gangs The US Secret Service Thursday announced arrests in eight states and six foreign countries of 28 suspected cybercrime gangsters on charges of identity theft, computer fraud, credit card fraud, and conspiracy.


number of ommon texts: 1

jaccard係数の低さから受ける印象よりは、そこそこテキストが被っているようにも見えます。
この結果からだけでは、rank_bm25とBM25Vectorizerのどっちが優れているかは言えませんが、応用先に合わせて適切なものを選択できると良さそうです。

おわりに

scikitlearnベースのBM25Vectorizerを内部的に用いることで、langchainのBM25Retrieverを高速化することができました。
検索結果は割と異なってしまったので、応用上、どちらを使っても問題がないかはもう少し慎重に検討したほうが良いかもしれません。

0
0
0

Register as a new user and use Qiita more conveniently

  1. You get articles that match your needs
  2. You can efficiently read back useful information
  3. You can use dark theme
What you can do with signing up
0
0

Delete article

Deleted articles cannot be recovered.

Draft of this article would be also deleted.

Are you sure you want to delete this article?