More than 1 year has passed since last update.

RAGにおけるKendra検索とベクトル検索の比較 - Amazon Titan Embeddingsでベクトル入門④

Last updated at 2023-11-07Posted at 2023-11-07

同じデータソースに対してKendra検索とベクトル検索を比較をしてみます。

使用ドキュメント

以下のドキュメントをKendraとVector DB（ChromaDB）にそれぞれにインデックス済です。
Vector DBについてはページ単位で分割してTitan Embeddingでベクトル化してあります。

Vector DBへの格納方法は以下をご参照ください。

ChromaDBではなくAurora PostgreSQL pgvectorを使用する例は以下をご参照ください。

使用プログラム

プロンプト等の条件は合わせていますが、一回に取得可能な文字数に差がある為、取得件数だけ差を付けています（Kendra検索：20件、ベクトル検索：15件）。

Kendra検索プログラム

from langchain.llms import Bedrock
from langchain.chains import RetrievalQA
from langchain.retrievers import AmazonKendraRetriever
from langchain.prompts import PromptTemplate
import streamlit as st

# Retriever(Kendra)の定義
# 日本語で"登録されている"ドキュメントを20件(top_k=20)検索する、と定義
kendra_index_id="xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx" #各自のINDEXに書き換えてください

attribute_filter = {"EqualsTo": {"Key": "_language_code","Value": {"StringValue": "ja"}}}
retriever = AmazonKendraRetriever(index_id=kendra_index_id,attribute_filter=attribute_filter,top_k=20)

# LLMの定義
llm = Bedrock(
    model_id="anthropic.claude-v2",
    model_kwargs={"max_tokens_to_sample": 1000}
)

# promptの定義
prompt_template = """
  <documents>{context}</documents>
  \n\nHuman: 上記の内容を参考文書として、質問の内容に対して詳しく説明してください。言語の指定が無い場合は日本語で答えてください。
    もし質問の内容が参考文書に無かった場合は「文書にありません」と答えてください。回答内容には質問自体やタグは含めないでください。
    Take a deep breath and work on this problem step-by-step.
  <question>{question}</question>
  \n\nAssistant:"""
PROMPT = PromptTemplate(template=prompt_template, input_variables=["context", "question"])
chain_type_kwargs = {"prompt": PROMPT}

# Chainの定義
qa = RetrievalQA.from_chain_type(retriever=retriever,llm=llm,chain_type_kwargs=chain_type_kwargs)

st.title("KendRAG")
input_text = st.text_input("入力された文字列でKendraを検索し回答します")
send_button = st.button("送信")

if send_button:
    # 実行
    st.write(qa.run(input_text))

ベクトル検索プログラム

from langchain.embeddings import BedrockEmbeddings
import chromadb
from langchain.vectorstores import Chroma
from langchain.llms import Bedrock
from langchain.chains import RetrievalQA
from langchain.prompts import PromptTemplate
import streamlit as st

# Embeddingsの定義
embeddings = BedrockEmbeddings(model_id="amazon.titan-embed-text-v1")

# ChromaDBの定義
persist_path = 'c:\Python\Embedding\chroma2'  #場所
client = chromadb.PersistentClient(path=persist_path)
db = Chroma(
    collection_name="vector_store",
    embedding_function=embeddings,
    client=client
)

# LLMの定義
llm = Bedrock(
    model_id="anthropic.claude-v2",
    model_kwargs={"max_tokens_to_sample": 1000},
)

# promptの定義
prompt_template = """
  <documents>{context}</documents>
  \n\nHuman: 上記の内容を参考文書として、質問の内容に対して詳しく説明してください。言語の指定が無い場合は日本語で答えてください。
    もし質問の内容が参考文書に無かった場合は「文書にありません」と答えてください。回答内容には質問自体やタグは含めないでください。
    Take a deep breath and work on this problem step-by-step.
  <question>{question}</question>
  \n\nAssistant:"""
PROMPT = PromptTemplate(template=prompt_template, input_variables=["context", "question"])
chain_type_kwargs = {"prompt": PROMPT}

# Chainの定義。検索結果の上位15件を使用
qa = RetrievalQA.from_chain_type(llm,retriever=db.as_retriever(search_kwargs={"k": 15}),chain_type_kwargs=chain_type_kwargs)

# Streamlit
st.title("VectoRAG")
input_text = st.text_input("入力された文字列でVectorDBを検索し回答します")
send_button = st.button("送信")

if send_button:

    # 実行
    st.write(qa.run(input_text))

まずKendra検索の動作確認

インデックスしたPDFに意味的には含まれる内容を聞いてみます。

情報が若干不足している気がしつつ、CloudWatch Logsでプロンプトに渡されている内容を確認してみます（長いので末尾）。
「基盤モデル」を全て網羅したページが検索ヒットしなかった為、「モデル」という表現が使われているページから生成したようです。というか元PDFを見ると「基盤モデル」という表現が使われていないので、あまり良い聞き方では無かったかもしれません。

結果から読み取れること

検索結果の多くは1ページ中の抜粋の為、ページ単位でチャンクに分けるよりも1件のチャンクが小さいケースが多そう
検索結果はページを跨る事もあるので、ページ単位でチャンクを分けるよりも有利なケースもありそう
検索結果の精度については、元PDFに含まれない表現で聞いたので、何とも言えない

Kendra検索とベクトル検索の比較

PDFに含まれる表現を使って、Kendra検索とベクトル検索をそれぞれ実行してみます。

質問1

質問2

質問3

質問4

これベクトル検索の方はPDFに無い内容を回答してきていますね

質問5

こちらもベクトル検索は上手くいきませんでした。PDFに含まれる単語で検索しているので、単語ベースの検索であればKendraの方が強いのかもしれません。

質問6

今度はPDFに含まれない「イメージ」という単語を使ってみたところ、ベクトル検索の方が幾分マシな結果になりました。

質問7

こちらはベクトル検索の方はちょっと適当に回答してます。

質問8

これもベクトル検索は適当に回答してきています。

まとめ

比較する前は（今回の例では）チャンクが大きいベクトル検索の方が精度が高いと想像していたのですが、想像と違って大差なく、むしろKendra検索の方が比較的良い結果が多かったようにも思います。
ただ、今回は元PDFファイルに含まれる単語を中心に検索した為、「単語は違うけど意味が似ている」ようなユースケースであれば結果は変わってくるかもしれません。

Kendra検索結果例

"inputTokenCount": 11155でした。

Excerptの内容を全部貼ろうとしたのですがQiitaに跳ねられて投稿できないのでExceptは省略して文字数と抜粋元を10件分だけ書いておきます

No	Excerpt	文字数	引用元
1	Document Excerpt:***	430	P.85の一部
2	Document Excerpt:***	564	P.85の一部
3	Document Excerpt:***	784	P.142途中-P.144の途中
4	Document Excerpt:***	614	P.61の一部
5	Document Excerpt:***	1170	P.25途中-P.26途中
6	Document Excerpt:***	703	P.56一部
7	Document Excerpt:***	964	P.142途中-P.144途中
8	Document Excerpt:***	682	P.56一部
9	Document Excerpt:***	1156	P.25途中-P.26途中
10	Document Excerpt:***	730	P.25一部

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up