AIを使ってアプリケーション名からCPEを求められないか？ #2 RAG

Last updated at 2025-02-20Posted at 2025-02-15

はじめに

前回はLoRAでファインチューニングしてCPEを求めてみました。
ただ精度はいまいちですし、新しいCPEに対応しません。
RAGを用いて外部情報を参照してCPEを求めるようにしてみようと思います。

作成物

目標

RAGで外部情報を参照させる
精度が高くなったらいいな
入力例:
Visual Studio Code　0.2.9
出力例:
cpe:2.3:a:microsoft:visual_studio_code:0.2.9:*:*:*:*:*:*:*

VectorStoreの作成

埋め込みモデルは多言語対応している Multilingual-E5-base を使ってみます。
VectorStoreの操作には扱いやすそうな Chroma を使ってみます。
さっそく langchain_community.document_loaders.CSVLoader を使ってドキュメントを読み込んでみます。

documents = CSVLoader(csv_path, encoding="utf-8").load()
embedding = HuggingFaceEmbeddings(
    model_name="intfloat/multilingual-e5-base",
)
Chroma.from_documents(
    documents=documents,
    embedding=embedding,
    persist_directory="vectorstore",
)

$ python create_vectorstore.py

VectorStoreの確認

163万件のドキュメントを読み込ませたところ1時間ほどかかりました。
永続化したVectorStoreを読み込ませて類似度85％、上位3件を検索してみます。

embeddings=HuggingFaceEmbeddings(model_name="intfloat/multilingual-e5-base")
vectorstore = Chroma(persist_directory="vectorstore", embedding_function=embeddings)
retriever = vectorstore.as_retriever(
    search_type="similarity_score_threshold",
    search_kwargs={
        "k": 3,
        "score_threshold": 0.85,
    },
)
retrieved_docs = retriever.invoke(query)

なかなか良さそうです。

retrieved_docs: [
  Document(
    id='94dd712f-9c16-4779-9089-538fddc99966',
    metadata={'row': 1303638, 'source': 'datas/categorized_cpes.csv'},
    page_content='title: Microsoft Visual Studio Code 0.2.9\npart: a\nvendor: microsoft\nproduct: visual_studio_code\nversion: 0.2.9'
  ),
  Document(
    id='33300921-f7f9-41e0-abf8-080befa9b1b5',
    metadata={'row': 850060, 'source': 'datas/categorized_cpes.csv'},
    page_content='title: Microsoft Visual Studio Code 0.2.9 for Python\npart: a\nvendor: microsoft\nproduct: visual_studio_code\nversion: 0.2.9'
  ),
  Document(
    id='9975855f-14d6-4f73-89fe-5128d2befca5',
    metadata={'row': 689877, 'source': 'datas/categorized_cpes.csv'},
    page_content='title: Microsoft Visual Studio Code 0.20.0\npart: a\nvendor: microsoft\nproduct: visual_studio_code\nversion: 0.20.0'
  )
]

インデックスをプロンプトテンプレートに挿入

VectoreStoreから得られた情報をプロンプトに挿入し、その情報を参考するよう指示します。
LangChainではLCELが推奨されていますが今回は自力で。

prompt_template = PromptTemplate(
    template=(
        "Generate a JSON from the given text.\n"
        "{format_instructions}\n\n"
+       "Please refer to the information below.\n\n"
+       "### Following information:\n"
+       "{context}\n\n"
    ),
-   input_variables=[],
+   input_variables=["context"],
    partial_variables={"format_instructions": format_instructions}
)
chat_template = tokenizer.apply_chat_template(
    [
-       {"role": "system", "content": prompt_template.format()},
+       {"role": "system", "content": prompt_template.format(context=retrieved_docs)},
        {"role": "user", "content": query}
    ],
    tokenize=False,
    add_generation_prompt=True
)

RAGを使用してCPEを生成してみる

$ generate.py 
Input: Visual Studio Code 0.2.9
Output:
#0: cpe:2.3:a:microsoft:visual_studio_code:0.2.9:*:*:*:*:*:*:*
#1: cpe:2.3:a:microsoft:visual_studio_code:0.2.9:*:*:*:*:*:*:*
#2: ```json
"a"    "microsoft"  "visual_studio_code"  "0.2.9"
```
Got invalid return object. Expected key `part` to be present, but got a
For troubleshooting, visit: https://python.langchain.com/docs/troubleshooting/errors/OUTPUT_PARSING_FAILURE 
#3: ```json
"a"    "microsoft"    "visual_studio_code"    "0.2.9"
```
Got invalid return object. Expected key `part` to be present, but got a
For troubleshooting, visit: https://python.langchain.com/docs/troubleshooting/errors/OUTPUT_PARSING_FAILURE 
#4: ```json
"a"    "microsoft"    "visual_studio_code"    "0.2.9"
```
Got invalid return object. Expected key `part` to be present, but got a
For troubleshooting, visit: https://python.langchain.com/docs/troubleshooting/errors/OUTPUT_PARSING_FAILURE 
#5: cpe:2.3:a:microsoft:visual_studio_code:0.2.9:*:*:*:*:*:*:*
#6: ```json
"a"      "microsoft" "visual_studio_code" "0.2.9"
```
Got invalid return object. Expected key `part` to be present, but got a
For troubleshooting, visit: https://python.langchain.com/docs/troubleshooting/errors/OUTPUT_PARSING_FAILURE 
#7: cpe:2.3:a:microsoft:visual_studio_code:0.2.9:*:*:*:*:*:*:*
#8: cpe:2.3:a:microsoft:visual_studio_code:0.2.9:*:*:*:*:*:*:*
#9: cpe:2.3:a:microsoft:visual_studio_code:0.2.9:*:*:*:*:*:*:*

パースに失敗するケースが増えてしまいましたが、他6件はすべて同じで正しいCPEを出力しています。
LoRAのみの時に出現していたベンダー名 ms、ms-vim などが出なくなりました。

まとめ

LoRA+RAGで精度が上がった（ように見えます）。
CSVLoader に全て任せたVectorStoreのドキュメントなど改良の余地は色々とありそうです。
あと、パースに失敗するケースが増えたの何故でしょう？
まだまだ味がします。

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up