HaystackでLlama3, ElasticsearchDocumentStoreを使った質問応答システム構築(その１）

Last updated at 2024-08-14Posted at 2024-05-03

はじめに

ローカルに作成した情報に関する質問応答システムを作ってみたかったので、Haystackのパイプライン構造を使ったRAGを試してみました。

RAG(Retrieval Augmented Generation)とは？

RAG（Retrieval-Augmented Generation）は、自然言語処理（NLP）タスクのための最新の機械学習モデルの一つです。RAGは、質問応答、文章生成、要約作成などのタスクに適用されます。このモデルは、あらかじめ学習された情報を取得（retrieval）し、その情報を利用して文を生成（generation）することが特徴です。

今回のターゲット

ElasticsearchDocumentStoreに回答の元になる情報を入れて、ElasticsearchのRetrieverを使い、Llama 3 をgeneratorにするRAGのパイプラインを Haystackで構成して動かしてみました。

実際に使用する場合は、もっとローカルな設計データなどをDocumentStoreに入れてその内容に関して質問応答できる様にしますが、今回は参考にしたHaystackのサンプルコードと同じく、wikipediaからデータ取り込み、
奥さんの好きなBe:Firstの質問に回答できるRAGを構成しました。

参考文献

最も参考にしたのは以下のhaystackのサンプルコードのllama3_rag.ipynbです。
　https://github.com/deepset-ai/haystack-cookbook/tree/main/notebooks
以下の本とサンプルコードも参考になりました。
O'REILLY "機械学習エンジニアのためのTransformers"　 7章　質問応答
　https://github.com/nlp-with-transformers/notebooks.git

llama3_rag.ipynbからの変更点

・そのままではllama3の読み込みでエラーになったので修正
・DocumentStoreをInMemoryDocumentStoreからElasticsearchDocumentStoreに変更
　　（InMemoryDocumentStoreだと覚えるデータの増加に対応できないので）
・覚えるwikipediaのページと質問を変更
・Hugging Face認証方法の変更

動作環境

OS:Ubuntu 22.04.4 LTS
GPU:NVIDIA GeForce RTX 3080+cuda12.3
python:3.12.2
(RTX 3080で100%近い使用率なのでGPUがないと動作しないかも）

コード

インストール

! pip install haystack-ai transformers "huggingface_hub>=0.22.0" sentence-transformers accelerate bitsandbytes

Hugging Face認証

import getpass, os

#hugging face に登録して　を取得token、以下に設定すると自動でログインできる。
os.environ["HF_API_TOKEN"] = getpass.getpass("hugging faceのホームページから取得したtoken")

#このコードでGUIからログインしてもよいはず
#from huggingface_hub import notebook_login
#notebook_login()

wikipediaからデータ読み込み（２つのページに増やしている）

! pip install wikipedia
from IPython.display import Image
from pprint import pprint
import rich
import random
import wikipedia
from haystack.dataclasses import Document

#Be First とSKY-HIの情報をwikipediaのページから読み込み

title = "Be_First"
page = wikipedia.page(title=title, auto_suggest=False)
raw_docs = [Document(content=page.content, meta={"title": page.title, "url":page.url})]

title = "Mitsuhiro_Hidaka"
page = wikipedia.page(title=title, auto_suggest=False)
raw_docs2 = [Document(content=page.content, meta={"title": page.title, "url":page.url})]

elasticsearchのダウンロードと起動

!wget https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-8.13.2-linux-x86_64.tar.gz
!wget https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-8.13.2-linux-x86_64.tar.gz.sha512
!shasum -a 512 -c elasticsearch-8.13.2-linux-x86_64.tar.gz.sha512 
!tar -xzf elasticsearch-8.13.2-linux-x86_64.tar.gz

elasticsearchが展開されたフォルダで以下を実行

./bin/elasticsearch -d -p pid

(プログラム中にやって３０秒ほど待ってもできるはずだがなぜか以下のコードでは使用時にエラーになった）

!./elasticsearch-8.13.2/bin/elasticsearch -d -p pid
# Wait until Elasticsearch has started
!sleep 30

Haystackのパイプライン設定

インストール

from langchain_elasticsearch import ElasticsearchStore
from haystack import Pipeline
from haystack.document_stores.in_memory import InMemoryDocumentStore
from haystack import Document
from haystack.components.embedders import SentenceTransformersTextEmbedder, SentenceTransformersDocumentEmbedder
from haystack.components.converters import TextFileToDocument
from haystack.components.writers import DocumentWriter
from haystack.components.preprocessors import DocumentSplitter
from haystack.utils import ComponentDevice
from haystack_integrations.document_stores.elasticsearch import ElasticsearchDocumentStore
from elasticsearch import Elasticsearch

DocumentStore設定

InMemoryDocumentStoreからElasticsearchDocumentStoreに変更
ElasticsearchDocumentStoreだと過去にデータが保存されているので、
再実行の際に同じインデックスのデータを登録しようとしてエラーになるのでインデックスをクリアしている

#document_store = InMemoryDocumentStore()
#InMemoryDocumentStoreからElasticsearchDocumentStoreに変更
document_store = ElasticsearchDocumentStore(hosts = "http://localhost:9200")

document_store = ElasticsearchDocumentStore(
    hosts=["http://localhost:9200"],
    index="test_index"  # ここで使用するインデックス名を明示的に指定
)

es_client = Elasticsearch(hosts=["http://localhost:9200"])# Elasticsearch Document Store の設定

# インデックス名
index_name = "test_index"

# インデックスの存在確認と削除
if es_client.indices.exists(index=index_name):
    es_client.indices.delete(index=index_name)

# インデックスの再作成
es_client.indices.create(index=index_name)

パイプライン登録（前半）

indexing_pipeline = Pipeline()
indexing_pipeline.add_component("splitter", DocumentSplitter(split_by="word", split_length=200))

indexing_pipeline.add_component(
    "embedder",
    SentenceTransformersDocumentEmbedder(
        model="Snowflake/snowflake-arctic-embed-l",  # good embedding model: https://huggingface.co/Snowflake/snowflake-arctic-embed-l
        device=ComponentDevice.from_str("cuda:0"),    # load the model on GPU
    ))
indexing_pipeline.add_component("writer", DocumentWriter(document_store=document_store, policy="upsert"))

# connect the components
indexing_pipeline.connect("splitter", "embedder")
indexing_pipeline.connect("embedder", "writer")

パイプライン（前半）実行

indexing_pipeline.run({"splitter":{"documents":raw_docs}})
indexing_pipeline.run({"splitter":{"documents":raw_docs2}})

PromptBuilder設定

from haystack.components.builders import PromptBuilder

prompt_template = """
<|begin_of_text|><|start_header_id|>user<|end_header_id|>


Using the information contained in the context, give a comprehensive answer to the question.
If the answer cannot be deduced from the context, do not give an answer.

Context:
  {% for doc in documents %}
  {{ doc.content }} URL:{{ doc.meta['url'] }}
  {% endfor %};
  Question: {{query}}<|eot_id|>

<|start_header_id|>assistant<|end_header_id|>


"""
prompt_builder = PromptBuilder(template=prompt_template)

generatorにLlama-3-8B-Instructを登録

import torch
from haystack.components.generators import HuggingFaceLocalGenerator
from pathlib import Path

generator = HuggingFaceLocalGenerator(
    model=Path("meta-llama/Meta-Llama-3-8B-Instruct/"),
    task="text-generation",
    huggingface_pipeline_kwargs={"device_map":"auto",
                                  "model_kwargs":{"load_in_4bit":True,
                                                  "bnb_4bit_use_double_quant":True,
                                                  "bnb_4bit_quant_type":"nf4",
                                                  "bnb_4bit_compute_dtype":torch.bfloat16}},
    generation_kwargs={"max_new_tokens": 500})

generator.warm_up()

パイプライン（後半）登録

Retrieverをelasticsearch用に変更している。

#from haystack.components.retrievers.in_memory import InMemoryEmbeddingRetriever

#from haystack_integrations.components.retrievers.elasticsearch import ElasticsearchEmbeddingRetriever,ElasticsearchBM25Retriever
from haystack_integrations.components.retrievers.elasticsearch import ElasticsearchEmbeddingRetriever

query_pipeline = Pipeline()

query_pipeline.add_component(
    "text_embedder",
    SentenceTransformersTextEmbedder(
        model="Snowflake/snowflake-arctic-embed-l",  # good embedding model: https://huggingface.co/Snowflake/snowflake-arctic-embed-l
        device=ComponentDevice.from_str("cuda:0"),  # load the model on GPU
        prefix="Represent this sentence for searching relevant passages: ",  # as explained in the model card (https://huggingface.co/Snowflake/snowflake-arctic-embed-l#using-huggingface-transformers), queries should be prefixed
    ))
#query_pipeline.add_component("retriever", InMemoryEmbeddingRetriever(document_store=document_store, top_k=5))
query_pipeline.add_component("retriever", ElasticsearchEmbeddingRetriever(document_store=document_store, top_k=5))
query_pipeline.add_component("prompt_builder", PromptBuilder(template=prompt_template))
query_pipeline.add_component("generator", generator)

# connect the components
query_pipeline.connect("text_embedder.embedding", "retriever.query_embedding")
query_pipeline.connect("retriever.documents", "prompt_builder.documents")
query_pipeline.connect("prompt_builder", "generator")

質問用関数

def get_generative_answer(query):

  results = query_pipeline.run({
      "text_embedder": {"text": query},
      "prompt_builder": {"query": query}
    }
  )

  answer = results["generator"]["replies"][0]
  rich.print(answer)

動作結果

get_generative_answer("What are the names of the members of Be first?")

質問１

Be firstのメンバーの名前は何ですか?

回答１

提供されたコンテキストによると、Be:First のメンバーは次のとおりです。

Sota (島雄, Shimao Sota)
シュント (久保舜斗、クボシュント)
マナト (廣瀨真人、ヒロセマナト)
龍平（黒田龍平、黒田龍平）
ジュノン (池亀樹音、池亀ジュノン)
リョウキ (三山凌輝、Miyama Ryouki)
Leo (上村礼王、かみむられお)

感想

　正解です。

get_generative_answer("What is the name of Be first's representative song?")

質問２

Be firstの代表曲は何というのでしょうか？

回答2

提供された文脈によると、Be:Firstの代表曲名は明示的に言及されていません。しかし、
"Gifted"は彼らの最初のシングルでチャートトップのヒット曲として言及されており、"Bye-Good-Bye", "Scream", "Boom Boom Back", "Mainstream"もチャートイン曲として挙げられている。

感想2

　良い答えだと思います。

get_generative_answer("What group did SKY-HI originally belong to?")

質問3

SKY-HIはもともとどのグループに属していましたか？

回答3

文脈によると、Sky-Hi (日高光啓) は J-POP グループ AAA のメンバーであり、ラッパーでもあります。
グループのメンバーであり、バンドのラップの歌詞を書いています。

感想3

　正解です。

get_generative_answer("What is the relationship between SKY-HI and Be first?")

質問4

SKY-HIとBe firstの関係は？

回答4

文脈によると、Sky-HiはBe:Firstを設立した会社であり、現実のホストでもあります
コンペティションショー「The First」がグループ結成のきっかけとなった。さらにラッパーの日高光啓さんも
Be:Firstの彼はSky-Hiのメンバーでもあり、同社の創設者でもあります。

感想4

実際にはSky-Hiは日高さん本人の別名でありBe:Firstを設立した会社はBMSGです。しかしBMSGのことは
wikipediaには一切書かれていないので誤解したのだと思います。質問３でSky-Hi は日高光啓さんと分かっていたのに、ここでは間違えているのが面白いですね。

get_generative_answer("What is the name of the Be first fandom?")

質問5

Be firstのファンダムの名称は？

回答5

Be:First ファンダムの名前は「Besty」です (すべて大文字で表記されています)

感想5

正解です。
※回答は実際には英語でここでは和訳しています。

全体の感想

多少間違いがありますが、全体的にとてもよく回答できていたので他のデータを増やしていっても使えると思います。

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up