Qiita Engineer Festa20242024年7月17日まで開催中！

RAGの練習

Posted at 2024-07-01

世間的にはNotebookLMが出てきて、個人用のRAGの需要はそっちに任せていいのかなって感じですが、
私の周囲の営業さんとかの話ですと、絶賛RAG熱中のようで、
たしなみ程度にRAGを書いてみました。

RAGとは

テキストを生成AIに投げ込むと、内部的にそのテキストを表す特徴量のベクトル(Embedding)が生成されています。
あるテキスト入力と、別のテキスト入力の意味や文章構造などが近いと、Embeddingも似た数値列になります。
なので、２つのEmbeddingの差は０に近いほど、もとのテキスト入力も近しいことになります。
この仕組みを使用して、質問や検索文が入力されたとき、最もリファレンスになりうるテキストファイルを検索する仕組み、あるいはそのテキストファイルを元に回答する仕組みをRAGと読んでいます。
あらかじめ、テキストドキュメントのEmbeddingを生成しておきます。
つぎに生成AIへ質問が投げられたときそのEmbeddingを作成して、Embeddingの差が０に近しいものを探します。
生成AIには、そのファイルを参照して答えよってプロンプトを書いておきます。

実装

Embeddingの取得

OpenAIのAPIを叩いて、埋め込みベクトルを取得します。

import http.client
import json

API_KEY = "OpenAIのキーをいれます"

def get_embedding(content):

    # リクエストヘッダーの設定
    headers = {
        "Content-Type": "application/json",
        "Authorization": "Bearer " + API_KEY
    }

    # リクエストボディの設定
    payload = {
        "input": content,
        "model": "text-embedding-ada-002"
    }
    payload_json = json.dumps(payload)

    # HTTP接続の確立
    conn = http.client.HTTPSConnection("api.openai.com")

    # POSTリクエストの送信
    conn.request("POST", "/v1/embeddings", body=payload_json, headers=headers)

    # レスポンスの取得
    response = conn.getresponse()

    # レスポンスの読み取り
    data = response.read().decode()
    dictionary = json.loads(data)
    ebd = dictionary['data'][0]['embedding']

    # HTTP接続の終了
    conn.close()
    return ebd

Embeddingリストの作成

リファレンス1.txt, リファレンス2.txt ... のようなテキストファイルを予め用意しておいて
それに対するエンべディングを作成して、DataFrameで保存しておきます。

import pandas as pd
import glob
import pickle


text_path_list = glob.glob("path/to/リファレンス*.txt")

df = pd.DataFrame()

for i in range(len(text_path_list)):
    filepath = text_path_list[i]
    with open(text_path, 'r') as file:
        content = file.read()
        ebd = get_embedding(content)

    # DataFrameに行を追加
    row = {'filepath': filepath}
    for j, val in enumerate(ebd):
        row['v'+str(j)] = val
    # DataFrameに行を追加
    df = pd.concat([df, pd.DataFrame(row, index=[0])], ignore_index=True)


with open('path/to/ebd.pickle', 'wb') as file:
    pickle.dump(df, file)

ベクトルの差を計算

# ebd_dfには先程作成した、リファレンス*.txtのEmbeddingを計算したDFを指定します。
def calc_ebd_dist(ebd_df, input_vector):
    dists = [0.0] * len(ebd_df)
    for i in range(len(ebd_df)):
        tmp = ebd_df.iloc[i, 1:].values - input_vector
        tmp2 = sum([x**2 for x in tmp])
        dists[i] = math.sqrt(tmp2)
    return dists

問い合わせ部の作成

ユーザー入力のEmbedding作成
Embeddingの差が最も小さいリファレンスファイルを取得
リファレンスの内容をプロンプトに含めて、生成AIのAPIを叩く

...のながれで　実装します。

INPUT_TEXT="ユーザーからの入力を設定します"

input_vector = get_embedding(INPUT_TEXT)
dists = calc_ebd_dist(df, input_vector)
min_dist_index = dists.index(min(dists))
ref_path = df['filepath'][min_dist_index]
with open(ref_path, 'r') as file:
    content = file.read()
    
def makePrompt(ref_doc, input_text):
    prompt  = "次の文書の内容に基づいて、ユーザーの質問に答えてください。\n"
    prompt += "# 文書 \n\n"
    prompt += ref_doc + "\n\n"
    prompt += "# ユーザーの質問 \n\n"
    prompt += input_text + "\n\n"
    prompt += "# あなたの答え"
    return prompt

prompt = makePrompt(content, INPUT_TEXT)
openai.api_key = API_KEY
response = openai.ChatCompletion.create(
    model="gpt-4o",
    messages=[
        {"role": "user",
         "content": prompt},
    ],
)

ret_msg = response.choices[0]["message"]["content"].strip()

print(ret_msg)

これで実装ができました。

publicにインターネットに公開されていないドキュメントを参照して答えるようなユースケースで役に立つので、社内改善に役立ちます。（といろんな事例で聞いています）

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up