Langchain-postgresとpsycopg3の基礎

Posted at 2024-09-29

langchain-postgresとpsycopg3を試してみる

python環境は：

(myenv) ~/myenv $pip -V

pip 24.2 from /Users/tn/myenv/lib/python3.12/site-packages/pip (python 3.12)

(myenv) ~/myenv $python -V

Python 3.12.6

以下のコードは、VScode上で作業しています（ipykernelのインストールと登録などは、*1を参照）。またpostgresqlサーバーにvector拡張をインストールしておく必要があります（postgresql16でvector拡張をインストールする方法は、*2を参照）

まず、起動済みのpostgresqlサーバーにdbを作成：

!createdb sampledb1

psycopg3のインストール：

!pip install psycopg  # psycopg3ではないので注意

vectorストアを初期化（大半のコードとデータを*3から拝借してます。embeddingモデルは適当です。）：

from langchain_postgres import PGVector
from langchain_postgres.vectorstores import PGVector
from langchain_core.documents import Document
from langchain_ollama import OllamaEmbeddings

embedding = OllamaEmbeddings(
    model="bge-m3"
)

#パスワードはかけていない
connection = "postgresql+psycopg://tn@localhost:5432/sampledb1" 
collection_name = "my_docs"

vectorstore = PGVector(
    embeddings=embedding,
    collection_name=collection_name,
    connection=connection,
    use_jsonb=True,
)

データを用意する（*3のものを少々追加・変更しています）：

docs = [
    Document(page_content='there are cats in the pond', metadata={"id": 1, "location": "pond", "topic": "animals"}),
    Document(page_content='ducks are also found in the pond', metadata={"id": 2, "location": "pond", "topic": "animals"}),
    Document(page_content='fresh apples are available at the market', metadata={"id": 3, "location": "market", "topic": "food"}),
    Document(page_content='the market also sells fresh oranges', metadata={"id": 4, "location": "market", "topic": "food"}),
    Document(page_content='the new art exhibit is fascinating', metadata={"id": 5, "location": "museum", "topic": "art"}),
    Document(page_content='a sculpture exhibit is also at the museum', metadata={"id": 6, "location": "museum", "topic": "art"}),
    Document(page_content='a new coffee shop opened on Main Street', metadata={"id": 7, "location": "Main Street", "topic": "food"}),
    Document(page_content='the book club meets at the library', metadata={"id": 8, "location": "library", "topic": "reading"}),
    Document(page_content='the library hosts a weekly story time for kids', metadata={"id": 9, "location": "library", "topic": "reading"}),
    Document(page_content='there are tigers in the yard', metadata={"id": 10, "location": "zoo", "topic": "animals"}),
    Document(page_content='there are dogs in the backyard', metadata={"id": 11, "location": "my home", "topic": "animals"})
]

dbに書き込む：

vectorstore.add_documents(docs, ids=[doc.metadata['id'] for doc in docs])

オマケですが、similarity_search_with_scoreなどの例：

results = vectorstore.similarity_search_with_score(query="lion",k=5)
for doc, score in results:
    print(f"* [SIM={score:3f}] {doc.page_content} [{doc.metadata}]")

[SIM=0.457508] there are tigers in the yard [{'id': 10, 'topic': 'animals', 'location': 'zoo'}]
[SIM=0.494071] there are dogs in the backyard [{'id': 11, 'topic': 'animals', 'location': 'my home'}]
[SIM=0.540048] ducks are also found in the pond [{'id': 2, 'topic': 'animals', 'location': 'pond'}]
[SIM=0.541976] there are cats in the pond [{'id': 1, 'topic': 'animals', 'location': 'pond'}]
[SIM=0.557055] the book club meets at the library [{'id': 8, 'topic': 'reading', 'location': 'library'}]

"lion"に近い順で並んでいるかは微妙。
filterをかける：

vectorstore.similarity_search('lion', k=5, filter={
    'topic': { "$eq": 'animals'}
})

[Document(id='10', metadata={'id': 10, 'topic': 'animals', 'location': 'zoo'}, page_content='there are tigers in the yard'),
Document(id='11', metadata={'id': 11, 'topic': 'animals', 'location': 'my home'}, page_content='there are dogs in the backyard'),
Document(id='2', metadata={'id': 2, 'topic': 'animals', 'location': 'pond'}, page_content='ducks are also found in the pond'),
Document(id='1', metadata={'id': 1, 'topic': 'animals', 'location': 'pond'}, page_content='there are cats in the pond')]
と当然期待通り４つが表示された。

本題に戻ってdbにテーブルが生成されたことを確認：

(myenv) ~ $psql -h localhost -p 5432 -U tn -d sampledb1

psql (16.4)
Type "help" for help.

langchainが生成したテーブルを表示：

sampledb1=# \dt

あるいは jupyter上で

!psql -d sampledb1 -c "\dt"

List of relations

Schema	Name	Type	Owner
public	langchain_pg_collection	table	tn
public	langchain_pg_embedding	table	tn
(2 rows)

テーブルの構造を表示：

sampledb1=# \d langchain_pg_embedding

あるいは jupyter上で

!psql -d sampledb1 -c "\d langchain_pg_embedding"

Table "public.langchain_pg_embedding"

Column	Type	Nullable
id	character varying	not null
collection_id	uuid
embedding	vector
document	character varying
cmetadata	jsonb

langchain_pg_embeddingに書き込まれるので、データを取得する：
念の為VScode上のjupyterカーネルの「再起動」ボタンをクリックして、

!pip install psycopg

以下のコードは、*4を参考にしています。

import psycopg

conn = psycopg.connect("dbname=sampledb1 user=tn") #注１
cur = conn.cursor()
cur.execute('select * from langchain_pg_embedding')
for row in cur:
    formatted_output = f"id: {row[0]}\n" \
                    f"uuid: {row[1]}\n" \
                    f"page_content: {row[2][:100]}...\n" \
                    f"page_content(string): {row[3]}\n" \
                    f"metadata: {row[4]}\n"
    print(formatted_output)
    #print(row)
cur.close()
conn.close()

実行すると以下の様に表示される：
id: 1
uuid: 8259217a-01a8-4034-9442-2233ec98c7c7
page_content: [-0.041045193,0.009569716,-0.093480445,0.01990515,0.00062993786,-0.057626557,-0.02735376,-0.01263911...
page_content(string): there are cats in the pond
metadata: {'id': 1, 'topic': 'animals', 'location': 'pond'}

id: 2
uuid: 8259217a-01a8-4034-9442-2233ec98c7c7
page_content: [-0.0365837,0.0019632874,-0.0848764,-0.007010041,-0.028483586,-0.027577631,-1.8776509e-05,-0.0053111...
page_content(string): ducks are also found in the pond
metadata: {'id': 2, 'topic': 'animals', 'location': 'pond'}

id: 3
...

dbをダンプするには：

!pg_dump -U tn -h localhost -p 5432 -d sampledb1 -t langchain_pg_embedding -f langchain_pg_embedding_dump.sql
!cat langchain_pg_embedding_dump.sql

参考情報：
*1: https://qiita.com/tnagata/items/a88febd0f8cea88e1be8
*2: https://qiita.com/tnagata/items/7e6ae9956bdcaf167d94
*3: https://github.com/langchain-ai/langchain-postgres/blob/main/examples/vectorstore.ipynb と https://api.python.langchain.com/en/latest/vectorstores/langchain_postgres.vectorstores.PGVector.html
*4: https://www.psycopg.org/psycopg3/docs/basic/usage.html

注１：

conn = psycopg.connect(dbname="sampledb1",host="localhost",port=5432,user="tn")

とも書けるが、この時psycopg2のようにdatabese=..とするとエラーになる。dbnameとする必要がある。

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up