langchain-postgresとpsycopg3を試してみる
python環境は:
(myenv) ~/myenv $pip -V
pip 24.2 from /Users/tn/myenv/lib/python3.12/site-packages/pip (python 3.12)
(myenv) ~/myenv $python -V
Python 3.12.6
以下のコードは、VScode上で作業しています(ipykernelのインストールと登録などは、*1を参照)。またpostgresqlサーバーにvector拡張をインストールしておく必要があります(postgresql16でvector拡張をインストールする方法は、*2を参照)
まず、起動済みのpostgresqlサーバーにdbを作成:
!createdb sampledb1
psycopg3のインストール:
!pip install psycopg # psycopg3ではないので注意
vectorストアを初期化(大半のコードとデータを*3から拝借してます。embeddingモデルは適当です。):
from langchain_postgres import PGVector
from langchain_postgres.vectorstores import PGVector
from langchain_core.documents import Document
from langchain_ollama import OllamaEmbeddings
embedding = OllamaEmbeddings(
model="bge-m3"
)
#パスワードはかけていない
connection = "postgresql+psycopg://tn@localhost:5432/sampledb1"
collection_name = "my_docs"
vectorstore = PGVector(
embeddings=embedding,
collection_name=collection_name,
connection=connection,
use_jsonb=True,
)
データを用意する(*3のものを少々追加・変更しています):
docs = [
Document(page_content='there are cats in the pond', metadata={"id": 1, "location": "pond", "topic": "animals"}),
Document(page_content='ducks are also found in the pond', metadata={"id": 2, "location": "pond", "topic": "animals"}),
Document(page_content='fresh apples are available at the market', metadata={"id": 3, "location": "market", "topic": "food"}),
Document(page_content='the market also sells fresh oranges', metadata={"id": 4, "location": "market", "topic": "food"}),
Document(page_content='the new art exhibit is fascinating', metadata={"id": 5, "location": "museum", "topic": "art"}),
Document(page_content='a sculpture exhibit is also at the museum', metadata={"id": 6, "location": "museum", "topic": "art"}),
Document(page_content='a new coffee shop opened on Main Street', metadata={"id": 7, "location": "Main Street", "topic": "food"}),
Document(page_content='the book club meets at the library', metadata={"id": 8, "location": "library", "topic": "reading"}),
Document(page_content='the library hosts a weekly story time for kids', metadata={"id": 9, "location": "library", "topic": "reading"}),
Document(page_content='there are tigers in the yard', metadata={"id": 10, "location": "zoo", "topic": "animals"}),
Document(page_content='there are dogs in the backyard', metadata={"id": 11, "location": "my home", "topic": "animals"})
]
dbに書き込む:
vectorstore.add_documents(docs, ids=[doc.metadata['id'] for doc in docs])
オマケですが、similarity_search_with_scoreなどの例:
results = vectorstore.similarity_search_with_score(query="lion",k=5)
for doc, score in results:
print(f"* [SIM={score:3f}] {doc.page_content} [{doc.metadata}]")
- [SIM=0.457508] there are tigers in the yard [{'id': 10, 'topic': 'animals', 'location': 'zoo'}]
- [SIM=0.494071] there are dogs in the backyard [{'id': 11, 'topic': 'animals', 'location': 'my home'}]
- [SIM=0.540048] ducks are also found in the pond [{'id': 2, 'topic': 'animals', 'location': 'pond'}]
- [SIM=0.541976] there are cats in the pond [{'id': 1, 'topic': 'animals', 'location': 'pond'}]
- [SIM=0.557055] the book club meets at the library [{'id': 8, 'topic': 'reading', 'location': 'library'}]
"lion"に近い順で並んでいるかは微妙。
filterをかける:
vectorstore.similarity_search('lion', k=5, filter={
'topic': { "$eq": 'animals'}
})
[Document(id='10', metadata={'id': 10, 'topic': 'animals', 'location': 'zoo'}, page_content='there are tigers in the yard'),
Document(id='11', metadata={'id': 11, 'topic': 'animals', 'location': 'my home'}, page_content='there are dogs in the backyard'),
Document(id='2', metadata={'id': 2, 'topic': 'animals', 'location': 'pond'}, page_content='ducks are also found in the pond'),
Document(id='1', metadata={'id': 1, 'topic': 'animals', 'location': 'pond'}, page_content='there are cats in the pond')]
と当然期待通り4つが表示された。
本題に戻ってdbにテーブルが生成されたことを確認:
(myenv) ~ $psql -h localhost -p 5432 -U tn -d sampledb1
psql (16.4)
Type "help" for help.
langchainが生成したテーブルを表示:
sampledb1=# \dt
あるいは jupyter上で
!psql -d sampledb1 -c "\dt"
List of relations
Schema | Name | Type | Owner |
---|---|---|---|
public | langchain_pg_collection | table | tn |
public | langchain_pg_embedding | table | tn |
(2 rows) |
テーブルの構造を表示:
sampledb1=# \d langchain_pg_embedding
あるいは jupyter上で
!psql -d sampledb1 -c "\d langchain_pg_embedding"
Table "public.langchain_pg_embedding"
Column | Type | Collation | Nullable | Default |
---|---|---|---|---|
id | character varying | not null | ||
collection_id | uuid | |||
embedding | vector | |||
document | character varying | |||
cmetadata | jsonb |
langchain_pg_embeddingに書き込まれるので、データを取得する:
念の為VScode上のjupyterカーネルの「再起動」ボタンをクリックして、
!pip install psycopg
以下のコードは、*4を参考にしています。
import psycopg
conn = psycopg.connect("dbname=sampledb1 user=tn") #注1
cur = conn.cursor()
cur.execute('select * from langchain_pg_embedding')
for row in cur:
formatted_output = f"id: {row[0]}\n" \
f"uuid: {row[1]}\n" \
f"page_content: {row[2][:100]}...\n" \
f"page_content(string): {row[3]}\n" \
f"metadata: {row[4]}\n"
print(formatted_output)
#print(row)
cur.close()
conn.close()
実行すると以下の様に表示される:
id: 1
uuid: 8259217a-01a8-4034-9442-2233ec98c7c7
page_content: [-0.041045193,0.009569716,-0.093480445,0.01990515,0.00062993786,-0.057626557,-0.02735376,-0.01263911...
page_content(string): there are cats in the pond
metadata: {'id': 1, 'topic': 'animals', 'location': 'pond'}
id: 2
uuid: 8259217a-01a8-4034-9442-2233ec98c7c7
page_content: [-0.0365837,0.0019632874,-0.0848764,-0.007010041,-0.028483586,-0.027577631,-1.8776509e-05,-0.0053111...
page_content(string): ducks are also found in the pond
metadata: {'id': 2, 'topic': 'animals', 'location': 'pond'}
id: 3
...
dbをダンプするには:
!pg_dump -U tn -h localhost -p 5432 -d sampledb1 -t langchain_pg_embedding -f langchain_pg_embedding_dump.sql
!cat langchain_pg_embedding_dump.sql
参考情報:
*1: https://qiita.com/tnagata/items/a88febd0f8cea88e1be8
*2: https://qiita.com/tnagata/items/7e6ae9956bdcaf167d94
*3: https://github.com/langchain-ai/langchain-postgres/blob/main/examples/vectorstore.ipynb と https://api.python.langchain.com/en/latest/vectorstores/langchain_postgres.vectorstores.PGVector.html
*4: https://www.psycopg.org/psycopg3/docs/basic/usage.html
注1:
conn = psycopg.connect(dbname="sampledb1",host="localhost",port=5432,user="tn")
とも書けるが、この時psycopg2のようにdatabese=..とするとエラーになる。dbnameとする必要がある。