More than 1 year has passed since last update.

CiNiiのDocumentLoaderからベクトルデータベースChromaにロードし近傍探索してみる（附 Embedding Model の比較）

Posted at 2024-01-08

環境構築

windows11で、pythonとchromadbその他のバージョンの整合性をとるのに苦労したので、以下を使いました。

miniforge create -n env_chroma chromadb

今回は以下のとおりです。
Python 3.11.7
langchain 0.0.353
chromadb 0.4.21

CiNiiのDocumentLoader

当初は、LangChainのJSONLoaderを使おうとしたのですが、jqのpythonバインディングスがWindowsに対応していなかったので、Building wheels for collected package: jq failed in Windows #4396 を参考に、CiNiiのDocumentLoaderを作ってみました。

import json
from pathlib import Path
from typing import Callable, Dict, List, Optional, Union
import requests
import urllib.parse
from langchain.docstore.document import Document
from langchain.document_loaders.base import BaseLoader
import pprint

class JSONLoader(BaseLoader):
    def __init__(
        self , 
        # file_path: Union[str, Path],
        file_path: Path,
        content_key: Optional[str] = None,
        metadata: Optional[str] = None,
        ):
        # self.file_path = Path(file_path).resolve()
        self.file_path = file_path
        self._content_key = content_key
        self.metadata_item = metadata

    def create_documents(self,processed_data):
        documents = []
        for item in processed_data:
            if '@id: ' in item:
                self.metadata_item = ''.join(item)
            content = ''.join(item)
            document = Document(page_content=content, metadata={'@id':self.metadata_item})
            documents.append(document)
        return documents
    
    def process_item(self, item, prefix=""):
        if isinstance(item, dict):
            result = []
            for key, value in item.items():
                new_prefix = f"{prefix}.{key}" if prefix else key
                result.extend(self.process_item(value, new_prefix))
            return result
        elif isinstance(item, list):
            result = []
            for value in item:
                result.extend(self.process_item(value, prefix))
            return result
        else:
            return [f"{prefix}: {item}"]

    def process_json(self,data):
        if isinstance(data, list):
            processed_data = []
            for item in data:
                processed_data.extend(self.process_item(item))
            return processed_data
        elif isinstance(data, dict):
            return self.process_item(data)
        else:
            return []

    def load(self) -> List[Document]:
        """Load and return documents from the JSON file."""
        docs=[]
        # with self as json_file:
        # with open(self.file_path, 'r') as json_file:
        try:
                data = self.file_path
                processed_json = self.process_json(data)
                docs = self.create_documents(processed_json)
        except json.JSONDecodeError:
                print("Error: Invalid JSON format in the file.")
        return docs

def filter_function(item):
    return item['items']

url = 'https://cir.nii.ac.jp/opensearch/all?sortorder=0&format=json&'
d = {'count': '10', 'q': '大規模言語モデル'}
d_qs = url + urllib.parse.urlencode(d)

response = requests.get(d_qs) 
dataaa = response.json()
filtered_data = filter_function(dataaa)

loader = JSONLoader(filtered_data)
data = loader.load()
pprint.pprint(data)

これで、以下のようなデータができます。

[Document(page_content='@id: https://cir.nii.ac.jp/crid/1390298668092493184', metadata={'@id': '@id: https://cir.nii.ac.jp/crid/1390298668092493184'}),
 Document(page_content='@type: item', metadata={'@id': '@id: https://cir.nii.ac.jp/crid/1390298668092493184'}),
 Document(page_content='title: 生成AIがやってきた！東北大学における注意喚起発出の経緯と方針，そして…', metadata={'@id': '@id: https://cir.nii.ac.jp/crid/1390298668092493184'}),

embeddingの実行

上記で作ったローディング用のデータを、Chromaにロードします。
日本語が使える Embedding Model はいくつかあったので、目についた４つで比較してみました。

from langchain.vectorstores import Chroma
from langchain.embeddings.sentence_transformer import SentenceTransformerEmbeddings
import os

def database(**args_dict):
    print(args_dict['db'], args_dict['embedding_function'])
    # create the open-source embedding function
    embedding_function = SentenceTransformerEmbeddings(
        model_name = args_dict['embedding_function'],
        encode_kwargs={"normalize_embeddings":False })
    path = os.path.join('./database2/', args_dict['db'])
    print(path)
    db = Chroma.from_documents(
        documents=data,
        embedding = embedding_function,
        persist_directory = path
    )
    if db:
        db.persist()
        # db = None
    else:
        print("Chroma DB has not been initialized.")

list_db = [
    {'db':'db1', 'embedding_function':'sentence-transformers/all-MiniLM-L6-v2'},
    {'db':'db2', 'embedding_function':'oshizo/sbert-jsnli-luke-japanese-base-lite'},
    {'db':'db3', 'embedding_function':'intfloat/multilingual-e5-small'},
    {'db':'db4', 'embedding_function':'sonoisa/sentence-bert-base-ja-mean-tokens-v2'}
]

for dict_db in list_db:
    database(**dict_db)

近傍検索

上記で作ったデータベースをロードして、検索してみます。

query ='教育情報'
top_k=10

def load_from_disk(**args_dict):
    print(args_dict['db'], args_dict['embedding_function'])
    path = os.path.join('./database2/', args_dict['db'])
    embedding_function = SentenceTransformerEmbeddings(model_name=args_dict['embedding_function'])
    db = Chroma(persist_directory = path, embedding_function=embedding_function)
    docs = db.similarity_search_with_score(query, K=top_k)
    print(len(docs))
    for mykey in docs:
        print(mykey)
    print('------\n')

for each_db in list_db:
    load_from_disk(**each_db)

結果は以下のとおり（長いですがそのまま貼り付けます）。

db1 sentence-transformers/all-MiniLM-L6-v2
4
(Document(page_content='title: キャラクタ対話システムのための文脈を用いた応答評価', metadata={'@id': '@id: https://cir.nii.ac.jp/crid/1390579830889855744'}), 1.272719383239746)
(Document(page_content='title: Remdis: リアルタイムマルチモーダル対話システム構築ツールキット', metadata={'@id': '@id: https://cir.nii.ac.jp/crid/1390016880936443648'}), 1.2835525274276733)
(Document(page_content='dc:subject: 注意喚起', metadata={'@id': 'rdfs:seeAlso.@id: https://cir.nii.ac.jp/crid/1390298668092493184.json'}), 1.3283870220184326)
(Document(page_content='dc:subject: 教育応用', metadata={'@id': 'rdfs:seeAlso.@id: https://cir.nii.ac.jp/crid/1390298668092493184.json'}), 1.3283870220184326)
------

db2 oshizo/sbert-jsnli-luke-japanese-base-lite
4
(Document(page_content='prism:publicationName: 教育システム情報学会誌', metadata={'@id': 'rdfs:seeAlso.@id: https://cir.nii.ac.jp/crid/1390298668092493184.json'}), 54.82743453979492)
(Document(page_content='dc:subject: 教育現場対応', metadata={'@id': 'rdfs:seeAlso.@id: https://cir.nii.ac.jp/crid/1390298668092493184.json'}), 59.101375579833984)
(Document(page_content='dc:publisher: 教育システム情報学会', metadata={'@id': 'rdfs:seeAlso.@id: https://cir.nii.ac.jp/crid/1390298668092493184.json'}), 59.296852111816406)
(Document(page_content='title: 生成AIがやってきた！東北大学における注意喚起発出の経緯と方針，そして…', metadata={'@id': '@id: https://cir.nii.ac.jp/crid/1390298668092493184'}), 65.97941589355469)
------

db3 intfloat/multilingual-e5-small
4
(Document(page_content='prism:publicationName: 教育システム情報学会誌', metadata={'@id': 'rdfs:seeAlso.@id: https://cir.nii.ac.jp/crid/1390298668092493184.json'}), 0.22750639915466309)
(Document(page_content='dc:publisher: 教育システム情報学会', metadata={'@id': 'rdfs:seeAlso.@id: https://cir.nii.ac.jp/crid/1390298668092493184.json'}), 0.23384930193424225)
(Document(page_content='dc:subject: 教育応用', metadata={'@id': 'rdfs:seeAlso.@id: https://cir.nii.ac.jp/crid/1390298668092493184.json'}), 0.2409597784280777)
(Document(page_content='dc:subject: 教育現場対応', metadata={'@id': 'rdfs:seeAlso.@id: https://cir.nii.ac.jp/crid/1390298668092493184.json'}), 0.25710949301719666)
------

db4 sonoisa/sentence-bert-base-ja-mean-tokens-v2
4
(Document(page_content='prism:publicationName: 教育システム情報学会誌', metadata={'@id': 'rdfs:seeAlso.@id: https://cir.nii.ac.jp/crid/1390298668092493184.json'}), 241.55291748046875)
(Document(page_content='dc:subject: 教育応用', metadata={'@id': 'rdfs:seeAlso.@id: https://cir.nii.ac.jp/crid/1390298668092493184.json'}), 260.86749267578125)
(Document(page_content='dc:publisher: 教育システム情報学会', metadata={'@id': 'rdfs:seeAlso.@id: https://cir.nii.ac.jp/crid/1390298668092493184.json'}), 305.72418212890625)
(Document(page_content='dc:subject: 教育現場対応', metadata={'@id': 'rdfs:seeAlso.@id: https://cir.nii.ac.jp/crid/1390298668092493184.json'}), 317.1845703125)

モデルによって、少しずつ結果が違うことが分かります（どれがどうだ、という評価は私にはできません・・・）。

今後

ベクトルデータベースにロードできたので、
LangChain RetrievalQA もできるとは思いますが、今後の課題ということで。

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up