Chromaで簡単なRAG構築：文書登録

Last updated at 2025-12-29Posted at 2025-12-26

はじめに

RAG（Retrieval-Augmented Generation）システムを構築する際、文章をベクトル化して保存するVectorStoreが重要な役割を果たします。本記事では、Pythonで使える無料のVectorStore「Chroma」を使って、文章を登録する方法を解説します。

対象読者

RAGシステムを自分で構築したい方

今回学ぶこと

Chromaについて
Embeddingモデルの選択（日本語対応）
ドキュメントの読み込みと分割
VectorStoreへの登録方法
永続化（再起動後も利用可能に）

1. Chromaとは

概要

Chromaは、オープンソースのベクトルデータベースです。文章をベクトル（数値の配列）に変換して保存し、類似度検索を高速に実行できます。

テキスト → Embedding（ベクトル化） → Chroma（保存） → 類似度検索

なぜChromaを選ぶのか

特徴	Chroma	Pinecone	FAISS
コスト	無料	有料プランあり	無料
永続化	簡単	クラウド	手動
セットアップ	pip install	アカウント作成	pip install
ローカル実行	✅	❌	✅

結論: ローカルで無料で使いたいならChromaが最適！

2. 必要なパッケージのインストール

pip install langchain-chroma langchain-huggingface langchain-text-splitters

パッケージの役割

パッケージ	役割
`langchain-chroma`	ChromaをLangChainから使う
`langchain-huggingface`	HuggingFaceのEmbeddingモデルを使う
`langchain-text-splitters`	長い文章を分割する

3. Embeddingモデルの選択

Embeddingとは

Embeddingは、テキストを数値のベクトルに変換する処理です。類似した文章は近いベクトルになります。

「猫が好き」 → [0.12, -0.45, 0.78, ...]
「犬が好き」 → [0.15, -0.42, 0.75, ...]  ← 近い！
「今日は晴れ」 → [-0.33, 0.21, -0.56, ...] ← 遠い

日本語対応モデル

重要: 日本語を扱う場合は、日本語対応モデルを選ぶ必要があります。

from langchain_huggingface import HuggingFaceEmbeddings

# ❌ NG: 英語専用モデル
embeddings_ng = HuggingFaceEmbeddings(
    model_name="sentence-transformers/all-MiniLM-L6-v2"  # 英語のみ
)

# ✅ OK: 多言語対応モデル（日本語OK）
embeddings = HuggingFaceEmbeddings(
    model_name="intfloat/multilingual-e5-small",
    model_kwargs={"device": "cpu"},
    encode_kwargs={"normalize_embeddings": True}
)

推奨モデル比較

モデル	サイズ	日本語	用途
`multilingual-e5-small`	約470MB	✅	一般的な用途（推奨）
`multilingual-e5-large`	約1.2GB	✅	高精度が必要な場合
`all-MiniLM-L6-v2`	約90MB	❌	英語のみ

4. ドキュメントの準備

ディレクトリ構成

project/
├── documents/           # テキストファイル置き場
│   ├── sample1.txt
│   ├── sample2.txt
│   └── ...
└── chroma_db/          # VectorStore保存先（自動生成）

サンプルドキュメント

documents/sample.txt:

LangGraphは、LLMアプリケーションのためのフレームワークです。
グラフ構造を使ってAIワークフローを構築できます。

主な機能として、State（状態管理）、Node（処理）、Edge（接続）があります。
Checkpointerを使えば会話履歴を永続化できます。

5. ドキュメントの読み込み

Documentオブジェクト

LangChainでは、テキストをDocumentオブジェクトとして扱います。

from pathlib import Path
from langchain_core.documents import Document

def load_documents(documents_dir: str) -> list[Document]:
    """
    ディレクトリ内のテキストファイルを読み込む

    Args:
        documents_dir: テキストファイルのディレクトリパス

    Returns:
        Documentオブジェクトのリスト
    """
    docs_path = Path(documents_dir)
    documents = []

    for txt_file in docs_path.glob("*.txt"):
        content = txt_file.read_text(encoding="utf-8")
        doc = Document(
            page_content=content,              # テキスト本体
            metadata={"source": txt_file.name}  # メタデータ（出典など）
        )
        documents.append(doc)
        print(f"読み込み: {txt_file.name}")

    return documents

実行例

documents = load_documents("documents")
# 出力:
# 読み込み: sample1.txt
# 読み込み: sample2.txt

print(f"読み込んだドキュメント数: {len(documents)}")
# 出力: 読み込んだドキュメント数: 2

6. ドキュメントの分割

なぜ分割が必要か

長い文章をそのままベクトル化すると、検索精度が下がります。適切なサイズに分割することで、ピンポイントな検索が可能になります。

長い文章（5000文字）
    ↓ 分割
チャンク1（500文字）, チャンク2（500文字）, ...
    ↓
各チャンクをベクトル化

RecursiveCharacterTextSplitter

from langchain_text_splitters import RecursiveCharacterTextSplitter

def split_documents(documents: list[Document]) -> list[Document]:
    """
    ドキュメントを適切なサイズに分割

    Args:
        documents: Documentオブジェクトのリスト

    Returns:
        分割されたDocumentのリスト
    """
    text_splitter = RecursiveCharacterTextSplitter(
        chunk_size=500,       # 1チャンクの最大文字数
        chunk_overlap=50,     # チャンク間の重複文字数
        length_function=len,
        separators=[
            "\n\n",           # 段落
            "\n",             # 改行
            "。",             # 日本語の句点
            "、",             # 日本語の読点
            " ",              # スペース
            ""                # 文字単位（最終手段）
        ]
    )

    splits = text_splitter.split_documents(documents)
    print(f"{len(documents)}個のドキュメントを{len(splits)}個のチャンクに分割")

    return splits

パラメータ解説

パラメータ	説明	推奨値
`chunk_size`	1チャンクの最大サイズ	300〜1000
`chunk_overlap`	チャンク間の重複	chunk_sizeの10%
`separators`	分割の優先順位	日本語対応を追加

なぜoverlapが必要？

チャンク1: 「...LangGraphの特徴です」
チャンク2: 「主な機能として...」

↑ 文脈が途切れる！

overlapあり:
チャンク1: 「...LangGraphの特徴です。主な機能として」
チャンク2: 「主な機能として、State...」

↑ 文脈が繋がる！

7. VectorStoreへの登録

from_documents()で一括登録

from langchain_huggingface import HuggingFaceEmbeddings
from langchain_community.vectorstores import Chroma

def create_vector_store(
    splits: list[Document],
    persist_dir: str = "chroma_db"
) -> Chroma:
    """
    分割されたドキュメントをVectorStoreに登録

    Args:
        splits: 分割されたDocumentのリスト
        persist_dir: 永続化先のディレクトリ

    Returns:
        Chromaインスタンス
    """
    # 1. Embeddingモデルの初期化
    print("Embeddingモデルを読み込み中...")
    embeddings = HuggingFaceEmbeddings(
        model_name="intfloat/multilingual-e5-small",
        model_kwargs={"device": "cpu"},
        encode_kwargs={"normalize_embeddings": True}
    )

    # 2. VectorStoreの構築
    print("VectorStoreを構築中...")
    vector_store = Chroma.from_documents(
        documents=splits,           # 分割済みドキュメント
        embedding=embeddings,       # Embeddingモデル
        persist_directory=persist_dir  # 保存先
    )

    print(f"VectorStore構築完了！保存先: {persist_dir}/")
    return vector_store

処理の流れ

splits (Document[])
    ↓
Embeddingモデルでベクトル化
    ↓
Chromaに保存（ベクトル + テキスト + メタデータ）
    ↓
ディスクに永続化（persist_directory）

8. 永続化と再読み込み

永続化の仕組み

persist_directoryを指定すると、VectorStoreがディスクに保存されます。

chroma_db/
├── chroma.sqlite3    # メタデータ
├── data_level0.bin   # ベクトルデータ
└── ...

既存のVectorStoreを読み込む

def load_existing_vector_store(persist_dir: str = "chroma_db") -> Chroma:
    """
    既存のVectorStoreを読み込む

    Args:
        persist_dir: VectorStoreの保存先ディレクトリ

    Returns:
        Chromaインスタンス
    """
    from pathlib import Path

    # 存在確認
    persist_path = Path(persist_dir)
    if not persist_path.exists() or not any(persist_path.iterdir()):
        raise FileNotFoundError(f"VectorStoreが見つかりません: {persist_dir}")

    # Embeddingモデルの初期化（構築時と同じモデルを使用）
    embeddings = HuggingFaceEmbeddings(
        model_name="intfloat/multilingual-e5-small",
        model_kwargs={"device": "cpu"},
        encode_kwargs={"normalize_embeddings": True}
    )

    # 既存のVectorStoreを読み込み
    vector_store = Chroma(
        persist_directory=persist_dir,
        embedding_function=embeddings
    )

    print(f"VectorStoreを読み込みました: {persist_dir}")
    return vector_store

新規構築 vs 読み込みの判定

def setup_vector_store(documents_dir: str, persist_dir: str) -> Chroma:
    """
    VectorStoreをセットアップ（存在すれば読み込み、なければ構築）
    """
    persist_path = Path(persist_dir)

    # 既存のVectorStoreがあれば読み込み
    if persist_path.exists() and any(persist_path.iterdir()):
        print("既存のVectorStoreを読み込みます...")
        return load_existing_vector_store(persist_dir)

    # なければ新規構築
    print("VectorStoreを新規構築します...")
    documents = load_documents(documents_dir)
    splits = split_documents(documents)
    return create_vector_store(splits, persist_dir)

9. 完全なサンプルコード

rag_setup.py

"""
Chroma VectorStore 構築スクリプト

Usage:
    python rag_setup.py
"""

from pathlib import Path
from langchain_core.documents import Document
from langchain_huggingface import HuggingFaceEmbeddings
from langchain_community.vectorstores import Chroma
from langchain_text_splitters import RecursiveCharacterTextSplitter


def load_documents(documents_dir: str) -> list[Document]:
    """ドキュメントを読み込む"""
    docs_path = Path(documents_dir)
    documents = []

    for txt_file in docs_path.glob("*.txt"):
        content = txt_file.read_text(encoding="utf-8")
        doc = Document(
            page_content=content,
            metadata={"source": txt_file.name}
        )
        documents.append(doc)
        print(f"  読み込み: {txt_file.name}")

    return documents


def split_documents(documents: list[Document]) -> list[Document]:
    """ドキュメントを分割する"""
    text_splitter = RecursiveCharacterTextSplitter(
        chunk_size=500,
        chunk_overlap=50,
        separators=["\n\n", "\n", "。", "、", " ", ""]
    )
    return text_splitter.split_documents(documents)


def build_vector_store(
    documents_dir: str = "documents",
    persist_dir: str = "chroma_db"
) -> Chroma:
    """VectorStoreを構築する"""

    print("=" * 50)
    print("Chroma VectorStore 構築ツール")
    print("=" * 50)

    # 1. Embeddingモデル初期化
    print("\n[1/4] Embeddingモデルを初期化中...")
    embeddings = HuggingFaceEmbeddings(
        model_name="intfloat/multilingual-e5-small",
        model_kwargs={"device": "cpu"},
        encode_kwargs={"normalize_embeddings": True}
    )

    # 2. ドキュメント読み込み
    print(f"\n[2/4] ドキュメントを読み込み中... ({documents_dir}/)")
    documents = load_documents(documents_dir)

    if not documents:
        raise ValueError(f"ドキュメントが見つかりません: {documents_dir}/*.txt")

    # 3. ドキュメント分割
    print(f"\n[3/4] ドキュメントを分割中...")
    splits = split_documents(documents)
    print(f"  {len(documents)}個 → {len(splits)}個のチャンクに分割")

    # 4. VectorStore構築
    print(f"\n[4/4] VectorStoreを構築中...")
    vector_store = Chroma.from_documents(
        documents=splits,
        embedding=embeddings,
        persist_directory=persist_dir
    )

    print("\n" + "=" * 50)
    print(f"構築完了！")
    print(f"  保存先: {persist_dir}/")
    print(f"  チャンク数: {len(splits)}")
    print("=" * 50)

    return vector_store


if __name__ == "__main__":
    build_vector_store()

実行

$ python rag_setup.py

==================================================
Chroma VectorStore 構築ツール
==================================================

[1/4] Embeddingモデルを初期化中...

[2/4] ドキュメントを読み込み中... (documents/)
  読み込み: sample1.txt
  読み込み: sample2.txt

[3/4] ドキュメントを分割中...
  2個 → 8個のチャンクに分割

[4/4] VectorStoreを構築中...

==================================================
構築完了！
  保存先: chroma_db/
  チャンク数: 8
==================================================

10. トラブルシューティング

Q1. モデルのダウンロードが遅い

原因: 初回はHuggingFaceからモデルをダウンロードします（約470MB）

解決策: 事前にダウンロードしておく

python -c "from sentence_transformers import SentenceTransformer; SentenceTransformer('intfloat/multilingual-e5-small')"

Q2. メモリ不足エラー

原因: モデルやドキュメントが大きすぎる

解決策:

# より軽量なモデルを使用（ただし英語のみ）
embeddings = HuggingFaceEmbeddings(
    model_name="sentence-transformers/all-MiniLM-L6-v2"
)

# チャンクサイズを小さく
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=300,  # 500 → 300
    chunk_overlap=30
)

Q3. 日本語の検索精度が低い

原因: 英語専用モデルを使っている

解決策: multilingual-e5-smallなど多言語対応モデルを使用

まとめ

学んだこと

✅ Chromaは無料でローカル実行できるVectorStore
✅ 日本語にはmultilingual-e5-smallがおすすめ
✅ 長い文章はRecursiveCharacterTextSplitterで分割
✅ from_documents()で一括登録
✅ persist_directoryで永続化

登録の流れ

テキストファイル
    ↓ load_documents()
Documentオブジェクト
    ↓ split_documents()
分割されたチャンク
    ↓ from_documents() + Embedding
VectorStore（Chroma）
    ↓ persist_directory
ディスクに永続化

次回予告

後編では、登録したVectorStoreから検索する方法を解説します。

similarity_search()の使い方
検索結果のフォーマット
LangGraphツールとしての活用

参考リンク

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up