LangChain の DocumentLoader のように HTMLファイルの一部を読み取っていくサンプル

Posted at 2023-08-04

概要

LLM で RAG などをしたいため、HTML ファイルを読み取りベクトルデータベースに保存したい
LangChain で HTML の DocumentLoader が使えるが、HTML の一部を抽出することができない
- https://python.langchain.com/docs/modules/data_connection/document_loaders/html
- セレクタ、XPath、id 指定など
代替手段のサンプルを提供

手順

BeautifulSoup を使って HTML ファイルの一部を読み取る
LangChain の Document クラスを手動で生成する

サンプルスクリプト

import os
from pathlib import Path
from bs4 import BeautifulSoup
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.docstore.document import Document

# 特定ディレクトリのサブディレクトリも含むファイル一覧を取得
html_dir = Path("どこかのディレクトリ")
html_files = list(html_dir.glob("**/*.htm"))

# HTML内のデータを分割する用のSplitterを用意
text_splitter = RecursiveCharacterTextSplitter(chunk_size=200, chunk_overlap=0)

all_documents = []
for p in html_files:
    # HTMLファイルを BeautifulSoup で読み込み
    with open(p, "r", encoding="UTF-8") as f:
        soup = BeautifulSoup(f)

    # HTMLの一部を取得(今回は id 指定)
    content = soup.find(id="main-content").text
    # タイトルを取得(今回は h1 タグのテキスト)
    title = soup.find("h1").text

    # langchain の DocumentLoader を使わず、自前で Document を作成
    # メタデータにファイルのパス、タイトルを設定
    doc = Document(page_content=content, metadata={"source": p, "title": title})

    # documentをsplitterで分割してリストに追加
    splited = text_splitter.split_documents([doc])
    all_documents += splited

# LangChain のサンプルを参考にベクトル化するなど
embeddings = OpenAIEmbeddings(deployment="text-embedding-ada-002")
db = Chroma(persist_directory="_vecorstore", embedding_function=embeddings)
for doc in all_documents:
    doc.metadata["source"] = str(doc.metadata["source"])
    db.add_texts([doc.page_content], [doc.metadata])

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up