Scrapy + Beautiful Soup + markdownify で節ごとに Markdown 化

Last updated at 2025-02-21Posted at 2025-02-20

RAG に投入するデータの分割をどうするかは悩ましいですね。

RAG chunking phase の Chunking approaches には以下が挙げられています。

Sentence-based parsing
Fixed-size parsing, with overlap
Custom code
Language model augmentation
Document layout analysis
Prebuilt model
Custom model

今回は custom code による分割に挑戦してみました。

方針

Scrapy に備わっている Items と Item Pipeline を使う
スクレイピングには Beautiful Soup を使う
Markdown 化には markdownify を使う

Items の作成

全体として以下の 4 つの処理が必要になるので、前 3 つがそれぞれ Scrapy の Item を返すようにします。

クローリング
スクレイピング
Markdown 化
保存
- HTML の保存
- 分割した Markdown の保存

Items の定義は以下のようにしました。

items.py

import scrapy
from bs4 import BeautifulSoup


class RawColumnPageItem(scrapy.Item):
    url: str = scrapy.Field()
    soup: BeautifulSoup = scrapy.Field()
    whole: str = scrapy.Field()


class SoupColumnPageItem(scrapy.Item):
    url: str = scrapy.Field()
    title: str = scrapy.Field()
    category: str = scrapy.Field()
    sections: list[BeautifulSoup] = scrapy.Field()
    writers = scrapy.Field() # list[soup element]
    whole: str = scrapy.Field()


class MarkdownColumnPageItem(scrapy.Item):
    url: str = scrapy.Field()
    title: str = scrapy.Field()
    category: str = scrapy.Field()
    sections: list[str] = scrapy.Field()
    whole: str = scrapy.Field()

クローラーの作成

別記事で作成したクローラーを少し書き換えて、BeautifulSoup を返すようにしました。また、今回は動作確認が目的なので 1 ページだけ処理するようにしました。

intellilink_column.py

import scrapy
from scrapy.http import Response
from scraping_project.items import RawColumnPageItem
from bs4 import BeautifulSoup


class IntellilinkColumnSpider(scrapy.Spider):
    name = "intellilink_column"
    allowed_domains = ["www.intellilink.co.jp"]
    columnlisting = "https://www.intellilink.co.jp/columnlisting.aspx"
    start_urls = [columnlisting]

    def parse(self, response: Response):
        query_for_a = '//div[@class="c-block"]//a[starts-with(@href, "/column/")]'
        column_page_links = response.xpath(query_for_a)
        #yield from response.follow_all(column_page_links, self.make_raw_item)
        yield response.follow(column_page_links[0], self.make_raw_item)
        return

        query_for_n = '//span[i[contains(@class, "fa-angle-right")]]/@data-navigate-to'
        if n := response.xpath(query_for_n).get():
            yield response.follow(f"{self.columnlisting}?page={n}", self.parse)

    def make_raw_item(self, response: Response):
        whole = response.text
        yield RawColumnPageItem(
            url=response.url,
            soup=BeautifulSoup(whole, "lxml"),
            whole=whole,
        )

Pipelines の実装

以下のようにしました。

Beautiful Soup での処理をどうするかは悩みましたが、c-news-richtext が指定された div の子のタグのうち、h3 が出現したらセクションを区切ることにしました。

markdownify ではセクションごとに Markdown 化しています。この際、概要が記述された最初のセクションに、タイトルと著者名を含めるようにしました。

pipelines.py

from copy import copy
from scraping_project.items import (
    RawColumnPageItem,
    SoupColumnPageItem,
    MarkdownColumnPageItem,
)
from bs4 import BeautifulSoup, Tag
from markdownify import MarkdownConverter
from pathlib import Path


class SoupPipeline:
    def process_item(self, item: RawColumnPageItem, spider):
        soup: BeautifulSoup = item["soup"]

        sections = []
        section = BeautifulSoup()
        for elem in soup.find("div", class_="c-news-richtext").children:
            if not isinstance(elem, Tag):
                continue

            if elem.find("h3"):
                sections.append(section)
                section = BeautifulSoup()
                section.append(copy(elem))
            else:
                section.append(copy(elem))
        sections.append(section)

        return SoupColumnPageItem(
            url=item["url"],
            title= soup.find("h1").string,
            category=soup.find("div", class_="c-block-content-header-news-category").string,
            sections=sections,
            writers=soup.find_all("div", class_="c-block-news-article-profile"),
            whole=item["whole"],
        )


class MarkdownifyPipeline:
    def md(self, soup, **options):
        return MarkdownConverter(**options).convert_soup(soup)

    def process_item(self, item: SoupColumnPageItem, spider):
        # include title and writers in the first section
        soup = BeautifulSoup(f'<h1>{item["title"]}</h1>', "lxml")
        for writer in item["writers"]:
            soup.append(copy(writer))
        item["sections"][0].insert(0, soup)

        return MarkdownColumnPageItem(
            url=item["url"],
            title= item["title"],
            category=item["category"],
            sections=[self.md(section) for section in item["sections"]],
            whole=item["whole"],
        )


class StorePipeline:
    html_store = Path.cwd().joinpath("html_store")
    html_store.mkdir(exist_ok=True)

    md_store = Path.cwd().joinpath("md_store")
    md_store.mkdir(exist_ok=True)

    url_suffix = ".aspx"
    
    def process_item(self, item: MarkdownColumnPageItem, spider):
        filename = "_".join(item["url"].split("/")[-2:])

        html_path = self.html_store.joinpath(filename).with_suffix(".html")
        html_path.write_text(item["whole"], encoding="utf-8")

        file_id = filename.removesuffix(self.url_suffix)
        for i, section in enumerate(item["sections"]):
            md_filename =  f"{file_id}_sec{i}.md"
            md_path = self.md_store.joinpath(md_filename)
            md_path.write_text(section, encoding="utf-8")

これらが順番に動作するよう settings.py で設定しました。

トラブル

`ModuleNotFoundError: No module named 'bs4'` が出る

pip で venv に install した後 uninstall して、再度 install した環境で発生しました。

venv のフォルダーを削除して作り直したら直りました。

Beautiful Soup で append したら元の soup から消える

以下のように、append した Tag は元のほうからは消えます。

>>> soup = BeautifulSoup("<html><body><p>1</p><p>2</p></body></html>", "lxml")
>>> BeautifulSoup().append(soup.find("p"))
<p>1</p>
>>> soup
<html><body><p>2</p></body></html>

これを防ぐには、コピーします。

>>> from copy import copy
>>> soup = BeautifulSoup("<html><body><p>1</p><p>2</p></body></html>", "lxml")
>>> BeautifulSoup().append(copy(soup.find("p")))
<p>1</p>
>>> soup
<html><body><p>1</p><p>2</p></body></html>

ちょっと驚く仕様でした。

所感

以下のようなことを感じました。

Scrapy の item や pipeline は使いづらい
- item の属性に型がつかないのでエディターで補完が効かない
- pipeline は一直線で、複数の出力があるグラフ構造にできない
- これらは使わず、Item の代わりには NamedTuple を使うと良さそう
Beautiful Soup は癖がある
- XPath が使える代替ライブラリーの lxml も同様に append 元から削除される仕様であり、この癖に慣れるしかなさそう
手作りは良し悪し
- ヘッダーやフッターは不要という人間ならではの判断で、検索エンジン (ベクトルデータベース) の容量削減に貢献できる
- 見出しで区切る、ピリオドで区切らないなど、想定通りの位置でチャンクを区切れる
- 品質の確保が難しい
  - append すると元の soup から消える事象を見落としそうだった
- Document Intelligence で十分な場合には、Document Intelligence を使ったほうがコストを抑制できそう

Scrapy の代替としては Crawlee という asyncio や型ヒントを使えるものもあるのですが、ボットであることを隠そうとする姿勢が好きになれなかったり、キャッシュ機能や Python で robots.txt に従わせる方法が見当たらなかったりしたため、採用は見送りました。

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up