Scrapy + lxml + markdownify で節ごとに Markdown 化

Last updated at 2025-02-20Posted at 2025-02-20

先日、下記記事を執筆しました。

その際、以下の点を改善したいと感じたため、書き換えてみました。

Item は NamedTuple、Pipeline は自作関数にしたほうが読みやすそう
IO に時間のかかる複数のクラウドサービスに保存する場合を想定して asyncio を使いたい
- その意味でも Pipeline は使えない
Beautiful Soup の soup に append したものが元の soup から消える挙動に驚いた
Beautiful Soup なので XPath を使えない

実装は最下部に掲載しています。

所感

Items

Scrapy の Item よりも NamedTuple のほうが書きやすかったです。

Pipelines

Pipeline クラスを使わず自作関数にしても書きやすさは変わらない印象です。asyncio で同時処理させたい場合には Pipeline を使えないので、その意味では自作関数のほうが良いです。

Beautiful Soup から lxml に変更したのですが、思ったよりも書き換えに苦労しました。

XPath は便利

XPath だけで「子孫に <div class="c-news-richtext"> を持たない <div class="c-news-richtext">」のような複雑な条件を表現できるのが便利だと感じました。

`.find()` では相対パスを使う

.xpath() だとツリー全体を走査してしまうため、最初に見つかったものを取得したい場合には .find() を使います。

引数は / で始まる絶対パスではなく . で始まる相対パスにしなければならない点に注意が必要です。

append 元から要素を削除したくない場合は `deepcopy` が必要

Beautiful Soup で驚いて、lxml を試すきっかけになった仕様と似た仕様が lxml にも存在しました。

Spider

注意点は下記の記事で書いたものと変わりません。

総合

asyncio は採用したほうが良さそうです。

Beautiful Soup にするか lxml にするかは好みで決めて良いだろうと感じました。

今回、カテゴリーを取得できないページがあることに気づきましたが、クロール対象を増やしたらこのような問題はもっと増えそうなので、どこまで自力でやるかは悩むところです。クラウドサービス側でもっと高精度にチャンクを作ってくれるといいんですけどね。

実装

以下のような実装になりました。

Items

items.py

from typing import NamedTuple, Any

class RawColumnPageItem(NamedTuple):
    url: str
    text: str

class LxmlColumnPageItem(NamedTuple):
    url: str
    title: str
    category: str
    sections: list[Any] # list[etree]
    writers: list[Any] # list[etree]

class MarkdownColumnPageItem(NamedTuple):
    url: str
    title: str
    category: str
    sections: list[str]

Pipelines

pipelines.py

import logging
from copy import deepcopy
from pathlib import Path
from typing import Optional
from lxml import etree
from markdownify import markdownify as md
from scraping_project.items import *

logger = logging.getLogger(__name__)

def process_raw(item: RawColumnPageItem) -> Optional[LxmlColumnPageItem]:
    parser = etree.HTMLParser(remove_blank_text=True)

    tree = etree.fromstring(item.text, parser)

    title_elem = tree.find('.//h1')
    if title_elem is None:
        logger.info(f"skipping {item.url} because no title is found")
        return

    category_elem = tree.find('.//div[@class="c-block-content-header-news-category"]')
    if category_elem is None:
        logger.info(f"skipping {item.url} because no category is found")
        return

    sections = []
    section = etree.Element("div")
    for elem in tree.xpath('//div[@class="c-news-richtext" and not(.//div[@class="c-news-richtext"])]'):
        if elem.find('.//h3') is not None:
            sections.append(section)
            section = etree.Element("div")
            section.append(deepcopy(elem))
        else:
            section.append(deepcopy(elem))
    sections.append(section)

    return LxmlColumnPageItem(
        url=item.url,
        title=title_elem.text,
        category=category_elem.text,
        sections=sections,
        writers=tree.xpath('//div[@class="c-block-news-article-profile"]'),
    )


def process_lxml(item: LxmlColumnPageItem) -> MarkdownColumnPageItem:
    # include title and writers in the first section
    t = etree.Element("h1")
    t.text = item.title
    item.sections[0].insert(0, t)
    for writer in item.writers:
        t.addnext(deepcopy(writer))

    return MarkdownColumnPageItem(
        url=item.url,
        title=item.title,
        category=item.category,
        sections=[md(etree.tostring(section)) for section in item.sections],
    )


def process_md(item: MarkdownColumnPageItem, md_store: Path, aspx_filename: str) -> None:
    suffix = ".aspx"
    file_id = aspx_filename.removesuffix(suffix)
    for i, section in enumerate(item.sections):
        md_filename = f"{file_id}_sec{i}.md"
        md_path = md_store.joinpath(md_filename)
        md_path.write_text(section, encoding="utf-8")

Spider

intellilink_column.py

import asyncio
import scrapy
from scrapy.http import Response
from pathlib import Path
from scraping_project.items import RawColumnPageItem
from scraping_project.pipelines import *

class IntellilinkColumnSpider(scrapy.Spider):
    name = "intellilink_column"
    allowed_domains = ["www.intellilink.co.jp"]
    columnlisting = "https://www.intellilink.co.jp/columnlisting.aspx"
    start_urls = [columnlisting]

    html_store = Path.cwd().joinpath("html_store")
    html_store.mkdir(exist_ok=True)

    md_store = Path.cwd().joinpath("md_store")
    md_store.mkdir(exist_ok=True)

    async def parse(self, response: Response):
        query_for_a = '//div[@class="c-block"]//a[starts-with(@href, "/column/")]'
        column_page_links = response.xpath(query_for_a)
        for request in response.follow_all(column_page_links, self.store_column):
            yield request

        query_for_n = '//span[i[contains(@class, "fa-angle-right")]]/@data-navigate-to'
        if n := response.xpath(query_for_n).get():
            yield response.follow(f"{self.columnlisting}?page={n}", self.parse)

    async def store_html(self, aspx_filename: str, text: str):
        html_path = self.html_store.joinpath(aspx_filename).with_suffix(".html")
        html_path.write_text(text, encoding="utf-8")

    async def store_md(self, aspx_filename: str, url: str, text: str):
        raw = RawColumnPageItem(url, text)
        lxml = process_raw(raw)
        if lxml is None:
            return
        md = process_lxml(lxml)
        process_md(md, self.md_store, aspx_filename)

    async def store_column(self, response: Response):
        aspx_filename = "_".join(response.url.split("/")[-2:])
        md_task = asyncio.create_task(self.store_md(aspx_filename, response.url, response.text))
        html_task = asyncio.create_task(self.store_html(aspx_filename, response.text))
        await asyncio.gather(md_task, html_task)

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up

Scrapy + lxml + markdownify で節ごとに Markdown 化

所感

Items

Pipelines

XPath は便利

.find() では相対パスを使う

append 元から要素を削除したくない場合は deepcopy が必要

Spider

総合

実装

Items

Pipelines

Spider

`.find()` では相対パスを使う

append 元から要素を削除したくない場合は `deepcopy` が必要