Scrapy で Web サイトの文書を集める

Last updated at 2025-02-20Posted at 2025-02-20

RAG に Web ページを投入することを想定して、Scrapy を使って自社の Web サイトのコラムを集めてみました。

環境は Windows です。コンソールは Git Bash ですが、VS Code のターミナルは PowerShell 7 です。

環境作成

$ mkdir scrapy-study
$ cd scrapy-study
$ python -m venv .venv
$ . .venv/Scripts/activate
$ pip install scrapy

プロジェクトの作成

すでにプロジェクト用のディレクトリーを作ってあるので、カレントディレクトリーを指定してプロジェクトを作成します。

$ scrapy startproject scraping_proj .

次のようになります。

scrapy-study
├───scrapy.cfg         # デプロイ設定
└───scraping_project   # プロジェクトの Python モジュール
    ├───__init__.py
    ├───items.py       # 項目定義ファイル
    ├───middlewares.py # ミドルウェア定義ファイル
    ├───pipelines.py   # パイプラインファイル
    ├───settings.py    # 設定ファイル
    └───spiders        # Spider 格納ディレクトリー
        └───__init.py__

Spider のひな型作成

これから、Web をクロールしてくれる、Spider というものを作成していきます。

下記コマンドで intellilink_column という名前の Spider のひな型を作ることができます。対象ドメインを www.intellilink.co.jp に絞っています。

$ scrapy genspider intellilink_column www.intellilink.co.jp

spiders ディレクトリーに以下のファイルが作成されます。

intellilink_column.py

import scrapy


class IntellilinkColumnSpider(scrapy.Spider):
    name = "intellilink_column"
    allowed_domains = ["www.intellilink.co.jp"]
    start_urls = ["https://www.intellilink.co.jp"]

    def parse(self, response):
        pass

設定

こちらの記事を参考に、settings.py で以下の設定にします。

settings.py

DOWNLOAD_DELAY = 1                  # ページからページに移る際に1秒待つ
CONCURRENT_REQUESTS_PER_DOMAIN = 1  # １つのドメインで同時に複数のリクエストをしない

キャッシュも有効化しておきます。

settings.py

HTTPCACHE_ENABLED = True
HTTPCACHE_EXPIRATION_SECS = 0
HTTPCACHE_DIR = "httpcache"
HTTPCACHE_IGNORE_HTTP_CODES = []
HTTPCACHE_STORAGE = "scrapy.extensions.httpcache.FilesystemCacheStorage"

コラムの URL を抜き出す

まずはコラム一覧の最初のページから、コラムの URL を抜き出してみます。

ブラウザーで試す

ブラウザーでページを開き、コラム一覧の最初の項目の a タグを開発者ツールで選択し、右クリック > コピー > XPath のコピーを実行してみます。

/html/body/div[1]/div/main/div[5]/dl[1]/dd/dl/dd/a

上記の結果が得られました。でもこれだと意味がわからないですね。

要素間の関係を開発者ツールで見てみると、<div class="c-block"> が一覧の領域になっており、その内側の a タグを抜き出せば良いらしいと分かりました。開発者ツールのコンソールで試してみましょう。

ルートの子孫の div タグで class が c-block のものは、XPath で //div[@class="c-block"] と表現できます。その子孫の a タグを抜き出すので、以下のようになります。

//div[@class="c-block"]//a

実際に試してみましょう。要素タブで Ctrl+F を押してこれを検索にかけてみると、一覧は 25 件なのに、26 件ヒットしてしまいます。26 件目に「25 件」と書かれたリンクを取ってしまっていることがわかります。

これを修正するために、href が /column/ で始まるものに絞り込みます。

//div[@class="c-block"]//a[starts-with(@href, "/column/")]

無事、25 件だけ取得することができました。

実装する

intellilink_column.py

import scrapy
from scrapy.http import Response


class IntellilinkColumnSpider(scrapy.Spider):
    name = "intellilink_column"
    allowed_domains = ["www.intellilink.co.jp"]
    start_urls = ["https://www.intellilink.co.jp/columnlisting.aspx"]

    def parse(self, response: Response):
        column_page_links = response.xpath(
            '//div[@class="c-block"]//a[starts-with(@href, "/column/")]'
        )
        for link in column_page_links:
            print(link.attrib["href"])

response の型を VS Code の Python 拡張機能に教えるために、Response 型を import して、parse メソッドの引数に型注釈をつけます。これで、コード補完が効きやすくなります。

start_urls はコラム一覧である ["https://www.intellilink.co.jp/columnlisting.aspx"] に修正します。

xpath メソッドからは Selector のリストが返ります。Selector の href 属性にアクセスするには .attrib["href"] のようにします。

想定通りならば、実行すると URL が 25 件印字されるはずです。venv で試してみます。Spider を実行するには scrapy crawl を使います。

$ scrapy crawl intellilink_column

大量のログの中ほどに、以下のような URL が印字されていることが確認できます。想定通りです。

/column/vulner/2025/012100.aspx
:
/column/dx/2024/032200.aspx

リンク先を保存する

リンク先 25 件を HTML ファイルとして保存してみましょう。

intellilink_column.py

import scrapy
from scrapy.http import Response
from pathlib import Path

class IntellilinkColumnSpider(scrapy.Spider):
    name = "intellilink_column"
    allowed_domains = ["www.intellilink.co.jp"]
    start_urls = ["https://www.intellilink.co.jp/columnlisting.aspx"]
    html_store = Path.cwd().joinpath("html_store")
    html_store.mkdir(exist_ok=True)

    def parse(self, response: Response):
        column_page_links = response.xpath(
            '//div[@class="c-block"]//a[starts-with(@href, "/column/")]'
        )
        yield from response.follow_all(column_page_links, self.store_column)

    def store_column(self, response: Response):
        filestem = "_".join(response.url.split("/")[-2:])
        filepath = self.html_store.joinpath(filestem).with_suffix(".html")
        filepath.write_bytes(response.body)

ファイル保存のため、pathlib から Path をインポートします。

follow_all の結果は Request のリストです。Request を実行するために yield from を付けます。

response.url にはスキーマ名から始まる完全な URL 文字列が入っています。URL の後半部分は /column/<tag>/<year>/<number>.aspx という形式なので、<year>/<number> で一意になりそうです。

拡張子が aspx だと HTML として扱われないので、拡張子を html に変更します。この際、. をつけ忘れないようにご注意ください。

実行すると、カレントディレクトリー以下の html_store というディレクトリーに、ファイルが保存されます。

ブラウザーで開くと、見栄えはひどいものですが、本文は取得できていることがわかります。テキストの検索に使うことはできるでしょう。コラム以外のヘッダーやフッターの内容が含まれているのは気になりますが。

一覧の次のページも対象とする

ブラウザーで > をクリックすると一覧の次のページが表示されます。この処理は JavaScript が使われるため、Scrapy で対処するには少し工夫が必要です。

開発者ツールで見てみると、この箇所は <span data-navigate-to="2"> で囲まれています。同様に、1 は <span data-navigate-to="1"> となっており、data-navigate-to はページ番号を表すようです。

実際に遷移してみると、URL は https://www.intellilink.co.jp/columnlisting.aspx?page=2 となりました。

> の data-navigate-to を取得する XPath は以下のようになります。class に fa-angle-right を含む i タグを子にもつ span タグの data-navigate-to を取得しています。

//span[i[contains(@class, "fa-angle-right")]]/@data-navigate-to

それではこのロジックをプログラムに追加してみましょう。該当箇所のみ抜き出します。

intellilink_column.py

    def parse(self, response: Response):
        : # 中略
        query = '//span[i[contains(@class, "fa-angle-right")]]/@data-navigate-to'
        if n := response.xpath(query).get():
            columnlisting = "https://www.intellilink.co.jp/columnlisting.aspx"
            yield response.follow(f"{columnlisting}?page={n}", self.parse)

get() により、長さ 1 のリストの先頭 Selector の値を得ています。

注意しなければならないのは、コラム一覧の最後のページでは n が None になることです。if 文内で n に代入し、n が None でない場合だけ処理するようにしています。

parse メソッドで Request を yield すると、その Response が引数となって parse が実行されます。

うまく実行できるようです。

最終的なプログラム

定数の共通部分をくくり出すなどして、最終的には以下のようになりました。

intellilink_column.py

import scrapy
from scrapy.http import Response
from pathlib import Path

class IntellilinkColumnSpider(scrapy.Spider):
    name = "intellilink_column"
    allowed_domains = ["www.intellilink.co.jp"]
    columnlisting = "https://www.intellilink.co.jp/columnlisting.aspx"
    start_urls = [columnlisting]
    html_store = Path.cwd().joinpath("html_store")
    html_store.mkdir(exist_ok=True)

    def parse(self, response: Response):
        query_for_a = '//div[@class="c-block"]//a[starts-with(@href, "/column/")]'
        column_page_links = response.xpath(query_for_a)
        yield from response.follow_all(column_page_links, self.store_column)

        query_for_n = '//span[i[contains(@class, "fa-angle-right")]]/@data-navigate-to'
        if n := response.xpath(query_for_n).get():
            yield response.follow(f"{self.columnlisting}?page={n}", self.parse)

    def store_column(self, response: Response):
        filename = "_".join(response.url.split("/")[-2:])
        filepath = self.html_store.joinpath(filename).with_suffix(".html")
        filepath.write_bytes(response.body)

Scrapy のキャッシュのおかげで、リファクタリング後にサーバーに負荷をかけずに再実行できます。

所感

ちょっとした実装ではありますが、次のようなことがわかりました。

デフォルトで robots.txt に従ってくれて嬉しい
設定でキャッシュを有効化できるので、開発中に誤ってクロール対象に高負荷をかけてしまうことが起こりにくい
- 重複した URL は見に行かない機能もある
何でも JavaScript で実装される現代の Web では、JavaScript も実行してくれるツールでクロールしたい
- Playwright や Selenium が良いかもしれない
- そうしたツールと Scrapy を組み合わせても良さそう
XPath だけで何を取得したいかを表現できることが多い
- CSS セレクターでは条件を詳しく指定できなかった

少しでもどなたかの参考になれば幸いです。

追記

asyncio 版も作ってみました。

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up