Scrapy で asyncio を使ってみた

Scrapy

Last updated at 2025-02-20Posted at 2025-02-20

公式に asyncio のページがあり、部分的なサポートが謳われているものの、具体例がなくてイメージが湧かなかったので、手を動かしてみました。

下記記事のプログラムの asyncio 版のようなものです。

プログラム

簡単に書きたかったことと、好みの問題で、Item や Pipeline は使っていません。

asyncio を使って、Markdown と HTML を同時に保存しています。

intellilink_column.py

import asyncio
import scrapy
from scrapy.http import Response
from markdownify import markdownify as md
from pathlib import Path

class IntellilinkColumnSpider(scrapy.Spider):
    name = "intellilink_column"
    allowed_domains = ["www.intellilink.co.jp"]
    columnlisting = "https://www.intellilink.co.jp/columnlisting.aspx"
    start_urls = [columnlisting]

    html_store = Path.cwd().joinpath("html_store")
    html_store.mkdir(exist_ok=True)

    md_store = Path.cwd().joinpath("md_store")
    md_store.mkdir(exist_ok=True)

    async def parse(self, response: Response):
        query_for_a = '//div[@class="c-block"]//a[starts-with(@href, "/column/")]'
        column_page_links = response.xpath(query_for_a)
        for request in response.follow_all(column_page_links, self.store_column):
            yield request

        query_for_n = '//span[i[contains(@class, "fa-angle-right")]]/@data-navigate-to'
        if n := response.xpath(query_for_n).get():
            yield response.follow(f"{self.columnlisting}?page={n}", self.parse)

    async def store_text(self, path: Path, text: str):
        path.write_text(text, encoding="utf-8")

    async def store_md(self, path: Path, html: str):
        await self.store_text(path, md(html))

    async def store_column(self, response: Response):
        filename = "_".join(response.url.split("/")[-2:])

        md_path = self.md_store.joinpath(filename).with_suffix(".md")
        md_task = asyncio.create_task(self.store_md(md_path, response.text))

        html_path = self.html_store.joinpath(filename).with_suffix(".html")
        html_task = asyncio.create_task(self.store_text(html_path, response.text))

        await asyncio.gather(md_task, html_task)

注意点

基本的にメソッドを async def にするだけで動くような印象を受けましたが、以下の点に注意が必要でした。

`yield from` は使えない

for + yeild に書き換える必要があります。

`gather` を忘れると、タスクを処理しないまま Scrapy が終了する

書き忘れないようにしましょう。

create_task に普通の関数とその引数を渡してしまうと難しいエラーが出る

パッと見で何がおかしいのかわかりにくいので、async def のものを渡すようご注意ください。

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up

Scrapy で asyncio を使ってみた

プログラム

注意点

yield from は使えない

gather を忘れると、タスクを処理しないまま Scrapy が終了する

create_task に普通の関数とその引数を渡してしまうと難しいエラーが出る

`yield from` は使えない

`gather` を忘れると、タスクを処理しないまま Scrapy が終了する