More than 5 years have passed since last update.

Scrapy でWebページのリンクを抽出する

Last updated at 2018-07-11Posted at 2018-07-11

Beautifulsoup で行っているのと同じことを、Scrapy でやってみました。
Beautifulsoup でWebページのリンクを抽出する

プログラム

scrapy01.py

# -*- coding: utf-8 -*-
#
#	scrapy01.py
#
#
#					Jul/11/2018
#
import scrapy

class FirstScrapySpider(scrapy.Spider):
    name = 'scrapy01'
    allowed_domains = ['ekzemplaro.org']
    start_urls = ['https://ekzemplaro.org']

    def parse(self, response):
        for unit in response.css('a::attr(href)').extract():
            print(unit)
#

実行結果

$ scrapy runspider --loglevel=WARN scrapy01.py
en/
ekzemplaro/
audio_books/
librivox/
./audio/
http://www.hi-ho.ne.jp/linux
./raspberry/
./storytelling/
./crowdsourcing/
https://twitter.com/ekzemplaro
https://github.com/ekzemplaro/
qiita/
./test_dir/

Arch Linux での Scrapy のインストール方法

sudo pacman -S scrapy

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up