LoginSignup
12
17

More than 5 years have passed since last update.

BeautifulSoup4でWEBスクレイピング(連番ページ)

Last updated at Posted at 2016-07-05

BeutifulSoup4でWEBスクレイピング

よくあるURLが連番になっているページで、後からまとめてダウンロードするためにURLリストを作成するコードを書いたのでメモ

インストール

$ apt-get install lxml-python
$ pip install beautifulsoup4

ソース

scraper.py
# -*- coding: utf-8 -*-

from __future__ import absolute_import
from __future__ import division
from __future__ import print_function

try:
    # Python 3
    from urllib import request
except ImportError:
    # Python 2
    import urllib2 as request

from bs4 import BeautifulSoup
import codecs
import time

def getSoup(url):
    response = request.urlopen(url)
    body = response.read()
    # Parse HTML
    return BeautifulSoup(body, 'lxml')

wait_sec = 3
domain = 'http://hoge.com'
result_file = 'list.txt'
i = 1
while(True):
    url = '{domain}/{index:0>2}/'.format(domain = domain, index = i)
    try:
        soup = getSoup(url)
    except IOError:
        break

    div = soup.find('div', attrs = {'id': 'div_id'})
    all_a = div.find_all('a', attrs = {'class': 'a_class'})
    src_list = []
    for a in all_a:
        src_list.append(a.img['src'])
    with codecs.open(result_file, 'a', 'utf-8') as f:
        f.write('\n'.join(src_list))
    print(i)
    i += 1

    time.sleep(wait_sec)

参考ページ

Python: BeautifulSoup4 を使って Web サイトをスクレイピングする

PythonとBeautiful Soupでスクレイピング

12
17
0

Register as a new user and use Qiita more conveniently

  1. You get articles that match your needs
  2. You can efficiently read back useful information
  3. You can use dark theme
What you can do with signing up
12
17