Pythonを利用して任意の形式のファイルをダウンロードする

Last updated at 2014-12-24Posted at 2014-12-24

この投稿はクローラー／スクレイピング Advent Calendar 2014の12月24日用です。

はじめに

Webサイトを閲覧していると、任意の形式のファイル（zip、pdf）などをまとめてダウンロードしたいケースがあると思います。

手作業でダウンロードしても良いのですが、こういう場合はPythonやRubyなどのスクリプト言語を使用すると簡単に処理が書くことができます。

今回はPythonを使用してダウンロードするスクリプトを書いてみました。

ライブラリ

本当は標準ライブラリのみでも良いのですが、今回は下記のライブラリを利用しました。

ライブラリのインストール

pip install requests
pip install BeautifulSoup

ソースコード

処理内容としては次の通りです。

URLからリンクを抽出
リンクから該当する拡張子のURLとファイル名を抽出
ファイルを少しづつダウンロード

# !/usr/bin/env python
# -*- coding: utf-8 -*-

import requests
import time

from BeautifulSoup import BeautifulSoup

BASE_URL = u"http://seanlahman.com/"
EXTENSION = u"csv.zip"

urls = [
    u"http://seanlahman.com/baseball-archive/statistics/",
]

for url in urls:

    download_urls = []
    r = requests.get(url)
    soup = BeautifulSoup(r.content)
    links = soup.findAll('a')

    # URLの抽出
    for link in links:

        href = link.get('href')

        if href and EXTENSION in href:
            download_urls.append(href)

    # ファイルのダウンロード（ひとまず3件に制限）
    for download_url in download_urls[:3]:
		 
        # 一秒スリープ
        time.sleep(1)

        file_name = download_url.split("/")[-1]

        if BASE_URL in download_url:
            r = requests.get(download_url)
        else:
            r = requests.get(BASE_URL + download_url)
        
        # ファイルの保存
        if r.status_code == 200:
            f = open(file_name, 'w')
            f.write(r.content)
            f.close()

終わりに

エラー処理や、ダウンロードURLの調整など改善する点は多々ありますが、
ひとまず任意のファイル形式（zip, pdfなど）のファイルをダウンロードできるようになりました。

Pythonなど使用すると、とても簡単にスクレイピングをすることができるので、
サイトに合わせて改善しながら、手持ちのスクリプトを増やしていくのが良いのではないでしょうか。

参考リンク

Sean Lahman Databaseについて

http://www.slideshare.net/shinyorke/python-39061157

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up