秒速でねこ画像を集めてネコヒルズ族を目指す

Python

Last updated at 2013-12-10Posted at 2013-12-10

####概要
先日作ったスクリプト
feedparserで自動的にねこ画像を拾ってくる
のお陰で猫画像収集が捗る日々…

しかしどうも、実行速度が遅い。
速さが足りない…このスクリプトには速さが足りない。
というわけでどうにか速くできないか改造を施してみる。

####まずは現状分析
とりあえず現行スクリプトがどのくらい遅いのかを計測してみた。
以下、計測ロジックを組み込んだソース。

get_cat.py

# -*- coding: utf-8 -*-
import feedparser
import urllib
import os
import  time

def download_picture(q, count=10):
    u"""qの画像をcount件 取ってくる。"""
    count = str(count)
    feed = feedparser.parse("https://picasaweb.google.com/data/feed/base/all?q=" + q + "&max-results=" + count)
    pic_urls = []
    for entry in feed['entries']:
        url = entry.content[0].src
        if not os.path.exists(os.path.join(os.path.dirname(__file__), q)):
            os.mkdir(os.path.join(os.path.dirname(__file__), q))
        urllib.urlretrieve(url, os.path.join(os.path.dirname(__file__), q, os.path.basename(url)))
        print('download:' + url)

if __name__ == "__main__":
    time1=time.time()
    download_picture("cat", 10)
    time1_2=str(time.time()-time1)
    print("complete!("+time1_2+")")

結果

complete!(6.05635690689)

10枚ダウンロードするのに６秒かかった。
何回か試してみたけどやっぱり大体6秒。
これで24時間に14400枚しかダウンロードできない。
理想には程遠い。

####httplib2
風のうわさでhttplib2というライブラリの存在を知った。
というか標準のよりこっちのほうが定番らしい。
httplib2の特徴(↓)。

キャッシュが使える。
gzip圧縮が使える。

素晴らしいじゃないか。
ではさっそく使ってみよう。

$ sudo pip install httplib2

サクッとインストール。
そしてプログラムを修正。

get_cat2.py

# -*- coding: utf-8 -*-
import feedparser
import httplib2
import os
import time


def download_picture(q, count=10):
    u"""qの画像をcount件 取ってくる。"""
    count = str(count)
    feed = feedparser.parse("https://picasaweb.google.com/data/feed/base/all?q=" + q + "&max-results=" + count)
    pic_urls = []
    http = httplib2.Http(".chache")
    for entry in feed['entries']:
        url = entry.content[0].src
        open(os.path.join(os.path.join(os.path.dirname(__file__), q),os.path.basename(url)),'wb').write(http.request(url)[1])
        print('download:' + url)

if __name__ == "__main__":
    time1=time.time()
    download_picture("cat", 10)
    time1_2=str(time.time()-time1)
    print("complete!("+time1_2+")")

変わったところ。

import文が変わった。
urllib.urlretrieve()がhttp.request()になった。
いちいちフォルダの存在確認をしなくなった

さっそく実行！

実行結果

まずは元のプログラム。まぁこんな感じ。
complete!(5.79861092567)

改良版。ん？
complete!(5.06348490715)
改良版2回め。おお…
complete!(1.20627403259)
改良版3回め。うおおおお！
complete!(0.768098115921)

っ速い！
これなら十分実用的。秒速と言っても嘘にはならない。

結論:httplib2を使うと猫画像収集が捗る。

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up