More than 5 years have passed since last update.

Python3.5のasync/awaitを使ってスクレイピング

Python

Posted at 2015-09-24

前にこんな記事があったので。
http://postd.cc/fast-scraping-in-python-with-asyncio/
これをPython3.5の新しい構文でやってみる。

あ、もとの記事はhtmlをスクレイピングしてるけど、なんとなく似たような量のレスポンスが返ってくるURLへ同時にアクセスしたかったので、RSSにしてみました。あ全然スクレイピングじゃない。まあ、やってること一緒なので……

import asyncio
import aiohttp
import feedparser
import time

async def print_first_title(url):
    response = await aiohttp.request('GET', url)
    body = await response.text()
    d = feedparser.parse(body)
    print(d.entries[0].title)

rss = [] # なんかRSSのURLの配列。Yahooニュースを10個くらいやりました

if __name__ == '__main__':
    start = time.time()
    loop = asyncio.get_event_loop()
    loop.run_until_complete(asyncio.wait([print_first_title(url) for url in rss]))
    end = time.time()
    print("{0} ms".format((end - start) * 1000))

390.4871940612793 ms

まあ、確かに読みやすくなった？でもこれだけだと、デコレータと yield from でやってたやつになんかそれっぽい構文が出来たよっていう感じにしか見えない。今のところぼくもそれ以上のアレはないんだけど。これすごいの？

それよりも、もとの記事と同じく、通信の部分を非同期にするためにaiohttpっていうライブラリを使ってますけど、これすごい便利！知らなかった！！！

ねんのためコルーチン使わないやつと速度比較

import urllib.request
import feedparser
import time

def print_first_title(url):
    response = urllib.request.urlopen(url)
    body = response.read()
    d = feedparser.parse(body)
    print(d.entries[0].title)

rss = []

if __name__ == '__main__':
    start = time.time()
    [print_first_title(url) for url in rss]
    end = time.time()
    print("{0} ms".format((end - start) * 1000))

1424.4353771209717 ms

おそーい！おしまい！

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up