More than 5 years have passed since last update.

Pythonのマルチコア並列処理でクローリングを高速化

Posted at 2019-05-02

複数のサイトから情報収集する用事があり、簡単なクローラーを書いていたのだがサイトが増えると遅くてしょうがない。そこでmultiprocessingで並列処理してみたらコア数の分だけ高速化できた気がするので公開。

from multiprocessing import Pool
from multiprocessing import Process

import feedparser
import time

keyword = ''
feed_urls = [
    'https://www.theverge.com/rss/index.xml',
    'https://gizmodo.com/rss',
    'https://www.cnet.com/rss/all/',
    'https://techcrunch.com/feed/',
    'https://news.ycombinator.com/rss',
    'http://feeds.arstechnica.com/arstechnica/index/',
    'http://feeds.mashable.com/Mashable',
    'https://hub.packtpub.com/feed/'
]

def function(n):
    count = 0
    feed_result = feedparser.parse(feed_urls[n])
    for entry in feed_result.entries:
        flag = False
        try:
            if keyword in entry.title.lower():
                flag = True
            if keyword in entry.content.lower():
                flag = True
            if keyword in entry.description.lower():
                flag = True
        except:
            pass
        if flag == True:
            print(entry.title)
            print(entry.link)
            print()
            count = count + 1
    return count

def multi(n):
    p = Pool(4) #最大プロセス数
    result = p.map(function, range(n))
    return result

def main():
    global keyword
    print("input keyword:",end='')
    keyword = input().rstrip().lower()

    start = time.time()
    hit_count = 0
    data = multi(len(feed_urls))
    for i in data:
        hit_count = hit_count + i
    print(hit_count)
    print(time.time() - start)

main()

こちらを参考にしました。

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up