More than 1 year has passed since last update.

Pythonでクローラーを作ってみる

Posted at 2022-07-21

まず、クローラーとはWeb上で検索エンジンが自動で検索結果を収集してくれるプログラムの事です。他にもスパイダーなどとも呼ばれています。
Pythonのフレームワークには、ブラウザを自動に操作できるseleniumというのがあり、ChromeやSafari、FirefoxなどWebDriverを通して動かしてくれる便利なヤツです。
Pythonのインストールができている状態で。VSCodeのターミナルから

pip install selenium

します。上手くインストールできているかは下記コマンドで確認しましょう。

pip3 show selenium

Name: selenium
Version: 4.3.0
Summary: 
Home-page: https://www.selenium.dev
Author: 
Author-email: 
License: Apache 2.0...

と出て来ればインストールできています。
seleniumはchromedriverというドライバーがないと.pyファイルを動かすことができません。下記chromedriver-binaryをインストールして下さい。

pip install chromedriver-binary

上手く動かない場合は、Chromeのバージョン確認をし、chromedriverのバージョンを合わせて再インストールしましょう。

pip install chromedriver-binary==103.0.5060.53

パスの通し方が何通りかあって、実行したい.pyファイルに

import chromedriver_binary

を入れるか、chromedriverを/usr/local/bin下に持ってくるか

from selenium import webdriver

driver = webdriver.Chrome('/usr/local/bin/chromedriver')

環境変数に入れるかなどです。

export PATH=$PATH:`chromedriver-path`

pipインストール時に

pip install chromedriver-binary-auto

する方法もあるみたいです。そして、無事パスの問題を解決できたらルート直下に置いたrequirements.txt内にpackageをそれぞれ記述して、

requirements.txt

certifi==2022.6.15
charset-normalizer==2.1.0
idna==3.3
requests==2.28.1
urllib3==1.26.10

pip install -r requirements.txt

で一括インストールします。utilsモジュール(ファイル)を記述し、

utils.py

from selenium import webdriver
import chromedriver_binary
from selenium.webdriver.common.by import By
from selenium.webdriver.support.select import Select
import logging

driver = None

def setup(url):
    global driver

    logging.basicConfig(format='%(asctime)s %(message)s',
                        datefmt='%Y/%m/%d %I:%M:%S %p', level=logging.INFO)

    driver = webdriver.Chrome()
    driver.implicitly_wait(2.5)

    driver.get(url)

def switch_frame(name):
    driver.switch_to.frame(name)

def find(name_or_id):
    if name_or_id.startswith('#'):
        return driver.find_element(By.ID, name_or_id[1:])
    else:
        return driver.find_element(By.NAME, name_or_id)

def update_select_value(name_or_id, value):
    Select(find(name_or_id)).select_by_visible_text(value)

def run_script(code):
    driver.execute_script(code)

def iframes():
    return driver.find_elements(By.TAG_NAME, "iframe")

同じルートにcrawlerモジュールを作成しインポート。

crawler.py

from logging import info

from utils import *

setup('https://URL.com')
info('done opening url')

switch_frame('frmRIGHT')
info('swtich to frmRIGHT')

update_select_value("#chotatsuType", "物品等")

find("#link2").click()
info('redirect to search page')

update_select_value("A300", "100")

update_select_value("A103", "電算業務")
run_script("doSearch1()")
# 省略〜

ターミナルより下記コマンドで実行してみます。

python3 crawler.py

指定したURLのWebサイトの画面を誘動して、自動入力することができました。

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up