追記

久しぶりにやったらうまく動かな

python3 と pip がインストールされていることが条件。

python3 がはいっていますか？


python3 --version
//Python 3.6.8

pip ありますか？


pip-v
//pip 9.0.3 from /usr/lib/python3.6/site-packages (python 3.6)

awslightsailの場合は


//pipインスコ
sudo yum install python3-pip

//pipを利用してインスコ
python3 -m pip install selenium

という感じになる。

OK!

selenium を install。
文字化けを防ぐためにフォント系も添えて。


pip install selenium
yum -y install libX11 GConf2 fontconfig
yum -y install ipa-gothic-fonts ipa-mincho-fonts ipa-pgothic-fonts ipa-pmincho-fonts
fc-cache -fv

chrome を入れる


vim /etc/yum.repos.d/google-chrome.repo

----- ここから -----
[google-chrome]
name=google-chrome
baseurl=http://dl.google.com/linux/chrome/rpm/stable/x86_64
enabled=1
gpgcheck=1
gpgkey=https://dl.google.com/linux/linux_signing_key.pub

yum でインスコ


yum -y install google-chrome-stable libOSMesa
google-chrome --version
//Google Chrome 80.0.3987.106//大事

上記のバージョンに合わせてドライバーをDL。
必ず同じものをDLしましょう。

閉鎖するみたい
~~https://sites.google.com/a/chromium.org/chromedriver/downloads~~
で確認して


wget https://chromedriver.storage.googleapis.com/80.0.3987.106/chromedriver_linux64.zip
unzip chromedriver_linux64.zip

firefox

これも色々ハマった。バージョンがあるみたいなので注意
あと、これ以外にも gecko が必要。


yum install firefox
wget https://github.com/mozilla/geckodriver/releases/download/v0.23.0/geckodriver-v0.23.0-linux64.tar.gz

headless がないとエラーになる
selenium.common.exceptions.WebDriverException: Message: invalid argument: can't kill an exited process


options = Options()
options.add_argument('-headless')
driver = webdriver.Firefox(options=options)

driver.get('https://www.ugtop.com/spill.shtml')
driver.get_screenshot_as_file('/var/www/html/screenshot.png')
driver.quit()

早速テストしちゃお♪
yourpass だけあなたの環境に合わせ。


from selenium import webdriver
from selenium.webdriver.chrome.options import Options

chrome_options = Options()
chrome_options.add_argument('--headless')
chrome_options.add_argument('--no-sandbox')
chrome_options.add_argument('--disable-dev-shm-usage')
chrome_options.add_experimental_option('prefs', {'intl.accept_languages': 'ja'})
chrome_options.add_argument('--window-size=1200,6920')# 横 x 縦 モニターサイズは大きめにしておいたほうが、良い。スクロールしてデータを取ろうとするとうまく取れない時があるので。

driver = webdriver.Chrome('/yourpass/chromedriver',chrome_options=chrome_options)

driver.get('https://qiita.com')
driver.get_screenshot_as_file('screenshot.png')

これでスクショが取れとる!

screenshot.png

pycharm をインストールしよう

pycharm は、 phpstorm の python 版。
有料版でないとFTPとかアップロード機能は使えないので、
有料版を買ってしまおう。

インストール後の設定

・日本語文字化け対策
file > settings > editor > font
で、例えば Ricty Diminished を選んでおく。

・vimモードの解除

参考
https://sayanotsu.hatenablog.jp/entry/2018/04/11/205318

file > settings > plugins で IdeaVim というのをチェック外しておく。

こんな感じ。あとはFTPを設定してアップロードできるようにしておく。
これでローカルで開発してサーバーに送れる。

現在、スクレイピングがどんな状況になっているか確認

index.php


<html>
    <head>
        <meta http-equiv="refresh" content="1; URL=">
    </head>
    <body>
        <img src='screenshot.png?<?=time();?>'>
    </body>
</html>

先程のスクショを毎秒読み出しているので、最新のスクショの状況を見ることができる。これでチェックしよう。

続いて、BeautifulSoup を使ってみる

Selenium と BeautifulSoup は併用可能。
BeautifulSoupはHTMLの解析をしやすくするライブラリ。

まずはインストール


pip install beautifulsoup4
pip install html5lib

ではコードをちょっといじる

hoge.py


from bs4 import BeautifulSoup

#html ソースをすべて 表示
soup = BeautifulSoup(driver.page_source, "html5lib")
log.debug(soup)

これで取得したデータのhtmlをソースをすべて吐き出す

#要素が見つかるまで待とうぜ

ページにアクセスして、ある程度待ってから色々操作させないとバグる時がある。
そこで wait を使ってみよう。

・ID sokuhime が現れれば処理開始
・最大10秒待ち、現れなければ強制終了


from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC


# 指定したURLを取得
driver.get('https://www.yahoo.co.jp')

try:
    element = WebDriverWait(driver, 10).until(
        EC.presence_of_element_located((By.ID, "sokuhime"))
    )

    # html ソースをすべて 表示
    soup = BeautifulSoup(driver.page_source, "html5lib")

    # タイトルタグだけを取得
    log.debug(soup.title.string)

    # これで毎回 chrome を閉じる。
    driver.quit()

finally:
    driver.quit()

画像のsrcを取得

python の場合、 for 文の最後に : がつくので要注意。


    for v in soup.find_all("img"):
        print(v['src'])

正規表現で girls にマッチするもののみ取得

import re

    i= 0;
    for v in soup.find_all("img"):

        if re.compile("girls").search(v['src']):
            i+=1
            print(v['src'])


    print(i)

ページを最下部まで移動させたい

面倒なので time.sleep を　使おう


import time

try:
    element = WebDriverWait(driver, 10).until(
        EC.presence_of_element_located((By.ID, "sokuhime"))
    )

    # 最下部へ移動
    driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
    time.sleep(5)

サイト上部のデータを取得したあと、ブラウザを最下部まで移動。
移動したあと、5秒待つ。

画像をダウンロード

リモート画像をダウンロードしたい場合がある。
その時は、こう。


import requests

def download_img(url, file_name):
    r = requests.get(url, stream=True)
    if r.status_code == 200:
        with open(file_name, 'wb') as f:
            f.write(r.content)

download_img("http://yahoo.co.jp/hoge.jpg", "hoge.jpg")

要素をクリックしてページ移動

[0].click をしないと以下のエラーが出る。
AttributeError: 'list' object has no attribute 'click'


    # find_elements_by_xpath は [0] を入れてクリックさせる
    element = driver.find_elements_by_xpath('/html/body/div[3]/div[9]/div/p/a[1]')[0].click()

chrome が立ち上がり過ぎた場合

topコマンドやると、chromeが複数立ち上がったりしている時がある。
その歳の一括強制終了。


pkill -f chrome

python で mysql 接続しようぜ。ついでに bulkinsertも。

バルクインサートってのはデータを一気に入れること。
これをやることで超音速でデータを入れられる。
しかし、すでにデータが有る場合はアップデートしたいので今回は
データがなけりゃ新規作成、ありゃアップデートというDUPLICATEのやり方も書いておく。

インスコ


pip install mysql-connector-python

今回のテーブルは users テーブル。
id と name フィールドだけ用意。

hoge.py


import mysql.connector

# mysql につなぐ
conn = mysql.connector.connect(
    host='localhost',
    port='3306',
    user='root',
    password='pass',
    database='hoge'
)


# 接続チェック true なら接続OK
#print(conn.is_connected())


cur = conn.cursor()

# 新規挿入
cur.execute("INSERT INTO users VALUES (10, 'hideki')")
# この conn.commitが無いと挿入されないので注意
conn.commit()

# update
cur.execute('UPDATE users SET name=%s WHERE id=%s', ('花子', 6))



# BULK INSERT ( Multiple Insert)

data = [
    (6, 'chanko'),
    (7, 'udon'),
    (9, 'shinki'),
]


# フィールドの数だけ UPDATE のあとにつなげる必要がある (一般的な書き方とは違いpython独自の癖がある。)

query = "INSERT INTO users (id,name) VALUES (%s,%s) ON DUPLICATE KEY UPDATE id = VALUES(id),name = VALUES(name);"
cur.executemany(query, data)




# 実行。これないと挿入されない
conn.commit()

IPアドレスを偽装しよう

いわばプロキシ。串を通す。
スクレイピング時、同じIPを弾く場合があるので、
これをやってアクセスする。

注意したいのが、なぜか、chromeで動かない。
よってfirefoxで偽装してスクレイピングするようにしよう。

今回使うのは、フリーのプロキシを探して、ランダムで使ってくれるありがたいライブラリ。
https://medium.com/ml-book/multiple-proxy-servers-in-selenium-web-driver-python-4e856136199d


pip install http-request-randomizer

hoge.py


from http_request_randomizer.requests.proxy.requestProxy import RequestProxy


from selenium.webdriver.firefox.options import Options

options = Options()
options.add_argument('-headless')


req_proxy = RequestProxy()
proxies = req_proxy.get_proxy_list()

PROXY = proxies[0].get_address()

#183.88.217.23:8080
webdriver.DesiredCapabilities.FIREFOX['proxy'] = {
    "httpProxy": PROXY,
    "ftpProxy": PROXY,
    "sslProxy": PROXY,
    "proxyType": "MANUAL",

}

driver = webdriver.Firefox(options=options)

driver.get('https://www.ugtop.com/spill.shtml')
driver.get_screenshot_as_file('/var/www/html/screenshot.png')
driver.quit()

これで確認くんを見るとIPアドレスが偽装されているのがわかる。
日本のブラウザのみにもできるみたいだが、
selenium.common.exceptions.WebDriverException: Message: Reached errorというエラーが出るので無理みたい。

centos7 python3 chrome seleniumスクレイピング基礎

firefox

pycharm を インストールしよう

インストール後の設定

現在、スクレイピングがどんな状況になっているか確認

続いて、BeautifulSoup を使ってみる

画像のsrcを取得

正規表現で girls にマッチするもののみ取得

ページを最下部まで移動させたい

画像をダウンロード

要素をクリックしてページ移動

chrome が立ち上がり過ぎた場合

python で mysql 接続しようぜ。 ついでに bulkinsertも。

IPアドレスを偽装しよう

pycharm をインストールしよう

python で mysql 接続しようぜ。ついでに bulkinsertも。