More than 1 year has passed since last update.

[スクレイピング] Pythonスクレイピング

Python

Last updated at 2023-05-12Posted at 2020-05-06

環境

Linux Ubuntu Xfce

参考

PythonによるWebスクレイピング
 Pythonクローリング＆スクレイピング-データ収集・解析のための実践開発ガイド
 実践 Selenium WebDriver

道具

Chrome
Chrome-Driver
BS4
文字や画像の取得時に用いる
Selenium
ブラウザ上の操作をするときに用いる
pandas
データの結合やファイル出力に用いる
liblzma-dev : pandasに必要なパッケージなので入れる

Chrome

sudo curl -sS -o - https://dl-ssl.google.com/linux/linux_signing_key.pub | apt-key add
sudo echo "deb [arch=amd64]  http://dl.google.com/linux/chrome/deb/ stable main" >> /etc/apt/sources.list.d/google-chrome.list
sudo apt-get -y update
sudo apt-get -y install google-chrome-stable

その他

sudo apt install chromium-chromedriver liblzma-dev \
&& pip install bs4 selenium pandas

基本

bs4でできること

bs4は様々なメソッドが用意されており、それらのメソッドや正規表現(re)などを駆使すれば取得が不可能なものは存在しない

パースはlxmlを使う

処理が最も早く、最も多くのCSSセレクタがつかえる

html_doc = '<html>...</html>'
soup = BeautifulSoup(html_doc, 'lxml')

実行後に必ずclose,quit

やらないとプロセスの残骸がたまる。

from selenium import webdriver
driver = webdriver.Chrome()
# ドライバーを終了させる
driver.close()
driver.quit()

seleniumで操作してhtmlソースをBS4に渡す

受け渡しが終わったらあとは、BS4で宝探し

options = ChromeOptions()
options.add_argument('--headless') # ウィンドウレスモード
driver = Chrome(options=options)
url = 'https://www.example.com/'
driver.get(url)

# Seleniumの操作開始
...
   ...
      ...
# Selenimuの操作終了

html = driver.page_source.encode('utf-8')
soup = BeautifulSoup(html, "lxml")

# BS4の処理開始
...
   ...
      ...
# BS4の処理終了

HTMLタグの数が少ないときはfindメソッドは使わない

直接BeautifulSoupオブジェクトからタグ名で探索する

こんな感じのタグが少ない時

from bs4 import BeautifulSoup

html_doc = '''
<html>
    <head>
        <title>hello soup</title>
    </head>
    <body>
        <p class="my-story">my story</p>
    </body>
</html>
'''
soup = BeautifulSoup(html_doc, 'lxml')
print(soup.title)
print(soup.title.text)
print(soup.p)
print(soup.p['class'])
print(soup.p.text)

実行結果

<title>hello soup</title>
hello soup
<p class="my-story">my story</p>
['my-story']
my story

bs4の4つのオブジェクトを知る

BeautfulSoupは、Tag, NavigableString, BeautifulSoup, Commentの4種類のオブジェクトがある

この中で、よく使うのが、BeautifulSoupとTag

BeautifulSoupオブジェクトとTagオブジェクト

BeautifulSoup : HTMLソースをPythonで扱える形式(ツリー構造)に変換する
Tag : BeautifulSoupオブジェクトに特定のメソッドを使うとTagオブジェクトが生成される

findとfind_allの違いを理解する

BeautifulSoupオブジェクトにfindメソッドとfind_allメソッド使うと何でも探索できるが、的確に探索をするためには、メソッドによって何が生成されるかを知っておく必要がある

メソッドによって生成されるオブジェクト
find → bs4.element.Tag
find_all → bs4.element.ResultSet

何もみつからないときの返り値
find → None
find_all → []空のリスト

bs4.element.Tag

find_allメソッド、BeautifulSoupメソッド、selectメソッド、以外のbs4のメソッドを使うと生成されると思っておけば良い

from bs4 import BeautifulSoup

html_doc = '''
<html>
    <head>
        <title>hello soup</title>
    </head>
    <body>
        <p class="my-story">my story</p>
        <a class='brother' href='http://example.com/1' id='link1'>リンク１</a>
        <a class='brother' href='http://example.com/2' id='link2'>リンク２</a>
        <a class='brother' href='http://example.com/3' id='link3'>リンク３</a>
    </body>
</html>
'''

soup = BeautifulSoup(html_doc, 'lxml')

print('tag1')
tag1 = soup.find('a')
print(tag1)
print(type(tag1))


print('tag2')
tag2 = soup.a
print(tag2)
print(type(tag2))

bs4.element.ResultSet

find_allメソッド、BeautifulSoupメソッド、selectメソッド、を使うと生成される

bs4.element.Tagがリストにたくさん入っているイメージ(このイメージけっこう大事)

bs4.element.ResultSetのイメージ

bs4.element.ResultSet = [bs4.element.Tag, bs4.element.Tag, bs4.element.Tag,...]

なので、そのままでは探索できず、リストから取り出してから使う
取り出せば、あとは上記のbs4.element.tagと同じメソッドが使える

※ メソッドが使えない！ってときは、ほぼほぼbs4.element.ResultSetにbs4.element.Tagのメソッドを使おうとしている時です

from bs4 import BeautifulSoup

html_doc = '''
<html>
    <head>
        <title>hello soup</title>
    </head>
    <body>
        <p class="my-story">my story</p>
        <a class='brother' href='http://example.com/1' id='link1'>リンク１</a>
        <a class='brother' href='http://example.com/2' id='link2'>リンク２</a>
        <a class='brother' href='http://example.com/3' id='link3'>リンク３</a>
    </body>
</html>
'''

soup = BeautifulSoup(html_doc, 'lxml')

print('tag3')
tag3 = soup.select('a:nth-of-type(2)') # bodyタグ内のaタグの有無で見つける
print(tag3)
print(type(tag3))

print('tag4')
tag4 = soup.select('.link1') # CSSセレクタのクラ
print(tag4)
print(type(tag4))

print('tag5')
tag5 = soup.select('a[href]') # 属性の有無でタグを見つける
print(tag5)
print(type(tag5))

小技

printの出力制限を緩める

デフォルトのままだとデカくなったファイルをprintしようとするとIOPub data rate exceeded.というエラーになるので無制限に変更

設定ファイルを作成

jupyter notebook --generate-config

~/.jupyter/jupyter_notebook_config.py

# 変更前 1000000 → 変更後 1e10
jupyter notebook --NotebookApp.iopub_data_rate_limit=1e10

pickleで高速に読み書きする

バイナリ形式(コード内の'b'はバイナリの意味)で読み書きをするので高速

同じ機能をもったライブラリにjoblibがありますが、こちらは速度を犠牲にして、ファイルサイズを小さくしたいときに使うと良い

書き込み(dump)

import pickle

example = 'example'

with open('example.pickle', 'wb') as f:
    pickle.dump(example, f)

読み込み(load)

with open('example.pickle', 'rb') as f:
    example = pickle.load(f)

文字列以外を読み書きしようとするとエラーがでる問題に対処

bs4オブジェクト(bs4.BeautifulSoup等)を書き込みしようとすると maximum recursion depth exceeded while pickling an objectというエラーがでてしまうので、string等に変換してから保存

dump

import pickle

example = 'example'

with open('example.pickle', 'wb') as f:
    pickle.dump(str(example), f)

load

with open('example.pickle', 'rb') as f:
    example = BeatitfulSoup(pickle.load(f), 'lxml')

ただ読み込むだけだとstr型のためbs4で扱うことができない
したがって、読み込み時にbs4型に変換

上記のやり方でも上手く行かない場合

dictをdumpしようとすると駄目な場合
こんなときはjsonでdumpするのが吉

dump

import json

with open('example.json', 'w') as f:
    json.dump(example, f)

load

with open('example.json', 'r') as f:
    json.load(f)

Jupyter Notebook

セルの幅を最大にする

pandasのDataFrameを見る時に、セル幅がデフォルトだと文字が見切れてしまうので、セル幅を最大になるように設定する

~/.jupyter/custom/custom.css

.container { width:100% !important; }

処理時間を計測する

Jupyter環境下でのみ使用できる%timeを使う
これはJupyterの組み込みメソッドなのでimport不要

使い方

%time example_function()

正規表現

URLのスラッシュの前後の文字を取得する

https://www.example.com/topics/scrapingのscrapingを取得したい時
splitで/を指定して、一番うしろの要素を取得する

コード

url = 'https://www.example.com/topics/scraping'

print(url.split('/'))
#['https:', '', 'www.example.com', 'topics', 'scraping']
print(url.split('/')[-1])
#scraping

Pandas

Pandas UserWarning: Could not import the lzma module. Your installed Python is incomplete

pandasに必要なパッケージが不足しているときにでるエラー

sudo apt install liblzma-dev

DataFrameのColumnを取り出してリストにする

Seriesに取り出したいColumnを入れる
Seriesのtolistメソッドを使う

import pandas as pd
df = pd.DataFrame(index=[1,2,3], {'Column1':data1, 'Column2':'data2', 'Column3':'data3'})

# Column3を取り出してリストにする
col3 = pd.Series(df['Column3']).tolist()

DataFrameの出力結果を左寄せにする

デフォルトは右寄せなので、URLや英語が読みづらい

df.style.set_properties(**{'text-align': 'left'})  # 左寄せ

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up