More than 1 year has passed since last update.

【Python】スクレイピング再利用しそうなコード

Last updated at 2022-06-06Posted at 2021-05-16

BeautifulSoup

インポート

import requests
from bs4 import BeautifulSoup

Requestsを使って、webから取得

r = requests.get(url)

要素を抽出

soup = BeautifulSoup(r.text, 'lxml')

特定のタグを取得

# idで探して要素を取り出す
#  id名はページ内に1つしかない

要素 = soup.find(id="id名")

# classで探して要素を取り出す
#  class名はページ内に同じ名前で複数ある場合がある
#  ※「class」はPythonの予約語のため「class_」と書く

要素 = soup.find(class_="class名")

# リンクを取り出す
# findを使ってリンク要素を取得
link = soup.find('a')
# getでリンク要素のhref属性の値を取得して出力
print(link.get('href'))

aaa=soup.find("div",class_="post clearfix")

    aaa='\n'.join(aaa.split())  # ＆nbsp対策

テキストを取得

aaa=aaa.get_text()

要素削除

# decompose
# <ul>タグを駆除
aaa.find('ul').decompose()  

# 複数decompose
# すべてのaタグを取得し、a_tagsという変数に格納（リスト型）
a_tags = soup("a") 
# すべてのaタグを削除
[aaa.decompose() for aaa in a_tags] 

# extractの使い方
# aaaからはbbbが取り除かれる
bbb=aaa.find("div",id="post_pagination").extract()

NavigableString

BeautifulSoup4はタグの間の文字列などをNavigableStringオブジェクトとして処理している
NavigableStringはBeautifulSoup4コマンドを使用できるが、通常の文字列として処理できない
txtに変換した後はBeautifulSoup4コマンドを使用できない

複数の要素

select

# 3番目を選択→結果はリストの[0]に入る
（注）strではなく、リスト
nexturl=nexturl.select("a:nth-of-type(3)")  
# リストの[0]だけのurlを取り出す
nexturl[0]=nexturl[0].get("href")

find_all

bbb=soup.find_all("div",class_="p-content-cassette__info__main")
for sub in bbb:
   sub2=sub.get_text()

UnicodeEncodeError対策

# 1文字
Story =Story.replace('\u6451', '')

# まとめて処理
Story = Story.encode('cp932', "ignore")
Story = Story.decode('cp932')

一定間隔で処理

1秒おき

import time
time.sleep(1)

Table形式

Tableを抜き出すのには、pandasで可能

import pandas as pd

url = 'https://ja.wikipedia.org/wiki/%E3%83%87%E3%82%A3%E3%82%BA%E3%83%8B%E3%83%BC%E4%BD%9C%E5%93%81'

# このページには9つの表がありました。
dflist = pd.read_html(url)
print(len(dflist))

# 2番目の「ウォルトディズニーアニメーションスタジオ長編作品」の表を使います。
df = dflist[1]
print(df.head())

# ヘッダー行の指定 インデックス列の指定
dflist = pd.read_html(url, header=0, index_col=0)
df = dflist[1]
print(df.head())

# 「監督」カラムで複数行にまたがる値が、1行目以外がNaNになってしまっているので、
# fillna関数のmethod引数にffillを渡してNaNを直前の値で埋めます。
df.fillna(method='ffill', inplace=True)
print(df)

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up

【Python】スクレイピング 再利用しそうなコード