More than 5 years have passed since last update.

Python BeautifulSoupでwebscrapping

Python

Posted at 2018-07-31

参考元：

beutidulsoupについてわかりやすい解説
https://qiita.com/itkr/items/513318a9b5b92bd56185

webscrappingの方法

使ったライブラリー：

requests
bs4 #BeautifulSoup

$pip install bs4

beutifulsoupはhtmlのソースコードを読み込めるが、urlでは読み込むことができない。requestsライブラリーでurlを読み込んでbyte列や文字列を取得できる。
html知らなくてもいけるいける大丈夫。

シンプルなhtmlで検証

import requests
from bs4 import BeautifulSoup

# get メソッド, urlかhtmlを渡す
r=requests.get("https://pythonhow.com/example.html")
print(type(r))
>>><class requests.models.Response>

# byte
c = r.content
print(type(c))
>>><class bytes>

# str
print(type(r.text))
>>><class str>

# ソースコードのbyte列に対して、beautifulsoupを使う
soup = BeautifulSoup(c, "html.parser")
print(type(soup))
>>><class bs4.BeautifulSoup>

# prettifyメソッドで整頓されたhtmlを見れる
print(soup.prettify())
>>><!DOCTYPE html>
<html>
 <head>
  <style>
   div.cities {
    background-color:black;
    color:white;
    margin:20px;
    padding:20px;
}
  </style>(以下略

requests.get()メソッドにhtmlのpathかurlを渡せばソースコードを取得できる。
.contentでbyte型。.textでstr型で取得可能
byte型のソースコードをBeautifulSoupに渡せばソースコーソを解析して、理解してくれる。html.parserはhtmlのときに渡すっぽい。構文解析に使うプログラムのこと。xml.parserとかいろいろある。htmlならhtml.parser。このインスタンス化によってできたsoupオブジェクトを使って情報を集める。
prettifyメソッドは整頓された形式のhtmlコードを表示してくれる。

# find_allは<***></***>で囲まれたtag全てを検索
all=soup.find_all("div", {"class":"cities"})
print(len(all), type(all))
>>>3, <class bs4.element.ResultSet>
print(all[0])
>>><div class="cities">
<h2>London</h2>
<p>London is the capital of England and its been a British settlement since 2000 years ago. </p>
</div>


# findは最初の1つのみ
one=soup.find("div",{"class":"cities"})
print(one)
>>><div class="cities">
<h2>London</h2>
<p>London is the capital of England and its been a British settlement since 2000 years ago. </p>
</div>


for i in all:
    #.textでstr型を取得可能
    print(i.find_all("h2")[0].text)
>>>
London
Paris
Tokyo

find_allメソッドで特定のtagに囲まれたブロックを全て検索できる。追加で属性部分、今回は{"class":"cities"}のようによりtagの中身をより詳細に指定できる。戻り値のresultsetはlist型のように扱える。iterateもできる。
findは最初の1つのブロックのみを取得する
.textで<tag>と</tag>に囲まれた要素を文字列で取得できる。
ここにはないが<tag title=""></tag>などのtag内部の属性は.get("attr_name")メソッドで取得可能

century21から情報を集める

website上の知りたい要素(文字列や画像やアイコン)の上で
右クリック→検証。対象のhtmlタグのソースコードに飛ぶことができる。
膨大なソースコードからいちいち検索する必要なんてないのだ。

scrap元：
https://pythonhow.com/real-estate/rock-springs-wy/LCWYROCKSPRINGS/

import requests
from bs4 import BeautifulSoup

r = requests.get("https://pythonhow.com/real-estate/rock-springs-wy/LCWYROCKSPRINGS/")
soup=BeautifulSoup(r.content, "html.parser")

# page数の取得
page_nr=soup.find_all("a",{"class":"Page"})[-1].text

base_url="https://pythonhow.com/real-estate/rock-springs-wy/LCWYROCKSPRINGS/t=0&s="

l=[]
for page in range(0, int(page_nr)*10, 10):
    url=base_url+str(page)+".html"
    print(url)
    r = requests.get(url)
    c = r.content
    soup = BeautifulSoup(c, "html.parser")
    all = soup.find_all("div", {"class":"propertyRow"})#class=valueなどは辞書で渡す{}

    for item in all:
        d = {}

        price=item.find_all("h4", {"class","propPrice"})[0].text.replace("\n", "")
        d["Price"]=price

        address=item.find_all("span",{"class":"propAddressCollapse"})
        try:
            d["Address"]=address[0].text
        except:
            d["Address"]=None
        try:
            d["Locality"]=address[1].text
        except:
            d["Locality"]=None

        bed = item.find("span",{"class","infoBed"})
        try:
            d["Bed"]=bed.find("b").text
        except:
            d["Bed"]=None

        try:
            d["Area"]=item.find("span",{"class","infoSqFt"}).find("b").text
        except:
            d["Area"]=None
        try:
            d["Baths"]=item.find("span",{"class","infoValueFullBath"}).find("b").text
        except:
            d["Baths"]=None
        try:
            d["Half Baths"]=item.find("span",{"class","infoValueHalfBath"}).find("b").text
        except:
            d["Half Baths"]=None

        for column_group in item.find_all("div",{"class":"columnGroup"}):
            for feature_group, feature_name in zip(column_group.find_all("span",{"class":"featureGroup"}), column_group.find_all("span",{"class":"featureName"})):
                if "Lot Size" in feature_group.text:
                    d["Lot Size"]= feature_name.text
        l.append(d)

import pandas
df = pandas.DataFrame(l)
df.to_csv("Output.csv")

Centruy21のワイオミング州のロックスプリングのページだぞ。先生の趣味だ。3pageあるから3page全て、26物件の情報を集める。archiveなので、.htmlがないとうまくロードできなかった。
最初にpage数を取得して何回反復するか決めている。
try...exceptを書いてるのは、物件ごとに情報を歯抜けがあって、必ずしも同じtagでも情報が取得できるとは限らないから。取得できない場合は、NoneだからNoneには.textで取得できる文字列などないから。
Lot sizeつまり英語での物件の土地の広さを取得する部分は、いくつかの特徴を集めた集合であったので、特徴を1つ1つLot sizeであるかどうか検索した。
暇なので.csvに出力した。グラフはいいものだ。

tagが整頓されているhtmlほどwebscrappingしやすい。
自分でhtmlを書くときは気をつけたい。

アニオタwiki(仮)のタグを集めてみる

https://www49.atwiki.jp/aniwotawiki/tag/?p=2
タグそれぞれにどのくらいのページがあるのかを調べた。BeautifulSoupのget()メソッドを使ってみたかった。

import requests, pandas
from bs4 import BeautifulSoup

# https://www49.atwiki.jp/aniwotawiki/tag/?p=1
base_url="https://www49.atwiki.jp/aniwotawiki/tag/?p="
start=0
end=5
for page in range(start,end):
    url=base_url+str(page)
    r=requests.get(url)
    c=r.content
    soup=BeautifulSoup(c,"html.parser")
    print(type(soup))
    page_nr=soup.find_all("p")
    a = page_nr[0].find_all("a")
    
    l = []
    for i in a:
        d={}
        title=i.get("title")[::-1]
        d["title_amount"]=title[title.index(")")+1:title.index("(")]     
        d["font_size"]=i.get("style")[10:-1]
        d["tag_name"]=i.text
        d["link_url"]=i.get("href")
        l.append(d)
        
df=pandas.DataFrame(l)
df.to_csv("aniota_tag.csv")

<html tag>の中の属性は、.get()メソッドで取得可能。(オンボロmacだから何千ページも検索したくない)

あとがき

webscrappingは簡単で面白い。クローラーとかいうのも多分同じような仕組みのより複雑なコードで動いてるのではなかろうか。
'明治THE GREEK YOGURT 5つの果実メーカー小売価格140円'>>>美味しい<<<

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up