More than 3 years have passed since last update.

機械学習を使って一番安い家賃の家に住む。〜スクレイピング編〜

Last updated at 2021-03-16Posted at 2021-03-11

はじめに

みなさんは家を借りる時ってどうやって決めてますか？
SUUMOやHOMESなど大手のサイトで条件を絞って検索して、その中から決めることが多いと思います。
でも検索結果って結構多くないですか？笑
どれ選べばいいのか分からないし、この広さでこの家賃が妥当とかって全く分からないですよね。
ということで、今回はSUUMOから物件データをスクレイピングして、機械学習モデルに通して、お得な物件を探したいと思います。

環境

・Mac OS Big sur
・Jupyter Lab
・Python 3.7.0
・Google Chrome

行ったこと

１　SUUMOから物件データをスクレイピング←今回はここだけ。
２　前処理
３　lightGBMを使って学習、予測。
４　最安値の家に住み、めでたしめでたし。

スクレイピングする前に

スクレイピングする前には必ずそのサイト規約を確認する必要があります。
SUUMOについては個人利用の範囲であれば問題無しでした。詳しく見たい方はSUUMO利用規約から

早速やってみる。

Jupyter Labの環境構築はQiitaでもたくさんわかりやすい記事があるので省略します。
まずSUUMOからスクレイピングしていきます。
SUUMOのページからエリア検索でどれか１つだけ選んでもらいます。今回は武蔵野市を選んでみます。すると武蔵野市の物件がたくさん出てきます。

こんな感じにサムネイルで１ページに３０件表示されます。
今写真に見えている
・マンション名（JR中央線　三鷹駅　４階建　築３３年）
・住所（東京都武蔵野市西久保３）
・最寄り駅からの歩き時間（１５分）
・築年数（３３年）
・階数（４階建て）
・何階か（２階）
・専有面積（５４ｍ^２）
・広さ（３DK）
・家賃（今回の目的変数。１２.５万円。）
これだけの情報でも結構いいモデルを作れるのですが、これだけだと
・バス・トイレ別
・鉄筋、鉄骨、木造
・ペット可
など細かいところまではスクレイピングできないので、「詳細を見る」をクリックしてもっと細かい情報を見てみます。下の方にスクロールすると

こんな感じでかなり細かく書いてあります。ここも合わせてスクレイピングして特徴量を増やしていきたいと思います。

スクレイピングしていく

スクレイピングでは、ほしい要素の上で右クリック→検証で中身を確認します。
試しに家の名前を見てみると、

どうやら、class="section_h1-header-title"というクラスに格納されています。
このようにほしい要素を全て確認して、RequestsとBeautifulSoupを使ってスクレイピングしていきます。

pd.read_html()が便利

ここで１つ便利な関数があるので紹介しておきます。
それはpandasに入っている**pd.read_html()**という関数です。これは引数にURLを入れるとそのページの表みたいになっている部分をリスト型で返してくれます。（正確にはhtmlのtableタグを持つものを返してくれる）
以下の例みたいになります。

url="https://suumo.jp/chintai/jnc_000031595266/?bc=100230104207"
table=pd.read_html(url)
table[3]

このように一気に取ってこれるので、あとはほしい要素を指定して取ってこれます。

コードをまとめる

# モジュールのimport
import requests
from bs4 import BeautifulSoup
from tqdm import tqdm as tqdm #for文使うときは必ずつける。
import pandas as pd
import time

# スクレイピングした要素を入れる空のリストを作成。
house_name=[]
price=[]
kanrihi=[]
sikikin=[]
reikin=[]
hosyokin=[]
sikibiki=[]
address=[]
nearest_station=[]
layout=[]
area=[]
age=[]
floor=[]
direction=[]
house_type=[]
construction=[]
floor_max=[]
insurance=[]
parking=[]
separate_bath=[]
elevator=[]
pet=[]
washing_machin=[]
auto_lock=[]
washroom=[]
url_syosai_lists=[]

# urlの最後に&page=1をつけると１ページ目、&page=2だと２ページ目が表示されるので、まずページ数をfor文で回す。
# rangeの80はだいたい75ページだったから。スクレイピング中にページ数が変わるかもしれないので大体で良い。

for h in tqdm(range(1,80)):
    try:
        h=str(h)
        url="https://suumo.jp/jj/chintai/ichiran/FR301FC001/?ar=030&bs=040&ta=13&sc=13203&cb=0.0&ct=9999999&et=9999999&cn=9999999&mb=0&mt=9999999&shkr1=03&shkr2=03&shkr3=03&shkr4=03&fw2=&srch_navi=1" + "&page=" + h
        urls=requests.get(url)
        urls.encoding=urls.apparent_encoding #これ地味に時間かかる
        soup=BeautifulSoup(urls.content,"html.parser")
        syosai=soup.find_all("a",class_="js-cassette_link_href cassetteitem_other-linktext") #詳細を見るのリンクを全て取得。

        for i in range(len(syosai)): #上記で取得したsyosaiを１つずつ取り出し、そｎi番目のhrefタグを取り出し、url_syosaiにする。
            id_href=syosai[i].get("href")# URLは"href"タグにある。
            url_syosai="https://suumo.jp" + id_href
            url_syosai_lists.append(url_syosai)
        time.sleep(1) #サーバーに負荷がかかるので１秒待つ。
        
    except IndexError: #ページ数はだいたいなのでエラーが出ても続けるようにする。
        continue

# 上記のスクレイピングが終わるとurl_syosai_listsに１つ１つの物件のURLが入る。それを取り出し、要素を取得してく。
for j in tqdm(range(len(url_syosai_lists))):
    try:
        url=url_syosai_lists[j]
        urls=requests.get(url)
        time.sleep(1) #サーバーに負荷がかかるので１秒待つ。
        urls.encoding=urls.apparent_encoding
        soup=BeautifulSoup(urls.content,"html.parser")

        details_df_1=pd.read_html(url)[2]
        details_df_2=pd.read_html(url)[3]

        house_name.append(soup.find(class_="section_h1-header-title").text) #家の名前
        price.append(soup.find(class_="property_view_note-list").find_all("span")[0].text) #家賃
        kanrihi.append(soup.find(class_="property_view_note-list").find_all("span")[1].text.split("\xa0")[1]) #スクレイピングすると「\xa0」が混ざってしまうのでsplitして１個目を取得 #管理費
        sikikin.append(soup.find_all(class_="property_view_note-list")[1].find_all("span")[0].text.split("\xa0")[1]) #敷金
        reikin.append(soup.find_all(class_="property_view_note-list")[1].find_all("span")[1].text.split("\xa0")[1]) #礼金
        hosyokin.append(soup.find_all(class_="property_view_note-list")[1].find_all("span")[2].text.split("\xa0")[1]) #保証金
        sikibiki.append(soup.find_all(class_="property_view_note-list")[1].find_all("span")[3].text.split("\xa0")[1]) #敷引
        address.append(details_df_1.iloc[0,1:4][1]) #住所
        nearest_station.append(details_df_1.iloc[1,1:4][1]) #最寄り駅、徒歩
        layout.append(details_df_1.iloc[2,1:2][1]) #間取り。1LDKとか
        area.append(details_df_1.iloc[2,3:4][3]) #専有面積
        age.append(details_df_1.iloc[3,1:2][1]) #築年数
        floor.append(details_df_1.iloc[3,3:4][3]) #何階か
        direction.append(details_df_1.iloc[4,1:2][1]) #方角。南向きとか。
        house_type.append(details_df_1.iloc[4,3:4][3]) #マンションorアパートなど
        construction.append(details_df_2.iloc[0,3:4][3]) #鉄筋、木造など
        floor_max.append(details_df_2.iloc[1,1:2][1]) #何回まであるか。2階/5階など
        insurance.append(details_df_2.iloc[2,1:2][1]) #保証について
        parking.append(details_df_2.iloc[2,3:4][3]) #駐車場の有無
        # 設備については全て取り出すのは大変なので個人的に気になるやつを取得
        setsubi=soup.find_all(class_="inline_list")[1]
        text=str(setsubi)
        text=text.split("、")
        
        # バストイレ別なら「1」、別じゃないなら「0」。他も同様。
        if '<ul class="inline_list">\n<li>バストイレ別' in text:
            separate_bath.append(1)
        else:
            separate_bath.append(0)

        if "エレベーター" in text: #エレベーターの有無
            elevator.append(1)
        else:
            elevator.append(0)

        if "ペット" in text: #ペットOKか
            pet.append(1)
        else:
            pet.append(0)

        if "室内洗濯置" in text: #室内に洗濯機がおけるか
            washing_machin.append(1)
        else:
            washing_machin.append(0)

        if "オートロック" in text: #オートロックの有無
            auto_lock.append(1)
        else:
            auto_lock.append(0)

        if "洗面所独立" in text: #独立洗面台の有無
            washroom.append(1)
        else:
            washroom.append(0)

    except AttributeError:
        continue    

# 得られたリストを全てSeriesに変換、結合していく。
house_name_s=pd.Series(house_name)     
price_s=pd.Series(price)
kanrihi_s=pd.Series(kanrihi)
sikikin_s=pd.Series(sikikin)
reikin_s=pd.Series(reikin)
hosyokin_s=pd.Series(hosyokin)
sikibiki_s=pd.Series(sikibiki)
address_s=pd.Series(address)
nearest_station_s=pd.Series(nearest_station)
layout_s=pd.Series(layout)
area_s=pd.Series(area)
age_s=pd.Series(age)
floor_s=pd.Series(floor)
direction_s=pd.Series(direction)
house_type_s=pd.Series(house_type)
construction_s=pd.Series(construction)
floor_max_s=pd.Series(floor_max)
insurance_s=pd.Series(insurance)
parking_s=pd.Series(parking)
separate_bath_s=pd.Series(separate_bath)
elevator_s=pd.Series(elevator)
pet_s=pd.Series(pet)
washing_machin_s=pd.Series(washing_machin)
auto_lock_s=pd.Series(auto_lock)
washroom_s=pd.Series(washroom)

df=pd.concat([house_name_s,price_s,kanrihi_s,sikikin_s,reikin_s,hosyokin_s,sikibiki_s,\
              address_s,nearest_station_s,layout_s,area_s,age_s,floor_s,direction_s,house_type_s,\
              construction_s,floor_max_s,insurance_s,parking_s,separate_bath_s,\
              elevator_s,pet_s,washing_machin_s,auto_lock_s,washroom_s],axis=1)

df.columns=["house_name","price","kanrihi","sikikin","reikin","hosyokin","sikibiki","address",\
            "nearest_station","layout","area","age","floor","direction","house_type","construction","floor_max",\
           "insurance","parking","separate_bath","elevator","pet","washing_machin","auto_lock","washroom"]

df.to_csv('suumo.csv',header=True, index=False) # csvに保存

スクレイピングが終わると、以下のようなデータが得られます。(武蔵野市は物件が多いので西多摩郡に変更しました。笑)

ちなみにスクレイピングは５時間くらいかかるので外部サーバーになげることをおすすめします。（西多摩郡は１時間かからないくらいでした。）
私はAWSのEC2にubuntuサーバーを立てて、anacondaをインストールして実行しました。
次回はこれらのデータの前処理、学習を行っていきたいと思います。
以上、ご覧いただきありがとうございました。

*最後結合するときにageを結合し忘れていました。コードは修正済みです。大変失礼しました。2021/03/16

111

157

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up