More than 5 years have passed since last update.

Fortniteの画像を100枚スクレイピングする

Last updated at 2020-03-26Posted at 2020-03-22

Fortniteの画像を、Yahooから100枚スクレイピングしてみた。

・Mac
・python3

（１）環境構築、ディレクトリ構成

デスクトップにディレクトリfortniteを作成。
ディレクトリ内にimagesフォルダ（画像保存用）と、scraping.pyを作成。

fortnite
├scraping.py
└images

ディレクトリ内で仮想環境構築。

python3 -m venv .
sorce bin/activate

必要なパッケージ、モジュールをインストール

pip install beautifulsoup4
pip install requests
pip install lxml

（２）scraping.pyを記述

Fortniteの画像スクレイピングは、Yahooの画像検索結果を使う。
https://search.yahoo.co.jp/image/search?p=%E3%83%95%E3%82%A9%E3%83%BC%E3%83%88%E3%83%8A%E3%82%A4%E3%83%88&ei=UTF-8&b=1
1ページあたり10枚画像あり、次ページ以降を合わせると100枚以上はあることが確認できる。
ここからスクレピングして、imagesフォルダに格納する。

scraping.py

from bs4 import BeautifulSoup
import lxml
import requests
import os
import time


def main():
    #１ページあたり20個の画像、次ページをスクレイピングするための変数
    page_key=0

    #保存した画像をナンバリングするための変数
    num_m = 0

    for i in range(6):
        URL = "https://search.yahoo.co.jp/image/search?p=%E3%83%95%E3%82%A9%E3%83%BC%E3%83%88%E3%83%8A%E3%82%A4%E3%83%88&ei=UTF-8&b={}".format(page_key + 1)
        res = requests.get(URL)
        res.encoding = res.apparent_encoding
        html_doc = res.text
        soup = BeautifulSoup(html_doc,"lxml")

        list = []
        _list = soup.find_all("div",class_="gridmodule")
        for i in _list:
            i2 = i.find_all('img')
            for i3 in i2:
                i4 = i3.get('src')
                list.append(i4)


        for i in list:
            i2 = requests.get(i)
            #絶対パスを指定して保存
            with open(os.path.dirname(os.path.abspath(__file__)) + '/images' + '/{}'.format(num_m)+'.jpeg','wb')as f:
                f.write(i2.content)
            num_m += 1
            #画像が101枚目となったら保存処理を停止（for文停止）
            if num_m == 101:
                break

        #内側の保存処理のfor文が停止した場合、外側のfor文も合わせて停止する処理
        else:
            continue
        break


        #サーバー負荷防止のため１秒間隔を開ける
        time.sleep(1)

        page_key+=20

if __name__ == '__main__':
    main()

補足説明

・GoogleChromeの”検証”を使い、画像URLのありそうな場所を探した結果、divタグのclassがgridmoduleの部分にあることを確認。そこからimgタグの部分をスクレイピング。
・get('src')で、imgタグのsrc属性の値を取得する。
・取得したimgタグのsrc属性はurlではあるものの、str型となっているので、requestsでレスポンス情報を格納したレスポンスオブジェクトを取得する。レスポンスオブジェクトには、text、encoding、status_code、contentが含まれる。contentはレスポンスボディをバイナリ形式で取得するために必要。（参考）Requests の使い方 (Python Library)
・ファイルで絶対パスを指定して、wbモードで書き込む（参考）
Python、os操作について
・for文で100枚保存したら内側のfor文と外側のfor文を中止する。Pythonのforループのbreak（中断条件）

実行すると、imagesフォルダに画像を保存できたことが確認できる。

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up