More than 5 years have passed since last update.

pythonで画像を集めてみただけ

Last updated at 2019-12-08Posted at 2019-12-08

はじめに

こちらの記事は、 SLP KBIT Advent Calendar 2019 の9日目の記事です。

pythonを使って画像スクレイピングを行いました。
サイトのURLを貼ったら、そのサイトの画像を全部表示してくれるものを作りました。

使用環境

-python 3.7.4

全体のコード

フォルダの構成

├── cgi-bin
│ └── image_scraping.py
└── style.css
└── index.html

image_scraping.py

# スクレイピング用
import requests
from bs4 import BeautifulSoup
# フォームデータ受け取り用
import cgi


# ---検索フォームのボタンを押したときに出力するhtml---
html_body1 = """
<!DOCTYPE html>
<html>
    <head>
        <title>画像スクレイピングの練習</title>
        <style>    /*見出しのCSS*/
            h1 {
                position: relative;
                padding: 0.6em;
                background: #e0edff;
            }
            
            h1:after {
                position: absolute;
                content: '';
                top: 100%;
                left: 30px;
                border: 15px solid transparent;
                border-top: 15px solid #e0edff;
                width: 0;
                height: 0;
            }
        </style>
    </head>
    <body>
        <h1> 集めた画像を表示したよ！</h1>
"""
# ここを繰り返して画像を並べる
html_img = """
            <img src = "%s">
"""
html_body2 = """
    </body>
</html>
"""
# ---フォームから受け取ったURLからwebページをダウンロード---
form = cgi.FieldStorage()# フォームデータを受け取る
url = form.getvalue('text','') # フォームに入力されるであろうurlを取得

res = requests.get(url)# webページのダウンロード
res.raise_for_status()#エラーならここで例外を発生させる

soup = BeautifulSoup(res.text, "html.parser") # html.parserを明示的に指定する必要がある
image_elem = soup.select("img")# imgタグを全部取得

# htmlの最初の部分
print(html_body1)

# ---src=の抜き取りとhtmlに画像を出力---
for tmp in image_elem:
    # attrsでsrcをひとつづつリスト化
    src_fact = tmp.attrs["src"] 
    if src_fact == []:
        print("画像が見つかりません。。。")
    else:
        # 画像を出力
        print(html_img % (src_fact))

# htmlの最後の部分
print(html_body2)

index.html

<!DOCTYPE html>
<html>
    <head>
        <title>画像スクレイピングの練習</title>
        <link rel="stylesheet" href="style.css">
    </head>
    <body>
        <h1>サイトの画像を表示するよ！</h1>
        <form id = "form1" action="cgi-bin/image_scraping.py" method="POST">
            <input id = "input_box" type="text" name="text" placeholder="URLを貼ってね" />
            <input id="button1" type="submit" name="submit" value = "検索！" />
        </form>
    </body>
</html>

＊**こちらはおまけです** ちょっとかわいくしたかったのでCSSコピペ（[見出し](https://saruwakakun.com/html-css/reference/h-design#section1), [検索フォーム](https://kagesai.net/search-form-design/)）してきました。

css.style.css

/* 見出し */
h1 {
    position: relative;
    padding: 0.6em;
    background: #e0edff;
}
  
h1:after {
    position: absolute;
    content: '';
    top: 100%;
    left: 30px;
    border: 15px solid transparent;
    border-top: 15px solid #e0edff;
    width: 0;
    height: 0;
}
/*フォーム全体*/
# form1{
    position:relative;/*フォームの相対位置*/	
    max-width:300px;/*フォームのサイズ*/
    margin-bottom:15px;/*フォームの下に余白*/	 	
}
/*検索ボックス*/
# input_box{
    position:absolute;/*フォームの絶対位置*/	
    right:20;
    top:0; 	
    outline:0;/*クリック時の青い枠線消す*/	
    height:50px;/*検索ボックスの高さ*/	
    width: 320px;
    padding:0 10px;/*テキスト位置調整*/	 
    border-radius:2px 0 0 2px;/*検索ボックスの角を丸める*/		
    background:#eee;/*検索ボックスの背景カラー*/		
}
/*検索ボタン*/
# button1{
    width:70px;/*検索ボタンの横幅*/ 
    height:50px;/*検索ボタンの縦幅*/ 
    position:absolute;/*検索ボタンの絶対位置*/  
    left:350px;/*検索ボタンの位置調整*/  
    top:3px;
    border-radius:0 2px 2px 0;/*検索ボタンの角を丸める*/ 
    background:#7fbfff;/*検索ボタンの背景カラー*/ 
    border:none;/*検索ボタンの枠線を消す*/ 
    color:#fff;/*検索ボタンのテキストカラー*/ 
    font-weight:bold;/*検索ボタンのテキスト太字*/ 
    font-size:16px;/*検索ボタンのフォントサイズ*/ 
}
/*検索ボタンマウスオーバー時*/
# button1:hover{
    color:#666;/*検索ボタンマウスオーバー時のフォントカラー*/ 
}

画像をスクレイピングする

準備

まずスクレイピングを行うために、Beautiful Soup4をインストールしていきます。

$ pip install beautifulsoup4

Beautiful Soup 4.4.0ドキュメントを見ながらインストールしました。

webページをダウンロード

requestsモジュールを用いてwebサイトからファイルをダウンロードします。

requestsモジュールをインポートします。

import requests

requests.get()関数を用いて、webページをダウンロードします。
urlには、サイトのURLがはいっている想定です。後でその処理は説明します。

# urlの値はフォームデータから受け取ります。
res = requests.get(url)

ここでエラー処理をしています。
もしもファイルのダウンロードに失敗したら例外処理をしてくれます。

res.raise_for_status()

imgタグで検索

BeautifulSoupモジュールを用いて、htmlの要素を解析します。
まず、BeautifulSoupモジュールをインポートします。

from bs4 import BeautifulSoup

次に、BeautifulSoupオブジェクトを生成します。
先ほどのresをテキスト形式にして受け取ってます。

soup = BeautifulSoup(res.text, "html.parser")

select()メソッドを用いて要素を見つけていきます。

今回見つけたいのはimgタグのsrc要素達なので、select()のなかにはimgを入れて、リスト化します。

image_elem = soup.select("img")

srcを取り出す

image_elemのリストの中からsrcを見つけていきます。
srcは複数なので、for文で繰り返します。

attrsを用いてhtmlのsrc=の中身をsrc_factに格納していきました。
attrsについては、こちらに書いてありましたが、よく理解できなかったです。（誰か教えてください。。。）

for tmp in image_elem:
    src_fact = tmp.attrs["src"] 
    if src_fact == []:
        print("画像が見つかりません。。。")
    else:
        # ここには出力する処理を書きます。

フォームデータのやり取りをする

準備

まずcgi-binファイルを用意して、その中にimage_scraping.pyを入れます。

次にindex.htmlを用意します。
フォームと送信ボタンをつけます。CSSもつかうので、headの中に記述しています。

index.html

<!DOCTYPE html>
<html>
    <head>
        <title>画像スクレイピングの練習</title>
        <link rel="stylesheet" href="style.css">
    </head>
    <body>
        <h1>サイトの画像を表示するよ！</h1>
        <form id = "form1" action="cgi-bin/image_scraping.py" method="POST">
            <input id = "input_box" type="text" name="text" placeholder="URLを貼ってね" />
            <input id="button1" type="submit" name="submit" value = "検索！" />
        </form>
    </body>
</html>

cgiモジュール

pythonを使ってページ間のやり取りを行うために、cgiサーバを使います。

import cgi

フォームデータの受け取り

cgi.FieldStorage()でフォームデータを受け取ります。
form.getvalue('text','')で値をテキスト形式に変えてからurlに格納します。

image_scraping.py

form = cgi.FieldStorage()
url = form.getvalue('text','')

画像の表示

pythonの中にhtmlを書いていきます。

image_scraping.py

html_body1 = """
<!DOCTYPE html>
<html>
    <head>
        <title>画像スクレイピングの練習</title>
        <style>/*見出しのスタイル*/
            h1 {
                position: relative;
                padding: 0.6em;
                background: #e0edff;
            }
            
            h1:after {
                position: absolute;/*見出しの絶対位置*/
                content: '';
                top: 100%;
                left: 30px;
                border: 15px solid transparent;
                border-top: 15px solid #e0edff;
                width: 0;
                height: 0;
            }
        </style>
    </head>
    <body>
        <h1> 集めた画像を表示したよ！</h1>
"""
html_img = """
            <img src = "%s">
"""
html_body2 = """
    </body>
</html>
"""

html_body1を表示します。

次に、html_imgを繰り返して、src_factの値を格納していくことで複数枚の画像を表示しています。

html_body2を表示して終わりです。

もっとスマートなやり方あると思います。。。

image_scraping.py


# htmlの最初の部分
print(html_body1)

# ---src=の抜き取りとhtmlに画像を出力---
for tmp in image_elem:
    src_fact = tmp.attrs["src"] 
    if src_fact == []:
        print("画像が見つかりません。。。")
    else:
        # 画像を出力！
        print(html_img % (src_fact))

# htmlの最後の部分
print(html_body2)

（なんかいい感じに…）

物足りなさを感じたので、コピペしたCSSを適応して完成です。

style.css

/* 見出し */
h1 {
    position: relative;/*見出しの相対位置*/
    padding: 0.6em;
    background: #e0edff;
}
  
h1:after {
    position: absolute;/*見出しの絶対位置*/
    content: '';
    top: 100%;
    left: 30px;
    border: 15px solid transparent;
    border-top: 15px solid #e0edff;
    width: 0;
    height: 0;
}
/*フォーム全体*/
# form1{
    position:relative;/*フォームの相対位置*/	
    max-width:300px;/*フォームのサイズ*/
    margin-bottom:15px;/*フォームの下に余白*/	 	
}
/*検索ボックス*/
# input_box{
    position:absolute;/*フォームの絶対位置*/	
    right:20;
    top:0; 	
    outline:0;/*クリック時の青い枠線消す*/	
    height:50px;/*検索ボックスの高さ*/	
    width: 320px;
    padding:0 10px;/*テキスト位置調整*/	 
    border-radius:2px 0 0 2px;/*検索ボックスの角を丸める*/		
    background:#eee;/*検索ボックスの背景カラー*/		
}
/*検索ボタン*/
# button1{
    width:70px;/*検索ボタンの横幅*/ 
    height:50px;/*検索ボタンの縦幅*/ 
    position:absolute;/*検索ボタンの絶対位置*/  
    left:350px;/*検索ボタンの位置調整*/  
    top:3px;
    border-radius:0 2px 2px 0;/*検索ボタンの角を丸める*/ 
    background:#7fbfff;/*検索ボタンの背景カラー*/ 
    border:none;/*検索ボタンの枠線を消す*/ 
    color:#fff;/*検索ボタンのテキストカラー*/ 
    font-weight:bold;/*検索ボタンのテキスト太字*/ 
    font-size:16px;/*検索ボタンのフォントサイズ*/ 
}
/*検索ボタンマウスオーバー時*/
# button1:hover{
    color:#666;/*検索ボタンマウスオーバー時のフォントカラー*/ 
}

実行結果

コードが書き終わったので実行していきます。

$ python -m http.server --cgi

とりあえず、 http://localhost:8000/ にアクセスします。

このように表示されました。

それでは、検索フォームにURLを貼っていきます。
適当にイラストが多いサイトのURLを貼りました。

そして、検索ボタンを押します。

ちゃんと表示されているみたいです。
成功！やった～！

おわりに

進捗出なさすぎで焦りました。
集めて何もすることがないので何か考えたいと思います。

とりあえず頑張りました。拍手！👏

参考サイト

Pythonで簡単に画像スクレイピングチュートリアル
https://to-be-loved.net/python-scraping-tutorial/

Beautiful Soup 4.4.0ドキュメント
https://www.crummy.com/software/BeautifulSoup/bs4/doc/

PythonでCGIを動かす(Python2, Python3)
https://dackdive.hateblo.jp/entry/2016/01/22/100000

Python3.5のcgiモジュールを使ってページ間で値を受け渡す
https://qiita.com/shuichi0712/items/84427a7722463a5cb4dd

robots.txtとは? ～書き方の記述例と注意点～
https://croja.jp/knowledge/robots_txt

(最終閲覧日：12/08)

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up