クローリングしてきたテキスト、画像をローカルに保存 #Python

Webから取得したデータを保存

requests とBeautifulSoup をインストールする

$ pip install beautifulsoup4

$ pip install requests

txtを書き込む

ニュースのトピックなどをローカルのtxtファイルに改行を挟みながら書き込む

title.py


import time
import sys 
import requests 
from bs4 import BeautifulSoup 

url = 'URL'
res = requests.get(url)
soup = BeautifulSoup(res.content,'html.parser')

for a in soup.find_all("タグ",attrs={"属性": "属性名"}):
    try:
        h1_text = a.string 
        print(h1_text)
        with open('ファイル選択', 'a+') as f:
            f.write(h1_text+"\n")
    except Exception as e:
        print(e)
    finally:
        time.sleep(2)

ファイル選択では絶対パスを指定する

C:\workspace\sample.txt

なら

c:\\workspace\\sample.txt

jpgを書き込む

サイトの画像をダウンロードしてローカルに保存する

jpg.py

import time
import sys 
import requests
from bs4 import BeautifulSoup 
import shutil

url = 'URL'
res = requests.get(url)
soup = BeautifulSoup(res.content,'html.parser')
i = 0

def download_img(url, file_name):
    r = requests.get(url, stream=True)
    if r.status_code == 200:
        with open(file_name, 'wb') as f:
            r.raw.decode_content = True
            shutil.copyfileobj(r.raw, f)

def download():
    global i
    for tag in soup.find_all("img"):
        try:
            href = tag.get("タグ")
            download_img(href, 'ファイル名%d.jpg' %i)
            print(href)
            i = i + 1
        except Exception as e:
            print(e)
        finally:
            time.sleep(2)
try:
    while(True):
        if soup.find("タグ名" ,attrs={"src": "/common/img/blank.gif"}):　#画像があれば続行する
            print(y)
            download()
            break
        else:
            url = 'URL'
            res = requests.get(url)
            soup = BeautifulSoup(res.content,'html.parser')
except Exception as e:
    print(e)

試したときにはそのサイトが更新時に時々画像が表示されない場合があったのでタグを見つけたら開始するようにした

最後に

クローリングは便利だがサイトによっては利用規約などに禁止事項として載っている場合があるので注意する必要がある。
また、待ち時間を最低1秒以上はもたせて紳士的なクローリングを心掛ける