More than 3 years have passed since last update.

乃木坂46ブログの画像を全自動保存するプログラム

Posted at 2020-06-26

はじめに

こんにちは．
坂道オタクの@tfujitaniです．
今回は乃木坂46の指定したメンバーのブログの画像を全自動で保存できるようなPythonを作成したので公開します．
ちなみにこのプログラムを作ろうと思った”きっかけ”としては，推しだった井上小百合さんが卒業する（した）ためです．（きっかけ，いい歌ですよね．）
そのために作成したプログラムを今更ですが公開します．

今回やったこと

用いたもの

・Python
・BeautifulSoup

スクレイピング環境のインストール

pip install requests
pip install beautifulsoup4

Pythonコード

Beautiful SoupとPython3を用いてスクレイピングしています．
今回は指定したいメンバーのブログURLを指定します．
秋元真夏さん（ http://blog.nogizaka46.com/manatsu.akimoto/ ）なら"manatsu.akimoto"
伊藤理々杏ちゃん（ http://blog.nogizaka46.com/riria.itou/ ）なら"riria.itou"という感じです．
さらに，保存したい期間の始点と終点も指定できます．

nogiblog.py

# coding:utf-8
from time import sleep
import time
from bs4 import BeautifulSoup
import sys
import requests, urllib.request, os
from selenium.common.exceptions import TimeoutException

domain="http://blog.nogizaka46.com/"
member="manatsu.akimoto" #メンバーの指定
url=domain+member+"/"

def getImages(soup,cnt,mouthtrue):
    member_path="./"+member
    #画像を保存する関数
    for entry in soup.find_all("div", class_="entrybody"):# 全てのentrybodyを取得
        for img in entry.find_all("img"):# 全てのimgを取得
            cnt +=1
            imgurl=img.attrs["src"]
            imgurlnon=imgurl.replace('https','http')
            if mouthtrue:
                try:
                    urllib.request.urlretrieve(imgurlnon, member_path+ str(year)+'0'+str(mouth) + "-" + str(cnt) + ".jpeg")
                except:
                    print("error",imgurlnon)
            else:
                try:
                    urllib.request.urlretrieve(imgurlnon, member_path + str(year)+str(mouth) + "-" + str(cnt) + ".jpeg")
                except:
                    print("error",imgurlnon)


if(__name__ == "__main__"):
    #保存するブログの始まり
    year=2012
    mouth=12
    #保存するブログの終わり
    endyear=2020
    endmouth=6

    while(True):
        mouthtrue=True
        if mouth<10:
            BlogPageURL=url+"?d="+str(year)+"0"+str(mouth)
        else:
            BlogPageURL=url+"?d="+str(year)+str(mouth)
            mouthtrue=False
        headers = {"User-Agent": "Mozilla/5.0"}
        soup = BeautifulSoup(requests.get(BlogPageURL, headers=headers).content, 'html.parser')#htmlの取得
        print(year,mouth)
        sleep(3)
        cnt = 0
        ht=soup.find_all("div", class_="paginate")
        print("ht",ht)
        getImages(soup,cnt,mouthtrue)#画像保存関数の呼び出し
        if len(ht)>0:#もし同じ月に複数ページあったとき，そのページ分だけ保存
            ht_url=ht[0]
            print(ht_url)
            url_all=ht_url.find_all("a")
            for i,hturl in enumerate(url_all):
                if (i+1)==len(url_all):
                    break
                link = hturl.get("href")
                print("url",url+link)
                soup = BeautifulSoup(requests.get(url+link, headers=headers).content, 'html.parser')
                sleep(3)
                getImages(soup,cnt,mouthtrue)#画像保存関数の呼び出し
        if year==endyear and mouth==endmouth:
            print("Finish")
            sys.exit()#プログラム終了
        if mouth==12:
            mouth=1
            year=year+1
            print("update",year,mouth)
        else:
            mouth=mouth+1
            print("update",year,mouth)

ちなみに「#もし同じ月に複数ページあったとき，そのページ分だけ保存」というのは画像で例示すると，こういうイメージです．

この画像の例でいうと，2013年1月の秋元真夏さんのブログなのですが，1ページ目の画像を保存した後に，2,3,4のリンクを取得して，それぞれのページで画像を保存するという内容になっています．

実行結果

試しに秋元真夏さんのブログで実行してみたところ，以下のような形で画像の保存が確認できました．
※2013年1月の画像が欠けているのは，ブログにあげられてた元画像が欠けているためです．

ちなみに先ほどのプログラムでhtがわかりにくいかと思ったので，その部分の実行結果を表示します．
少しわかりにくいですが，このようにその月ごとのページをそれぞれスクレイピングで表示しています．

ht 
[<div class="paginate"> 1  | <a href="?p=2&amp;d=201301"> 2 </a> | <a href="?p=3&amp;d=201301"> 3 </a> | <a href="?p=4&amp;d=201301"> 4 </a> | <a href="?p=2&amp;d=201301">＞</a></div>, <div class="paginate"> 1  | <a href="?p=2&amp;d=201301"> 2 </a> | <a href="?p=3&amp;d=201301"> 3 </a> | <a href="?p=4&amp;d=201301"> 4 </a> | <a href="?p=2&amp;d=201301">＞</a></div>]

その後，以下のように1ページ目をスクレイピングした後に，4ページスクレイピングしていることがわかります．

url http://blog.nogizaka46.com/manatsu.akimoto/?p=2&d=201301
url http://blog.nogizaka46.com/manatsu.akimoto/?p=3&d=201301
url http://blog.nogizaka46.com/manatsu.akimoto/?p=4&d=201301

おわりに

余談ですが，井上小百合さんのブログを保存したときは，余裕で数千枚を超えました（余計なものを消去して2385枚でした．）

さゆの努力家の部分が見えますね．

参考文献

https://qiita.com/xxPowderxx/items/e9726b8b8a114655d796 の記事がめちゃくちゃ参考になりました・

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up