More than 3 years have passed since last update.

pixivpyを使ってpixivから画像を取得する

Posted at 2021-10-02

環境

machine: Mac mini(M1, 2020)
OS: macOS Big Sur(11.5.2)
Python 3.9

背景

　CNNで遊ぶための画像が欲しくてやりました．CNNをやる動機は未整理の画像を分類したかったのに，さらに分類対象を増やすという矛盾を抱えてる記事です．ただ画像が星井だけで検索からやってきた方のために，コードはコピペで済む状態になっているはずです．画像はオリジナル画質で保存します．

事前準備

`pixivpy`のインストール

　公式の通りにpip install pixivpyで完了．

chromeのインストール

　pixivの認証を通すために必要になります．chromeを絶対にインストールしたくない場合は後述のログインは面倒な法を使って下さい．

`selenium`のインストール

　これまた公式に従います．まずはpip install seleniumします．続いてseleniumからchromeを操作するためにドライバーを入手します．こちらから自分のchromeのバージョンに合わせてダウンロードして下さい．公式githubではwgetを使う例がありますが，バージョンが合ってないとseleniumが動作しないためきちんと確認しましょう．入手したドライバーは作業スクリプトと同じフォルダに展開しておきます．

お手製ダウンローダの作成

　githubの最初に書かれていますが，実行ファイルにパスワードを打ち込んでおく方法は使えなくなっています．対策は2つ提案されていますが1つ目の方法はチンタラやってると失敗しますし面倒なので，2つ目の方法を使います．
　まずはこちらのpixiv_auth.pyをコピペしてきます．この中のlogin()を実行すれば認証に必要な情報が得られるのですが，このままだと標準出力に結果をぶちまけられるだけで使いにくいので少し改変します．121-122行目に次のコードを追加します．

pixiv_auth_selenium.py

121    data = response.json()
122    return data['refresh_token']

　これでlogin関数を実行すると認証に必要な情報が返されるようになりました．お手製ダウンローダではこの改変を施したファイルを利用します．それではダウンローダの中身を順に見ていきます．ただし残念ながらこのダウンローダは3600秒で認証期限が切れてしまうため，1時間ごとに立ち上げ直す必要があります．

scrape_pixiv.py

from pathlib import Path
from time import sleep
from time import time
from IPython.core.debugger import Pdb
from pixivpy3 import *
from pixiv_auth_selenium import login, refresh



class IdolCollector:
    '''
    Image downloader for Pixiv.
    This class searches images with tag,
    so not only for im@s.
    '''
    def __init__(self, root, count='ordinary'):
        '''
        In constructing, chrome opens login page of pixiv.
        Enter ID and password.

        Parameters
        ----------
        root: folder name to where save images
        count: this type images are counted

        Outputs
        -------
        A kind of logfile 'missed_illusts.txt' will be made.
        If saving a image failed, the ID will be written.
        '''

        self.root = root
        self.refresh_token = login()
        self.api = AppPixivAPI()
        self.api.auth(refresh_token=self.refresh_token)
        assert count in ['ordinary', 'R-18', 'all'], 'count must be one of "ordinary" or "R-18" or "all"'
        self.count = count
        with open('missed_illusts.txt', 'w'):
            pass

　コンストラクタでは画像の保存先をrootで指定します．今回分類したい画像群とは表情が大きく異なってくるため，R-18は基本的に省く設定にしています．途中で実行しているlogin()は先ほど改変したもので認証に必要な情報を受け取ります．この際chromeが起動するので普通にログインするとchromeが閉じてpythonを実行しているコンソールに戻ります．そして受け取った認証情報をself.api.auth(refresh_token=self.refresh_token)に渡しています．

scrape_pixiv.py

    def collect(
            self, idol,
            search_type='exact_match_for_tags',
            sort='popular_desc', duplicates=[],
            n_collect=1000):
        '''
        make directory, search images and save them.

        Parameters
        ----------
        idol: an idol name or a tag
        search_type: 'exact_match_for_tags'
                     or 'partial_match_for_tags'
                     or 'title_and_caption'
        sort: 'date_desc' or 'date_asc'
              premium members can use 'popular_desc'
        duplicates: excludes these idols.
                    this list also be used
                    to prevent duplicates.
        n_collect: save these number of search results.

        Return
        ------
        True in success and False in failure
        '''

        folder = self.root/'ordinary'/idol
        if not folder.exists():
            folder.mkdir(parents=True)

        if self.count in ['R-18', 'all']:
            folder = self.root/'R18'/idol
            if not folder.exists():
                folder.mkdir(parents=True)

        i = 0#counter
        print(f'search {idol}')

        #search
        json_result = self.api.search_illust(
                            idol, search_type, sort
                        )

        #for error case
        try:
            #session or token time out will be caught.
            if json_result.error is not None:
                msgs = json_result.error
                for key, msg in msgs.items():
                    print(f'{key}: {msg}\n')
                    return False
        except:
            print(f'{json_result} does not have illust.')
            return False

        if json_result.illusts is not None:
            for illust in json_result.illusts:
                i += self.save_img(illust, idol, duplicates)
        else:
            #No images
            print(f'{json_result} does not have illust.')

　検索結果を走査するメソッドの前半です．前半は最初の検索結果の30件に対する処理です．idolで検索したいタグを指定します．私は必要経費と思って有料会員登録（といっても初月無料）したので検索方法sortに人気順popular_descを指定しています．duplicatesに既に保存したタグをリストで指定することで重複した保存を防げます．
前後半ともにtry文やif文でのエラー対策が目立ちますが，実際に起こるエラーは1時間ごとの認証切れです．検索結果が少なくて最終ページに引っかかることも想定していますが，まだ引っかかってないのでどうなるか分かりません．画像を保存している本体はお手製ダウンローダ内で定義するself.save_imgで後で出てきます．

scrape_pixiv.py

        while i <= n_collect:
            #refresh token had not succeeded.
            '''
            if (time()-t_refresh)>(50*60):
                print(f'old token: {refresh_token}')
                refresh_token = refresh(refresh_token)
                api.auth(refresh_token=refresh_token)
                sleep(1)
                t_refresh = time()
                print('refresh token')
            '''

            if json_result.next_url == []:
                break
            elif json_result.next_url is None:
                break

            next_qs = self.api.parse_qs(
                                    json_result.next_url
                                )
            json_result = self.api.search_illust(**next_qs)

            if json_result == []:
                break
            elif json_result is None:
                break

            if json_result.illusts is not None:
                for illust in json_result.illusts:
                    i += self.save_img(illust, idol, duplicates)
            else:
                try:
                    if json_result.error is not None:
                        msgs = json_result.error
                        for key, msg in msgs.items():
                            print(f'{key}: {msg}\n')
                        return False
                except:
                    print(f'{json_result} does not have illust.')

        print(f'{i} results was saved for {idol}.')
        return True

　走査メソッドの後半です．n_collectで指定した件数を保存するまでwhileで回しています．件数単位のため取得できる画像数は3–4倍になります．途中の怪しいコメントアウトは認証を持続させようと足掻いた遺産です．このメソッドは無事に終了したらTrue，持続不可能なエラーに遭遇したらFalseを返します．こうしておくで次のメソッドでfor文内で呼び出している間に認証切れになっても無駄なアクセスを防いでサーバ負荷低減に繋がります．

scrape_pixiv.py

    def collect_idols(
            self, idols, duplicates=[], n_collect=1000):
        '''
        collects images of idols.
        only calls self.collect in for loop.

        Parameters
        -----------
        idols: list of idol names or tags.
        duplicates: excludes these idols.
                    this list also be used
                    to prevent duplicates.
        n_collect: save these number of search results.
        '''

        for idol in idols:
            if self.collect(
                        idol, duplicates=duplicates,
                        n_collect=n_collect):
                duplicates.append(idol)
            else:
                print(f'failed on {idol}.')
                break

　先ほどまでのcollectメソッドを呼び出しているだけです．

scrape_pixiv.py

    def save_img(self, illust, idol, duplicated_idols=None, skip_r18=True):
        n = 1
        tags = [t['name'] for t in illust.tags]

        if 'R-18' in tags:
            folder = 'R18'
            if self.count == 'ordinary':
                n = 0
                return n
        else:
            folder = 'ordinary'

        #exclude duplicates
        dup = set(duplicated_idols)&set(tags)
        if dup != set():
            return n

        jpg = self.root/folder/idol/(str(illust.id)+'_p0.jpg')
        png = self.root/folder/idol/(str(illust.id)+'_p0.png')
        path = self.root/folder/idol
        path = str(path)

        if jpg.exists() or png.exists():
            return n

　画像保存用のメソッド前半です．pixivpyのAppPixivAPIはタグを{{'name': '萩原雪歩'}, {'name': 'アイドルマスター'}}という形式で返してくるため，まずはそれを['萩原雪歩', 'アイドルマスター']というリスト形式に変換しています．そしてR-18タグがあれば件数に数え上げずに離脱します．また過去に保存したタグとの重複があった場合や，既に保存済みの場合には件数には数えつつも保存はせずに離脱します．

scrape_pixiv.py

        if len(illust.meta_pages) == 0:
            #for one image
            try:
                img = illust.meta_single_page.original_image_url
                self.api.download(img, path=path)
                sleep(1)
            except:
                print(f'failure {str(illust.id)}')
                with open('missed_illusts.txt', 'a') as f:
                    f.write(str(illust.id))
                    f.write('\n')
                return 0
        else:
            #for multi images
            for page in illust.meta_pages:
                img = page.image_urls.original
                try:
                    self.api.download(img, path=path)
                    sleep(1)
                except:
                    print(f'failure {str(illust.id)}')
                    with open('missed_illusts.txt', 'a') as f:
                        f.write(str(illust.id))

        return n

　後半は投稿画像が1枚の場合と複数の場合で分岐しています．とはいえimgで受け取るURLが違う以外はfor文が必要かどうかの違いだけです．pixivpyの公式サンプルはオリジナル画質の保存例がありませんでしたが，AppPixivApi().search_illust()が返すjson文字列を注意深く読むとoriginalなる文字があったので，該当する値を取り出してオリジナル画質の保存に成功しています．また注意点はpixivpyはpathlibに対応指定していないため，保存先の指定はpathlibオブジェクトではなく文字列で行う必要があることです．
　ここまで分割されているscrape_pixiv.pyを上から順にコピペして1つのファイルにまとめて下さい．

お手製ダウンローダの実行例

　ダウンローダは先ほどまでのコードのコピペで終わりですが実行部の実装がまだです．実装はmain関数とし，検索したいタグ（アイドル名）を直打ちします．例えば14件のidolsを保存したいとします．1時間ごとの再認証を繰り返して既に9人目までは完了していた場合は次のように書きます．保存先はスクリプトを実行しているフォルダの下に準備した"Pixiv"フォルダです．n_collectで1,000件保存するように指定していますが，前述の通り被り除外が働かないと3-4倍の枚数の画像が保存されます．このダウンローダはオリジナル画質で保存しているため1,000件~4,000枚の保存で4GBくらいになりました．52人分実行している間に重複除外が進みますが，それでも100GBくらいは覚悟してます．

scrape_pixiv.py

def main():
    idols = ['萩原雪歩', '宮尾美也', '田中琴葉',
             '所恵美', '天空橋朋花', '高坂海美',
             '星井美希', '島原エレナ', '篠宮可憐',
             '春日未来', '最上静香', '七尾百合子',
             '白石紬', '桜守歌織',
             ]
    duplicates = []
    for _ in range(9):
        duplicates.append(idols.pop(0))

    root = Path.cwd()/'Pixiv'
    collector = IdolCollector(root)
    collector.collect_idols(idols, duplicates, n_collect=1000)


if __name__=='__main__':
    main()

　ここまでをscrape_pixiv.pyにまとめ，別途pixiv_auth_selenium.pyを同じフォルダに置いて下さい．そしてコンソールからpython scrape_pixiv.pyを実行するとchromeが立ち上がってログイン画面になるので，普通にログインすればあとは1時間だけ自動検索と保存が働きます．サーバ負荷対策で画像保存ごとに1秒停止する設定なので3,000枚/1時間くらいです．

反省

　コピペできるようにしたけどやっぱりpythonに慣れてないと厳しそうだし，分かってる人には読みにくいしで中途半端な記事になりました・・・．せめてpixivpyのドキュメントやヘルプが全然なくてオリジナル画質での保存やログイン周りで困っている方の助けになれば幸いです．

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up

pixivpyを使ってpixivから画像を取得する

環境

背景

事前準備

pixivpyのインストール

chromeのインストール

seleniumのインストール

お手製ダウンローダの作成

お手製ダウンローダの実行例

反省

`pixivpy`のインストール

`selenium`のインストール