More than 3 years have passed since last update.

pixivpyで5000人以上フォローしているユーザーを取得できない問題を解決案（番外編4）

Last updated at 2021-11-19Posted at 2021-04-05

はじめに

pythonのpixivpyにはpixivpyという非公式のライブラリがあるのだが、
どうやら現在私が書いた記事の方法だと、フォローしているユーザーをすべて取得できないらしい。

なので分析するとともに、新しい方法で取得した。

きっかけ

@fukubucho_ さんのコメント

すみません、追加でお聞きしたいのですが、ID getterには取得上限などあるのでしょうか？
6600フォローしてるのですが、何度IDを取り直しても30×167=5010フォローしか取れないので…

とあるので調査してみた
今回はこの記事を読んでいること、使っていることを前提としています。

1.原因の調査

現在自分のフォローしているユーザーを取得するスクリプトはこちらである。

pixiv_follow_id_getter.py

# -*- coding: utf-8 -*-
"""
Created on Tue Nov 16 20:58:22 2021

@author: yuki
"""

from pixivpy3 import *
import json
import copy
from time import sleep
import datetime


# client.jsonの読み込み処理
f = open("client.json", "r")
client_info = json.load(f)
f.close()





# 2021/2/21ログイン方法変更
# 2021/11/9　api(PixivAPI())削除
aapi = AppPixivAPI()
aapi.auth(refresh_token = client_info["refresh_token"])





# 現在のフォローユーザーのidを取得
server_ids = []
user_following = aapi.user_following(client_info["user_id"])
print("現在pixivであなたがフォローしているユーザーは")
while True:
    try:
        for i in range(30):
            server_ids.append(user_following.user_previews[i].user.id)
            print("{:<10}".format(user_following.user_previews[i].user.id) + user_following.user_previews[i].user.name)
        next_qs = aapi.parse_qs(user_following.next_url)
        sleep(1)
        user_following = aapi.user_following(**next_qs)
        sleep(1)
    except:
        print("\nたぶん終わり\n")
        break



# ここからは、pixivでフォロー解除したユーザーについて、client_infoから削除しないようにする処理
# もしくはエラーで取得できなかったユーザーを削除しないように

# 現在のjsonと比較して追加したい
# listはオブジェクトだからメモリの位置が渡されてしまうからcopyを使う
new_ids = copy.copy(client_info["ids"])
for i in range(len(server_ids)):
    if (server_ids[i] not in client_info["ids"]):
        new_ids.append(server_ids[i])
        print("{:<10}".format(server_ids[i]) + "を追加したよ")

# 数値を表示したい
print("現在のフォロー総数は")
print(len(server_ids))
print("更新前のリスト内の総数は")
print(len(client_info["ids"]))
print("更新後のリスト内の総数は")
print(len(new_ids))

# for i in range(len(new_ids)):
    #print(new_ids[i] not in client_info["ids"])


# client.jsonに書き込みたい
client_info["ids"] = new_ids
client_info["version"] = datetime.datetime.now().strftime('%Y%m%d')

with open("client.json", "w") as n:
    json.dump(client_info, n, indent=2, ensure_ascii=False)




# これid取得時に逐次比較のほうがメモリ的にもよくね？
# 5000以上のプログラムも見よう

このスクリプトの

.py

 next_qs = aapi.parse_qs(a.next_url)
 user_following = aapi.user_following(**next_qs)

この部分を調査する。

next_qs

next_qsの中身は、

.py

{'offset': 30, 'restrict': 'public', 'user_id': 数字}
{'offset': 60, 'restrict': 'public', 'user_id': 数字}
{'offset': 90, 'restrict': 'public', 'user_id': 数字}
...

となっており、
offsetがフォローしているユーザーの先頭からの通し番号になっていることがわかった。

ここで、

.py

next_qs = {'offset': 15000, 'restrict': 'public', 'user_id': 数字}

とし、実行してみたところ、

.py

{'error': {'user_message': '', 'message': '{"offset":["Offset must be no more than 5000"]}', 'reason': '', 'user_message_details': {}}}

と返答が来た。

Offset must be no more than 5000
とあるから、5000以上のフォローユーザーは受け付けていないことがわかった。
これが、pixivpyの問題なのかapiの問題なのかを調査した。

user_following

next_qsを使用するuser_followingという関数を調査した。

aapi.py

    # Following用户列表
    def user_following(self, user_id, restrict='public', offset=None, req_auth=True):
        url = '%s/v1/user/following' % self.hosts
        params = {
            'user_id': user_id,
            'restrict': restrict,
        }
        if (offset):
            params['offset'] = offset

        r = self.no_auth_requests_call('GET', url, params=params, req_auth=req_auth)
        return self.parse_result(r)

上のようになっており、問題はない。

no_auth_requests_call

さらにuser_following内で呼び出されていたno_auth_requests_callという関数を調べた。

aapi.py

    # Check auth and set BearerToken to headers
    def no_auth_requests_call(self, method, url, headers={}, params=None, data=None, req_auth=True):
        if self.hosts != "https://app-api.pixiv.net":
            headers['host'] = 'app-api.pixiv.net'
        if headers.get('User-Agent', None) == None and headers.get('user-agent', None) == None:
            # Set User-Agent if not provided
            headers['App-OS'] = 'ios'
            headers['App-OS-Version'] = '12.2'
            headers['App-Version'] = '7.6.2'
            headers['User-Agent'] = 'PixivIOSApp/7.6.2 (iOS 12.2; iPhone9,1)'

        if (not req_auth):
            return self.requests_call(method, url, headers, params, data)
        else:
            self.require_auth()
            headers['Authorization'] = 'Bearer %s' % self.access_token
            return self.requests_call(method, url, headers, params, data)

この中ではapiにアクセスしていた。
pixivが使用しているapiであるhttps://app-api.pixiv.netにアクセスすると、

.py

{"error":{"user_message":"\u6307\u5b9a\u3055\u308c\u305f\u30a8\u30f3\u30c9\u30dd\u30a4\u30f3\u30c8\u306f\u5b58\u5728\u3057\u307e\u305b\u3093","message":"","reason":"","user_message_details":{}}}

となり、先ほどの

.py

{'error': {'user_message': '', 'message': '{"offset":["Offset must be no more than 5000"]}', 'reason': '', 'user_message_details': {}}}

と返答の構造が同じであった。

よって5000以上idを取得できない問題は、pixiv側にあることが分かった。

しかしこのapiはアプリ用のapiらしく、パソコンでふつうにpixivを使用している分には問題にならなかったのだろう。

2.代替案

AppPixivAPI()でダメならPixivAPI()なら、と思ったが、そのような関数がなかった。

なので最終手段として、seleniumを用いてマイページにアクセスして、htmlを分析してフォローしているユーザーidを取得しようと思う。

前提

REFRESH_TOKEN

こちらを参考にREFRESH_TOKENを取得してください。

selenium

REFRESH_TOKEN取得時にseleniumでchromeを使用しますが、今回のスクリプトでも必要になります。

client.json

こちらを参考に

client.jsonというファイルを用意し、各々書き込んでください。

スクリプト

pixiv_follow_id_getter_over5000.py

# -*- coding: utf-8 -*-
"""
Created on Thu Jul  9 15:10:40 2020
@author: yuki
"""

# 1.3.1を前提に配布用に変更
# aapiではフォローユーザー5000人以降はとってこれないのでwebからスクレイピングする





from pixivpy3 import *
import json
import copy
from time import sleep
import datetime
from bs4 import BeautifulSoup
from selenium.webdriver.chrome.options import Options
from selenium import webdriver
from tqdm import tqdm



f = open("client.json", "r")
client_info = json.load(f)
f.close()


# 2021/2/21ログイン方法変更
# 2021/11/9　api(PixivAPI())削除
aapi = AppPixivAPI()
aapi.auth(refresh_token = client_info["refresh_token"])



# フォロー数の確認
user_detail = aapi.user_detail(client_info["user_id"])
total_follow_users = user_detail.profile.total_follow_users
print("全部で"+ str(total_follow_users) + "人フォローしています")
total_follow_page = total_follow_users//24 +1



# ここからseleniumでのid取得
options = Options()
# Headlessモードを無効にする（Trueだとブラウザが立ち上がらない）
options.set_headless(False)


# ブラウザを起動する
driver = webdriver.Chrome("chromedriver.exe", chrome_options=options)
driver.get("https://accounts.pixiv.net/login")

login_id = driver.find_element_by_xpath('//*[@id="LoginComponent"]/form/div[1]/div[1]/input')
login_id.send_keys(client_info["pixiv_id"])
password = driver.find_element_by_xpath('//*[@id="LoginComponent"]/form/div[1]/div[2]/input')
password.send_keys(client_info["password"])

login_btn = driver.find_element_by_xpath('//*[@id="LoginComponent"]/form/button')
login_btn.click()

# ネット回線が遅くてログイン完了前にページ遷移してエラーになってしまうならここを長くする
sleep(4)



server_ids = []
for i in range(1, total_follow_page+1):

    #ログインページに
    url = "https://www.pixiv.net/users/" +str(client_info["user_id"]) + "/following?p={}".format(i)
    driver.get(url)
    
    #表示前に処理を始めないように一応
    sleep(2)
    
    html = driver.page_source.encode('utf-8')
    #print(html)
    #https://hideharaaws.hatenablog.com/entry/2016/05/06/175056
    #lxmlだとなぜが取得的ないページが
    soup = BeautifulSoup(html, "html5lib")
    #ここのclass_は前回と変わったので定期的にチェックを    
    users = soup.find_all("div", class_ = "sc-19z9m4s-5 iqZEnZ")

    
    for user in users:        
        url = user.find("a")["href"]        
        user_id = url.split("/")[-1]
        user_name = user.find("a").text
        print("{:<10}".format(user_id) + user_name)
        server_ids.append(user_id)
    

    





# ここからは、pixivでフォロー解除したユーザーについて、client_infoから削除しないようにする処理
# もしくはエラーで取得できなかったユーザーを削除しないように

# 現在のjsonと比較して追加したい
# listはオブジェクトだからメモリの位置が渡されてしまうからcopyを使う
new_ids = copy.copy(client_info["ids"])
for i in range(len(server_ids)):
    if (server_ids[i] not in client_info["ids"]):
        new_ids.append(server_ids[i])
        print("{:<10}".format(server_ids[i]) + "を追加したよ")

# 数値を表示したい
print("現在のフォロー総数は")
print(len(server_ids))
print("更新前のリスト内の総数は")
print(len(client_info["ids"]))
print("更新後のリスト内の総数は")
print(len(new_ids))

# for i in range(len(new_ids)):
    #print(new_ids[i] not in client_info["ids"])


# client.jsonに書き込みたい
client_info["ids"] = new_ids
client_info["version"] = datetime.datetime.now().strftime('%Y%m%d')

with open("client.json", "w") as n:
    json.dump(client_info, n, indent=2, ensure_ascii=False)

前までのスクリプトとの違い

1.apiではなくseleniumを使用して直接htmlからidを取得した
2.書き込む際、すべて取得できていないことがあるので、client.jsonの中身と比較して、追加された分だけ追加で書き込むようにした。

使用方法

上記の前提を用意し、
カレントディレクトリにchromedriver.exeがなければ

.py

driver = webdriver.Chrome("chromedriver.exe", chrome_options=options)

の"chromedriver.exe"を"chromedriver.exeのパス"に変更して実行

注意

何度も短時間に実行すると、ログイン時に認証が必要になり（9枚の写真から選ぶやつ）、うまくログインできなくなります。
その時は時間を空けるか、

.py

# Headlessモードを有効にする（Falseにするとブラウザが実際に立ち上がります）
options.set_headless(True)

の部分をFalseにし、

.py

login_btn = driver.find_element_by_xpath('//*[@id="LoginComponent"]/form/button')
login_btn.click()

の直後にある程度長いsleepを挟んで手動で認証をし、ログインボタンをクリックしてください

まとめ

5000を超えるフォローユーザーを取得することに対して、pixivpyでは対応していないことが分かった。
その原因はpixivのアプリ用のapiを使用していることであり、
sleniumを使用して力技で取得するようにした。
webのpixivが使用しているapiがあったら教えてください。

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up