【Python】画像のWebスクレイピングにおけるエラー
Q&A
Closed
解決したいこと
参考書を元にWebスクレイピングプログラムを作っているものの、エラーが出て動作しません。
個人的には画像データの取得・Excelへの渡し方を失敗している気がするのですが、改善方法をご存じの方がいれば教えていただけると助かります!
【補足情報】
・参考書『Python&Excel自動処理全部入り』のサンプルプログラムが原型(以下参照)
・当方Python初心者のため、皆様からすると論外なミスをしている可能性あり
・プログラムは毎回社内ネットワーク、社用PCにて実行
・プロキシのせいかrequestsモジュールが上手く動かないため、seleniumのwebdriverで代用したいです
※pip installもwhl等でオフラインにしないと失敗します。
※同書の指定URLからh2タグを取得するサンプルプログラムはseleniumで代用できました。
import time
from io import BytesIO
from urllib import parse
import requests
from bs4 import BeautifulSoup
from openpyxl import Workbook
from openpyxl.drawing.image import Image
url = 'https://book.impress.co.jp/'
r = requests.get(url)
soup = BeautifulSoup(r.text, 'html.parser')
books = soup.select('div.block-sub-box-body > ol > li')
wb = Workbook()
ws = wb.active
ws.column_dimensions['A'].width = 50
ws.column_dimensions['B'].width = 40
for i, book in enumerate(books):
time.sleep(1)
row_no = i + 1
ws.cell(row_no, 1).value = book.text
ws.cell(row_no, 2).value = parse.urljoin(url, book.find('a')['href'])
image_url = book.find('img')['src']
image_r = requests.get(parse.urljoin('http:', image_url))
image = Image(BytesIO(image_r.content))
image.width = 80
image.height = 120
ws.add_image(image, ws.cell(row_no, 3).coordinate)
ws.row_dimensions[row_no].height = 100
wb.save('ブックランキング.xlsx')
発生している問題・エラー
Exception managing MicrosoftEdge: error sending request for url (https://msedgedriver.azureedge.net/LATEST_RELEASE_118_WINDOWS): error trying to connect: dns error: これは通常、ホスト名の解決中の一時的なエラーで、ローカル サーバーが権限を持っているサーバーから応答を受信しなかったことを意味します。 (os error 11002)
[12424:7920:1025/183532.481:ERROR:policy_logger.cc(154)] :components\enterprise\browser\controller\chrome_browser_cloud_management_controller.cc(163) Cloud management controller initialization aborted as CBCM is not enabled.
[12424:7920:1025/183532.497:ERROR:assistance_home_client.cc(32)] File path C:\Users\499513\AppData\Local\Temp\scoped_dir4588_1904738743\Default
DevTools listening on ws://127.0.0.1:54317/devtools/browser/31f2ae26-1bf0-44a7-834b-6abae52be10a
[12424:7920:1025/183532.681:ERROR:edge_auth_errors.cc(504)] EDGE_IDENTITY: Get Default OS Account failed: Error: Primary Error: kImplicitSignInFailure, Secondary Error: kAccountProviderFetchError, Platform error: 0, Error string:
[12424:7920:1025/183532.982:ERROR:smartscreen_dns_resolver.cc(110)] SmartScreenDnsResolver::OnComplete Error: -105 DidTimeOut: 0 URL: https://book.impress.co.jp/
[12424:7920:1025/183533.691:ERROR:smartscreen_dns_resolver.cc(110)] SmartScreenDnsResolver::OnComplete Error: -105 DidTimeOut: 0 URL: https://book.impress.co.jp/
[12424:7920:1025/183533.786:ERROR:smartscreen_dns_resolver.cc(110)] SmartScreenDnsResolver::OnComplete Error: -105 DidTimeOut: 0 URL: https://book.impress.co.jp/
[12424:7920:1025/183534.434:ERROR:fallback_task_provider.cc(124)] Every renderer should have at least one task provided by a primary task provider. If a "Renderer" fallback task is shown, it is a bug. If you have repro steps, please file a new bug and tag it as a dependency of crbug.com/739782.
[12424:7920:1025/183534.532:ERROR:fallback_task_provider.cc(124)] Every renderer should have at least one task provided by a primary task provider. If a "Renderer" fallback task is shown, it is a bug. If you have repro steps, please file a new bug and tag it as a dependency of crbug.com/739782.
[12424:7920:1025/183535.685:ERROR:smartscreen_dns_resolver.cc(110)] SmartScreenDnsResolver::OnComplete Error: -7 DidTimeOut: 1 URL: https://platform.twitter.com/widgets/widget_iframe.d37472b4a6622d0b1fff46ad904f6896.html?origin=https%3A%2F%2Fbook.impress.co.jp
[12424:7920:1025/183535.783:ERROR:smartscreen_dns_resolver.cc(110)] SmartScreenDnsResolver::OnComplete Error: -7 DidTimeOut: 1 URL: https://cdn.cxense.com/sp1.html#ver=2.8.33&typ=pgv&rnd=lo5k8bpk7fn9fdas&sid=1132885330118579441&loc=https%3A%2F%2Fbook.impress.co.jp%2F&new=1&arf=0<m=1698226533706&ref=&tzo=-540&wsz=988x523&res=1280x720&dpr=1.5&col=24&bln=ja&chs=UTF-8&cks=lo5k8bqwklum6xjo&ckp=lo5k8bpkt30s6hhe&glb=&cp_userState=anon
[12424:7920:1025/183536.940:ERROR:smartscreen_dns_resolver.cc(110)] SmartScreenDnsResolver::OnComplete Error: -105 DidTimeOut: 0 URL: https://book.impress.co.jp/
[12424:7920:1025/183537.916:ERROR:smartscreen_dns_resolver.cc(110)] SmartScreenDnsResolver::OnComplete Error: -7 DidTimeOut: 1 URL: https://platform.twitter.com/widgets/follow_button.d37472b4a6622d0b1fff46ad904f6896.ja.html#dnt=false&id=twitter-widget-0&lang=ja&screen_name=impress_corp&show_count=false&show_screen_name=false&size=l&time=1698226535900
[12424:7920:1025/183538.729:ERROR:fallback_task_provider.cc(124)] Every renderer should have at least one task provided by a primary task provider. If a "Renderer" fallback task is shown, it is a bug. If you have repro steps, please file a new bug and tag it as a dependency of crbug.com/739782.
[12424:7920:1025/183539.025:ERROR:smartscreen_dns_resolver.cc(110)] SmartScreenDnsResolver::OnComplete Error: -105 DidTimeOut: 0 URL: https://book.impress.co.jp/
[12424:7920:1025/183539.085:ERROR:smartscreen_dns_resolver.cc(110)] SmartScreenDnsResolver::OnComplete Error: -105 DidTimeOut: 0 URL: http://img.ips.co.jp/ij/23/1123101004/1123101004-240x.jpg
Traceback (most recent call last):
File "C:\Users\499513\AppData\Local\Programs\Python\Python311\Lib\site-packages\PIL\Image.py", line 3222, in open
fp.seek(0)
^^^^^^^
AttributeError: 'function' object has no attribute 'seek'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "D:\12_Python\python_excel\Chapter07\booklist_img_selenium.py", line 33, in <module>
image = Image(image_r)
^^^^^^^^^^^^^^
File "C:\Users\499513\AppData\Local\Programs\Python\Python311\Lib\site-packages\openpyxl\drawing\image.py", line 32, in __init__
image = _import_image(img)
^^^^^^^^^^^^^^^^^^
File "C:\Users\499513\AppData\Local\Programs\Python\Python311\Lib\site-packages\openpyxl\drawing\image.py", line 16, in _import_image
img = PILImage.open(img)
^^^^^^^^^^^^^^^^^^
File "C:\Users\499513\AppData\Local\Programs\Python\Python311\Lib\site-packages\PIL\Image.py", line 3224, in open
fp = io.BytesIO(fp.read())
^^^^^^^
AttributeError: 'function' object has no attribute 'read'
該当するソースコード
import time
from urllib import parse
from selenium import webdriver
from bs4 import BeautifulSoup
from openpyxl import Workbook
from openpyxl.drawing.image import Image
url = 'https://book.impress.co.jp/'
driver = webdriver.Edge()
driver.get(url)
r = driver.page_source
soup = BeautifulSoup(r, 'html.parser')
books = soup.select('div.block-sub-box-body > ol > li')
wb = Workbook()
ws = wb.active
ws.column_dimensions['A'].width = 50
ws.column_dimensions['B'].width = 40
for i, book in enumerate(books):
time.sleep(1)
row_no = i + 1
ws.cell(row_no, 1).value = book.text
ws.cell(row_no, 2).value = parse.urljoin(url, book.find('a')['href'])
image_url = book.find('img')['src']
driver.get(parse.urljoin('http:', image_url))
image_r = driver.get_screenshot_as_png
image = Image(image_r)
image.width = 80
image.height = 120
ws.add_image(image, ws.cell(row_no, 3).coordinate)
ws.row_dimensions[row_no].height = 100
wb.save('ブックランキング.xlsx')
自分で試したこと
・モジュール不足のエラーがないことを確認
・webdriverによってブラウザが立ち上がることを確認(動作途中に自動で閉じる)
・エラーで止まるせいか、実行後もxlsxファイルが作成されていないことを確認
・画像周りでデータ型が不適切になっていないか確認
・Bingチャットでコードの不備を質問(指摘された箇所は修正済み)
・VSCodeにて「ワークスペースで問題は検出されていません」と表示されることを確認