プロキシ経由 スクレイピング Python
Q&A
Closed
解決したいこと
プロキシ経由で、IPアドレスを変更しスクレイピングを実行しようとしています。
他のブログ記事等を参考にコーディングをしましたが、
エラーが発生し、つまづいています。
解決方法をご教示いただけないでしょうか。
IPアドレスは有料で購入したものを使用しています。
環境:python3,Windows11
発生している問題・エラー
---------------------------------------------------------------------------
TimeoutError Traceback (most recent call last)
File ~\AppData\Local\Programs\Python\Python310\lib\site-packages\urllib3\connection.py:174, in HTTPConnection._new_conn(self)
173 try:
--> 174 conn = connection.create_connection(
175 (self._dns_host, self.port), self.timeout, **extra_kw
176 )
178 except SocketTimeout:
File ~\AppData\Local\Programs\Python\Python310\lib\site-packages\urllib3\util\connection.py:95, in create_connection(address, timeout, source_address, socket_options)
94 if err is not None:
---> 95 raise err
97 raise socket.error("getaddrinfo returns an empty list")
File ~\AppData\Local\Programs\Python\Python310\lib\site-packages\urllib3\util\connection.py:85, in create_connection(address, timeout, source_address, socket_options)
84 sock.bind(source_address)
---> 85 sock.connect(sa)
86 return sock
TimeoutError: [WinError 10060] 接続済みの呼び出し先が一定の時間を過ぎても正しく応答しなかったため、接続できませんでした。または接続済みのホストが応答しなかったため、確立された接続は失敗しました。
During handling of the above exception, another exception occurred:
ConnectTimeoutError Traceback (most recent call last)
File ~\AppData\Local\Programs\Python\Python310\lib\site-packages\urllib3\connectionpool.py:700, in HTTPConnectionPool.urlopen(self, method, url, body, headers, retries, redirect, assert_same_host, timeout, pool_timeout, release_conn, chunked, body_pos, **response_kw)
699 if is_new_proxy_conn and http_tunnel_required:
--> 700 self._prepare_proxy(conn)
702 # Make the request on the httplib connection object.
File ~\AppData\Local\Programs\Python\Python310\lib\site-packages\urllib3\connectionpool.py:994, in HTTPSConnectionPool._prepare_proxy(self, conn)
992 conn.tls_in_tls_required = True
--> 994 conn.connect()
File ~\AppData\Local\Programs\Python\Python310\lib\site-packages\urllib3\connection.py:358, in HTTPSConnection.connect(self)
356 def connect(self):
357 # Add certificate verification
--> 358 conn = self._new_conn()
359 hostname = self.host
File ~\AppData\Local\Programs\Python\Python310\lib\site-packages\urllib3\connection.py:179, in HTTPConnection._new_conn(self)
178 except SocketTimeout:
--> 179 raise ConnectTimeoutError(
180 self,
181 "Connection to %s timed out. (connect timeout=%s)"
182 % (self.host, self.timeout),
183 )
185 except SocketError as e:
ConnectTimeoutError: (<urllib3.connection.HTTPSConnection object at 0x000001693B426860>, 'Connection to 194.110.89.26 timed out. (connect timeout=None)')
During handling of the above exception, another exception occurred:
MaxRetryError Traceback (most recent call last)
File ~\AppData\Local\Programs\Python\Python310\lib\site-packages\requests\adapters.py:440, in HTTPAdapter.send(self, request, stream, timeout, verify, cert, proxies)
439 if not chunked:
--> 440 resp = conn.urlopen(
441 method=request.method,
442 url=url,
443 body=request.body,
444 headers=request.headers,
445 redirect=False,
446 assert_same_host=False,
447 preload_content=False,
448 decode_content=False,
449 retries=self.max_retries,
450 timeout=timeout
451 )
453 # Send the request.
454 else:
File ~\AppData\Local\Programs\Python\Python310\lib\site-packages\urllib3\connectionpool.py:785, in HTTPConnectionPool.urlopen(self, method, url, body, headers, retries, redirect, assert_same_host, timeout, pool_timeout, release_conn, chunked, body_pos, **response_kw)
783 e = ProtocolError("Connection aborted.", e)
--> 785 retries = retries.increment(
786 method, url, error=e, _pool=self, _stacktrace=sys.exc_info()[2]
787 )
788 retries.sleep()
File ~\AppData\Local\Programs\Python\Python310\lib\site-packages\urllib3\util\retry.py:592, in Retry.increment(self, method, url, response, error, _pool, _stacktrace)
591 if new_retry.is_exhausted():
--> 592 raise MaxRetryError(_pool, url, error or ResponseError(cause))
594 log.debug("Incremented Retry for (url='%s'): %r", url, new_retry)
MaxRetryError: HTTPSConnectionPool(host='www.google.com', port=443): Max retries exceeded with url: / (Caused by ConnectTimeoutError(<urllib3.connection.HTTPSConnection object at 0x000001693B426860>, 'Connection to 194.110.89.26 timed out. (connect timeout=None)'))
During handling of the above exception, another exception occurred:
ConnectTimeout Traceback (most recent call last)
Input In [26], in <module>
7 url = 'https://www.google.com/'
9 proxies = {
10 'http':'https://194.110.89.26:24000',
11 'https':'https://194.110.89.26:24000'
12 }
---> 14 res = requests.get(url, proxies=proxies)
15 soup = BeautifulSoup(res.content, 'lxml')
16 print(soup.text)
File ~\AppData\Local\Programs\Python\Python310\lib\site-packages\requests\api.py:75, in get(url, params, **kwargs)
64 def get(url, params=None, **kwargs):
65 r"""Sends a GET request.
66
67 :param url: URL for the new :class:`Request` object.
(...)
72 :rtype: requests.Response
73 """
---> 75 return request('get', url, params=params, **kwargs)
File ~\AppData\Local\Programs\Python\Python310\lib\site-packages\requests\api.py:61, in request(method, url, **kwargs)
57 # By using the 'with' statement we are sure the session is closed, thus we
58 # avoid leaving sockets open which can trigger a ResourceWarning in some
59 # cases, and look like a memory leak in others.
60 with sessions.Session() as session:
---> 61 return session.request(method=method, url=url, **kwargs)
File ~\AppData\Local\Programs\Python\Python310\lib\site-packages\requests\sessions.py:529, in Session.request(self, method, url, params, data, headers, cookies, files, auth, timeout, allow_redirects, proxies, hooks, stream, verify, cert, json)
524 send_kwargs = {
525 'timeout': timeout,
526 'allow_redirects': allow_redirects,
527 }
528 send_kwargs.update(settings)
--> 529 resp = self.send(prep, **send_kwargs)
531 return resp
File ~\AppData\Local\Programs\Python\Python310\lib\site-packages\requests\sessions.py:645, in Session.send(self, request, **kwargs)
642 start = preferred_clock()
644 # Send the request
--> 645 r = adapter.send(request, **kwargs)
647 # Total elapsed time of the request (approximately)
648 elapsed = preferred_clock() - start
File ~\AppData\Local\Programs\Python\Python310\lib\site-packages\requests\adapters.py:507, in HTTPAdapter.send(self, request, stream, timeout, verify, cert, proxies)
504 if isinstance(e.reason, ConnectTimeoutError):
505 # TODO: Remove this in 3.0.0: see #2811
506 if not isinstance(e.reason, NewConnectionError):
--> 507 raise ConnectTimeout(e, request=request)
509 if isinstance(e.reason, ResponseError):
510 raise RetryError(e, request=request)
ConnectTimeout: HTTPSConnectionPool(host='www.google.com', port=443): Max retries exceeded with url: / (Caused by ConnectTimeoutError(<urllib3.connection.HTTPSConnection object at 0x000001693B426860>, 'Connection to 194.110.89.26 timed out. (connect timeout=None)'))
該当するソースコード
import requests
from bs4 import BeautifulSoup
#google url
url = 'https://www.google.com/'
proxies = {
'http':'http://000.00.00.00:00000',
'https':'https://000.000.00.00:00000'
}
res = requests.get(url, proxies=proxies)
soup = BeautifulSoup(res.content, 'lxml')
print(soup.text)
自分で試したこと
・対象のurlを変更してみた。
・IPアドレスを別のものに変更してみた。
・システム環境変数に以下を追加してみた。
http_proxy:http://000.00.00.00:00000
https_proxy:https://000.000.00.00:00000
0