@BELL55posted at 2022-11-08

Python　スクレイピング時の文字化け

Q&A

Closed

解決したいこと

スクレイピング対象：https://db.netkeiba.com/race/202204010808/
上記サイトをスクレイピングする際に、以下のような事象が発生しています。
①requestsを使用する場合
　文字化けする。
　同じ系統の別ページをスクレイピングした際は文字化けしない。
　例）https://db.netkeiba.com/race/202204010807/
②seleniumを使用する場合
　文字化けしない。

②の手法を取ることで、問題は解決したのですが、なぜこのような事象が発生するのかご教授いただければと思います。

該当するソースコード

①の場合

import requests
from bs4 import BeautifulSoup

session = requests.session()
url="https://db.netkeiba.com/race/202204010808/"
responce = session.get(url)
soup = BeautifulSoup(responce.content, "html.parser")

②の場合

from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from bs4 import BeautifulSoup

options = Options()
options.add_argument('--headless')
options.add_experimental_option('excludeSwitches', ['enable-logging'])
driverpath = "ドライバのパス"
driver = webdriver.Chrome(driverpath, options=options)

url="https://db.netkeiba.com/race/202204010808/"
driver.get(url)
html = driver.page_source
soup = BeautifulSoup(html, "html.parser")

0 likes

2Answer

@HalHarada posted at 2022-11-08

soup = BeautifulSoup(responce.content, 'html.parser', from_encoding='utf-8')

先方のサーバーは　utf-8 なのでしょうか？cp932とか？eucJPとか？

①はpython（BeautifulSoup）ーサーバーに直接接続してます。　BeautifulSoupはデホルトがutf-8？
一方②は、
python（BeautifulSoup）ーwebdriverーChromeーサーバー
です。Chromeがコード処理を担っています。
暇人ｘ in 駅

1Like

@kawagoe6884 posted at 2022-11-09

文字化けの原因

res.content をBeutifulSoupがエンコードをするときに使ったのが'windows-1252'だから文字化けした
'windows-1252'を使うことになったのは﨏 という機種依存文字？
文字化けしないURLで使われたエンコードは'euc-jp'だった

以下、詳細

mojibake.py

# 文字化けしたURL
url = "https://db.netkeiba.com/race/202204010808/"
res = requests.get(url)
soup = BeautifulSoup(res.content, "html.parser")
soup.original_encoding
# 'windows-1252'
soup.select("h1")[1].text
# '4ºÐ°Ê¾å1¾¡¥¯¥é¥¹'

# 文字化けしないURL
url = "https://db.netkeiba.com/race/202204010807/"
res = requests.get(url)
soup = BeautifulSoup(res.content, "html.parser")
soup.original_encoding
# 'euc-jp'
soup.select("h1")[1].text
# '4歳以上1勝クラス'

# 文字化けしないURLでもエンコードを'windows-1252'にすると文字化けする
url = "https://db.netkeiba.com/race/202204010807/"
res = requests.get(url)
soup = BeautifulSoup(res.content, "html.parser", from_encoding="windows-1252")
soup.original_encoding
# 'windows-1252'
soup.select("h1")[1].text
# '4ºÐ°Ê¾å1¾¡¥¯¥é¥¹'

対応策

公式ドキュメントから：

If one parser isn’t working on a certain document, the best solution is to try a different parser.
あるパーサーが特定の文書で機能しない場合、最良の解決策は別のパーサーを試すことです。

なので、①は以下のようにすると文字化けせずに取得できます

soup = BeautifulSoup(res.content, "lxml")
# or
soup = BeautifulSoup(res.content, "html5lib")

であれば、なぜパーサーを変えただけで文字化けしないのか？
BeutifulSoup __init__.py の Docstring に

It works better if lxml and/or html5lib is installed.
lxmlやhtml5libがインストールされているとより効果的です。

と書いてあるからそういうものなんだろう。(1から10まで読んでないので曖昧)

上記の方法でも文字化けするようなら
②の手法を使えば、ほぼ文字化けするサイトは無いと思います

1Like

Comments

@BELL55
Questioner
丁寧に回答ありがとうございます。
パーサーを変えるという発想がまったくなかったです。
参考にしていたサイトでは、"html.parser"を使っているものばかりだったので。。。

回答いただいた内容から、エンコードが'windows-1252'の時は、②の処理へ飛ばすように改修しようと思います。

Are you sure you want to delete the question?

Python　スクレイピング時の文字化け

解決したいこと

該当するソースコード

2Answer

文字化けの原因

対応策

Comments

Your answer might help someone💌

Python スクレイピング時の文字化け

解決したいこと

該当するソースコード

2Answer

文字化けの原因

対応策

Comments

Your answer might help someone💌

Python　スクレイピング時の文字化け