More than 3 years have passed since last update.

Python で web スクレイピングをして文字化けした時の解決方法

Last updated at 2020-04-04Posted at 2020-04-04

経緯

2020年4月時点の情報となります。
Python で web スクレイピングをしたら、日本語文が文字化けしたのでその対処方法をまとめました。
スクレイピング初心者向け。

環境

Windows: Windows 10
Python: Python 3.8.0
Requests: requests 2.23.0
Beautiful Soup: beautifulsoup4 4.8.2

準備

requests とBeautiful Soup 4 を使いますので、
pip でも conda でも、自分の環境に合う方法でインストールしておいてください。

# pip 利用の場合
$ pip install requests
$ pip install beautifulsoup4

# conda 利用の場合
$ conda install -c anaconda requests
$ conda install -c anaconda beautifulsoup4

スクレイピング

requests とBeautiful Soup 4 を使って web スクレイピングしたところ、日本語が文字化けしてしまいました。
*基本的には以下のコードで上手くスクレイピングしてくれるようです。

sample.py

import requests

url = 'URL'
response = requests.get(url) # ここまでだと文字化けの可能性が
response.encoding = response.apparent_encoding # 呪文

print(response.text) # textに入ってくる

(引用を一部改変。引用元: https://qiita.com/naka-j/items/ef38498273e036c26f4d)

ただ私の場合上手くいかなかったので、代わりに

.py

import requests
from bs4 import BeautifulSoup

url = 'URL'
r = requests.get(url)
soup = BeautifulSoup(r.content, 'html.parser')

(引用を一部改変。引用元: https://orangain.hatenablog.com/entry/encoding-in-requests-and-beautiful-soup)
を試したら文字化けが解消されました！

まとめ

response.apparent_encoding を基本使ってみて、ダメなら . content という順番で試すのが良さそうです！

参考・引用元

Pythonでスクレイピング
 RequestsとBeautiful Soupでのスクレイピング時に文字化けを減らす
 Python - Scrapyを使ってクローラーを作る

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up