More than 3 years have passed since last update.

Beautifulsoupで文字化けに悩まされているあなたへ

Python

Last updated at 2021-02-03Posted at 2021-02-03

#はじめに
pythonでWebページをスクレイピングする際によく使われている (と思われる) Beautifulsoupモジュール。取得したWebページにをパースした後に日本語の文字化けに悩まされていたので今回はバグ解消した流れを備忘録として書き残しておきます。
#問題のコード

# coding: utf-8
import requests
from bs4 import BeautifulSoup
url = "http://www2.he.tohoku.ac.jp/zengaku/zengaku_info_g.html"
site = requests.get(url)
soup = BeautifulSoup(site.text, "html.parser")
print(soup.find_all(id="content_box"))

この状態で実行しても (タグの情報は正確に取れるが) 日本語部分は全て文字化け...
文字コード関連のバグかと思い1行目のcodingの指定をshift_jisなどに変えてみても効果なし
#解決策
はい，簡単なことでした。
6行目soup = BeautifulSoup(site.text, "html.parser")でパースされてるのがunicode文字列だから駄目だったんです。
こうしましょう。soup = BeautifulSoup(site.content, "html.parser")

変更点

site.	取得できる形式
text	unicode
content	bytes

#参考
BeautifulSoupの文字化けが止まらない時の解消方法

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up