More than 5 years have passed since last update.

BeautifulSoupでいらないHTMLタグを取り除く

Last updated at 2017-05-12Posted at 2017-05-12

BeautifulSoupを初めて使った

業務で何故かスクレイピングすることになったので急遽突貫で使ってみました。

sc.py

import urllib.request
import bs4

url = 'http://www.XXXXXX.jp'

html = urllib.request.urlopen(url)
soup = bs4.BeautifulSoup(html, 'html.parser')

title = soup.select('.lxl-inCateList ul li a dl dt')
price = soup.find_all("dd", class_="l-price")



for i in title:
    a = (i.string)
    print (a)
for i in price:
    b = (i.string)
    print (b)

見栄えはビューティフルじゃないソースですが、

a = (i.string)

とすることで不要なHTMLタグは削除できました。

soup.find_all("dd", class_="l-price")

クラスなどを参照しに行けるのはめちゃくちゃ便利ですね。もっと早く知っていれば良かった・・・。
急な用件で「これとこれをサイトからまとめて資料にしといて」とかいう作業が一気に楽になりますね。

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up