More than 5 years have passed since last update.

PythonでWebスクレイピング

Last updated at 2018-04-17Posted at 2018-04-17

背景

前回同様なんとなく（笑）

データ

こんなサイトがありました。
http://beer-cruise.net/
ご当地クラフトビールが載っているので、全部試しに取得してみたいと思います

ライブラリ

urllib2：URLからデータ取得用
BeautifulSoup：取得したデータをHTMLパースする用

基本コード

import urllib2
from bs4 import BeautifulSoup
# URL設定
url = 'http://beer-cruise.net/beer/Hokkaido.html'
# 読み込み
html = urllib2.urlopen(url)
# HTMLパース
soup = BeautifulSoup(html, "html.parser")

これでsoupの中にHTMLのデータが存在しますので、.find(タグ)や.get(クラス)などで詳細が取得できます。

確認

まずはどのタグを取得すべきなのか、HTMLソースを確認します。

<table bgcolor="#e0ffd0" border="1" cellspacing="0" cellpadding="4">
<tr valign="top">
<th align="left" bgcolor="#d0fcff" colspan="4"><a name="hokkaido">　　　北海道</a></th>
</tr>
<tr valign="top">
<td align="center">ブランド名</td>
<td align="center">ブルーパブ名</td>
<td align="center">製造元・企画元</td>
<td align="center">所在地</td>
</tr>
<tr valign="top">
<td align="left"><a href="Z00033.html" target="_top">えんがる太陽の丘ビール</a></td>
<td align="left">麦酒館ふぁーらいと<font color=red>［閉店］</font></td>
<td align="left"><img src="../image/label11.png" border="0">(株)遠軽農業振興公社</td>
<td align="left">北海道遠軽町</td>
…(省略)

ここから

trで一式取れば問題なさそう
thは都道府県名が記載（スペース考慮）
tdのalign=leftが取得対象
tdの1つ目が名前
tdの2つ目のfontに閉店などが記載

でいけるかな？

実装

import urllib2
from bs4 import BeautifulSoup
# URL設定
url = 'http://beer-cruise.net/beer/Hokkaido.html'
# 読み込み
html = urllib2.urlopen(url)
# HTMLパース
soup = BeautifulSoup(html, "html.parser")
# 1.trタグ全取得
row_list = soup.find_all('tr')
prefecture = None
prefecture_beers = {}
for row in row_list:
    # 2.thは都道府県名
    if row.find('th') is not None:
        prefecture = row.find('th').string.replace(u'　', '')
    cols = row.find_all('td')
    if len(cols) < 2:
        continue
    # 3.tdのalign確認
    if cols[0].get('align') != 'left':
        continue
    # 4.ビール名
    beer_name = cols[0].text
    # 5.tdの2つ目のfont確認、閉店や製造中止は含めない
    if cols[1].find('font') is not None:
        continue
    if prefecture is None:
        continue
    if prefecture not in prefecture_beers.keys():
        prefecture_beers[prefecture] = []
    prefecture_beers[prefecture].append(beer_name)

辞書型のprefecture_beers変数に都道府県名とクラフトビール名が格納されます。
一点注意すべきはビール名取得をstringではなくtextにしていることです。

<td align="left"><a href="111108.html" target="_top">薄野地麦酒<br>夕張石炭ビール</a></td>

この場合stringで取得する間に
タグがあるためNoneになりますのでtextにしています。

言い忘れてしまいましたが、今回のURLは北海道だけですので、全国の場合はさらにタブ分ループすると良いと思います。

参考

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up