More than 5 years have passed since last update.

Pythonスクレイピング（Beautiful Soup）+XPathでテキスト文指定でリンク先ＵＲＬを取得

Last updated at 2017-01-25Posted at 2016-11-27

先日来から訓練中のスクレイピングですが、
以下のことがなかなかできなかったのですが、
出来たので記事に。

・テーブル構造の中に存在する、テキストと、リンク先URLをセットでスクレイピングしたい（pandasのDataFrameを利用）
・当該リンク先URLは、同じテーブル内で複数のa hrefがあり、かつ識別可能な名前が付与されておらず正規表現でもとりにくいものだった
→テキスト文章を指定してそのテキストのリンク先として指定をして、スクレイピングしたら良さそうだったのでXPathを使うことに
（DataFrameは行数が揃わないとエラーが返ってくるので不要なデータを省いて確実にとりたい）
・Beautiful SoupはXPath使えないけど、lxmlを使えば出来た

【参考にさせていただいたサイト】
http://gci.t.u-tokyo.ac.jp/tutorial/crawling/
http://www.slideshare.net/tushuhei/python-xpath
http://qiita.com/tamonoki/items/a341657a86ff7a945224

scraping.py

# coding: utf-8
from bs4 import BeautifulSoup
import urllib2
import pandas as pd
import time
import lxml.html

aaa = []
bbb = []

for page in range(1,2):
	url = "http://www.～～～" + str(page)
	html = urllib2.urlopen(url)
	html2 = urllib2.urlopen(url)
	soup = BeautifulSoup(html, "lxml")
	dom = lxml.html.fromstring(html2.read())

	for o1 in soup.findAll("td", class_="xx"):
		aaa.append(o1.string)

	for o2 in dom.xpath(u"//a[text()='xxx']/@href"): #xxxの部分をテキスト指定でhrefを取得
		bbb.append(o2)

	time.sleep(2)

df = pd.DataFrame({"aaa":aaa, "bbb":bbb})
print(df)
df.to_csv("xxxx.csv", index=False, encoding='utf-8')

簡単ですが、今日は以上です。

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up