More than 5 years have passed since last update.

Python: WebCrawling

Posted at 2019-09-30

SAMPLE

F12 デベロッパーツール

この画面から、
・ユーザー名
・記事タイトル（複数）
・記事タイトル（文字列"T-SQL"がタイトルに含まれているもののみ）
・利用規約（Terms）のリンク先URL
・Itemsの数
をそれぞれ取得する。

取得したいノードを
右クリック=>Copy=>CopyElement
でxpathをコピーできる。ただしその際は複雑なことはできない。

REFERENCE

クローラ作成に必須！XPATHの記法まとめ

PYTHON

import requests
import lxml.html

url = 'https://qiita.com/kinoshita_yuri'

response_page = requests.get(url)
html_page = lxml.html.fromstring(response_page.content)

## user name
xpath_name = 'string(//*[@class="col-md-3 col-sm-3 col-xs-12 newUserPageProfile"]/div[@class="newUserPageProfile_header"]/h3/span)'
html_name = html_page.xpath(xpath_name)

## article title
xpath_title = '//*[@id="main"]/div/div/div[@class="col-md-9 col-sm-9 col-xs-12"]/div[@class="tableList"]/article[position()]/div/div[@class="ItemLink__title"]/a/text()'
html_title = html_page.xpath(xpath_title)

## article title (only T-SQL)
xpath_title = '//*[@id="main"]/div/div/div[@class="col-md-9 col-sm-9 col-xs-12"]/div[@class="tableList"]/article[position()]/div/div[@class="ItemLink__title"]/a[contains(text(),"T-SQL")]/text()'
html_title = html_page.xpath(xpath_title)

## terms link（文字列"Terms"を検索した後、親ノードのhrefを取得する）
xpath_terms = '//*[@class="footer_container"]/ul[@class="footer_links-left"]/li/a/text()[.="Terms"]/parent::node()/@href'
html_terms = html_page.xpath(xpath_terms)

## Items（文字列"Itemsを検索した後、親ノードの兄ノードに含まれるTextを取得"）
xpath_area = '//*[@id="main"]/div/div/div[2]/div[1]/div[2]/a[@class="col-xs-4 userActivityChart_stat active"]/span[@class="userActivityChart_statUnit"]/text()[.="Items"]/parent::node()/preceding-sibling::span/text()'
html_area = html_page.xpath(xpath_area)

各ノードとツリー上の位置関係

名前	説明
self	ノード自身を表す
child	ノードの子ノードの集合
parent	ノードの親ノードの集合
ancestor	ノードから祖先ノードの集合(親も含む)
descendant	ノードから子孫ノード集合
following	ノードの後に出てくるノードの集合
preceding	ノードの前に出てくるノードの集合
following-sibling	ノードと同じ階層にあり、かつ後に出てくる兄弟ノードの集合
preceding-sibling	ノードと同じ階層にあり、かつ前に出てくる兄弟ノードの集合

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up