More than 5 years have passed since last update.

Requests+lxmlでXPathを使ってみたい

Posted at 2018-04-28

##状況

Python初心者
BeautifulSoupに慣れていない
少しだけ触ったことがあるXPathを使いたい
SeleniumからChromeを動かすほどではないページの内容を取得したい

##サンプルHTML

sample.html

<html>
    <head>
        <title>Sample</title>
    </head>
    <body>
        <div class="contents" id="item01">
            <h1 class="title">Sample #01</h1>
            <p>sample text</p>
        </div>
        <div class="contents" id="item02">
            <h1 class="title">Sample #02</h1>
            <ul class="list">
                <li>List #1</li>
                <li>List #2</li>
                <li>List #3</li>
            </ul>
        </div>
        <div class="contents" id="item03">
            <h1 class="title">Sample #03</h1>
            <img src="image01.jpg" alt="image01">
        </div>
    </body>
</html>

##Requestsで取得したhtmlをlxmlに渡す

import requests
import lxml.html

url = "http://hoge.hoge/sample.html"

response = requests.get(url)
html = lxml.html.fromstring(response.content)

##タグの取得
###単純なXPath

htmltag = html.xpath("//div")
# lxml.html.HtmlElementのリストが返される

>>> print(htmltag)
[<Element div at 0x***********>, <Element div at 0x***********>, <Element div at 0x***********>]

###順番を指定したXPath

htmltag = html.xpath("//div[2]/ul/li[1]")
# <li>List #1</li>

###CSSセレクタを使ったXPath
####id

htmltag = html.xpath("//div[@id='item01']")
# ヒットするタグが1件でもリストが返る

>>> print(htmltag)
[<Element div at 0x***********>]

####class

divs = html.xpath("//div[@class='contents']")

###文字列を含むタグを取得するXPath

#class名に「i」を含む全てのタグ
htmltag = html.xpath("//*[contains(@class, 'i')]")
for item in htmltag:
    print(item.tag)

h1 # <h1 class="title">Sample #01</h1>
h1 # <h1 class="title">Sample #02</h1>
ul # <ul class="list">
h1 # <h1 class="title">Sample #03</h1>

##タグ内情報の取得
###タグの中身

htmltag = html.xpath("//div[1]/p")

>>> print(htmltag[0].text)
sample text

###タグの属性

htmltag = html.xpath("//div[1]")

>>> print(htmltag[0].attrib)
{'class': 'contents', 'id': 'item01'}

>>> print(htmltag[0].attrib["class"])
contents

#XPathでも取得可
htmltag = html.xpath("//div[1]/@class")
>>> print(htmltag[0])
contents

###タグの名称

htmltag = html.xpath("//div[1]")

>>> print(htmltag[0].tag)
div

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up