LoginSignup
17
25

More than 5 years have passed since last update.

Requests+lxmlでXPathを使ってみたい

Posted at

状況

  • Python初心者
  • BeautifulSoupに慣れていない
  • 少しだけ触ったことがあるXPathを使いたい
  • SeleniumからChromeを動かすほどではないページの内容を取得したい

サンプルHTML

sample.html
<html>
    <head>
        <title>Sample</title>
    </head>
    <body>
        <div class="contents" id="item01">
            <h1 class="title">Sample #01</h1>
            <p>sample text</p>
        </div>
        <div class="contents" id="item02">
            <h1 class="title">Sample #02</h1>
            <ul class="list">
                <li>List #1</li>
                <li>List #2</li>
                <li>List #3</li>
            </ul>
        </div>
        <div class="contents" id="item03">
            <h1 class="title">Sample #03</h1>
            <img src="image01.jpg" alt="image01">
        </div>
    </body>
</html>

Requestsで取得したhtmlをlxmlに渡す

import requests
import lxml.html

url = "http://hoge.hoge/sample.html"

response = requests.get(url)
html = lxml.html.fromstring(response.content)

タグの取得

単純なXPath

htmltag = html.xpath("//div")
# lxml.html.HtmlElementのリストが返される

>>> print(htmltag)
[<Element div at 0x***********>, <Element div at 0x***********>, <Element div at 0x***********>]

順番を指定したXPath

htmltag = html.xpath("//div[2]/ul/li[1]")
# <li>List #1</li>

CSSセレクタを使ったXPath

id

htmltag = html.xpath("//div[@id='item01']")
# ヒットするタグが1件でもリストが返る

>>> print(htmltag)
[<Element div at 0x***********>]

class

divs = html.xpath("//div[@class='contents']")

文字列を含むタグを取得するXPath

#class名に「i」を含む全てのタグ
htmltag = html.xpath("//*[contains(@class, 'i')]")
for item in htmltag:
    print(item.tag)

h1 # <h1 class="title">Sample #01</h1>
h1 # <h1 class="title">Sample #02</h1>
ul # <ul class="list">
h1 # <h1 class="title">Sample #03</h1>

タグ内情報の取得

タグの中身

htmltag = html.xpath("//div[1]/p")

>>> print(htmltag[0].text)
sample text

タグの属性

htmltag = html.xpath("//div[1]")

>>> print(htmltag[0].attrib)
{'class': 'contents', 'id': 'item01'}

>>> print(htmltag[0].attrib["class"])
contents

#XPathでも取得可
htmltag = html.xpath("//div[1]/@class")
>>> print(htmltag[0])
contents

タグの名称

htmltag = html.xpath("//div[1]")

>>> print(htmltag[0].tag)
div
17
25
0

Register as a new user and use Qiita more conveniently

  1. You get articles that match your needs
  2. You can efficiently read back useful information
  3. You can use dark theme
What you can do with signing up
17
25