17
25

Delete article

Deleted articles cannot be recovered.

Draft of this article would be also deleted.

Are you sure you want to delete this article?

More than 5 years have passed since last update.

Requests+lxmlでXPathを使ってみたい

Posted at

##状況

  • Python初心者
  • BeautifulSoupに慣れていない
  • 少しだけ触ったことがあるXPathを使いたい
  • SeleniumからChromeを動かすほどではないページの内容を取得したい

##サンプルHTML

sample.html
<html>
    <head>
        <title>Sample</title>
    </head>
    <body>
        <div class="contents" id="item01">
            <h1 class="title">Sample #01</h1>
            <p>sample text</p>
        </div>
        <div class="contents" id="item02">
            <h1 class="title">Sample #02</h1>
            <ul class="list">
                <li>List #1</li>
                <li>List #2</li>
                <li>List #3</li>
            </ul>
        </div>
        <div class="contents" id="item03">
            <h1 class="title">Sample #03</h1>
            <img src="image01.jpg" alt="image01">
        </div>
    </body>
</html>

##Requestsで取得したhtmlをlxmlに渡す

import requests
import lxml.html

url = "http://hoge.hoge/sample.html"

response = requests.get(url)
html = lxml.html.fromstring(response.content)

##タグの取得
###単純なXPath

htmltag = html.xpath("//div")
# lxml.html.HtmlElementのリストが返される

>>> print(htmltag)
[<Element div at 0x***********>, <Element div at 0x***********>, <Element div at 0x***********>]

###順番を指定したXPath

htmltag = html.xpath("//div[2]/ul/li[1]")
# <li>List #1</li>

###CSSセレクタを使ったXPath
####id

htmltag = html.xpath("//div[@id='item01']")
# ヒットするタグが1件でもリストが返る

>>> print(htmltag)
[<Element div at 0x***********>]

####class

divs = html.xpath("//div[@class='contents']")

###文字列を含むタグを取得するXPath

#class名に「i」を含む全てのタグ
htmltag = html.xpath("//*[contains(@class, 'i')]")
for item in htmltag:
    print(item.tag)

h1 # <h1 class="title">Sample #01</h1>
h1 # <h1 class="title">Sample #02</h1>
ul # <ul class="list">
h1 # <h1 class="title">Sample #03</h1>

##タグ内情報の取得
###タグの中身

htmltag = html.xpath("//div[1]/p")

>>> print(htmltag[0].text)
sample text

###タグの属性

htmltag = html.xpath("//div[1]")

>>> print(htmltag[0].attrib)
{'class': 'contents', 'id': 'item01'}

>>> print(htmltag[0].attrib["class"])
contents

#XPathでも取得可
htmltag = html.xpath("//div[1]/@class")
>>> print(htmltag[0])
contents

###タグの名称

htmltag = html.xpath("//div[1]")

>>> print(htmltag[0].tag)
div
17
25
0

Register as a new user and use Qiita more conveniently

  1. You get articles that match your needs
  2. You can efficiently read back useful information
  3. You can use dark theme
What you can do with signing up
17
25

Delete article

Deleted articles cannot be recovered.

Draft of this article would be also deleted.

Are you sure you want to delete this article?