More than 1 year has passed since last update.

Nokogiri を使う(XPath)

Nokogiri

Posted at 2022-08-29

HTML 解析で Nokogiri を使ったので使い方を忘れないように自分用にメモしておく。

基本的な使い方

公式ドキュメントのサンプルコードを見ると分かりやすい

#! /usr/bin/env ruby

require 'nokogiri'
require 'open-uri'

# Fetch and parse HTML document
doc = Nokogiri::HTML(URI.open('https://nokogiri.org/tutorials/installing_nokogiri.html'))

# Search for nodes by css
doc.css('nav ul.menu li a', 'article h2').each do |link|
  puts link.content
end

# Search for nodes by xpath
doc.xpath('//nav//ul//li/a', '//article//h2').each do |link|
  puts link.content
end

# Or mix and match
doc.search('nav ul.menu li a', '//article//h2').each do |link|
  puts link.content
end

以下より、自分用のメモ

ファイルから読み込ませる

自分の場合、あらかじめダウンロード済みの HTML のパースだったので以下のように記載

doc = File.open("out.html") do |f| 
  Nokogiri::HTML(f, nil, 'UTF-8')
end

/html/body/div/table/tbody/tr[1]

検索

以下の方法がある模様

CSS
XPath
mix and match

何が良いのか分からないが、Chrome の DevTool を開き、 Elements から取得した要素を選択し、右クリック後に Copy->Copy XPath というように XPath の確認ができたので、今回は XPath を使うことにした。

今回取得してきた HTML で処理したい処理は以下のようになっていた

・・・
  <tbody>
      <tr class="">
        <td>ABC</td>
        <td>DEF</td>
        <td>GHI</td>
      </tr>
      <tr class="">
・・・

その為、「//」をつけてドキュメント全体から対象の要素(tbody/tr)を探し、その後、各 tr/td 要素をそれぞれ処理するようにした。

nodes = doc.xpath("//tbody/tr")
nodes.each do |node|
  informations = node.xpath("td")
  informations.each_with_index do |information, i|
  ・・・
  end
end

以下を参考にさせて頂きました。

nokogiriの使い方メモ（XPathを使った場合）

デバッグ

p や pp などを使って Nokogiri より取得した情報を確認しつつ、デバッグした

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up