åæ
-
Nokogiri ã®äœ¿ãæ¹ã«ã€ããŠ
- Ruby ã§ XML ãšã HTML ãããæãã«æ±ããã©ã€ãã©ãª
- ããŒãžã§ã³ 1.6.1
- åèæ¬ãšåããªãå€å倧äžå€«
ãµã³ãã«
ãããªHTMLããã¥ã¡ã³ãããã£ããšããŠã
<html>
<head>
<title>Lightweight language</title>
</head>
<body id="test_id">
<h1>Lightweight language</h1>
<div>
<ul>
<li>
<a href="link_ruby" title="ruby">Ruby</a>
</li>
<li>
<a href="link_python" title="python">Python</a>
</li>
<li>
<a href="link_php" title="php">PHP</a>
</li>
<li>
<a href="link_perl" title="perl">Perl</a>
</li>
</ul>
</div>
</body>
</html>
ãããããåèšèªãšãã®ãªã³ã¯ããæœåºããã
require 'open-uri'
require 'nokogiri'
html = Nokogiri::HTML.parse(URI.open('/tmp/test.html'))
# 芪ã¿ã°ãšã㊠`body(id="test_id") çŽäžã« divãçŽäžã« ulãçŽäžã« li` ãæã€ `aã¿ã°` ãååŸ
# ã¡ãªã¿ã«ã`çŽäž` éå®ããã« `é
äž` ã«ããã°è¯ãå Žåã¯ã`>` ã®ä»£ããã« ` ` ãæå®ãã
# html.css('body[id="test_id"] a').each do |a|
html.css('body[id="test_id"] > div > ul > li > a').each do |a|
p a.text.strip
p a[:href]
end
=>
"Ruby"
"link_ruby"
"Python"
"link_python"
"PHP"
"link_php"
"Perl"
"link_perl"
ä»ã«ãè²ã ã§ããŠãäŸãã°ã察象ããã¥ã¡ã³ãã®ããã¹ãå 容ãå šãŠãã£ä»ããŠè¿ããŠããã
html = Nokogiri::HTML.parse(open('/tmp/test.html'))
p html.text
=>
"LightweightlanguageLightweightlanguageRubyPythonPHPPerl"
å©ç𿹿³ã®å€§æ
ããªãŒãæ§ç¯/æ€çŽ¢ããŠãããŒããåç §ãã
- ããªãŒã®æ§ç¯
- HTMLãXMLã®ããã¥ã¡ã³ããè§£æããŠ
Nokogiri::HTML::Document
ã«å€æ
- HTMLãXMLã®ããã¥ã¡ã³ããè§£æããŠ
- ããªãŒã®æ€çŽ¢
- æ€çŽ¢ç³»ã¡ãœãããå©çšããŠæ¢çŽ¢ãè¡ããç®åœãŠã®ããŒããç¹å®ãã
- ããŒãã®åç
§
- ååŸã§ãã
Nokogiri::XML::NodeSet
ãŸãã¯Nokogiri::XML::Element
ããç®åœãŠã®ããŒã¿ãæœåºãã
- ååŸã§ãã
ããªãŒã®æ§ç¯
è§£æã®çš®é¡
DOMã®ä»ã«ãSAXãReaderãPullãããããããã©ãæåã©ããã®DOMã«ã€ããŠèª¬æãã
DOM è§£æ
Nokogiri ã§ã¯ HTML/XML ããã¥ã¡ã³ãã®ã©ã¡ããè§£æã§ãã
doc = Nokogiri::HTML.parse(html_document) # HTML ããã¥ã¡ã³ãã®è§£æ
doc = Nokogiri::XML.parse(xml_document) # XML ããã¥ã¡ã³ãã®è§£æ
- 第1åŒæ°ã«ã¯ãIO ãªããžã§ã¯ããŸãã¯æååãªããžã§ã¯ããæå®ãã
- open_uri ãçŽæ¥ Nokogiri ã«æž¡ãããšãå¯èœïŒæååã䜿çšããããè¥å¹²å¹çã¯äžãããããïŒïŒ
- 第3åŒæ°ã«ã¯ã察象ããŒãžã®æåã³ãŒããæå®ãã
- è§£æå¯Ÿè±¡ã®æåã³ãŒãã UTF8 以å€ã®å Žåã倧æµè§£æã倱æããã®ã§æå®ãã
- 第2åŒæ°ã¯URLã第4åŒæ°ã¯ãªãã·ã§ã³ãã©ã¡ãã倧æµã¯æå®ããäºè¶³ãã
ãªããžã§ã¯ãã®å¯èŠå
html = Nokogiri::HTML(open('/tmp/test.html'))
ã¡ã¢ãªäžã§ã¯ã©ã®ãããªãªããžã§ã¯ããäœãããŠããã®ã
ç°¡ç¥åãããšãããªããªãŒç¶ã®ãªããžã§ã¯ããäœãããŠãã
åãªããžã§ã¯ãã«ã€ããŠ
- Nokogiri ã«ãã£ãŠãè€æ°ã®ãªããžã§ã¯ããäœæãããŠãã
-
Nokogiri::XML::Document
ãšãNokogiri::XML::Element
ãšã
-
- å
šãŠã®ãªããžã§ã¯ããããŒãã§ãã
-
Nokogiri::XML::Node
ãç¶æ¿ããŠãã- Nogogiri::XML::NodeSet < Enumerable
- Nokogiri::HTML::Document < Nokogiri::XML::Document < Nokogiri::XML::Node
- Nokogiri::XML::Element < Nokogiri::XML::Node
- Nokogiri::XML::Text < Nokogiri::XML::CharacterData < Nokogiri::XML::Node
- etc...
- æ€çŽ¢ç³»ã®ã¡ãœããã¯
Nokogiri::XML::Node
ã«ãŸãšãŸã£ãŠãããããåãæ€çŽ¢ã¡ãœãããããããªããžã§ã¯ãã«å¯ŸããŠäœ¿ãã
-
ä»ã«ãè²ã ãããã©ãäœãããŠãããªããžã§ã¯ãã®èª¬æãç°¡åã«
- Nokogiri::XML::Node
- ããŒãã«å¯Ÿããæäœãæ€çŽ¢åŠçãèŠå®ãã
- å
·äœçã«ã¯ã
Nokogiri::XML::Searchable
ãincludeããŠåçš®ã¡ãœãããå®è£- Searchableã¯DOMæ€çŽ¢ã®ã€ã³ã¿ãŒãã§ãŒã¹
- Nokogiri::XML::NodeSet
- Nokogiri::XML::Node ãªããžã§ã¯ãã®ãªã¹ããæã€
- Nokogiri::XML::SearchableïŒcss/ïŒxpathã®å®è¡çµæ
- Nokogiri::HTML::Document
- Nokogiriã«ãã£ãŠè§£æãããHTMLããã¥ã¡ã³ã
- Nokogiri::HTML.parseã®æ»ãå€
- Nokogiriã«ãã£ãŠè§£æãããHTMLããã¥ã¡ã³ã
- Nokogiri::XML::DTD
- 察象ããã¥ã¡ã³ãã DTD ã«ããææžæ§é ã«åŸã£ãŠãããã©ãããæ€èšŒããŠããïŒ
- Nokogiri::XML::Element
- Nokogiriã§ HTML èŠçŽ ãæ±ãããã®ãªããžã§ã¯ã
- C æ¡åŒµãªã®ã§ãã£ãšæ©ã
- Nokogiriã§ HTML èŠçŽ ãæ±ãããã®ãªããžã§ã¯ã
- Nokogiri::XML::Text
- Nokogiri ã§ HTML ããã¹ããæ±ãããã®ãªããžã§ã¯ã
ããªãŒã®æ€çŽ¢
æ€çŽ¢æ¹æ³
倧ãŸãã« 3 çš®é¡
- XPath
- XML圢åŒã®ææžããç¹å®ã®éšåãæå®ããŠæœåºããããã®ç°¡æœãªæ§æ
- CSS
- ãŠã§ãããŒãžã®ã¹ã¿ã€ã«ãæå®ããããã®èšèª
- ãã®ä»
- child ãšã parentãšãçžå¯Ÿçã«äœçœ®ãç¹å®ãã
ããããæµã
-
Nokogiri::HTML::Document
ãªããžã§ã¯ãã«å¯ŸããŠãCSSã»ã¬ã¯ã¿ãXPathã§æ€çŽ¢ãè¡ããæ€çŽ¢çµæãšããŠNokogiri::XML::NodeSet
ãªããžã§ã¯ããååŸ - NodeSet 㯠Node ã®ãªã¹ãã¯é
åã®ããã«æ±ãããããeach ã [] ã§è©²åœããŒãã§ãã Element ãç¹å®ãã
- CSS ã Xpath 㯠NodeSet ãè¿ãããat ã child ãªã© Element ãè¿ããã®ãããã®ã§èŠç¢ºèª
æ€çŽ¢ã¡ãœãã
ãã䜿ãã§ããã Nokogiri::XML::Node ã®ã¡ãœããã«ã€ããŠ
Nodeset ãš Element äž¡æ¹ã«å¯ŸããŠäœ¿ãã
XPath
芪ã¿ã°ãšã㊠body(id="test_id"), div, ul
ãæã€ li
å
šãŠãæ€çŽ¢
html.xpath('//body[@id="test_id"]/div/ul/li').each do |li|
p li.text
p li.at('a')[:href]
end
- åŒæ°ã«ã€ããŠ
- åé ã®
/
ãéå±€æ§é ã®ã«ãŒãã衚ã - ãã®ä»ã®
/
ã§åã¿ã°ã®éå±€æ å ±ãæå® -
/
ã®éã«äœããªãå Žåã¯ãé äžã«ããããšã瀺ã -
div
ãul
ã¯ã¿ã°åã®æå®-
*
ã¯ãä»»æã®ã¿ã°ã«äžèŽããã¯ã€ã«ãã«ãŒããšããŠäœ¿ãã
-
- 屿§ïŒäŸãã°idã®å€ãšãïŒãæå®ãããšãã¯
ã¿ã°å[@屿§å=xxx]
ã䜿ã - åŒæ°ã¯è€æ°æå®ã§ãã
- åé ã®
- æ»ãå€ã«ã€ããŠ
- NodeSet ãè¿ã
- çµæãç¡ããã°ç©ºã® NodeSet
- ãã®ä»
- çŸç¶ CSS ã§ã¯ãå®çŸã§ããªãããšãã§ããïŒ2022/10/14ïŒ
CSS
äžèšã CSS ã§çœ®ãæãã
html.css('body[id="test_id"] > div > ul > li').each do |li|
p li.text
p li.at('a')[:href]
end
- åŒæ°ã«ã€ããŠ
-
/
ã®ä»£ããã«>
ãŸãã¯ç©ºçœã䜿ã -
>
ã¯èŠªã¿ã°çŽäžã®ã¿ã°ãæå®ããããšãã«äœ¿ã- xpathã§èšã
//h3/a
- h3 çŽäžã« a ãæ¥ã
- xpathã§èšã
- 空çœã¯ã¿ã°éã«ä»»æã®ã¿ã°ã蚱容ããããšãã«äœ¿ã
- xpathã§èšã
//h3//a
- h3 ãš a éã«ä»»æã®ã¿ã°ã蚱容ãã
- xpathã§èšã
- 屿§ïŒäŸãã°idã®å€ãšãïŒãæå®ãããšãã¯
ã¿ã°å[屿§å=xxx]
ã䜿ã - class 屿§ã®å
å«ãæå®ãããšã㯠ã¿ã°å.classå ã䜿ã
-
ã¿ã°å[屿§å=xxx]
ã¯å®å šäžèŽããå¿ èŠããã
-
- ããã¹ãããæ€çŽ¢ããããšãã¯
ã¿ã°å:contains("ããã¹ã")
ã䜿ãhtml.at('a:contains("Ruby")')
- åŒæ°ã¯è€æ°æå®ã§ãã
-
- æ»ãå€ã«ã€ããŠ
- NodeSet ãè¿ã
- çµæãç¡ããã°ç©ºã® NodeSet
è£è¶³
- 䟿å©ã¡ãœãã
-
search("æ€çŽ¢")
- åŒæ°ã« XPath ãŸã㯠CSS ãæå®ã§ãã
- æ»ãå€ãšããŠãNodeSetãè¿ããç¡ããã°ç©ºã® NodeSet
-
at("æ€çŽ¢")
- åŒæ°ã« XPath ãŸã㯠CSS ãæå®ã§ãã
- æ»ãå€ãšããŠãæåã®ããŒãã® Element ãè¿ããç¡ããã° nil
-
- ChromeãFirefoxïŒFirebugïŒã§æå®ã¿ã°ã®ãã¹ãæœåºã§ãã
- Chrome ã®å Žå
- ããŒãžã§ã³: 106.0.5249.103ïŒOfficial BuildïŒ ïŒx86_64ïŒ
- éçºè ã³ã³ãœãŒã«ãéããŠãElements ã¿ãã§è©²åœèŠçŽ ãéžæãã
- å³ã¯ãªãã¯ããŠã
Copy > Copy Selector / Copy XPath
ã§ã³ãã
- Chrome ã®å Žå
ãã®ä»
ããŒããæå®ããŠä»»æã®çžå¯ŸããŒãã«ã¢ã¯ã»ã¹ã§ãã
- child
- æåã®åããŒãã Element ã§è¿ã
- children
- åããŒãïŒElementïŒã®é åãè¿ã
- previous_siblingãprevious
- å
ããŒãã Element ã§è¿ã
- æåã«ããããŒã
- å
ããŒãã Element ã§è¿ã
- next_siblingãnext
- åŒããŒãã Element ã§è¿ã
- åŸã«ããããŒã
- åŒããŒãã Element ã§è¿ã
- parent
- 芪ããŒãã Element ã§è¿ã
- ancestors
- ç¥å ããŒãïŒElementïŒã®é åãè¿ã
ããŒãæ å ±ã®åç §
ããŒãã«é¢ããæ
å ±ãååŸããããã®ã¡ãœãã
åºæ¬çã« Nodeset ãš Element äž¡æ¹ã«å¯ŸããŠäœ¿ãããäžéšäŸå€ãã
ããã¥ã¡ã³ãã®åç §
Nokogiri::XML::Node ãš NodeSet ã§åºæ¬çã«ã¯åã API ã䜿ããããäžéšã®ã¡ãœããïŒcontentãto_strïŒããªã
NodeSet ã®å Žåããªã¹ãå
ã®å
šãŠã®ããŒãã«å¯ŸããŠã¡ãœãããé©çšããçµæãè¿ããŠããã
- to_s
- ããŒãå šäœã®ããã¹ããã€ãªãåãããæååãè¿ã
- contentãtextãinner_textãto_str
- åå«ããŒãã®ããã¹ãå 容ãã€ãªãåãããæååãè¿ã
- ãšã€ãªã¢ã¹å€
- to_htmlã to_s
- ããŒãå
šäœHTMLãã€ãªãåãããæååãè¿ã
- æ¬äººããŒããå«ãã inner_html
- ããŒãå
šäœHTMLãã€ãªãåãããæååãè¿ã
- to_xhtmlã to_xml
- ããŒãå šäœã XHTML ãã€ãªãåãããæååãè¿ã
- inner_html
- åå«ããŒãã® HTML ãã€ãªãåãããæååãè¿ã
屿§æ å ±ã®åç §
éåžžã® Ruby ããã·ã¥ã®ããã«æ±ãã
Element ã«å¯ŸããŠäœ¿ã
- ["屿§å"]ãget_attribute("屿§å")
- 屿§å€ãæååã§è¿ããç¡ããã° nil
- key?("屿§å")ãhas_attribute?("屿§å")
- 屿§ã®æç¡ãtrue/falseã§è¿ã
- keys
- 屿§åãæååã®é åã§è¿ã
- values
- 屿§å€ãæååã®é åã§è¿ã
- attributes
- 屿§åãšå±æ§ãªããžã§ã¯ãã®ããã·ã¥ãè¿ã
- attribute("屿§å")
- 屿§ãªããžã§ã¯ããè¿ã
- each { |k,v| }
- 屿§åãšå±æ§å€ãè¿ããããã¯åŒã³åºã
ãŸãšã
- ããªãŒãæ¢çŽ¢ããŠããã€ã¡ãŒãžãæã€ãšãããããã
- ä»èªåãæäœããŠããã®ã NodeSet ãªã®ã Element ãªã®ãç¥ã£ãŠããæ¹ãæ··ä¹±ãªããã
- CSS ã Xpath 㯠NodeSet ãè¿ãããat ã child ãªã© Element ãè¿ã
- ä»èªåãæäœããŠããã®ã NodeSet ãªã®ã Element ãªã®ãç¥ã£ãŠããæ¹ãæ··ä¹±ãªããã
- 倧æµã®ã±ãŒã¹ã§ã¯æåã®ãµã³ãã«ã§äºè¶³ããã
- ååŸéšåã®çްããæå®ãå¿
èŠãªå Žåã«ã¯ children çã䜿ã
- CSS ã§æ€çŽ¢ã㊠NodeSet ãåã£ãŠããŠãããã each ã§åããŠåèŠçŽ ã®æ å ±ãååŸãã
- ããã¹ãããæ€çŽ¢ããã®ã¯çµæ§äœ¿ããã
- CSSã§æžããš
html.at('a:contains("Ruby")')
ãªã©
- CSSã§æžããš
- ååŸéšåã®çްããæå®ãå¿
èŠãªå Žåã«ã¯ children çã䜿ã