More than 5 years have passed since last update.

Webスクレイピングライブラリ "Yasuri" をリリースしました

Ruby

Last updated at 2015-10-24Posted at 2015-05-05

はじめに

こんにちは．私はWebスクレイピングが大好きなのですが、Rubyでもっと簡単にスクレイピングができればと思い、ライブラリを書いてみました．
ようやくREADMEとUSAGEが書けたので公開しようと思います．

ソースはGithubで公開しています．
特にドキュメントの英語が大変怪しいので、つっこみいただけると喜びます．
tac0x2a/yasuri

gemでも公開しているので、以下のコマンドで簡単にお試しできます．

$ gem install yasuri

簡単なサンプルと解説を書いてみました．
Yasuriでお手軽スクレイピング

よろしければ使ってみてください＞＜

Yasuri とは

Yasuri (鑢) は簡単にWebスクレイピングを行うための、"Mechanize" をサポートするライブラリです．

Yasuriは、スクレイピングにおける、よくある処理を簡単に記述することができます．
例えば、

ページ内の複数のリンクを開いて、各ページをスクレイピングした結果をHashで取得する
ページ内の複数のテキストをスクレイピングし、名前をつけてHashにする
ページ内に繰り返し出現するテーブルをそれぞれスクレイピングして、配列として取得する
ページネーションで提供される各ページのうち、上位3つだけを順にスクレイピングする

これらを簡単に実装することができます．

例

require 'yasuri'
require 'mechanize'

# Node tree constructing by DSL
root = Yasuri.links_root '//*[@id="menu"]/ul/li/a' do
         text_title '//*[@id="contents"]/h2'
         text_content '//*[@id="contents"]/p[1]'
       end

agent = Mechanize.new
root_page = agent.get("http://some.scraping.page.net/")

result = root.inject(agent, root_page)
# => [ {"title" => "PageTitle1", "content" => "Page Contents1" },
#      {"title" => "PageTitle2", "content" => "Page Contents2" }, ...  ]

この例では、 LinkNode(links_root)の xpath で指定された各リンク先のページを開いて、TextNode(text_title,text_content) の xpath で指定された2つのテキストをスクレイピングする例です．

このように、スクレイピング対象と結果の構造をツリーとして定義し、HashやArrayとして返すライブラリです．
ちなみに、ツリーは上記のようなDSLのかわりに、json(というかHash)で定義することもできます．

src = <<-EOJSON
   { "node"     : "links",
     "name"     : "root",
     "path"     : "//*[@id='menu']/ul/li/a",
     "children" : [
                    { "node" : "text",
                      "name" : "title",
                      "path" : "//*[@id='contents']/h2"
                    },
                    { "node" : "text",
                      "name" : "content",
                      "path" : "//*[@id='contents']/p[1]"
                    }
                  ]
   }
EOJSON
root = Yasuri.json2tree(src)

サンプル作っておきました => Yasuri Sample

例) ページ内の複数Tableをそれぞれスクレイピングする

<!-- http://yasuri.example.net -->
<html>
  <head>
    <title>Books</title>
  </head>
  <body>
    <h1>1996</h1>
    <table>
      <thead>
        <tr><th>Title</th> <th>Publication Date</th></tr>
      </thead>
      <tr><td>The Perfect Insider</td>      <td>1996/4/5</td></tr>
      <tr><td>Doctors in Isolated Room</td> <td>1996/7/5</td></tr>
      <tr><td>Mathematical Goodbye</td>     <td>1996/9/5</td></tr>
    </table>

    <h1>1997</h1>
    <table>
      <thead>
        <tr><th>Title</th> <th>Publication Date</th></tr>
      </thead>
      <tr><td>Jack the Poetical Private</td> <td>1997/1/5</td></tr>
      <tr><td>Who Inside</td>                <td>1997/4/5</td></tr>
      <tr><td>Illusion Acts Like Magic</td>  <td>1997/10/5</td></tr>
    </table>

    <h1>1998</h1>
    <table>
      <thead>
        <tr><th>Title</th> <th>Publication Date</th></tr>
      </thead>
      <tr><td>Replaceable Summer</td>   <td>1998/1/7</td></tr>
      <tr><td>Switch Back</td>          <td>1998/4/5</td></tr>
      <tr><td>Numerical Models</td>     <td>1998/7/5</td></tr>
      <tr><td>The Perfect Outsider</td> <td>1998/10/5</td></tr>
    </table>
  </body>
</html>

agent = Mechanize.new
page = agent.get("http://yasuri.example.net")

node = Yasuri.strucre_tables '/html/body/table' do
  struct_table './tr' do
    text_title    './td[1]'
    text_pub_date './td[2]'
  end
])

node.inject(agent, page)

# =>      [ { "table" => [ { "title"    => "The Perfect Insider",
#                           "pub_date" => "1996/4/5" },
#                         { "title"    => "Doctors in Isolated Room",
#                           "pub_date" => "1996/7/5" },
#                         { "title"    => "Mathematical Goodbye",
#                           "pub_date" => "1996/9/5" }]},
#          { "table" => [ { "title"    => "Jack the Poetical Private",
#                           "pub_date" => "1997/1/5" },
#                         { "title"    => "Who Inside",
#                           "pub_date" => "1997/4/5" },
#                         { "title"    => "Illusion Acts Like Magic",
#                           "pub_date" => "1997/10/5" }]},
#          { "table" => [ { "title"    => "Replaceable Summer",
#                           "pub_date" => "1998/1/7" },
#                         { "title"    => "Switch Back",
#                           "pub_date" => "1998/4/5" },
#                         { "title"    => "Numerical Models",
#                           "pub_date" => "1998/7/5" },
#                         { "title"    => "The Perfect Outsider",
#                           "pub_date" => "1998/10/5" }]}
#       ]

例) ページネーションで提供される各ページをそれぞれパースする

<!-- http://yasuri.example.net/page01.html -->
<html>
  <head><title>Page01</title></head>
  <body>
    <p>Patination01</p>

    <nav class='pagination'>
      <span class='prev'> &laquo; PreviousPage </span>
      <span class='page'> 1 </span>
      <span class='page'> <a href="./page02.html">2</a> </span>
      <span class='page'> <a href="./page03.html">3</a> </span>
      <span class='page'> <a href="./page04.html">4</a> </span>
      <span class='next'> <a href="./page02.html" class="next" rel="next">NextPage &raquo;</a> </span>
    </nav>

  </body>
<title>

page02.html から page04.html も同様になっているものとしてください．

agent = Mechanize.new
page = agent.get("http://yasuri.example.net/page01.html")

node = Yasuri.pages_root "/html/body/nav/span/a[@class='next']" , limit:3 do
         text_content '/html/body/p'
       end

node.inject(agent, page)
# => [ {"content" => "Patination01"},
#     {"content" => "Patination02"},
#     {"content" => "Patination03"}]

他の例についてはUSAGE.ja.mdを見てください．

# ドキュメント書くのが一番疲れた・・・

209

210

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up