More than 5 years have passed since last update.

Mechanizeでページ遷移しながらスクレイピング

Posted at 2014-07-03

ちょっとダルいポイントが有ったのでメモ程度に。
スクレイピング対象サイトとスクレイピングの流れは

ページャで何ページか一覧ページがある
一覧ページのタイトルをクリックすると詳細ページが見れる
詳細ページの一部を使用
また他のタイトルをクリックしていく
CSVで出力（別にいらないけどメモ代わりに。。）

みたいな感じです。mechanizeだけでやります。

require 'mechanize'
require 'csv'

class ScrapingPages

  def initialize
    @agent = Mechanize.new
    @data = []
  end

  def retrieve
    # 1ページ目から10ページ目までスクレイピングする
    (1..10).each do |i|
      page = @agent.get(url(i)
      each_section(page) do |section|
        title = section.css('h2.title > a').first.text
        detail = @page.links_with(text: title).first.click
        @data << {
          title: title,
          detail: detail.links.first.text
        }
      end
    end
  end

  def each_section(page)
    page.search('.articleBox').each do |section|
      yield section
    end
  end

  def url(current_page)
    "https://hogehoge.com/#{current_page}"
  end

  def make_file_as_csv
    CSV.open("./csv/scraping-#{Time.now.to_i}.csv", "wb", encoding: 'Shift_JIS') do |csv|
      csv << %w(title detail)
      @data.each do |record|
        csv << [record[:title], record[:detail]]
      end
    end
  end
end

scraping_pages = ScrapingPages.new
scraping_pages.retrieve
scraping_pages.make_file_as_csv

コードは実際のものとちょこちょこ変更箇所あるので流しでいいのですが、ポイントはclickのところ。
mechanizeは中でnokogiriを使っているようで、
上記の@agent, pageとかはmechanizeクラスが親のオブジェクトなのですが、
.cssとか使うと返ってくるオブジェクトがnokogiriクラスのインスタンスオブジェクトが返ってきます。

で、clickメソッドはmechanizeクラスに対してしか使えないので、微妙に工夫が必要。

title = section.css('h2.title > a').first.text
detail = @page.links_with(text: title).first.click

# .css('h2.title > a').first.clickとかすると、
# nokogiriがclickメソッド持ってないのでエラーになる。

ここです。title変数で一旦クリック個所のテキストを格納しておいて、
mechanizeクラス継承の@pageに対してリンクを辿ってテキストを指定してクリックする、と。

csvはよく忘れるのでメモ程度に載せただけです。

もっと良いやり方あるかもだけど。

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up