More than 3 years have passed since last update.

スクレイピング　ページ遷移しながらデータ収集

Posted at 2021-11-04

開発環境

ruby 2.6.5

上の記事を参考にさせてもらって

yahoo ニュースのデータを
CSVファイルに格納するスクリプトを書いてみた

ここのページネーションでページ遷移しながら

全ての記事のタイトルを所得する試み

ここまで出来ればいろいろ自動化できそう

スクリプトは以下

require 'nokogiri'
require 'open-uri'
require "csv"

url_base = "https://news.yahoo.co.jp/topics/top-picks/"
add_url = "https://news.yahoo.co.jp/"
def get_categories(url)
  html = open(url)
  doc = Nokogiri::HTML.parse(html)
  categories = doc.css(".pagination_item a")
  categories.map do |category|
    cat_name = category.text
    cat = category[:href]
  end
end

@cat_list = get_categories(url_base)
@cat_list.pop(1)
@cat_list.unshift("topics/top-picks")
@infos = []

@cat_list.each do |cat|
  url = "#{add_url + cat}"
  html = open(url)
  doc = Nokogiri::HTML.parse(html)
  titles = doc.css(".newsFeed_item_title")
  i = 1
  titles.each do |title|
    @infos << [i,title.text]
    i += 1

  end
end

CSV.open("result.csv", "w") do |csv|
  @infos.each do |info|
    csv << info
  end
end

いろいろ無駄はありそうだけどとりあえず動く

流れは

① ページネーションの href を配列にして nokogiri で所得
② 所得した href を加工（次への href を削除したり　今のページの href を追加したり）
③ 加工した href の配列を each で回しながら nokogiri で記事のタイトルを所得
④ 所得したタイトルを CSV ファイルに格納

実際に実行すると

こんな感じでCSVファイルがつくられて
所得した情報が入る

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up

スクレイピング ページ遷移しながらデータ収集

開発環境

スクレイピング　ページ遷移しながらデータ収集