More than 5 years have passed since last update.

yahoo画像検索から画像をスクレイピングするRubyスクリプト

Posted at 2018-01-22

スクリプトの概要

指定したワードでのyahoo画像検索の検索結果から、画像をスクレイピングして保存するRubyスクリプトです。元々、Pythonでこの処理を行うスクリプトを書かれている方がいらっしゃったので、同じ処理を行うスクリプトをRubyで書いてみました。

参考にしたページ

Yahoo、Bing、Googleでの画像収集事情まとめ
 スクレイピング初心者がpythonでかわいい猫ちゃん画像をコマンド一発でネットから収集してみた

スクリプトのコード

picture_from_yahoo.rb

# !/usr/bin/env ruby

require 'open-uri'
require 'nokogiri'

class Fetcher
  def initialize(ua='')
    @ua = ua
  end

  def fetch(url)
    p url
    html = open(url)
    body = html.read
    mime = html.content_type
    return body, mime
  end
end

$fetcher = Fetcher.new

def fetch_and_save_img(word)
  data_dir = word.to_s + '/'
  Dir.mkdir(data_dir) unless File.exist?(data_dir)

  img_url_list(word).each_with_index do |img_url,i|
    sleep(1)
    img, mime = $fetcher.fetch(img_url)
    next if mime.nil? or img.nil?
    ext = ''
    if mime.include?('jpeg')
      ext = '.jpg'
    elsif mime.include?('png')
      ext = '.png'
    elsif mime.include?('gif')
      ext = '.gif'
    else
      next
    end
    result_file_name = data_dir + i.to_s + ext
    File.open(result_file_name, 'w') do |file|
      file.puts(img)
    end
  end
end

def img_url_list(word)
  url = URI.escape("https://search.yahoo.co.jp/image/search?n=60&p=#{word}&search.x=1")
  content, mime = $fetcher.fetch(url)
  doc = Nokogiri::HTML.parse(content)
  a_list = doc.css('a[target="imagewin"]')
  # open-uriはリダイレクトに対応していない。エラーを回避するために、画像ファイルのリンクだけを抽出する
  # 重複要素も除いておく
  img_urls = a_list.map {|each| each.attr('href')}.select{|str| str.start_with?('http')}.select{|str| str.end_with?('jpg','png','gif')}.uniq
end

if ARGV[0].nil?
  puts "picture's word is not find."
  puts "ruby picture_from_yahoo.rb [word]"
  exit
end

word = ARGV[0]
fetch_and_save_img word

使い方

実行時に検索ワードを引数として渡すと、画像検索の検索結果の画像をディレクトリに保存してくれます。

./picture_from_yahoo.rb "猫"

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up