49
50

Delete article

Deleted articles cannot be recovered.

Draft of this article would be also deleted.

Are you sure you want to delete this article?

More than 5 years have passed since last update.

匿名IP(月10ドル)でガシガシScraping、Ruby+Nokogiriソース付

Posted at

あいさつ

匿名IPをローテさせてアクセス制限を気にせずガシガシScrapingする方法を紹介します.

今回はRuby+Nokogiriで済む簡単なモノを紹介しますが、PhantomJSやSeleniumにプロキシをかましても同じような事が出来ます.

  • 悪用は良くない

http://proxymesh.com/

http://proxymesh.com/ まずはここで会員登録.

無料版もあるのでそれで済むならそれでいいかも.

ここで登録したusername①とpassword②は後で使うので控えておく.

Screen Shot 2015-04-08 at 6.04.09 PM.png

ダッシュボードに表示されているAuthorized済みのProxyのHost③とPort④を控えておく.

NokogiriでGogo

#openproxy_http_basic_authenticationオプションを使えばおk

require "open-uri"
class OpenWithProxy
  def initialize(proxy_host, proxy_port, username, pass)
    @proxy_uri = URI.parse("#{proxy_host}:#{proxy_port}")
    @username  = username
    @pass      = pass
  end

  def open(url)
    Kernel.open(url, proxy_http_basic_authentication: [@proxy_uri, @username, @pass])
  end
end
owp = OpenWithProxy.new('③で控えたやつ', '④で控えたやつ', '①で控えたやつ', '②で控えたやつ')
# こんな感じ. owp = OpenWithProxy.new('http://us.proxymesh.com', '31280', 'username', 'password')

ちゃんとプロキシを噛んでるか確認

を使うとIPが確認できる.

require 'json'
check_ip = -> { JSON.parse(owp.open('http://api.ipify.org?format=json').read)['ip'] }

check_ip.call
#=> "166.78.113.337"

check_ip.call
#=> "166.28.153.347"

check_ip.call
#=> "192.237.163.323"

...

おお、10個のIPの内から毎回ランダムで選ばれるぽい、いいね

Nokogiriに渡してガシガシ

require 'nokogiri'

doc = Nokogiri::HTML owp.open('https://www.google.com')

doc.title
#=> "Google"

いいね!

49
50
1

Register as a new user and use Qiita more conveniently

  1. You get articles that match your needs
  2. You can efficiently read back useful information
  3. You can use dark theme
What you can do with signing up
49
50

Delete article

Deleted articles cannot be recovered.

Draft of this article would be also deleted.

Are you sure you want to delete this article?