2
1

Delete article

Deleted articles cannot be recovered.

Draft of this article would be also deleted.

Are you sure you want to delete this article?

More than 5 years have passed since last update.

javascriptが有効なサイトでsoupを取得

Last updated at Posted at 2016-07-14

いろいろなsoupの取得方法メモ。
google画像検索とか、javascriptが有効なサイトだとselenium使わないといけないっぽい。

get_soup.py
#-*- coding:utf-8 -*-

from bs4 import BeautifulSoup

def get_soup_uulib2(url):
  import urllib2
  opener = urllib2.build_opener()
  opener.addheaders = [('User-agent', 'Mozilla/5.0')]
  page = opener.open(url)
  soup = BeautifulSoup(page,"lxml")
  return soup

def get_soup_urequests(url):
  import requests
  s = requests.Session()
  r = s.get(url)
  soup = BeautifulSoup(r.text,"lxml")
  print soup

def get_soup_uselenium(url):
  from selenium import webdriver
  #need chromedriver #https://sites.google.com/a/chromium.org/chromedriver/downloads
  chromedriver = "./chromedriver"
  driver = webdriver.Chrome(chromedriver)
  driver.get(url)
  page_source= driver.page_source
  soup=BeautifulSoup(page_source,"lxml")
  return soup
 
#javascript=enable
print get_soup_uselenium("https://www.google.co.jp/search?q=ねこ")
#java=off
#print get_soup_uulib2("https://www.google.co.jp/search?q=ねこ")
#print get_soup_uulib2("https://www.google.co.jp/search?q=ねこ")


#課題
ブラウザがいちいち立ち上がるのが面倒です。
headless seleniumとかでgoogle検索して色々試して見るも、断念。
どなたか教えてくださいませんか。

#追記 7/19
HEADless(ブラウザを立ち上げない)の場合は、
driver = webdriver.PhantomJS()とすれば良いです。
brew install phantomjsなどで入ります。

2
1
0

Register as a new user and use Qiita more conveniently

  1. You get articles that match your needs
  2. You can efficiently read back useful information
  3. You can use dark theme
What you can do with signing up
2
1

Delete article

Deleted articles cannot be recovered.

Draft of this article would be also deleted.

Are you sure you want to delete this article?