2
2

Delete article

Deleted articles cannot be recovered.

Draft of this article would be also deleted.

Are you sure you want to delete this article?

More than 5 years have passed since last update.

Golang + PhantomJSでスクレイピング

Last updated at Posted at 2018-09-10

概要

GolangとPhantomJSで、javascriptのページでもスクレイピングする方法です。

agoutiを使ってPhantomJSを起動し、goqueryでDOMをパースしてみました。

関連記事

Goの環境構築

Dockerfile

Dockerで環境を作ります。

Dockerfile
From centos:7

ARG GO_VER=1.11

# Golang
WORKDIR /usr/local/src
RUN yum install -y git
RUN curl -O https://dl.google.com/go/go${GO_VER}.linux-amd64.tar.gz
RUN tar -C /usr/local -xzf go${GO_VER}.linux-amd64.tar.gz

ENV GOPATH=/code
ENV PATH=$PATH:$GOPATH/bin
ENV PATH=$PATH:/usr/local/go/bin

# PhantomJS
RUN yum install epel-release
RUN rpm -ivh http://repo.okay.com.mx/centos/6/x86_64/release/okay-release-1-1.noarch.rpm
RUN yum install phantomjs.x86_64

Go

Qiitaにログインするコードです。
agoutiだけでも要素の取得はできますが、goqueryを使ったほうが便利そうです。

main.go
package main

import (
        "github.com/sclevine/agouti"
        "github.com/PuerkitoBio/goquery"
        "strings"
        "log"
        "fmt"
        "time"
)

func main() {
        // Set User-Agent
        capabilities := agouti.NewCapabilities()
        capabilities["phantomjs.page.settings.userAgent"] = "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/37.0.2062.120 Safari/537.36"
        capabilitiesOption := agouti.Desired(capabilities)
        driver := agouti.PhantomJS(capabilitiesOption)
        if err := driver.Start(); err != nil {
                log.Fatalf("Failed to start driver:%v", err)
        }
        defer driver.Stop()

        page, err := driver.NewPage(agouti.Browser("phantomjs"))
        if err != nil {
                log.Fatalf("Failed to open page:%v", err)
        }

        login_url := "https://qiita.com/login"
        if err := page.Navigate(login_url); err != nil {
                log.Fatalf("Failed to navigate:%v", err)
        }

        // form
        identity := page.FindByID("identity")
        password := page.FindByID("password")
        identity.Fill("Your Id Here.")
        password.Fill("Your Passowrd Here.")
        // submit
        if err := page.FindByClass("loginSessionsForm_submit").Submit(); err != nil {
                log.Fatalf("Failed to login:%v", err)
        }

        // wait 3sec
        time.Sleep(3 * time.Second)

        // <title>を取得
        title, err := page.Title()
        log.Printf(title)
        // 閲覧中のページのURLを取得
        url, err := page.URL()
        log.Printf(url)

        // HTMLソースを取得
        getSource, err := page.HTML()
        log.Printf(getSource)
        if err != nil {
                log.Fatalf("Failed to get HTML:%v", err)
        }

        // Screen shotを保存
        page.Screenshot("/tmp/phantomjs_qiita.jpg")

        // Parse DOM
        readerCurContents := strings.NewReader(getSource)
        doc, err := goquery.NewDocumentFromReader(readerCurContents)
        if err != nil {
                log.Fatal(err)
        }

        // 今度はgoqueryで<title>を取得
        fmt.Println(doc.Find("title").Text())
        // ユーザー名を取得
        fmt.Println(doc.Find("#globalHeader > div > div.st-Header_end > div:nth-child(4) > div.st-Header_loginUser > img").Attr("alt"))
}

これを実行すれば、ログインが成功し、ログイン後のページのスクリーンショットが取得できるはずです。

$ go run main.go

Error: unsafe-eval

github.comをスクレイピング使用として、page.URL()を実行しみたところ、以下のエラーが出ました。

request unsuccessful: Refused to evaluate a string as JavaScript because 'unsafe-eval' is not an allowed source of script in the following Content Security
Policy directive: "script-src assets-cdn.github.com".

以下のPhantomJSのissueによると、PhantomJS1系なら動くみたいです。
2系で動かす方法は調べていません。どなたか情報があればいただけると幸いです。

Thanks for suggestion, It worked with PhantomJS 1.9.8

2
2
0

Register as a new user and use Qiita more conveniently

  1. You get articles that match your needs
  2. You can efficiently read back useful information
  3. You can use dark theme
What you can do with signing up
2
2

Delete article

Deleted articles cannot be recovered.

Draft of this article would be also deleted.

Are you sure you want to delete this article?