More than 3 years have passed since last update.

Goでのスクレイピングに使っていたgoqueryをcollyに置き換えてみた

Last updated at 2022-09-07Posted at 2022-09-06

はじめに

以前，goqueryを用いて東京ドームのイベント情報を1日1回取得するLambda関数を実装してみたのですが，どうやらcollyというライブラリが割といいという話を後になって知りました．
深夜テンションでライブラリを置き換えてみたのでその記録です．
日本語での情報が少ないため，go言語でスクレイピングをしたい方のお役に立てれば幸いです．

goqueryで実装したときの記録：https://qiita.com/a_uchida/items/1cdd7a1a6003d8a93651

colly vs goquery

スター数

collyのほうが急激に伸びていますね．というのも，collyはgoqueryをもとにして作られたライブラリなので当然っちゃ当然なのかも知れません．
実際にcollyのgo.modを見てみると，github.com/PuerkitoBio/goqueryが含まれているのがわかります．

簡単な実装例

Googleのトップページにアクセスし，タイトルを出力する例を見てみましょう．

goquery

package tmp

import (
	"fmt"
	"net/http"

	"github.com/PuerkitoBio/goquery"
)

func main() {
	url := "https://www.google.com"

	res, err := http.Get(url)
	if err != nil {
		panic(err)
	}

	doc, err := goquery.NewDocumentFromReader(res.Body)

	if err != nil {
		panic(err)
	}

	title := doc.Find("title").Text()
	fmt.Println(title)
}

goqueryでも，NewDocument(url string)関数を使って指定したURLからHTMLを持ってくることも可能ですが，deprecatedになっているので，現在はnet/httpを使ってHTMLを取得した上で，NewDocumentFromReader(r io.Reader)関数を用いて読み込むことが望ましいです．

colly

package tmp

import (
	"fmt"

	"github.com/gocolly/colly"
)

func main() {
	c := colly.NewCollector()

	url := "https://www.google.com"

    c.OnError(func(_ *colly.Response, err error) {
		panic(err)
	}

	c.OnHTML("title", func(e *colly.HTMLElement) {
		fmt.Println(e.Text)
	})

	c.Visit(url)
}

collyで直接エラーハンドリングが可能であり，可読性が大幅に向上します．
これに加え，並列処理，スレッド数の上限の設定，実行遅延なども可能です．

実際の実装

旧実装（goquery）

package scraper

import (
	"fmt"
	"net/http"
	"strconv"
	"time"

	"github.com/PuerkitoBio/goquery"
)

func FetchTodayEvent() string {
	jst, _ := time.LoadLocation("Asia/Tokyo")

    url := "https://www.tokyo-dome.co.jp/dome/event/schedule.html"
	res, err := http.Get(url)

	if err != nil {
		fmt.Println("Failed to scrape")
		panic(err)
	}
	defer res.Body.Close()

	doc, _ := goquery.NewDocumentFromReader(res.Body)

	selector := "div.c-mod-tab__body:nth-child(2) > table > tbody"
	innerSelector := "tr.c-mod-calender__item"
	dateSelector := "th > span:nth-child(1)"
	categorySelector := "td:nth-child(2) > div > div:nth-child(1) > p > span"
	titleSelector := "td > div > div:nth-child(2) > p.c-mod-calender__links"
	timeSelector := "td > div > div:nth-child(2) > p:nth-child(2)"

	selection := doc.Find(selector)

	var event string
	selection.Find(innerSelector).Each(func(index int, s *goquery.Selection) {
		date, _ := strconv.Atoi(s.Find(dateSelector).Text())
		category := s.Find(categorySelector).Text()
		title := s.Find(titleSelector).Text()
		info := s.Find(timeSelector).Text()

		if date == time.Now().In(jst).Day() {
			if title == "" {
				event = "イベントなし"
			} else {
				event = title + "（" + category + "）" + "\n" + info
			}
		}
	})
	return event
}

新実装（colly）

package scraper

import (
	"fmt"
	"time"

	"github.com/gocolly/colly"
)

func FetchTodayEvent() string {
	jst, _ := time.LoadLocation("Asia/Tokyo")

	url := "https://www.tokyo-dome.co.jp/dome/event/schedule.html"

	c := colly.NewCollector()

	selector := "div.c-mod-tab__body:nth-child(2) > table > tbody"
	innerSelector := "tr.c-mod-calender__item"
	dateSelector := "th > span:nth-child(1)"
	categorySelector := "td:nth-child(2) > div > div:nth-child(1) > p > span"
	titleSelector := "td > div > div:nth-child(2) > p.c-mod-calender__links"
	timeSelector := "td > div > div:nth-child(2) > p:nth-child(2)"

	var event string
	c.OnHTML(selector, func(e *colly.HTMLElement) {
		e.ForEach(innerSelector, func(_ int, s *colly.HTMLElement) {
			date := s.ChildText(dateSelector)
			category := s.ChildText(categorySelector)
			title := s.ChildText(titleSelector)
			info := s.ChildText(timeSelector)
			today := time.Now().In(jst).Format("02")

			if date == today {
				if title == "" {
					event = "イベントなし"
				} else {
					event = title + "（" + category + "）" + "\n" + info
				}
			}
		})
	})

	c.Visit(url)

	return event
}

所感

新実装の方が依存ライブラリが減ったこともあり，何をやっているかが直感的に幾分かわかりやすくなったように思えます．
goqueryよりも様々な機能がcollyに備わっており，スター数もcollyの方が急劇に伸びていることから，goでスクレイピングをしたいときはcollyが良さそうです．

リポジトリURL

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up