Goを使ったHTML解析：初心者から上級者まで

Posted at 2025-03-03

Leapcell: The Next-Gen Serverless Platform for Web Hosting

Goqueryのインストールと使い方

インストール

以下を実行します：

go get github.com/PuerkitoBio/goquery

インポート

import "github.com/PuerkitoBio/goquery"

ページの読み込み

IMDbの人気映画ページを例にとります：

package main

import (
    "fmt"
    "log"
    "net/http"

    "github.com/PuerkitoBio/goquery"
)

func main() {
    res, err := http.Get("https://www.imdb.com/chart/moviemeter/")
    if err != nil {
        log.Fatal(err)
    }
    defer res.Body.Close()
    if res.StatusCode != 200 {
        log.Fatalf("status code error: %d %s", res.StatusCode, res.Status)
    }

ドキュメントオブジェクトの取得

    doc, err := goquery.NewDocumentFromReader(res.Body)
    if err != nil {
        log.Fatal(err)
    }
    // その他の作成方法
    // doc, err := goquery.NewDocumentFromReader(reader io.Reader)
    // doc, err := goquery.NewDocument(url string)
    // doc, err := goquery.NewDocument(strings.NewReader("<p>Example content</p>"))

要素の選択

要素セレクタ

基本的なHTML要素に基づいて選択します。たとえば、dom.Find("p") はすべての p タグに一致します。チェーンコールがサポートされています：

ele.Find("h2").Find("a")

属性セレクタ

要素の属性と値によって要素をフィルタリングし、複数の一致方法があります：

Find("div[my]")        // my属性を持つdiv要素をフィルタリング
Find("div[my=zh]")     // my属性がzhであるdiv要素をフィルタリング
Find("div[my!=zh]")    // my属性がzhでないdiv要素をフィルタリング
Find("div[my|=zh]")    // my属性がzhまたはzh-で始まるdiv要素をフィルタリング
Find("div[my*=zh]")    // my属性に文字列zhを含むdiv要素をフィルタリング
Find("div[my~=zh]")    // my属性に単語zhを含むdiv要素をフィルタリング
Find("div[my$=zh]")    // my属性がzhで終わるdiv要素をフィルタリング
Find("div[my^=zh]")    // my属性がzhで始まるdiv要素をフィルタリング

`parent > child` セレクタ

特定の要素の下の子要素をフィルタリングします。たとえば、dom.Find("div>p") はdivタグの下の p タグをフィルタリングします。

`element + next` 隣接セレクタ

要素の選択が規則的でない場合に使用しますが、前の要素にはパターンがあります。たとえば、dom.Find("p[my=a]+p") は my 属性値が a の p タグの隣接する p タグをフィルタリングします。

`element~next` 兄弟セレクタ

同じ親要素の下の非隣接のタグをフィルタリングします。たとえば、dom.Find("p[my=a]~p") は my 属性値が a の p タグの兄弟の p タグをフィルタリングします。

IDセレクタ

# で始まり、要素を正確に一致させます。たとえば、dom.Find("#title") は id=title のコンテンツに一致し、タグを指定することもできます dom.Find("p#title")。

ele.Find("#title")

クラスセレクタ

. で始まり、指定されたクラス名を持つ要素をフィルタリングします。たとえば、dom.Find(".content1") で、タグを指定することもできます dom.Find("div.content1")。

ele.Find(".title")

セレクタ OR (|) 演算

複数のセレクタを組み合わせ、コンマで区切ります。どれか1つが満たされればフィルタリングされます。たとえば、Find("div,span")。

func main() {
    html := `<body>
                <div lang="zh">DIV1</div>
                <span>
                    <div>DIV5</div>
                </span>
            </body>`
    dom, err := goquery.NewDocumentFromReader(strings.NewReader(html))
    if err != nil {
        log.Fatalln(err)
    }
    dom.Find("div,span").Each(func(i int, selection *goquery.Selection) {
        fmt.Println(selection.Html())
    })
}

フィルター

`:contains` フィルター

指定されたテキストを含む要素をフィルタリングします。たとえば、dom.Find("p:contains(a)") は a を含む p タグをフィルタリングします。

dom.Find("div:contains(DIV2)").Each(func(i int, selection *goquery.Selection) {
    fmt.Println(selection.Text())
})

`:has(selector)`

指定された要素ノードを含む要素をフィルタリングします。

`:empty`

子要素を持たない要素をフィルタリングします。

`:first-child` と `:first-of-type` フィルター

Find("p:first-child") は最初の p タグをフィルタリングします；first-of-type はそのタイプの最初の要素であることが必要です。

`:last-child` と `:last-of-type` フィルター

:first-child と :first-of-type の逆です。

`:nth-child(n)` と `:nth-of-type(n)` フィルター

:nth-child(n) は親要素の n 番目の要素をフィルタリングします；nth-of-type(n) は同じタイプの n 番目の要素をフィルタリングします。

`:nth-last-child(n)` と `:nth-last-of-type(n)` フィルター

逆順に計算し、最後の要素を最初の要素とします。

`:only-child` と `:only-of-type` フィルター

Find(":only-child") は親要素の唯一の子要素をフィルタリングします；Find(":only-of-type") は同じタイプの唯一の要素をフィルタリングします。

コンテンツの取得

ele.Html()
ele.Text()

反復処理

Each メソッドを使用して選択された要素を反復処理します：

ele.Find(".item").Each(func(index int, elA *goquery.Selection) {
    href, _ := elA.Attr("href")
    fmt.Println(href)
})

組み込み関数

配列位置指定関数

Eq(index int) *Selection
First() *Selection
Get(index int) *html.Node
Index...() int
Last() *Selection
Slice(start, end int) *Selection

拡張関数

Add...()
AndSelf()
Union()

フィルタリング関数

End()
Filter...()
Has...()
Intersection()
Not...()

ループ反復処理関数

Each(f func(int, *Selection)) *Selection
EachWithBreak(f func(int, *Selection) bool) *Selection
Map(f func(int, *Selection) string) (result []string)

ドキュメント変更関数

After...()
Append...()
Before...()
Clone()
Empty()
Prepend...()
Remove...()
ReplaceWith...()
Unwrap()
Wrap...()
WrapAll...()
WrapInner...()

属性操作関数

Attr*(), RemoveAttr(), SetAttr()
AttrOr(e string, d string)
AddClass(), HasClass(), RemoveClass(), ToggleClass()
Html()
Length()
Size()
Text()

ノード検索関数

Contains()
Is...()

ドキュメントツリー反復処理関数

Children...()
Contents()
Find...()
Next...() *Selection
NextAll() *Selection
Parent[s]...()
Prev...() *Selection
Siblings...()

型定義

Document
Selection
Matcher

ヘルパー関数

NodeName
OuterHtml

例

始め方の例

func main() {
    html := `<html>
            <body>
                <h1 id="title">O Captain! My Captain!</h1>
                <p class="content1">
                O Captain! my Captain! our fearful trip is done,
                The ship has weather’d every rack, the prize we sought is won,
                The port is near, the bells I hear, the people all exulting,
                While follow eyes the steady keel, the vessel grim and daring;
                </p>
            </body>
            </html>`
    dom, err := goquery.NewDocumentFromReader(strings.NewReader(html))
    if err != nil {
        log.Fatalln(err)
    }
    dom.Find("p").Each(func(i int, selection *goquery.Selection) {
        fmt.Println(selection.Text())
    })
}

IMDb人気映画情報のクローリングの例

package main

import (
    "fmt"
    "log"

    "github.com/PuerkitoBio/goquery"
)

func main() {
    doc, err := goquery.NewDocument("https://www.imdb.com/chart/moviemeter/")
    if err != nil {
        log.Fatal(err)
    }
    doc.Find(".titleColumn a").Each(func(i int, selection *goquery.Selection) {
        title := selection.Text()
        href, _ := selection.Attr("href")
        fmt.Printf("Movie Name: %s, Link: https://www.imdb.com%s\n", title, href)
    })
}

上記の例では、IMDbの人気映画ページから映画の名前とリンク情報を抽出しています。実際の使用では、必要に応じてセレクタと処理ロジックを調整することができます。

Leapcell: The Next-Gen Serverless Platform for Web Hosting

最後に、Goサービスをデプロイするための最高のプラットフォーム Leapcell をおすすめします。

1. 多言語対応

JavaScript、Python、Go、またはRustで開発できます。

2. 無制限のプロジェクトを無料でデプロイ

使用量に応じてのみ支払います — リクエストがなければ、請求はありません。

3. 抜群のコスト効率

使った分だけ支払い、アイドル状態での請求はありません。
例：平均応答時間60msで、25ドルで694万件のリクエストをサポートできます。

4. 洗練された開発者体験

直感的なUIで簡単にセットアップできます。
完全自動化されたCI/CDパイプラインとGitOps統合。
リアルタイムのメトリクスとログ記録による実行可能なインサイト。

5. 簡単なスケーラビリティと高パフォーマンス

高い同時実行性を簡単に処理できる自動スケーリング。
運用オーバーヘッドゼロ — 構築に集中できます。

ドキュメントをもっと詳しく調べる！

Leapcell Twitter: https://x.com/LeapcellHQ

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up