More than 3 years have passed since last update.

julia でスクレイピング

Posted at 2021-06-26

以下のような感じでスクレイピングできます．
存在しないURLを使っているので，適宜URLを変更しましょう．

using HTTP, Gumbo, Cascadia

url = "https://hoge.example"

req = HTTP.request("GET", url)
doc = parsehtml(String(req.body))

hogelist = eachmatch(Selector(".Hoge"), doc.root)

HTTP.jl で web ページの情報を取得して， Gumbo.jl　の parsehtml() で HTML をパースします．
Cascadia.jl を使えば，指定したセレクタにマッチする HTMLNode を取得できます．例えば以下のような感じです．HTMLNode は text() を使って文字列に変換できます．

julia> doc = parsehtml("<p id=\"hoge\"> Hello, world! </p>")
HTML Document:
<!DOCTYPE >
HTMLElement{:HTML}:<HTML>
  <head></head>
  <body>
    <p id="hoge">
      Hello, world!
    </p>
  </body>
</HTML>

julia> eachmatch(Selector("#hoge"), doc.root)
1-element Vector{HTMLNode}:
 HTMLElement{:p}:<p id="hoge">
  Hello, world!
</p>



julia> eachmatch(Selector("#hoge"), doc.root)[1] |> text
"Hello, world!"

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up