Help us understand the problem. What is going on with this article?

ガルパのスクレイピングをはじめよう~【Julia言語】

概要

以前、ガルパの画像をPythonで収集することについて取り上げましたが...

別のプログラミング言語でもスクレイピングしてみたいなと思いました('ω')ノ
そこで、Julia言語で同じことができるのかを試してみました

※ 私のほうで調べて見つけた書き方を参考にコードを作成してみて動いたものを載せただけなので、Juliaの文法として正しいかどうかは保証できません。また、コードの最適化などは行っていません。

使用したライブラリ

  • HTTP.jl
    • HTTP client and server functionality for Julia
    • 参考にさせていただいた『Juliaで楽しくWebスクレイピング!』では、『Requests.jl』を用いていましたが、githubの方で『HTTP.ji』を利用することを推奨していたので代わりに利用しました
  • Cascadia.jl
    • A CSS Selector library in Julia
    • URLから要素を取得するのに利用しました

動作環境

  • Windows10 Home 64bit
  • Julia 1.2.0

URLを取得してみる

サンプルコード

using Cascadia
using Gumbo
using HTTP

r = HTTP.request("GET", "https://bangdream.gamedbs.jp/chara/show/7/658")
body = String(r.body)
h = parsehtml(convert(String, body))
mlinks = eachmatch(Selector(".tc"),h.root)
for mlink in mlinks
    bs = eachmatch(Selector("a"), mlink)
    href = bs[1].attributes["href"]
    if occursin("//bangdream.gamedbs.jp/images/chara/card/", href) == true
        println("https:$href")
    end
end

実行結果

PS > julia test.jl
https://bangdream.gamedbs.jp/images/chara/card/1566270291274_shpq83o4.png
https://bangdream.gamedbs.jp/images/chara/card/1566270291274_oi1y6bhe.png
https://bangdream.gamedbs.jp/images/chara/card/1566270291171_yxn7hr8i.png
https://bangdream.gamedbs.jp/images/chara/card/1566270291171_osng4369.png
PS >

画像ファイルの保存

サンプルコード

using Cascadia
using Gumbo
using HTTP

r = HTTP.request("GET", "https://bangdream.gamedbs.jp/chara/show/7/658")
body = String(r.body)
h = parsehtml(convert(String, body))
mlinks = eachmatch(Selector(".tc"),h.root)
for (i, mlink) in zip(1:length(mlinks), mlinks)
    bs = eachmatch(Selector("a"), mlink)
    href = bs[1].attributes["href"]
    if occursin("//bangdream.gamedbs.jp/images/chara/card/", href) == true
        println("https:$href")
        r = HTTP.request("GET", href)
        filename = "test-" * string(i) * ".jpg"
        open(filename, "w") do file
            write(file, r.body)
        end
    end
end

実行結果

PS > julia test.jl
https://bangdream.gamedbs.jp/images/chara/card/1566270291274_shpq83o4.png
https://bangdream.gamedbs.jp/images/chara/card/1566270291274_oi1y6bhe.png
https://bangdream.gamedbs.jp/images/chara/card/1566270291171_yxn7hr8i.png
https://bangdream.gamedbs.jp/images/chara/card/1566270291171_osng4369.png
PS >

image.png

情報を取得したいURLを巡回させる場合

試しにJSONファイルにキャラクターごとの名前と対応するURLを保存しておき、JSONファイルに記載したURLから辿っていくようにしました。
※ JSONファイルについては今回、省略させていただきますが、構成としては以前Pythonでやってみたときの記事の配列を参考にしていただければと思います。

サンプルコード

using Cascadia
using Gumbo
using HTTP
using JSON

function main()
    data = JSON.parsefile("../../../0.json/bang-dream_gbp.json")
    for (key, d) in data
        url = d["url"]
        r = HTTP.request("GET", url)
        body = String(r.body)
        h = parsehtml(convert(String, body))
        mlinks = eachmatch(Selector(".hvr-grow"),h.root)
        for mlink in mlinks
            bs = eachmatch(Selector("a"), mlink)
            lhref = bs[1].attributes["href"]
            if occursin("//bangdream.gamedbs.jp/chara/show/", lhref) == true
                r = HTTP.request("GET", lhref)
                body = String(r.body)
                h = parsehtml(convert(String, body))
                mimages = eachmatch(Selector(".tc"),h.root)
                for mimage in mimages
                    is = eachmatch(Selector("a"), mimage)
                    href = is[1].attributes["href"]
                    if occursin("//bangdream.gamedbs.jp/images/chara/card/", href) == true
                        println("name:$key    https:$href")
                    end
                end
                mimages = eachmatch(Selector(".lazy"),h.root)
                for mimage in mimages
                    is = eachmatch(Selector("img"), mimage)
                    href = is[1].attributes["data-original"]
                    if occursin("//bangdream.gamedbs.jp/images/chara/livesd/", href) == true
                        println("name:$key    https:$href")
                    end
                end
            end
        end

        limages = eachmatch(Selector(".swimg sbtn radius animated"),h.root)
        for limage in limages
            bs = eachmatch(Selector("span"), limage)
            href = bs[1].attributes["data-img-url"]
            if occursin("//bangdream.gamedbs.jp/images/chara/live2d/", href) == true
                println("name:$key    https:$href")
            end
        end

    end
end

main()

実行結果(一部抜粋)

上記のコードに画像を保存する処理を加えれば、画像が収集できます(*´ω`)

PS > julia test.jl
name:ran_mitake    https://bangdream.gamedbs.jp/images/chara/card/1508507023043_09irdao4.png
name:ran_mitake    https://bangdream.gamedbs.jp/images/chara/card/1508509373043_zecsa2j0.png
name:ran_mitake    https://bangdream.gamedbs.jp/images/chara/livesd/1530435015002_azdjbfyt.png
name:ran_mitake    https://bangdream.gamedbs.jp/images/chara/livesd/1530435015003_nbiqsel1.png
name:ran_mitake    https://bangdream.gamedbs.jp/images/chara/livesd/1530435015004_ebow2dzj.png
name:ran_mitake    https://bangdream.gamedbs.jp/images/chara/card/1508507023044_wb8jpko6.png
name:ran_mitake    https://bangdream.gamedbs.jp/images/chara/card/1508509373044_e5tpd6rh.png
name:ran_mitake    https://bangdream.gamedbs.jp/images/chara/livesd/1530435016002_a0k2c1xe.png
name:ran_mitake    https://bangdream.gamedbs.jp/images/chara/livesd/1530435016003_mrkgf3nd.png
name:ran_mitake    https://bangdream.gamedbs.jp/images/chara/livesd/1530435016004_jdyv6wme.png
name:ran_mitake    https://bangdream.gamedbs.jp/images/chara/card/1508507023045_j9wcinm4.png
name:ran_mitake    https://bangdream.gamedbs.jp/images/chara/card/1508509373045_w0pg1hn9.png
name:ran_mitake    https://bangdream.gamedbs.jp/images/chara/card/1508509626024_ovckequa.png
name:ran_mitake    https://bangdream.gamedbs.jp/images/chara/card/1508509695024_e7ds8cjk.png
name:ran_mitake    https://bangdream.gamedbs.jp/images/chara/livesd/1530435017002_2ldub631.png
name:ran_mitake    https://bangdream.gamedbs.jp/images/chara/livesd/1530435017003_p65hf981.png
name:ran_mitake    https://bangdream.gamedbs.jp/images/chara/livesd/1530435017004_u5cdwfyp.png
name:ran_mitake    https://bangdream.gamedbs.jp/images/chara/card/1508507023046_l7d4bakz.png
name:ran_mitake    https://bangdream.gamedbs.jp/images/chara/card/1508509373046_rlo8fkp7.png
name:ran_mitake    https://bangdream.gamedbs.jp/images/chara/card/1508509626025_ti1xz4hd.png
name:ran_mitake    https://bangdream.gamedbs.jp/images/chara/card/1508509695025_9ferl4om.png
name:ran_mitake    https://bangdream.gamedbs.jp/images/chara/livesd/1530435018002_jl57m2uw.png
name:ran_mitake    https://bangdream.gamedbs.jp/images/chara/livesd/1530435018003_ak3t21iy.png
name:ran_mitake    https://bangdream.gamedbs.jp/images/chara/livesd/1530435018004_z94uwlkq.png
name:ran_mitake    https://bangdream.gamedbs.jp/images/chara/card/1508507023047_16mpwbj5.png
name:ran_mitake    https://bangdream.gamedbs.jp/images/chara/card/1508509373047_su3ipznb.png
name:ran_mitake    https://bangdream.gamedbs.jp/images/chara/card/1508509626026_kg96b3ed.png
name:ran_mitake    https://bangdream.gamedbs.jp/images/chara/card/1508509695026_hubqsg3n.png
name:ran_mitake    https://bangdream.gamedbs.jp/images/chara/livesd/1530435019002_23xmoiqw.png
name:ran_mitake    https://bangdream.gamedbs.jp/images/chara/livesd/1530435019003_pinhd0ur.png
name:ran_mitake    https://bangdream.gamedbs.jp/images/chara/livesd/1530435019004_yvjcihbg.png
name:ran_mitake    https://bangdream.gamedbs.jp/images/chara/card/1508507023048_3wgvqbkz.png
name:ran_mitake    https://bangdream.gamedbs.jp/images/chara/card/1508509373048_a6i3ry0w.png
name:ran_mitake    https://bangdream.gamedbs.jp/images/chara/card/1508507023049_tu2n0w8f.png
name:ran_mitake    https://bangdream.gamedbs.jp/images/chara/card/1508509373049_a0qdfn39.png
name:ran_mitake    https://bangdream.gamedbs.jp/images/chara/card/1508509626027_80o9szpw.png
name:ran_mitake    https://bangdream.gamedbs.jp/images/chara/card/1508509695027_0oxu4mf7.png
name:ran_mitake    https://bangdream.gamedbs.jp/images/chara/livesd/1530435020002_pnkqutir.png
name:ran_mitake    https://bangdream.gamedbs.jp/images/chara/livesd/1530435020003_im75319u.png
name:ran_mitake    https://bangdream.gamedbs.jp/images/chara/livesd/1530435020004_ulv3kt8e.png
name:ran_mitake    https://bangdream.gamedbs.jp/images/chara/card/1508507023050_d3ytxmcv.png
name:ran_mitake    https://bangdream.gamedbs.jp/images/chara/card/1508509373050_n8yoswbx.png
name:ran_mitake    https://bangdream.gamedbs.jp/images/chara/card/1511233845053_r8ompgcf.png
name:ran_mitake    https://bangdream.gamedbs.jp/images/chara/card/1511233846053_eo62m74a.png
name:ran_mitake    https://bangdream.gamedbs.jp/images/chara/livesd/1533007001002_odh98gks.png
name:ran_mitake    https://bangdream.gamedbs.jp/images/chara/livesd/1533007001003_8sp7yzib.png
name:ran_mitake    https://bangdream.gamedbs.jp/images/chara/livesd/1533007001004_jn74t2ox.png
name:ran_mitake    https://bangdream.gamedbs.jp/images/chara/card/1512707491059_q4bfr1ec.png
name:ran_mitake    https://bangdream.gamedbs.jp/images/chara/card/1512707491059_w2izyqpl.png
name:ran_mitake    https://bangdream.gamedbs.jp/images/chara/card/1512707492034_zmc9pn02.png
・
・
・

感想

ひとまず、私としてはやりたかったことができたので満足です。徐々にブラッシュアップしていきたいなーと思っています(*´ω`)

Why do not you register as a user and use Qiita more conveniently?
  1. We will deliver articles that match you
    By following users and tags, you can catch up information on technical fields that you are interested in as a whole
  2. you can read useful information later efficiently
    By "stocking" the articles you like, you can search right away
Comments
Sign up for free and join this conversation.
If you already have a Qiita account
Why do not you register as a user and use Qiita more conveniently?
You need to log in to use this function. Qiita can be used more conveniently after logging in.
You seem to be reading articles frequently this month. Qiita can be used more conveniently after logging in.
  1. We will deliver articles that match you
    By following users and tags, you can catch up information on technical fields that you are interested in as a whole
  2. you can read useful information later efficiently
    By "stocking" the articles you like, you can search right away