本日は
- アドベントカレンダー3日目です.
- GPU 関連だったら Flux.jl でいいかなという気持ちで書いています.
フィードバックを得た
-
1日目のアドベントカレンダー CUDAが動く Julia の深層学習フレームワーク Flux.jl の環境構築をDockerで行う. を書いた時に Twitter で NGC(NVIDIA GPU CLOUD) でもNVIDIAのGPUが使える Julia 環境のコンテナが用意されているということを教えていただきました.
上のページを読むとですね.使い方がすごく丁寧に書かれているんです.モゥそれ見て・・・.ゴマちゃん感動したよ.(´・ω・`)b
使い方
Docker pull
$ sudo docker pull nvcr.io/hpc/julia:v1.2.0
コンテナ起動
コンテナ内に /workspace/examples/
があるらしくその中にGPUに関するJuliaパッケージのテストコードが置いてあるようですので動かしてみます.
$ sudo docker run --rm -it --gpus all nvcr.io/hpc/julia:v1.2.0 /workspace/examples/test_cudanative.jl
Testing CUDAnative
Resolving package versions...
Status `/tmp/jl_U6JyjR/Manifest.toml`
[79e6a3ab] Adapt v1.0.0
[fa961155] CEnum v0.2.0
[3895d2a7] CUDAapi v1.2.0
[c5f51814] CUDAdrv v4.0.2
[be33ccc6] CUDAnative v2.5.1
[a8cc5b0e] Crayons v4.0.0
[864edb3b] DataStructures v0.17.5
[929cbde3] LLVM v1.3.2
[bac558e1] OrderedCollections v1.1.0
[a759f4b9] TimerOutputs v0.5.0
[2a0f44e3] Base64 [`@stdlib/Base64`]
[8ba89e20] Distributed [`@stdlib/Distributed`]
[b77e0a4c] InteractiveUtils [`@stdlib/InteractiveUtils`]
[8f399da3] Libdl [`@stdlib/Libdl`]
[37e2e46d] LinearAlgebra [`@stdlib/LinearAlgebra`]
[56ddb016] Logging [`@stdlib/Logging`]
[d6f4376e] Markdown [`@stdlib/Markdown`]
[de0858da] Printf [`@stdlib/Printf`]
[9a3f8284] Random [`@stdlib/Random`]
[9e88b42a] Serialization [`@stdlib/Serialization`]
[6462fe0b] Sockets [`@stdlib/Sockets`]
[8dfed614] Test [`@stdlib/Test`]
[4ec0a83e] Unicode [`@stdlib/Unicode`]
[ Info: Testing using device GeForce GTX 1080
[ Info: Building the CUDAnative run-time library for your sm_61 device, this might take a while...
────────────────────────────────────────────────────────────────────────────────────
Time Allocations
────────────────────── ───────────────────────
Tot / % measured: 349s / 8.22% 10.4GiB / 15.7%
Section ncalls time %tot avg alloc %tot avg
────────────────────────────────────────────────────────────────────────────────────
LLVM middle-end 399 11.3s 39.5% 28.4ms 670MiB 40.2% 1.68MiB
IR generation 399 6.97s 24.3% 17.5ms 613MiB 36.8% 1.54MiB
emission 399 4.41s 15.4% 11.0ms 451MiB 27.1% 1.13MiB
rewrite 398 2.44s 8.50% 6.13ms 158MiB 9.50% 407KiB
hide unreachable 1.58k 584ms 2.04% 370μs 20.8MiB 1.25% 13.5KiB
find 1.58k 366ms 1.28% 232μs 507KiB 0.03% 329B
predecessors 1.58k 159ms 0.55% 101μs 13.4MiB 0.80% 8.70KiB
replace 1.58k 51.1ms 0.18% 32.4μs 2.88MiB 0.17% 1.87KiB
lower throw 398 531ms 1.85% 1.33ms 43.9MiB 2.64% 113KiB
hide trap 398 49.5ms 0.17% 124μs 3.34MiB 0.20% 8.59KiB
clean-up 398 35.4ms 0.12% 88.9μs 2.88MiB 0.17% 7.40KiB
linking 398 35.0ms 0.12% 87.9μs 526KiB 0.03% 1.32KiB
optimization 393 2.10s 7.33% 5.35ms 50.7MiB 3.04% 132KiB
device library 32 1.98s 6.91% 62.0ms 45.9KiB 0.00% 1.43KiB
runtime library 66 79.7ms 0.28% 1.21ms 60.3KiB 0.00% 935B
CUDA object generation 275 9.68s 33.7% 35.2ms 289MiB 17.3% 1.05MiB
compilation 275 9.27s 32.3% 33.7ms 279MiB 16.8% 1.02MiB
device runtime library 9 401ms 1.40% 44.6ms 9.01MiB 0.54% 1.00MiB
validation 698 6.85s 23.9% 9.82ms 702MiB 42.1% 1.01MiB
LLVM back-end 313 825ms 2.87% 2.64ms 4.86MiB 0.29% 15.9KiB
machine-code generation 313 729ms 2.54% 2.33ms 1.02MiB 0.06% 3.33KiB
preparation 313 95.1ms 0.33% 304μs 3.83MiB 0.23% 12.5KiB
Julia front-end 400 4.15ms 0.01% 10.4μs 89.1KiB 0.01% 228B
strip debug info 70 363μs 0.00% 5.19μs 0.00B 0.00% 0.00B
────────────────────────────────────────────────────────────────────────────────────
Test Summary: | Pass Total
CUDAnative | 482 482
Testing CUDAnative tests passed
自分のコードを動かす
$ cat sample.jl
using CuArrays
@info "CPU"
for i in 1:10
A = rand(Float32,10000,10000)
x = rand(Float32,10000)
@show @elapsed z = A * x
end
@info "GPU"
for i in 1:10
B = cu(rand(10000,10000))
y = cu(rand(10000))
@show @elapsed w = B * y
end
$ sudo docker run --rm -it --gpus all -v $PWD:/work -w /work nvcr.io/hpc/julia:v1.2.0 /work/sample.jl
┌ Warning: Could not find libcutensor, CuArrays.CUTENSOR will be unavailable.
└ @ CuArrays /usr/local/share/julia/packages/CuArrays/ir1DU/src/CuArrays.jl:104
[ Info: CPU
#= /work/sample.jl:7 =# @elapsed(z = A * x) = 0.009784724
#= /work/sample.jl:7 =# @elapsed(z = A * x) = 0.009777004
#= /work/sample.jl:7 =# @elapsed(z = A * x) = 0.009265234
#= /work/sample.jl:7 =# @elapsed(z = A * x) = 0.009485499
#= /work/sample.jl:7 =# @elapsed(z = A * x) = 0.009337323
#= /work/sample.jl:7 =# @elapsed(z = A * x) = 0.009510084
#= /work/sample.jl:7 =# @elapsed(z = A * x) = 0.009216996
#= /work/sample.jl:7 =# @elapsed(z = A * x) = 0.00951427
#= /work/sample.jl:7 =# @elapsed(z = A * x) = 0.009206806
#= /work/sample.jl:7 =# @elapsed(z = A * x) = 0.009453773
[ Info: GPU
#= /work/sample.jl:14 =# @elapsed(w = B * y) = 0.332208323
#= /work/sample.jl:14 =# @elapsed(w = B * y) = 8.1016e-5
#= /work/sample.jl:14 =# @elapsed(w = B * y) = 3.8336e-5
#= /work/sample.jl:14 =# @elapsed(w = B * y) = 3.5216e-5
#= /work/sample.jl:14 =# @elapsed(w = B * y) = 3.4769e-5
#= /work/sample.jl:14 =# @elapsed(w = B * y) = 3.6667e-5
#= /work/sample.jl:14 =# @elapsed(w = B * y) = 3.4929e-5
#= /work/sample.jl:14 =# @elapsed(w = B * y) = 3.62e-5
#= /work/sample.jl:14 =# @elapsed(w = B * y) = 3.587e-5
#= /work/sample.jl:14 =# @elapsed(w = B * y) = 3.7449e-5