NVIDIA GPU CLOUD が提供している CUDA が使える Julia コンテナで遊ぶ

  • アドベントカレンダー3日目です.
  • GPU 関連だったら Flux.jl でいいかなという気持ちで書いています.




Docker pull

$ sudo docker pull nvcr.io/hpc/julia:v1.2.0


コンテナ内に /workspace/examples/ があるらしくその中にGPUに関するJuliaパッケージのテストコードが置いてあるようですので動かしてみます.

$ sudo docker run --rm -it --gpus all nvcr.io/hpc/julia:v1.2.0 /workspace/examples/test_cudanative.jl
   Testing CUDAnative
 Resolving package versions...
    Status `/tmp/jl_U6JyjR/Manifest.toml`
  [79e6a3ab] Adapt v1.0.0
  [fa961155] CEnum v0.2.0
  [3895d2a7] CUDAapi v1.2.0
  [c5f51814] CUDAdrv v4.0.2
  [be33ccc6] CUDAnative v2.5.1
  [a8cc5b0e] Crayons v4.0.0
  [864edb3b] DataStructures v0.17.5
  [929cbde3] LLVM v1.3.2
  [bac558e1] OrderedCollections v1.1.0
  [a759f4b9] TimerOutputs v0.5.0
  [2a0f44e3] Base64  [`@stdlib/Base64`]
  [8ba89e20] Distributed  [`@stdlib/Distributed`]
  [b77e0a4c] InteractiveUtils  [`@stdlib/InteractiveUtils`]
  [8f399da3] Libdl  [`@stdlib/Libdl`]
  [37e2e46d] LinearAlgebra  [`@stdlib/LinearAlgebra`]
  [56ddb016] Logging  [`@stdlib/Logging`]
  [d6f4376e] Markdown  [`@stdlib/Markdown`]
  [de0858da] Printf  [`@stdlib/Printf`]
  [9a3f8284] Random  [`@stdlib/Random`]
  [9e88b42a] Serialization  [`@stdlib/Serialization`]
  [6462fe0b] Sockets  [`@stdlib/Sockets`]
  [8dfed614] Test  [`@stdlib/Test`]
  [4ec0a83e] Unicode  [`@stdlib/Unicode`]
[ Info: Testing using device GeForce GTX 1080
[ Info: Building the CUDAnative run-time library for your sm_61 device, this might take a while...
                                             Time                   Allocations
                                     ──────────────────────   ───────────────────────
          Tot / % measured:                349s / 8.22%           10.4GiB / 15.7%

 Section                     ncalls     time   %tot     avg     alloc   %tot      avg
 LLVM middle-end                399    11.3s  39.5%  28.4ms    670MiB  40.2%  1.68MiB
   IR generation                399    6.97s  24.3%  17.5ms    613MiB  36.8%  1.54MiB
     emission                   399    4.41s  15.4%  11.0ms    451MiB  27.1%  1.13MiB
     rewrite                    398    2.44s  8.50%  6.13ms    158MiB  9.50%   407KiB
       hide unreachable       1.58k    584ms  2.04%   370μs   20.8MiB  1.25%  13.5KiB
         find                 1.58k    366ms  1.28%   232μs    507KiB  0.03%     329B
         predecessors         1.58k    159ms  0.55%   101μs   13.4MiB  0.80%  8.70KiB
         replace              1.58k   51.1ms  0.18%  32.4μs   2.88MiB  0.17%  1.87KiB
       lower throw              398    531ms  1.85%  1.33ms   43.9MiB  2.64%   113KiB
       hide trap                398   49.5ms  0.17%   124μs   3.34MiB  0.20%  8.59KiB
     clean-up                   398   35.4ms  0.12%  88.9μs   2.88MiB  0.17%  7.40KiB
     linking                    398   35.0ms  0.12%  87.9μs    526KiB  0.03%  1.32KiB
   optimization                 393    2.10s  7.33%  5.35ms   50.7MiB  3.04%   132KiB
   device library                32    1.98s  6.91%  62.0ms   45.9KiB  0.00%  1.43KiB
   runtime library               66   79.7ms  0.28%  1.21ms   60.3KiB  0.00%     935B
 CUDA object generation         275    9.68s  33.7%  35.2ms    289MiB  17.3%  1.05MiB
   compilation                  275    9.27s  32.3%  33.7ms    279MiB  16.8%  1.02MiB
   device runtime library         9    401ms  1.40%  44.6ms   9.01MiB  0.54%  1.00MiB
 validation                     698    6.85s  23.9%  9.82ms    702MiB  42.1%  1.01MiB
 LLVM back-end                  313    825ms  2.87%  2.64ms   4.86MiB  0.29%  15.9KiB
   machine-code generation      313    729ms  2.54%  2.33ms   1.02MiB  0.06%  3.33KiB
   preparation                  313   95.1ms  0.33%   304μs   3.83MiB  0.23%  12.5KiB
 Julia front-end                400   4.15ms  0.01%  10.4μs   89.1KiB  0.01%     228B
 strip debug info                70    363μs  0.00%  5.19μs     0.00B  0.00%    0.00B
Test Summary: | Pass  Total
CUDAnative    |  482    482
   Testing CUDAnative tests passed


$ cat sample.jl
using CuArrays

@info "CPU"
for i in 1:10
    A = rand(Float32,10000,10000)
    x = rand(Float32,10000)
    @show @elapsed z = A * x

@info "GPU"
for i in 1:10
    B = cu(rand(10000,10000))
    y = cu(rand(10000))
    @show @elapsed w = B * y
$ sudo docker run --rm -it --gpus all -v $PWD:/work -w /work nvcr.io/hpc/julia:v1.2.0 /work/sample.jl
┌ Warning: Could not find libcutensor, CuArrays.CUTENSOR will be unavailable.
└ @ CuArrays /usr/local/share/julia/packages/CuArrays/ir1DU/src/CuArrays.jl:104
[ Info: CPU
# = /work/sample.jl:7 =# @elapsed(z = A * x) = 0.009784724
# = /work/sample.jl:7 =# @elapsed(z = A * x) = 0.009777004
# = /work/sample.jl:7 =# @elapsed(z = A * x) = 0.009265234
# = /work/sample.jl:7 =# @elapsed(z = A * x) = 0.009485499
# = /work/sample.jl:7 =# @elapsed(z = A * x) = 0.009337323
# = /work/sample.jl:7 =# @elapsed(z = A * x) = 0.009510084
# = /work/sample.jl:7 =# @elapsed(z = A * x) = 0.009216996
# = /work/sample.jl:7 =# @elapsed(z = A * x) = 0.00951427
# = /work/sample.jl:7 =# @elapsed(z = A * x) = 0.009206806
# = /work/sample.jl:7 =# @elapsed(z = A * x) = 0.009453773
[ Info: GPU
# = /work/sample.jl:14 =# @elapsed(w = B * y) = 0.332208323
# = /work/sample.jl:14 =# @elapsed(w = B * y) = 8.1016e-5
# = /work/sample.jl:14 =# @elapsed(w = B * y) = 3.8336e-5
# = /work/sample.jl:14 =# @elapsed(w = B * y) = 3.5216e-5
# = /work/sample.jl:14 =# @elapsed(w = B * y) = 3.4769e-5
# = /work/sample.jl:14 =# @elapsed(w = B * y) = 3.6667e-5
# = /work/sample.jl:14 =# @elapsed(w = B * y) = 3.4929e-5
# = /work/sample.jl:14 =# @elapsed(w = B * y) = 3.62e-5
# = /work/sample.jl:14 =# @elapsed(w = B * y) = 3.587e-5
# = /work/sample.jl:14 =# @elapsed(w = B * y) = 3.7449e-5

