More than 5 years have passed since last update.

NVIDIA GPU CLOUD が提供している CUDA が使える Julia コンテナで遊ぶ

Last updated at 2019-12-03Posted at 2019-12-03

本日は

アドベントカレンダー３日目です．
GPU 関連だったら Flux.jl でいいかなという気持ちで書いています．

フィードバックを得た

１日目のアドベントカレンダー CUDAが動く Julia の深層学習フレームワーク Flux.jl の環境構築をDockerで行う．を書いた時に Twitter で NGC(NVIDIA GPU CLOUD) でもNVIDIAのGPUが使える Julia 環境のコンテナが用意されているということを教えていただきました．
https://www.nvidia.com/ja-jp/gpu-cloud/
https://ngc.nvidia.com/catalog/containers/hpc:julia

上のページを読むとですね．使い方がすごく丁寧に書かれているんです．モゥそれ見て・・・．ゴマちゃん感動したよ．(´・ω・｀)b

使い方

Docker pull

$ sudo docker pull nvcr.io/hpc/julia:v1.2.0

コンテナ起動

コンテナ内に /workspace/examples/ があるらしくその中にGPUに関するJuliaパッケージのテストコードが置いてあるようですので動かしてみます．

$ sudo docker run --rm -it --gpus all nvcr.io/hpc/julia:v1.2.0 /workspace/examples/test_cudanative.jl
   Testing CUDAnative
 Resolving package versions...
    Status `/tmp/jl_U6JyjR/Manifest.toml`
  [79e6a3ab] Adapt v1.0.0
  [fa961155] CEnum v0.2.0
  [3895d2a7] CUDAapi v1.2.0
  [c5f51814] CUDAdrv v4.0.2
  [be33ccc6] CUDAnative v2.5.1
  [a8cc5b0e] Crayons v4.0.0
  [864edb3b] DataStructures v0.17.5
  [929cbde3] LLVM v1.3.2
  [bac558e1] OrderedCollections v1.1.0
  [a759f4b9] TimerOutputs v0.5.0
  [2a0f44e3] Base64  [`@stdlib/Base64`]
  [8ba89e20] Distributed  [`@stdlib/Distributed`]
  [b77e0a4c] InteractiveUtils  [`@stdlib/InteractiveUtils`]
  [8f399da3] Libdl  [`@stdlib/Libdl`]
  [37e2e46d] LinearAlgebra  [`@stdlib/LinearAlgebra`]
  [56ddb016] Logging  [`@stdlib/Logging`]
  [d6f4376e] Markdown  [`@stdlib/Markdown`]
  [de0858da] Printf  [`@stdlib/Printf`]
  [9a3f8284] Random  [`@stdlib/Random`]
  [9e88b42a] Serialization  [`@stdlib/Serialization`]
  [6462fe0b] Sockets  [`@stdlib/Sockets`]
  [8dfed614] Test  [`@stdlib/Test`]
  [4ec0a83e] Unicode  [`@stdlib/Unicode`]
[ Info: Testing using device GeForce GTX 1080
[ Info: Building the CUDAnative run-time library for your sm_61 device, this might take a while...
 ────────────────────────────────────────────────────────────────────────────────────
                                             Time                   Allocations
                                     ──────────────────────   ───────────────────────
          Tot / % measured:                349s / 8.22%           10.4GiB / 15.7%

 Section                     ncalls     time   %tot     avg     alloc   %tot      avg
 ────────────────────────────────────────────────────────────────────────────────────
 LLVM middle-end                399    11.3s  39.5%  28.4ms    670MiB  40.2%  1.68MiB
   IR generation                399    6.97s  24.3%  17.5ms    613MiB  36.8%  1.54MiB
     emission                   399    4.41s  15.4%  11.0ms    451MiB  27.1%  1.13MiB
     rewrite                    398    2.44s  8.50%  6.13ms    158MiB  9.50%   407KiB
       hide unreachable       1.58k    584ms  2.04%   370μs   20.8MiB  1.25%  13.5KiB
         find                 1.58k    366ms  1.28%   232μs    507KiB  0.03%     329B
         predecessors         1.58k    159ms  0.55%   101μs   13.4MiB  0.80%  8.70KiB
         replace              1.58k   51.1ms  0.18%  32.4μs   2.88MiB  0.17%  1.87KiB
       lower throw              398    531ms  1.85%  1.33ms   43.9MiB  2.64%   113KiB
       hide trap                398   49.5ms  0.17%   124μs   3.34MiB  0.20%  8.59KiB
     clean-up                   398   35.4ms  0.12%  88.9μs   2.88MiB  0.17%  7.40KiB
     linking                    398   35.0ms  0.12%  87.9μs    526KiB  0.03%  1.32KiB
   optimization                 393    2.10s  7.33%  5.35ms   50.7MiB  3.04%   132KiB
   device library                32    1.98s  6.91%  62.0ms   45.9KiB  0.00%  1.43KiB
   runtime library               66   79.7ms  0.28%  1.21ms   60.3KiB  0.00%     935B
 CUDA object generation         275    9.68s  33.7%  35.2ms    289MiB  17.3%  1.05MiB
   compilation                  275    9.27s  32.3%  33.7ms    279MiB  16.8%  1.02MiB
   device runtime library         9    401ms  1.40%  44.6ms   9.01MiB  0.54%  1.00MiB
 validation                     698    6.85s  23.9%  9.82ms    702MiB  42.1%  1.01MiB
 LLVM back-end                  313    825ms  2.87%  2.64ms   4.86MiB  0.29%  15.9KiB
   machine-code generation      313    729ms  2.54%  2.33ms   1.02MiB  0.06%  3.33KiB
   preparation                  313   95.1ms  0.33%   304μs   3.83MiB  0.23%  12.5KiB
 Julia front-end                400   4.15ms  0.01%  10.4μs   89.1KiB  0.01%     228B
 strip debug info                70    363μs  0.00%  5.19μs     0.00B  0.00%    0.00B
 ────────────────────────────────────────────────────────────────────────────────────
Test Summary: | Pass  Total
CUDAnative    |  482    482
   Testing CUDAnative tests passed

自分のコードを動かす

$ cat sample.jl
using CuArrays

@info "CPU"
for i in 1:10
    A = rand(Float32,10000,10000)
    x = rand(Float32,10000)
    @show @elapsed z = A * x
end

@info "GPU"
for i in 1:10
    B = cu(rand(10000,10000))
    y = cu(rand(10000))
    @show @elapsed w = B * y
end
$ sudo docker run --rm -it --gpus all -v $PWD:/work -w /work nvcr.io/hpc/julia:v1.2.0 /work/sample.jl
┌ Warning: Could not find libcutensor, CuArrays.CUTENSOR will be unavailable.
└ @ CuArrays /usr/local/share/julia/packages/CuArrays/ir1DU/src/CuArrays.jl:104
[ Info: CPU
# = /work/sample.jl:7 =# @elapsed(z = A * x) = 0.009784724
# = /work/sample.jl:7 =# @elapsed(z = A * x) = 0.009777004
# = /work/sample.jl:7 =# @elapsed(z = A * x) = 0.009265234
# = /work/sample.jl:7 =# @elapsed(z = A * x) = 0.009485499
# = /work/sample.jl:7 =# @elapsed(z = A * x) = 0.009337323
# = /work/sample.jl:7 =# @elapsed(z = A * x) = 0.009510084
# = /work/sample.jl:7 =# @elapsed(z = A * x) = 0.009216996
# = /work/sample.jl:7 =# @elapsed(z = A * x) = 0.00951427
# = /work/sample.jl:7 =# @elapsed(z = A * x) = 0.009206806
# = /work/sample.jl:7 =# @elapsed(z = A * x) = 0.009453773
[ Info: GPU
# = /work/sample.jl:14 =# @elapsed(w = B * y) = 0.332208323
# = /work/sample.jl:14 =# @elapsed(w = B * y) = 8.1016e-5
# = /work/sample.jl:14 =# @elapsed(w = B * y) = 3.8336e-5
# = /work/sample.jl:14 =# @elapsed(w = B * y) = 3.5216e-5
# = /work/sample.jl:14 =# @elapsed(w = B * y) = 3.4769e-5
# = /work/sample.jl:14 =# @elapsed(w = B * y) = 3.6667e-5
# = /work/sample.jl:14 =# @elapsed(w = B * y) = 3.4929e-5
# = /work/sample.jl:14 =# @elapsed(w = B * y) = 3.62e-5
# = /work/sample.jl:14 =# @elapsed(w = B * y) = 3.587e-5
# = /work/sample.jl:14 =# @elapsed(w = B * y) = 3.7449e-5

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up