検証環境
WSL2で動作するUbuntu
Ubuntu 20.04
Cuda 11.8.0
次のエラーメッセージがでて動作しません。
[error] cuLinkAddData fails. This is usually caused by stale driver version.
Ubuntuにインストールしている、cudaのバージョンを11.1.1-1
などにすると動作します。
cudaのバージョン
masa@DESKTOP-HP:~/nxtest$ sudo apt list -a cuda
Listing... Done
cuda/unknown,now 11.8.0-1 amd64 [installed]
cuda/unknown 11.7.1-1 amd64
cuda/unknown 11.7.0-1 amd64
cuda/unknown 11.6.2-1 amd64
cuda/unknown 11.6.1-1 amd64
cuda/unknown 11.6.0-1 amd64
cuda/unknown 11.5.2-1 amd64
cuda/unknown 11.5.1-1 amd64
cuda/unknown 11.5.0-1 amd64
cuda/unknown 11.4.4-1 amd64
cuda/unknown 11.4.3-1 amd64
cuda/unknown 11.4.2-1 amd64
cuda/unknown 11.4.1-1 amd64
cuda/unknown 11.4.0-1 amd64
cuda/unknown 11.3.1-1 amd64
cuda/unknown 11.3.0-1 amd64
cuda/unknown 11.2.2-1 amd64
cuda/unknown 11.2.1-1 amd64
cuda/unknown 11.2.0-1 amd64
cuda/unknown 11.1.1-1 amd64
cuda/unknown 11.1.0-1 amd64
不具合の再現方法
$ export XLA_TARGET=cuda111
$ export XLA_BUILD=false
$ iex
Erlang/OTP 25 [erts-13.1.1] [source] [64-bit] [smp:16:16] [ds:16:16:10] [async-threads:1] [jit:ns]
Interactive Elixir (1.14.0) - press Ctrl+C to exit (type h() ENTER for help)
iex(1)> Mix.install([
...(1)> {:nx, "~> 0.3.0"},
...(1)> {:exla, "~> 0.3.0"},
...(1)> ],
...(1)> config: [
...(1)> nx: [
...(1)> default_backend: EXLA.Backend,
...(1)> default_defn_options: [compiler: EXLA],
...(1)> ]
...(1)> ]
...(1)> )
Resolving Hex dependencies...
Dependency resolution completed:
New:
complex 0.4.2
elixir_make 0.6.3
exla 0.3.0
nx 0.3.0
xla 0.3.0
* Getting nx (Hex package)
* Getting exla (Hex package)
* Getting elixir_make (Hex package)
* Getting xla (Hex package)
* Getting complex (Hex package)
==> complex
Compiling 2 files (.ex)
Generated complex app
==> nx
Compiling 24 files (.ex)
Generated nx app
==> elixir_make
Compiling 1 file (.ex)
Generated elixir_make app
==> xla
Compiling 2 files (.ex)
Generated xla app
21:04:26.560 [info] Found a matching archive (xla_extension-x86_64-linux-cuda111.tar.gz), going to download it
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0
100 204M 100 204M 0 0 7981k 0 0:00:26 0:00:26 --:--:-- 6268k
21:04:52.742 [info] Successfully downloaded the XLA archive
==> exla
Unpacking /home/masa/.cache/xla/0.3.0/cache/download/xla_extension-x86_64-linux-cuda111.tar.gz into /home/masa/.cache/mix/installs/elixir-1.14.0-erts-13.1.1/c709e1c9414e09e688e98d555e300276/deps/exla/cache
g++ -fPIC -I/home/masa/.asdf/installs/erlang/25.1.1/erts-13.1.1/include -Icache/xla_extension/include -O3 -Wall -Wno-sign-compare -Wno-unused-parameter -Wno-missing-field-initializers -Wno-comment -shared -std=c++14 c_src/exla/exla.cc c_src/exla/exla_nif_util.cc c_src/exla/exla_client.cc -o cache/libexla.so -Lcache/xla_extension/lib -lxla_extension -Wl,-rpath,'$ORIGIN/lib'
Compiling 21 files (.ex)
Generated exla app
:ok
iex(2)> Nx.add(Nx.tensor([1]), Nx.tensor([1]))
21:05:33.943 [info] could not open file to read NUMA node: /sys/bus/pci/devices/0000:01:00.0/numa_node
Your kernel may have been built without NUMA support.
21:05:33.943 [info] XLA service 0x7efc2c5ffda0 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:
21:05:33.943 [info] StreamExecutor device (0): NVIDIA GeForce RTX 3060, Compute Capability 8.6
21:05:33.943 [info] Using BFC allocator.
21:05:33.943 [info] XLA backend allocating 10641368678 bytes on device 0 for BFCAllocator.
21:05:34.312 [info] Start cannot spawn child process: No such file or directory
21:05:34.312 [error] cuLinkAddData fails. This is usually caused by stale driver version.
21:05:34.312 [error] The CUDA linking API did not work. Please use XLA_FLAGS=--xla_gpu_force_compilation_parallelism=1 to bypass it, but expect to get longer compilation time due to the lack of multi-threading.
** (RuntimeError) no kernel image is available for execution on the device
in tensorflow/stream_executor/cuda/cuda_asm_compiler.cc(65): 'status'
(exla 0.3.0) lib/exla/computation.ex:92: EXLA.Computation.unwrap!/1
(exla 0.3.0) lib/exla/computation.ex:61: EXLA.Computation.compile/4
(exla 0.3.0) lib/exla/defn.ex:396: anonymous fn/9 in EXLA.Defn.compile/7
(exla 0.3.0) lib/exla/defn/locked_cache.ex:36: EXLA.Defn.LockedCache.run/2
(stdlib 4.1.1) timer.erl:235: :timer.tc/1
(exla 0.3.0) lib/exla/defn.ex:383: EXLA.Defn.compile/7
(exla 0.3.0) lib/exla/defn.ex:251: EXLA.Defn.__compile__/4
iex:2: (file)
iex(2)>
11.2.0でも試して見る
$ sudo apt-get -y install cuda=11.2.0-1
11.2.0では動作した。
$ iex
Erlang/OTP 25 [erts-13.1.1] [source] [64-bit] [smp:16:16] [ds:16:16:10] [async-threads:1] [jit:ns]
Interactive Elixir (1.14.0) - press Ctrl+C to exit (type h() ENTER for help)
iex(1)> Mix.install([
...(1)> {:nx, "~> 0.3.0"},
...(1)> {:exla, "~> 0.3.0"},
...(1)> ],
...(1)> config: [
...(1)> nx: [
...(1)> default_backend: EXLA.Backend,
...(1)> default_defn_options: [compiler: EXLA],
...(1)> ]
...(1)> ]
...(1)> )
Could not find Hex, which is needed to build dependency :nx
Shall I install Hex? (if running non-interactively, use "mix local.hex --force") [Yn] Nx.add(Nx.tensor([1]), Nx.tensor([1]))
** (Mix.Error) Could not find an SCM for dependency :nx from Mix.InstallProject
(mix 1.14.0) lib/mix.ex:513: Mix.raise/2
(mix 1.14.0) lib/mix/dep/loader.ex:195: Mix.Dep.Loader.with_scm_and_app/4
(mix 1.14.0) lib/mix/dep/loader.ex:141: Mix.Dep.Loader.to_dep/3
(elixir 1.14.0) lib/enum.ex:1658: Enum."-map/2-lists^map/1-0-"/2
(mix 1.14.0) lib/mix/dep/loader.ex:358: Mix.Dep.Loader.mix_children/2
(mix 1.14.0) lib/mix/dep/loader.ex:18: Mix.Dep.Loader.children/0
(mix 1.14.0) lib/mix/dep/converger.ex:80: Mix.Dep.Converger.all/4
iex:1: (file)
iex(1)>
nil
iex(2)> Mix.install([
...(2)> {:nx, "~> 0.3.0"},
...(2)> {:exla, "~> 0.3.0"},
...(2)> ],
...(2)> config: [
...(2)> nx: [
...(2)> default_backend: EXLA.Backend,
...(2)> default_defn_options: [compiler: EXLA],
...(2)> ]
...(2)> ]
...(2)> )
Could not find Hex, which is needed to build dependency :nx
Shall I install Hex? (if running non-interactively, use "mix local.hex --force") [Yn] y
* creating /home/masa/.asdf/installs/elixir/1.14.0-otp-25/.mix/archives/hex-1.0.1
Resolving Hex dependencies...
Dependency resolution completed:
New:
complex 0.4.2
elixir_make 0.6.3
exla 0.3.0
nx 0.3.0
xla 0.3.0
* Getting nx (Hex package)
* Getting exla (Hex package)
* Getting elixir_make (Hex package)
* Getting xla (Hex package)
* Getting complex (Hex package)
==> complex
Compiling 2 files (.ex)
Generated complex app
==> nx
Compiling 24 files (.ex)
Generated nx app
==> elixir_make
Compiling 1 file (.ex)
Generated elixir_make app
==> xla
Compiling 2 files (.ex)
Generated xla app
20:14:28.008 [info] Found a matching archive (xla_extension-x86_64-linux-cuda111.tar.gz), going to download it
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0
100 204M 100 204M 0 0 15.6M 0 0:00:13 0:00:13 --:--:-- 23.1M
20:14:41.085 [info] Successfully downloaded the XLA archive
==> exla
Unpacking /home/masa/.cache/xla/0.3.0/cache/download/xla_extension-x86_64-linux-cuda111.tar.gz into /home/masa/.cache/mix/installs/elixir-1.14.0-erts-13.1.1/c709e1c9414e09e688e98d555e300276/deps/exla/cache
g++ -fPIC -I/home/masa/.asdf/installs/erlang/25.1.1/erts-13.1.1/include -Icache/xla_extension/include -O3 -Wall -Wno-sign-compare -Wno-unused-parameter -Wno-missing-field-initializers -Wno-comment -shared -std=c++14 c_src/exla/exla.cc c_src/exla/exla_nif_util.cc c_src/exla/exla_client.cc -o cache/libexla.so -Lcache/xla_extension/lib -lxla_extension -Wl,-rpath,'$ORIGIN/lib'
Compiling 21 files (.ex)
Generated exla app
:ok
iex(3)> Nx.add(Nx.tensor([1]), Nx.tensor([1]))
20:15:21.915 [info] could not open file to read NUMA node: /sys/bus/pci/devices/0000:01:00.0/numa_node
Your kernel may have been built without NUMA support.
20:15:21.915 [info] XLA service 0x7fb1c8002b90 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:
20:15:21.915 [info] StreamExecutor device (0): NVIDIA GeForce RTX 3060, Compute Capability 8.6
20:15:21.915 [info] Using BFC allocator.
20:15:21.915 [info] XLA backend allocating 10641368678 bytes on device 0 for BFCAllocator.
20:15:22.332 [info] Start cannot spawn child process: No such file or directory
#Nx.Tensor<
s64[1]
EXLA.Backend<cuda:0, 0.1029217212.3535405091.41223>
[2]
>
iex(4)>
11.3.0-1にあげてみたが正常に動作した
バージョンごとの確認結果
11.8.0-1 NG (latest version)
11.7.0-1 NG
11.6.0-1 NG
11.3.0-1 OK
11.2.0-1 OK
11.1.1-1 OK
11.1.0-1 OK
XLA起動コマンド(メモ)
Mix.install([
{:nx, "~> 0.3.0"},
{:exla, "~> 0.3.0"},
],
config: [
nx: [
default_backend: EXLA.Backend,
default_defn_options: [compiler: EXLA],
]
]
)
Nx.add(Nx.tensor([1]), Nx.tensor([1]))
export XLA_BUILD=true
にしてcuda11.8でのビルドを試みてみたんですが、ビルドエラーになってしまいました。
本家の、dockerイメージでビルドしてみましたが、同じエラー。
この件を、elixir-nx/xlaのissueをあげました。