2
1

Delete article

Deleted articles cannot be recovered.

Draft of this article would be also deleted.

Are you sure you want to delete this article?

More than 1 year has passed since last update.

EXLAが最新のcuda11.8で動作しない

Last updated at Posted at 2022-10-22

検証環境

WSL2で動作するUbuntu
Ubuntu 20.04
Cuda 11.8.0

次のエラーメッセージがでて動作しません。

[error] cuLinkAddData fails. This is usually caused by stale driver version.

Ubuntuにインストールしている、cudaのバージョンを11.1.1-1などにすると動作します。

cudaのバージョン

masa@DESKTOP-HP:~/nxtest$ sudo apt list -a cuda
Listing... Done
cuda/unknown,now 11.8.0-1 amd64 [installed]
cuda/unknown 11.7.1-1 amd64
cuda/unknown 11.7.0-1 amd64
cuda/unknown 11.6.2-1 amd64
cuda/unknown 11.6.1-1 amd64
cuda/unknown 11.6.0-1 amd64
cuda/unknown 11.5.2-1 amd64
cuda/unknown 11.5.1-1 amd64
cuda/unknown 11.5.0-1 amd64
cuda/unknown 11.4.4-1 amd64
cuda/unknown 11.4.3-1 amd64
cuda/unknown 11.4.2-1 amd64
cuda/unknown 11.4.1-1 amd64
cuda/unknown 11.4.0-1 amd64
cuda/unknown 11.3.1-1 amd64
cuda/unknown 11.3.0-1 amd64
cuda/unknown 11.2.2-1 amd64
cuda/unknown 11.2.1-1 amd64
cuda/unknown 11.2.0-1 amd64
cuda/unknown 11.1.1-1 amd64
cuda/unknown 11.1.0-1 amd64

不具合の再現方法

$ export XLA_TARGET=cuda111
$ export XLA_BUILD=false
$ iex
Erlang/OTP 25 [erts-13.1.1] [source] [64-bit] [smp:16:16] [ds:16:16:10] [async-threads:1] [jit:ns]

Interactive Elixir (1.14.0) - press Ctrl+C to exit (type h() ENTER for help)
iex(1)> Mix.install([
...(1)>   {:nx, "~> 0.3.0"},
...(1)>   {:exla, "~> 0.3.0"},
...(1)> ],
...(1)> config: [
...(1)>           nx: [
...(1)>             default_backend: EXLA.Backend,
...(1)>             default_defn_options: [compiler: EXLA],
...(1)>           ]
...(1)>         ]
...(1)> )
Resolving Hex dependencies...
Dependency resolution completed:
New:
  complex 0.4.2
  elixir_make 0.6.3
  exla 0.3.0
  nx 0.3.0
  xla 0.3.0
* Getting nx (Hex package)
* Getting exla (Hex package)
* Getting elixir_make (Hex package)
* Getting xla (Hex package)
* Getting complex (Hex package)
==> complex
Compiling 2 files (.ex)
Generated complex app
==> nx
Compiling 24 files (.ex)
Generated nx app
==> elixir_make
Compiling 1 file (.ex)
Generated elixir_make app
==> xla
Compiling 2 files (.ex)
Generated xla app

21:04:26.560 [info] Found a matching archive (xla_extension-x86_64-linux-cuda111.tar.gz), going to download it
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                                                                                                Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
100  204M  100  204M    0     0  7981k      0  0:00:26  0:00:26 --:--:-- 6268k

21:04:52.742 [info] Successfully downloaded the XLA archive
==> exla
Unpacking /home/masa/.cache/xla/0.3.0/cache/download/xla_extension-x86_64-linux-cuda111.tar.gz into /home/masa/.cache/mix/installs/elixir-1.14.0-erts-13.1.1/c709e1c9414e09e688e98d555e300276/deps/exla/cache
g++ -fPIC -I/home/masa/.asdf/installs/erlang/25.1.1/erts-13.1.1/include -Icache/xla_extension/include -O3 -Wall -Wno-sign-compare -Wno-unused-parameter -Wno-missing-field-initializers -Wno-comment -shared -std=c++14 c_src/exla/exla.cc c_src/exla/exla_nif_util.cc c_src/exla/exla_client.cc -o cache/libexla.so -Lcache/xla_extension/lib -lxla_extension -Wl,-rpath,'$ORIGIN/lib'
Compiling 21 files (.ex)
Generated exla app
:ok
iex(2)> Nx.add(Nx.tensor([1]), Nx.tensor([1]))

21:05:33.943 [info] could not open file to read NUMA node: /sys/bus/pci/devices/0000:01:00.0/numa_node
Your kernel may have been built without NUMA support.

21:05:33.943 [info] XLA service 0x7efc2c5ffda0 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:

21:05:33.943 [info]   StreamExecutor device (0): NVIDIA GeForce RTX 3060, Compute Capability 8.6

21:05:33.943 [info] Using BFC allocator.

21:05:33.943 [info] XLA backend allocating 10641368678 bytes on device 0 for BFCAllocator.

21:05:34.312 [info] Start cannot spawn child process: No such file or directory

21:05:34.312 [error] cuLinkAddData fails. This is usually caused by stale driver version.

21:05:34.312 [error] The CUDA linking API did not work. Please use XLA_FLAGS=--xla_gpu_force_compilation_parallelism=1 to bypass it, but expect to get longer compilation time due to the lack of multi-threading.
** (RuntimeError) no kernel image is available for execution on the device
in tensorflow/stream_executor/cuda/cuda_asm_compiler.cc(65): 'status'
    (exla 0.3.0) lib/exla/computation.ex:92: EXLA.Computation.unwrap!/1
    (exla 0.3.0) lib/exla/computation.ex:61: EXLA.Computation.compile/4
    (exla 0.3.0) lib/exla/defn.ex:396: anonymous fn/9 in EXLA.Defn.compile/7
    (exla 0.3.0) lib/exla/defn/locked_cache.ex:36: EXLA.Defn.LockedCache.run/2
    (stdlib 4.1.1) timer.erl:235: :timer.tc/1
    (exla 0.3.0) lib/exla/defn.ex:383: EXLA.Defn.compile/7
    (exla 0.3.0) lib/exla/defn.ex:251: EXLA.Defn.__compile__/4
    iex:2: (file)
iex(2)>

11.2.0でも試して見る

$ sudo apt-get -y install cuda=11.2.0-1

11.2.0では動作した。

$ iex
Erlang/OTP 25 [erts-13.1.1] [source] [64-bit] [smp:16:16] [ds:16:16:10] [async-threads:1] [jit:ns]

Interactive Elixir (1.14.0) - press Ctrl+C to exit (type h() ENTER for help)
iex(1)> Mix.install([
...(1)>   {:nx, "~> 0.3.0"},
...(1)>   {:exla, "~> 0.3.0"},
...(1)> ],
...(1)> config: [
...(1)>           nx: [
...(1)>             default_backend: EXLA.Backend,
...(1)>             default_defn_options: [compiler: EXLA],
...(1)>           ]
...(1)>         ]
...(1)> )
Could not find Hex, which is needed to build dependency :nx
Shall I install Hex? (if running non-interactively, use "mix local.hex --force") [Yn] Nx.add(Nx.tensor([1]), Nx.tensor([1]))
** (Mix.Error) Could not find an SCM for dependency :nx from Mix.InstallProject
    (mix 1.14.0) lib/mix.ex:513: Mix.raise/2
    (mix 1.14.0) lib/mix/dep/loader.ex:195: Mix.Dep.Loader.with_scm_and_app/4
    (mix 1.14.0) lib/mix/dep/loader.ex:141: Mix.Dep.Loader.to_dep/3
    (elixir 1.14.0) lib/enum.ex:1658: Enum."-map/2-lists^map/1-0-"/2
    (mix 1.14.0) lib/mix/dep/loader.ex:358: Mix.Dep.Loader.mix_children/2
    (mix 1.14.0) lib/mix/dep/loader.ex:18: Mix.Dep.Loader.children/0
    (mix 1.14.0) lib/mix/dep/converger.ex:80: Mix.Dep.Converger.all/4
    iex:1: (file)
iex(1)>
nil
iex(2)> Mix.install([
...(2)>   {:nx, "~> 0.3.0"},
...(2)>   {:exla, "~> 0.3.0"},
...(2)> ],
...(2)> config: [
...(2)>           nx: [
...(2)>             default_backend: EXLA.Backend,
...(2)>             default_defn_options: [compiler: EXLA],
...(2)>           ]
...(2)>         ]
...(2)> )
Could not find Hex, which is needed to build dependency :nx
Shall I install Hex? (if running non-interactively, use "mix local.hex --force") [Yn] y
* creating /home/masa/.asdf/installs/elixir/1.14.0-otp-25/.mix/archives/hex-1.0.1
Resolving Hex dependencies...
Dependency resolution completed:
New:
  complex 0.4.2
  elixir_make 0.6.3
  exla 0.3.0
  nx 0.3.0
  xla 0.3.0
* Getting nx (Hex package)
* Getting exla (Hex package)
* Getting elixir_make (Hex package)
* Getting xla (Hex package)
* Getting complex (Hex package)
==> complex
Compiling 2 files (.ex)
Generated complex app
==> nx
Compiling 24 files (.ex)
Generated nx app
==> elixir_make
Compiling 1 file (.ex)
Generated elixir_make app
==> xla
Compiling 2 files (.ex)
Generated xla app

20:14:28.008 [info] Found a matching archive (xla_extension-x86_64-linux-cuda111.tar.gz), going to download it
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                                                                                                Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
100  204M  100  204M    0     0  15.6M      0  0:00:13  0:00:13 --:--:-- 23.1M

20:14:41.085 [info] Successfully downloaded the XLA archive
==> exla
Unpacking /home/masa/.cache/xla/0.3.0/cache/download/xla_extension-x86_64-linux-cuda111.tar.gz into /home/masa/.cache/mix/installs/elixir-1.14.0-erts-13.1.1/c709e1c9414e09e688e98d555e300276/deps/exla/cache
g++ -fPIC -I/home/masa/.asdf/installs/erlang/25.1.1/erts-13.1.1/include -Icache/xla_extension/include -O3 -Wall -Wno-sign-compare -Wno-unused-parameter -Wno-missing-field-initializers -Wno-comment -shared -std=c++14 c_src/exla/exla.cc c_src/exla/exla_nif_util.cc c_src/exla/exla_client.cc -o cache/libexla.so -Lcache/xla_extension/lib -lxla_extension -Wl,-rpath,'$ORIGIN/lib'
Compiling 21 files (.ex)
Generated exla app
:ok
iex(3)> Nx.add(Nx.tensor([1]), Nx.tensor([1]))

20:15:21.915 [info] could not open file to read NUMA node: /sys/bus/pci/devices/0000:01:00.0/numa_node
Your kernel may have been built without NUMA support.

20:15:21.915 [info] XLA service 0x7fb1c8002b90 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:

20:15:21.915 [info]   StreamExecutor device (0): NVIDIA GeForce RTX 3060, Compute Capability 8.6

20:15:21.915 [info] Using BFC allocator.

20:15:21.915 [info] XLA backend allocating 10641368678 bytes on device 0 for BFCAllocator.

20:15:22.332 [info] Start cannot spawn child process: No such file or directory
#Nx.Tensor<
  s64[1]
  EXLA.Backend<cuda:0, 0.1029217212.3535405091.41223>
  [2]
>
iex(4)>

11.3.0-1にあげてみたが正常に動作した

バージョンごとの確認結果

11.8.0-1 NG (latest version)
11.7.0-1 NG
11.6.0-1 NG
11.3.0-1 OK
11.2.0-1 OK
11.1.1-1 OK
11.1.0-1 OK

XLA起動コマンド(メモ)

Mix.install([
  {:nx, "~> 0.3.0"},
  {:exla, "~> 0.3.0"},
],
config: [
          nx: [
            default_backend: EXLA.Backend,
            default_defn_options: [compiler: EXLA],
          ]
        ]
)
Nx.add(Nx.tensor([1]), Nx.tensor([1]))

export XLA_BUILD=trueにしてcuda11.8でのビルドを試みてみたんですが、ビルドエラーになってしまいました。
本家の、dockerイメージでビルドしてみましたが、同じエラー。

この件を、elixir-nx/xlaのissueをあげました。

2
1
0

Register as a new user and use Qiita more conveniently

  1. You get articles that match your needs
  2. You can efficiently read back useful information
  3. You can use dark theme
What you can do with signing up
2
1

Delete article

Deleted articles cannot be recovered.

Draft of this article would be also deleted.

Are you sure you want to delete this article?