8
4

Delete article

Deleted articles cannot be recovered.

Draft of this article would be also deleted.

Are you sure you want to delete this article?

More than 3 years have passed since last update.

Linux(Ubuntu) + GPU エラーメッセージ集

Last updated at Posted at 2018-01-30

Linux + GPU で運用していたときのエラーメッセージ集を記録していきます.

GPU サーバの死活監視ルールに加えるなど活用ください.

情報源

  • dmesg
  • nvidia-smi(NVIDIA GPU)

NVIDIA

アプリで CUDA エラーが出る -> dmesg に XID エラーがだいたいでている.

XID エラーリスト

GPU has fallen off the bus

79, GPU has fallen off the bus

主には熱暴走で発生します.

とくに, RTX 3000 series だと GPU 温度 60 度台でも意図せずメモリが高温になって error code 79 で落ちる時がありますので注意しましょう.
Linux 上では memory 温度見れないっぽいので, 空冷の場合なるべくバックプレートの後ろを開けて(CPU ファン側から遠ざけて)通気を確保するとよさそうです.

あとは PCIex 関連でも起きます(主に Linux)

aspm off にしたり, (長期で計算を回すのであれば) PowerMizer モードで max performance(adaptive 調整 off)にしたりすると改善するやも?

マザーボードのチップセットや電源容量(電圧の変化?)なども影響あるようです.

熱ではなく PCIex 関連と思われる状況で何度も XID 79 生じるようでしたら, マザーボード交換したり, OS を Windows にする(Windows だと起きづらいっぽい)なども検討してみましょう.

XID 45

Jan 30 16:13:42 mini-titanv kernel: [ 2906.990065] nvidia-modeset: Allocated GPU:0 (GPU-XXXXXXXXXXXXX) @ PCI:0000:01:00.0
Jan 30 16:13:42 mini-titanv kernel: [ 2907.003559] NVRM: GPU at PCI:0000:01:00: GPU-XXXXXXXXXXXXXX
Jan 30 16:13:42 mini-titanv kernel: [ 2907.003561] NVRM: GPU Board Serial Number: XXXXXXXXX
Jan 30 16:13:42 mini-titanv kernel: [ 2907.003563] NVRM: Xid (PCI:0000:01:00): 45, Ch 00000000, engmsk 00000100
....
Jan 30 16:13:42 mini-titanv kernel: [ 2907.014600] nvidia-modeset: ERROR: GPU:0: Unable to allocate DMA memory
Jan 30 16:13:42 mini-titanv kernel: [ 2907.014602] nvidia-modeset: ERROR: GPU:0: Notifier DMA allocation failed
Jan 30 16:13:42 mini-titanv kernel: [ 2907.014604] nvidia-modeset: ERROR: GPU:0: Failed to allocate display engine window channels

TITANV + Z370 で発生. 何らかのタイミングで nvidia-modeset が走ったときにエラーになるっぽい(nvidia-smi のレポートもエラーになる). リブートすると治る.

その他

XID 31 : GPU page fault. マイニングプログラムで発生. リブートすると治る.
XID 13 : Graphics Engine Exception. ハードウェア(GPU)の故障でした...
XID 62 : Internal micro-controller halt. ごくまれに発生するっぽい? リブートすると治る.

[585960.901845] Uhhuh. NMI received for unknown reason 31 on CPU 14.
[585960.901850] Do you have a strange power saving mode enabled?
[585960.901850] Dazed and confused, but trying to continue

powerlimit のために発生? 特に再起動などせずに問題なく動作

AMD

Jan 29 23:06:21 mini-titanv kernel: [106695.489023] INFO: task kworker/4:2:1138 blocked for more than 120 seconds.
Jan 29 23:06:21 mini-titanv kernel: [106695.489026]       Tainted: P           OE   4.13.0-32-generic #35~16.04.1-Ubuntu
Jan 29 23:06:21 mini-titanv kernel: [106695.489027] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Jan 29 23:06:21 mini-titanv kernel: [106695.489028] kworker/4:2     D    0  1138      2 0x00000000
Jan 29 23:06:21 mini-titanv kernel: [106695.489038] Workqueue: kfd_process_wq kfd_process_wq_release [amdkfd]
Jan 29 23:06:21 mini-titanv kernel: [106695.489039] Call Trace:
Jan 29 23:06:21 mini-titanv kernel: [106695.489043]  __schedule+0x3c2/0x890
Jan 29 23:06:21 mini-titanv kernel: [106695.489045]  schedule+0x36/0x80
Jan 29 23:06:21 mini-titanv kernel: [106695.489083]  amd_sched_entity_push_job+0xc9/0x110 [amdgpu]
Jan 29 23:06:21 mini-titanv kernel: [106695.489086]  ? wait_woken+0x80/0x80
Jan 29 23:06:21 mini-titanv kernel: [106695.489117]  amdgpu_job_submit+0x6c/0x80 [amdgpu]
Jan 29 23:06:21 mini-titanv kernel: [106695.489141]  amdgpu_vm_bo_update_mapping+0x324/0x3f0 [amdgpu]
Jan 29 23:06:21 mini-titanv kernel: [106695.489143]  ? __slab_free+0x9e/0x2e0
Jan 29 23:06:21 mini-titanv kernel: [106695.489167]  ? amdgpu_vm_free_mapping.isra.20+0x30/0x30 [amdgpu]
Jan 29 23:06:21 mini-titanv kernel: [106695.489189]  amdgpu_vm_clear_freed+0xa6/0x170 [amdgpu]
Jan 29 23:06:21 mini-titanv kernel: [106695.489218]  unmap_bo_from_gpuvm.isra.13+0x68/0xd0 [amdgpu]
Jan 29 23:06:21 mini-titanv kernel: [106695.489246]  amdgpu_amdkfd_gpuvm_unmap_memory_from_gpu+0x1ad/0x2f0 [amdgpu]
Jan 29 23:06:21 mini-titanv kernel: [106695.489249]  ? __radix_tree_delete+0x8d/0xb0
Jan 29 23:06:21 mini-titanv kernel: [106695.489255]  kfd_process_free_outstanding_kfd_bos+0x91/0x100 [amdkfd]
Jan 29 23:06:21 mini-titanv kernel: [106695.489260]  kfd_process_wq_release+0x55/0xf0 [amdkfd]
Jan 29 23:06:21 mini-titanv kernel: [106695.489262]  process_one_work+0x156/0x410
Jan 29 23:06:21 mini-titanv kernel: [106695.489264]  worker_thread+0x4b/0x460
Jan 29 23:06:21 mini-titanv kernel: [106695.489266]  kthread+0x109/0x140
Jan 29 23:06:21 mini-titanv kernel: [106695.489267]  ? process_one_work+0x410/0x410
Jan 29 23:06:21 mini-titanv kernel: [106695.489269]  ? kthread_create_on_node+0x70/0x70
Jan 29 23:06:21 mini-titanv kernel: [106695.489271]  ret_from_fork+0x1f/0x30

admgpu-pro 17.50 + VEGA + Z370 で発生(NVIDIA GPU 混在). ソフトウェアリブートできないようなので注意.

8
4
0

Register as a new user and use Qiita more conveniently

  1. You get articles that match your needs
  2. You can efficiently read back useful information
  3. You can use dark theme
What you can do with signing up
8
4

Delete article

Deleted articles cannot be recovered.

Draft of this article would be also deleted.

Are you sure you want to delete this article?