error
GPU

Linux(Ubuntu) + GPU エラーメッセージ集

Linux + GPU で運用していたときのエラーメッセージ集を記録していきます.

GPU サーバの死活監視ルールに加えるなど活用ください.

情報源

  • dmesg
  • nvidia-smi(NVIDIA GPU)

NVIDIA

XID エラーリスト

http://docs.nvidia.com/deploy/xid-errors/index.html

dmesg

Jan 30 16:13:42 mini-titanv kernel: [ 2906.990065] nvidia-modeset: Allocated GPU:0 (GPU-XXXXXXXXXXXXX) @ PCI:0000:01:00.0
Jan 30 16:13:42 mini-titanv kernel: [ 2907.003559] NVRM: GPU at PCI:0000:01:00: GPU-XXXXXXXXXXXXXX
Jan 30 16:13:42 mini-titanv kernel: [ 2907.003561] NVRM: GPU Board Serial Number: XXXXXXXXX
Jan 30 16:13:42 mini-titanv kernel: [ 2907.003563] NVRM: Xid (PCI:0000:01:00): 45, Ch 00000000, engmsk 00000100
....
Jan 30 16:13:42 mini-titanv kernel: [ 2907.014600] nvidia-modeset: ERROR: GPU:0: Unable to allocate DMA memory
Jan 30 16:13:42 mini-titanv kernel: [ 2907.014602] nvidia-modeset: ERROR: GPU:0: Notifier DMA allocation failed
Jan 30 16:13:42 mini-titanv kernel: [ 2907.014604] nvidia-modeset: ERROR: GPU:0: Failed to allocate display engine window channels

TITANV + Z370 で発生. 何らかのタイミングで nvidia-modeset が走ったときにエラーになるっぽい(nvidia-smi のレポートもエラーになる). リブートすると治る.

AMD

Jan 29 23:06:21 mini-titanv kernel: [106695.489023] INFO: task kworker/4:2:1138 blocked for more than 120 seconds.
Jan 29 23:06:21 mini-titanv kernel: [106695.489026]       Tainted: P           OE   4.13.0-32-generic #35~16.04.1-Ubuntu
Jan 29 23:06:21 mini-titanv kernel: [106695.489027] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Jan 29 23:06:21 mini-titanv kernel: [106695.489028] kworker/4:2     D    0  1138      2 0x00000000
Jan 29 23:06:21 mini-titanv kernel: [106695.489038] Workqueue: kfd_process_wq kfd_process_wq_release [amdkfd]
Jan 29 23:06:21 mini-titanv kernel: [106695.489039] Call Trace:
Jan 29 23:06:21 mini-titanv kernel: [106695.489043]  __schedule+0x3c2/0x890
Jan 29 23:06:21 mini-titanv kernel: [106695.489045]  schedule+0x36/0x80
Jan 29 23:06:21 mini-titanv kernel: [106695.489083]  amd_sched_entity_push_job+0xc9/0x110 [amdgpu]
Jan 29 23:06:21 mini-titanv kernel: [106695.489086]  ? wait_woken+0x80/0x80
Jan 29 23:06:21 mini-titanv kernel: [106695.489117]  amdgpu_job_submit+0x6c/0x80 [amdgpu]
Jan 29 23:06:21 mini-titanv kernel: [106695.489141]  amdgpu_vm_bo_update_mapping+0x324/0x3f0 [amdgpu]
Jan 29 23:06:21 mini-titanv kernel: [106695.489143]  ? __slab_free+0x9e/0x2e0
Jan 29 23:06:21 mini-titanv kernel: [106695.489167]  ? amdgpu_vm_free_mapping.isra.20+0x30/0x30 [amdgpu]
Jan 29 23:06:21 mini-titanv kernel: [106695.489189]  amdgpu_vm_clear_freed+0xa6/0x170 [amdgpu]
Jan 29 23:06:21 mini-titanv kernel: [106695.489218]  unmap_bo_from_gpuvm.isra.13+0x68/0xd0 [amdgpu]
Jan 29 23:06:21 mini-titanv kernel: [106695.489246]  amdgpu_amdkfd_gpuvm_unmap_memory_from_gpu+0x1ad/0x2f0 [amdgpu]
Jan 29 23:06:21 mini-titanv kernel: [106695.489249]  ? __radix_tree_delete+0x8d/0xb0
Jan 29 23:06:21 mini-titanv kernel: [106695.489255]  kfd_process_free_outstanding_kfd_bos+0x91/0x100 [amdkfd]
Jan 29 23:06:21 mini-titanv kernel: [106695.489260]  kfd_process_wq_release+0x55/0xf0 [amdkfd]
Jan 29 23:06:21 mini-titanv kernel: [106695.489262]  process_one_work+0x156/0x410
Jan 29 23:06:21 mini-titanv kernel: [106695.489264]  worker_thread+0x4b/0x460
Jan 29 23:06:21 mini-titanv kernel: [106695.489266]  kthread+0x109/0x140
Jan 29 23:06:21 mini-titanv kernel: [106695.489267]  ? process_one_work+0x410/0x410
Jan 29 23:06:21 mini-titanv kernel: [106695.489269]  ? kthread_create_on_node+0x70/0x70
Jan 29 23:06:21 mini-titanv kernel: [106695.489271]  ret_from_fork+0x1f/0x30

admgpu-pro 17.50 + VEGA + Z370 で発生(NVIDIA GPU 混在). ソフトウェアリブートできないようなので注意.