Tracking My NVIDIA GPU Issues and Fixes
This post records various GPU-related issues I encountered, along with their diagnoses and resolutions.
Unable to Determine the Device Handle for GPU0000:1E:00.0: Unknown Error
-->Source
Recently, while training deep neural network models, I encountered a GPU-related issue. Here's a concise log of the problem and how I resolved it:
Issue
Running nvidia-smi
resulted in: 1
Unable to determine the device handle for GPU0000:1E:00.0: Unknown Error
Diagnosis
Check GPU Status: Using
1
sh lspci | grep "RTX 3090"
(My GPUs on that server are all RTX 3090s)
I got:
1
2
3
4
5
6
7
81b:00.0 VGA compatible controller: NVIDIA Corporation GA102 [GeForce RTX 3090] (rev a1)
1c:00.0 VGA compatible controller: NVIDIA Corporation GA102 [GeForce RTX 3090] (rev a1)
1d:00.0 VGA compatible controller: NVIDIA Corporation GA102 [GeForce RTX 3090] (rev a1)
1e:00.0 VGA compatible controller: NVIDIA Corporation GA102 [GeForce RTX 3090] (rev ff)
3d:00.0 VGA compatible controller: NVIDIA Corporation GA102 [GeForce RTX 3090] (rev a1)
44:00.0 VGA compatible controller: NVIDIA Corporation GA102 [GeForce RTX 3090] (rev a1)
45:00.0 VGA compatible controller: NVIDIA Corporation GA102 [GeForce RTX 3090] (rev a1)
46:00.0 VGA compatible controller: NVIDIA Corporation GA102 [GeForce RTX 3090] (rev a1)GPUs marked as
rev ff
are unavailable, indicating an issue with GPU0000:1e:00.0
1.Debug with NVIDIA Bug Report Script: I ran:
1
curl -s https://raw.githubusercontent.com/Pardus-Linux/Packages/master/hardware/graphics/nvidia-xconfig/files/nvidia-bug-report.sh | sudo bash
The output included:
1
[11638160.349810] NVRM: GPU 0000:1e:00.0: GPU has fallen off the bus.
From this NVIDIA forum thread, I learned that ML workloads can cause power spikes, leading to GPU instability. This pointed to my PSU as a possible culprit.
Resolution
Knowing that it's a hardware problem, then the solution was straightforward: I rebooted the system. After rebooting, the issue was resolved.
Running
nvidia-smi
now works normally.Running
lspci | grep "RTX 3090"
gives:1
2
3
4
5
6
7
81b:00.0 VGA compatible controller: NVIDIA Corporation GA102 [GeForce RTX 3090] (rev a1)
1c:00.0 VGA compatible controller: NVIDIA Corporation GA102 [GeForce RTX 3090] (rev a1)
1d:00.0 VGA compatible controller: NVIDIA Corporation GA102 [GeForce RTX 3090] (rev a1)
1e:00.0 VGA compatible controller: NVIDIA Corporation GA102 [GeForce RTX 3090] (rev a1)
3d:00.0 VGA compatible controller: NVIDIA Corporation GA102 [GeForce RTX 3090] (rev a1)
44:00.0 VGA compatible controller: NVIDIA Corporation GA102 [GeForce RTX 3090] (rev a1)
45:00.0 VGA compatible controller: NVIDIA Corporation GA102 [GeForce RTX 3090] (rev a1)
46:00.0 VGA compatible controller: NVIDIA Corporation GA102 [GeForce RTX 3090] (rev a1)
GPU PCI ID in the format
XXXX:YY.Z.a
where XXXX = domain, YY = bus, Z = device, a = function.↩︎