Tracking My NVIDIA GPU Issues and Fixes
This post records various GPU-related issues I encountered, along with their diagnoses and resolutions.
Unable to Determine the Device Handle for GPU0000:1E:00.0: Unknown Error
-->Source
Recently, while training deep neural network models, I encountered a GPU-related issue. Here's a concise log of the problem and how I resolved it:
Issue
Running nvidia-smi resulted in: 1
Unable to determine the device handle for GPU0000:1E:00.0: Unknown Error
Diagnosis
Check GPU Status: Using
1
sh lspci | grep "RTX 3090"
(My GPUs on that server are all RTX 3090s)
I got:
1
2
3
4
5
6
7
81b:00.0 VGA compatible controller: NVIDIA Corporation GA102 [GeForce RTX 3090] (rev a1)
1c:00.0 VGA compatible controller: NVIDIA Corporation GA102 [GeForce RTX 3090] (rev a1)
1d:00.0 VGA compatible controller: NVIDIA Corporation GA102 [GeForce RTX 3090] (rev a1)
1e:00.0 VGA compatible controller: NVIDIA Corporation GA102 [GeForce RTX 3090] (rev ff)
3d:00.0 VGA compatible controller: NVIDIA Corporation GA102 [GeForce RTX 3090] (rev a1)
44:00.0 VGA compatible controller: NVIDIA Corporation GA102 [GeForce RTX 3090] (rev a1)
45:00.0 VGA compatible controller: NVIDIA Corporation GA102 [GeForce RTX 3090] (rev a1)
46:00.0 VGA compatible controller: NVIDIA Corporation GA102 [GeForce RTX 3090] (rev a1)GPUs marked as
rev ffare unavailable, indicating an issue with GPU0000:1e:00.01.Debug with NVIDIA Bug Report Script: I ran:
1
curl -s https://raw.githubusercontent.com/Pardus-Linux/Packages/master/hardware/graphics/nvidia-xconfig/files/nvidia-bug-report.sh | sudo bash
The output included:
1
[11638160.349810] NVRM: GPU 0000:1e:00.0: GPU has fallen off the bus.
From this NVIDIA forum thread, I learned that ML workloads can cause power spikes, leading to GPU instability. This pointed to my PSU as a possible culprit.
Resolution
Knowing that it's a hardware problem, then the solution was straightforward: I rebooted the system. After rebooting, the issue was resolved.
Running
nvidia-sminow works normally.Running
lspci | grep "RTX 3090"gives:1
2
3
4
5
6
7
81b:00.0 VGA compatible controller: NVIDIA Corporation GA102 [GeForce RTX 3090] (rev a1)
1c:00.0 VGA compatible controller: NVIDIA Corporation GA102 [GeForce RTX 3090] (rev a1)
1d:00.0 VGA compatible controller: NVIDIA Corporation GA102 [GeForce RTX 3090] (rev a1)
1e:00.0 VGA compatible controller: NVIDIA Corporation GA102 [GeForce RTX 3090] (rev a1)
3d:00.0 VGA compatible controller: NVIDIA Corporation GA102 [GeForce RTX 3090] (rev a1)
44:00.0 VGA compatible controller: NVIDIA Corporation GA102 [GeForce RTX 3090] (rev a1)
45:00.0 VGA compatible controller: NVIDIA Corporation GA102 [GeForce RTX 3090] (rev a1)
46:00.0 VGA compatible controller: NVIDIA Corporation GA102 [GeForce RTX 3090] (rev a1)
Failed to initialize NVML: Unknown Error in Docker container
Due to the unfixed bug of Nvidia CUDA Container Toolkit, our docker container loses access to the GPUs occasionally, the fix is simple. My solution mixes different solutions in the stackoverflow post:
sudo vim /etc/nvidia-container-runtime/config.toml, then changedno-cgroups = false, saveset the parameter
"exec-opts": ["native.cgroupdriver=cgroupfs"]in the/etc/docker/daemon.jsonfile and restart docker. The edited file from should be like1
2
3
4
5
6
7
8
9{
"runtimes": {
"nvidia": {
"args": [],
"path": "nvidia-container-runtime"
}
},
"exec-opts": ["native.cgroupdriver=cgroupfs"]
}Restart docker daemon:
sudo systemctl restart docker, then you can test by runningsudo docker run --rm --runtime=nvidia --gpus all ubuntu nvidia-smi
Solution2:
GPU PCI ID in the format
XXXX:YY.Z.awhere XXXX = domain, YY = bus, Z = device, a = function.↩︎