Tracking My NVIDIA GPU Issues and Fixes

This post records various GPU-related issues I encountered, along with their diagnoses and resolutions.

Unable to Determine the Device Handle for GPU0000:1E:00.0: Unknown Error

-->Source

Recently, while training deep neural network models, I encountered a GPU-related issue. Here's a concise log of the problem and how I resolved it:

Issue

Running nvidia-smi resulted in:

1
Unable to determine the device handle for GPU0000:1E:00.0: Unknown Error

Diagnosis

  1. Check GPU Status: Using

    1
    sh   lspci | grep "RTX 3090"

    (My GPUs on that server are all RTX 3090s)

    I got:

    1
    2
    3
    4
    5
    6
    7
    8
    1b:00.0 VGA compatible controller: NVIDIA Corporation GA102 [GeForce RTX 3090] (rev a1)
    1c:00.0 VGA compatible controller: NVIDIA Corporation GA102 [GeForce RTX 3090] (rev a1)
    1d:00.0 VGA compatible controller: NVIDIA Corporation GA102 [GeForce RTX 3090] (rev a1)
    1e:00.0 VGA compatible controller: NVIDIA Corporation GA102 [GeForce RTX 3090] (rev ff)
    3d:00.0 VGA compatible controller: NVIDIA Corporation GA102 [GeForce RTX 3090] (rev a1)
    44:00.0 VGA compatible controller: NVIDIA Corporation GA102 [GeForce RTX 3090] (rev a1)
    45:00.0 VGA compatible controller: NVIDIA Corporation GA102 [GeForce RTX 3090] (rev a1)
    46:00.0 VGA compatible controller: NVIDIA Corporation GA102 [GeForce RTX 3090] (rev a1)

    GPUs marked as rev ff are unavailable, indicating an issue with GPU 0000:1e:00.01.

  2. Debug with NVIDIA Bug Report Script: I ran:

    1
    curl -s https://raw.githubusercontent.com/Pardus-Linux/Packages/master/hardware/graphics/nvidia-xconfig/files/nvidia-bug-report.sh | sudo bash

    The output included:

    1
    [11638160.349810] NVRM: GPU 0000:1e:00.0: GPU has fallen off the bus.

    From this NVIDIA forum thread, I learned that ML workloads can cause power spikes, leading to GPU instability. This pointed to my PSU as a possible culprit.

Resolution

Knowing that it's a hardware problem, then the solution was straightforward: I rebooted the system. After rebooting, the issue was resolved.

  • Running nvidia-smi now works normally.

  • Running lspci | grep "RTX 3090" gives:

    1
    2
    3
    4
    5
    6
    7
    8
    1b:00.0 VGA compatible controller: NVIDIA Corporation GA102 [GeForce RTX 3090] (rev a1)
    1c:00.0 VGA compatible controller: NVIDIA Corporation GA102 [GeForce RTX 3090] (rev a1)
    1d:00.0 VGA compatible controller: NVIDIA Corporation GA102 [GeForce RTX 3090] (rev a1)
    1e:00.0 VGA compatible controller: NVIDIA Corporation GA102 [GeForce RTX 3090] (rev a1)
    3d:00.0 VGA compatible controller: NVIDIA Corporation GA102 [GeForce RTX 3090] (rev a1)
    44:00.0 VGA compatible controller: NVIDIA Corporation GA102 [GeForce RTX 3090] (rev a1)
    45:00.0 VGA compatible controller: NVIDIA Corporation GA102 [GeForce RTX 3090] (rev a1)
    46:00.0 VGA compatible controller: NVIDIA Corporation GA102 [GeForce RTX 3090] (rev a1)

Failed to initialize NVML: Unknown Error in Docker container

Due to the unfixed bug of Nvidia CUDA Container Toolkit, our docker container loses access to the GPUs occasionally, the fix is simple. My solution mixes different solutions in the stackoverflow post:

  1. sudo vim /etc/nvidia-container-runtime/config.toml, then changed no-cgroups = false, save

  2. set the parameter "exec-opts": ["native.cgroupdriver=cgroupfs"] in the /etc/docker/daemon.json file and restart docker. The edited file from should be like

    1
    2
    3
    4
    5
    6
    7
    8
    9
    {  
    "runtimes": {
    "nvidia": {
    "args": [],
    "path": "nvidia-container-runtime"
    }
    },
    "exec-opts": ["native.cgroupdriver=cgroupfs"]
    }
  3. Restart docker daemon: sudo systemctl restart docker, then you can test by running sudo docker run --rm --runtime=nvidia --gpus all ubuntu nvidia-smi

Solution2:


  1. GPU PCI ID in the format XXXX:YY.Z.a where XXXX = domain, YY = bus, Z = device, a = function.↩︎