Tracking My NVIDIA GPU Issues and Fixes

This post records various GPU-related issues I encountered, along with their diagnoses and resolutions.

Unable to Determine the Device Handle for GPU0000:1E:00.0: Unknown Error

-->Source

Recently, while training deep neural network models, I encountered a GPU-related issue. Here's a concise log of the problem and how I resolved it:

Issue

Running nvidia-smi resulted in:

1
Unable to determine the device handle for GPU0000:1E:00.0: Unknown Error

Diagnosis

  1. Check GPU Status: Using

    1
    sh   lspci | grep "RTX 3090"

    (My GPUs on that server are all RTX 3090s)

    I got:

    1
    2
    3
    4
    5
    6
    7
    8
    1b:00.0 VGA compatible controller: NVIDIA Corporation GA102 [GeForce RTX 3090] (rev a1)
    1c:00.0 VGA compatible controller: NVIDIA Corporation GA102 [GeForce RTX 3090] (rev a1)
    1d:00.0 VGA compatible controller: NVIDIA Corporation GA102 [GeForce RTX 3090] (rev a1)
    1e:00.0 VGA compatible controller: NVIDIA Corporation GA102 [GeForce RTX 3090] (rev ff)
    3d:00.0 VGA compatible controller: NVIDIA Corporation GA102 [GeForce RTX 3090] (rev a1)
    44:00.0 VGA compatible controller: NVIDIA Corporation GA102 [GeForce RTX 3090] (rev a1)
    45:00.0 VGA compatible controller: NVIDIA Corporation GA102 [GeForce RTX 3090] (rev a1)
    46:00.0 VGA compatible controller: NVIDIA Corporation GA102 [GeForce RTX 3090] (rev a1)

    GPUs marked as rev ff are unavailable, indicating an issue with GPU 0000:1e:00.01.

  2. Debug with NVIDIA Bug Report Script: I ran:

    1
    curl -s https://raw.githubusercontent.com/Pardus-Linux/Packages/master/hardware/graphics/nvidia-xconfig/files/nvidia-bug-report.sh | sudo bash

    The output included:

    1
    [11638160.349810] NVRM: GPU 0000:1e:00.0: GPU has fallen off the bus.

    From this NVIDIA forum thread, I learned that ML workloads can cause power spikes, leading to GPU instability. This pointed to my PSU as a possible culprit.

Resolution

Knowing that it's a hardware problem, then the solution was straightforward: I rebooted the system. After rebooting, the issue was resolved.

  • Running nvidia-smi now works normally.

  • Running lspci | grep "RTX 3090" gives:

    1
    2
    3
    4
    5
    6
    7
    8
    1b:00.0 VGA compatible controller: NVIDIA Corporation GA102 [GeForce RTX 3090] (rev a1)
    1c:00.0 VGA compatible controller: NVIDIA Corporation GA102 [GeForce RTX 3090] (rev a1)
    1d:00.0 VGA compatible controller: NVIDIA Corporation GA102 [GeForce RTX 3090] (rev a1)
    1e:00.0 VGA compatible controller: NVIDIA Corporation GA102 [GeForce RTX 3090] (rev a1)
    3d:00.0 VGA compatible controller: NVIDIA Corporation GA102 [GeForce RTX 3090] (rev a1)
    44:00.0 VGA compatible controller: NVIDIA Corporation GA102 [GeForce RTX 3090] (rev a1)
    45:00.0 VGA compatible controller: NVIDIA Corporation GA102 [GeForce RTX 3090] (rev a1)
    46:00.0 VGA compatible controller: NVIDIA Corporation GA102 [GeForce RTX 3090] (rev a1)

  1. GPU PCI ID in the format XXXX:YY.Z.a where XXXX = domain, YY = bus, Z = device, a = function.↩︎