Troubleshooting NVIDIA GPU driver issues

Source: Deep Learning on Medium

3. Are there nvidia-smi errors?

The nvidia-smi command can also show some errors, which can help us determine what the problem might be.

“ NVIDIA-SMI has failed because it couldn’t communicate with the NVIDIA driver”

This error is usually due to a problem with the installation of the driver. Most commonly we see this error when people have installed the driver but not yet rebooted the system, although it is also seen if other packages have been installed after the NVIDIA driver and before the system has been rebooted.

If a reboot of the system doesn’t correctly pick up the driver, then I would recommend removing all old driver versions, and then reinstalling the latest driver and rebooting the system immediately afterwards. For details of how to do this, there are some good instructions for both RHEL and Ubuntu in the IBM Knowledge Center.

“Failed to initialize NVML: Driver/library version mismatch”

This is the most common error we see, and is an indication that the NVIDIA driver kernel module has not been built against the same level of the Linux kernel as the one the system is running. The driver installation uses details from the kernel-devel package to build the driver correctly for the system. If this is at a different level to the kernel that is currently running then they will be incompatible.

To check if this is the problem, you can use the Dynamic Module Kernel Support framework (DKMS) to find the state of the NVIDIA driver kernel module:

sudo dkms status

This should list an NVIDIA module, and give the status of “installed”. If the status is listed as “added” or “built”, then you will need to reinstall the kernel module. First, you will need to remove the existing module, then install consistent levels of the kernel and kernel-devel packages. Once that’s done you can rebuild and install the kernel module.

Update your kernel and kernel-devel packages to the same latest level:

sudo yum install kernel-devel
sudo yum update kernel kernel-devel

If the kernel is updated to a newer version then you will need to reboot the system to pick up these changes. If not, then you can continue with the next steps. For these, the NVIDIA driver level is the driver release you are trying to install. The kernel level is the release of the Linux kernel running on your system. This can be found from the installed package name, or by running:

uname -a

In our case, the kernel version is currently 3.10.0–957.21.3.el7.ppc64le so you might need to look for a similar looking string.

sudo dkms remove nvidia/<nvidia driver level> — all
sudo dkms build nvidia/<nvidia driver level> -k <kernel level>
sudo dkms install nvidia/<nvidia driver level> -k <kernel level>
dkms status

The final dkms status command should now show you that the driver kernel module is installed for your current kernel version. You should now be able to run the nvidia-smi command and see all of your installed GPUs.