Tag Archives: Unexpected error from cudaGetDeviceCount()

[Solved] CUDA initialization: Unexpected error from cudaGetDeviceCount()

I recently used the rented server, and suddenly the torch cannot be used normally, and it prompts CUDA initialization: Unexpected error from cudaGetDeviceCount() error, as shown in the following figure

 

After several setbacks, the cause of the problem was found to be:

Because the nvidia-fabricmanager package has been updated for some reason, such as being updated during automatic system update or apt-get update, apt-get upgrade, etc. And this package must be consistent with the driver version before it can be used normally, because A100-80G is the nvswitch version, and this package can be installed to support nviswitch before it can be used normally.

My default nvidia-fabricmanager version is 510.47.03 and the driver version 510.47.03. As a result, the package version has been updated, resulting in a mismatch, so it cannot be used normally.

 

Solution:

Reinstall the corresponding compatible version of this package nvidia-fabricmanager (for example, I need to install version 510.47.03 here), and then disable automatic updates. In addition, to ensure that this service runs normally nvidia-fabricmanager, you can set it to start automatically at boot.

The installation and setup operations are as follows:

wget https://developer.download.nvidia.cn/compute/cuda/repos/ubuntu1804/x86_64/nvidia-fabricmanager-510_510.47.03-1_amd64.deb
apt-get remove nvidia-fabricmanager-510
apt-get install ./nvidia-fabricmanager-510_510.47.03-1_amd64.deb
systemctl enable nvidia-fabricmanager
systemctl restart nvidia-fabricmanager
systemctl status nvidia-fabricmanager

 

The result is as follows:  

 

Lessons learned : When using the A100-80G server, it is not necessary to use apt-get updates or turn on Ubuntu system updates.