I recently used the rented server, and suddenly the torch cannot be used normally, and it prompts CUDA initialization: Unexpected error from cudaGetDeviceCount() error, as shown in the following figure
After several setbacks, the cause of the problem was found to be:
Because the nvidia-fabricmanager package has been updated for some reason, such as being updated during automatic system update or apt-get update, apt-get upgrade, etc. And this package must be consistent with the driver version before it can be used normally, because A100-80G is the nvswitch version, and this package can be installed to support nviswitch before it can be used normally. |
My default nvidia-fabricmanager version is 510.47.03 and the driver version 510.47.03. As a result, the package version has been updated, resulting in a mismatch, so it cannot be used normally.
Solution:
Reinstall the corresponding compatible version of this package nvidia-fabricmanager (for example, I need to install version 510.47.03 here), and then disable automatic updates. In addition, to ensure that this service runs normally nvidia-fabricmanager, you can set it to start automatically at boot. |
The installation and setup operations are as follows:
wget https://developer.download.nvidia.cn/compute/cuda/repos/ubuntu1804/x86_64/nvidia-fabricmanager-510_510.47.03-1_amd64.deb
apt-get remove nvidia-fabricmanager-510
apt-get install ./nvidia-fabricmanager-510_510.47.03-1_amd64.deb
systemctl enable nvidia-fabricmanager
systemctl restart nvidia-fabricmanager
systemctl status nvidia-fabricmanager
The result is as follows:
Lessons learned : When using the A100-80G server, it is not necessary to use apt-get updates or turn on Ubuntu system updates.
Read More:
- [Solved] CUDA driver version is insufficient for CUDA runtime version
- [Solved] Error response from daemon: manifest for nvidia/cuda:latest not found: manifest unknown: manifest
- Solution of device eth0 does not see to be present, delaying initialization. Error in network card under Linux
- How to Solve Rosep Problem in ROS Initialization (Sudo Rosep Init)
- [Solved] Ubuntu Install jdk Error: Error occurred during initialization of VM
- Shell Script syntax error near unexpected token `done’
- [Solved] Error during WebSocket handshake Unexpected response code 404
- [Solved] Bash: Syntax error: redirection unexpected
- [Solved] Module yaml error: Unexpected key in data: static_context
- Ubuntu shell Script syntax error near unexpected token `$‘{\r‘‘
- [Solved] bash: /etc/vimrc: line 15: syntax error near unexpected token `“autocmd“‘
- [Solved] Shell error: syntax error: unexpected end of file
- Ubuntu: How to deal with the fatal: the remote end hung up unexpected error of GIT clone Android kernel
- Running shell script reports an error: “syntax error near unexpected token solution ‘”
- [Solved] Linux shell Script Error: syntax error near unexpected token `do
- [Solved] An unexpected error has occurred. Conda has prepared the above report.
- [Solved] shell Error: Syntax error: “(“ unexpected (expecting “}“)
- Failed to Connect NVIDIA Driver: NVIDIA-SMI has failed because it couldn‘t communicate with the NVIDIA driver
- [Solved] NVIDIA-SMI has failed because it couldn‘t communicate with the NVIDIA driver
- [Solved] NVIDIA-SMI has failed because it couldn‘t communicate with the NVIDIA driver.