pg = ProcessGroupNCCL(prefix_store, rank, world_size, pg_options)
RuntimeError: ProcessGroupNCCL is only supported with GPUs, no GPUs found!
At first, this mistake made me wonder if this GPU was useless, – – |, But the little partners in the lab are sure that GPU is OK! Then I started the bug troubleshooting journey
At this time, when viewing the command line, it finally shows its feet. It is estimated that there is a problem with pytorch, which is harmful!
>>> import torch
>>> print(torch.cuda.is_available())
/home/xutianjiao/anaconda3/envs/py36/lib/python3.6/site-packages/torch/cuda/__init__.py:80: UserWarning: CUDA initialization: The NVIDIA driver on your system is too old (found version 9020). Please update your GPU driver by downloading and installing a new version from the URL: http://www.nvidia.com/Download/index.aspx Alternatively, go to: https://pytorch.org to install a PyTorch version that has been compiled with your version of the CUDA driver. (Triggered internally at ../c10/cuda/CUDAFunctions.cpp:112.)
return torch._C._cuda_getDeviceCount() > 0
False
>>> print(torch.cuda.get_device_name(0))
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/xutianjiao/anaconda3/envs/py36/lib/python3.6/site-packages/torch/cuda/__init__.py", line 326, in get_device_name
return get_device_properties(device).name
File "/home/xutianjiao/anaconda3/envs/py36/lib/python3.6/site-packages/torch/cuda/__init__.py", line 356, in get_device_properties
_lazy_init() # will define _get_device_properties
File "/home/xutianjiao/anaconda3/envs/py36/lib/python3.6/site-packages/torch/cuda/__init__.py", line 214, in _lazy_init
torch._C._cuda_init()
RuntimeError: The NVIDIA driver on your system is too old (found version 9020). Please update your GPU driver by downloading and installing a new version from the URL: http://www.nvidia.com/Download/index.aspx Alternatively, go to: https://pytorch.org to install a PyTorch version that has been compiled with your version of the CUDA driver.
After checking this error, it shows that the versions of CUDA and torch do not match.
Check the version of pytorch, 1.10 +. OK, try installing a lower version of torch!
pip install torch==1.7.0
be accomplished!