This problem has been solved for a day….
Train the code well. If you change a machine, you will report an error.
I thought it was cuda11. I was worried that the CUDA version did not match the pytorch version. I reinstalled it, but it didn’t solve the problem.
Problem phenomenon:
raceback (most recent call last):
File "train.py", line 100, in <module>
main(opt)
File "train.py", line 71, in main
……
File "/home/xxxx/.local/lib/python3.8/site-packages/torch/nn/modules/conv.py", line 395, in _conv_forward
return F.conv2d(input, weight, bias, self.stride,
RuntimeError: cuDNN error: CUDNN_STATUS_INTERNAL_ERROR
You can try to repro this exception using the following code snippet. If that doesn't trigger the error, please include your original repro script when reporting this issue.
import torch
torch.backends.cuda.matmul.allow_tf32 = True
torch.backends.cudnn.benchmark = True
torch.backends.cudnn.deterministic = False
torch.backends.cudnn.allow_tf32 = True
data = torch.randn([1, 64, 80, 144], dtype=torch.float, device='cuda', requires_grad=True)
net = torch.nn.Conv2d(64, 64, kernel_size=[3, 3], padding=[1, 1], stride=[1, 1], dilation=[1, 1], groups=1)
net = net.cuda().float()
out = net(data)
out.backward(torch.randn_like(out))
torch.cuda.synchronize()
ConvolutionParams
data_type = CUDNN_DATA_FLOAT
padding = [1, 1, 0]
stride = [1, 1, 0]
dilation = [1, 1, 0]
groups = 1
deterministic = false
allow_tf32 = true
input: TensorDescriptor 0xaa030590
type = CUDNN_DATA_FLOAT
nbDims = 4
dimA = 1, 64, 80, 144,
strideA = 737280, 11520, 144, 1,
output: TensorDescriptor 0xaa0d6560
type = CUDNN_DATA_FLOAT
nbDims = 4
dimA = 1, 64, 80, 144,
strideA = 737280, 11520, 144, 1,
weight: FilterDescriptor 0xaa0d0360
type = CUDNN_DATA_FLOAT
tensor_format = CUDNN_TENSOR_NCHW
nbDims = 4
dimA = 64, 64, 3, 3,
Pointer addresses:
input: 0x567e50000
output: 0x568120000
weight: 0x550a2da00
Solution:
Save CUDA’s prompt to a file,
import torch
torch.backends.cuda.matmul.allow_tf32 = True
torch.backends.cudnn.benchmark = True
torch.backends.cudnn.deterministic = False
torch.backends.cudnn.allow_tf32 = True
data = torch.randn([1, 64, 80, 144], dtype=torch.float, device='cuda', requires_grad=True)
net = torch.nn.Conv2d(64, 64, kernel_size=[3, 3], padding=[1, 1], stride=[1, 1], dilation=[1, 1], groups=1)
net = net.cuda().float()
out = net(data)
out.backward(torch.randn_like(out))
torch.cuda.synchronize()
When Python runs it, it will report the same error, then select the switch to adjust it, and try again whether it still reports an error.
For my code, modifying the following is work.
torch.backends.cudnn.benchmark = False
Then put this in front of the problem code.
Read More:
- RuntimeError: CUDNN_STATUS_EXECUTION_FAILED [How to Solve]
- [Solved] RuntimeError: CUDA error: out of memory
- [Solved] Runtimeerror during dcgan training: found dtype long but expected float
- Cuda Runtime error (38) : no CUDA-capable device is detected
- How to Solve Error: RuntimeError: all tensors must be on devices[0]
- [Solved] CUDA error:-UserWarning: CUDA initialization: CUDA unknown error
- Pytorch failed to specify GPU resolution
- Numpy.exp Function Error ‘Float’ object has no attribute ‘exp’
- Error using tensorflow GPU: could not create cudnn handle: cudnn_STATUS_NOT_INITIALIZED
- Tensorflow Error polling for event status: failed to query event: CUDA_ERROR_ILLEGAL_ADDRESS
- [Solved] Pytorch loading model specified GPU card number error or failed to specify
- [Solved] MindSpore Error: ValueError: `padding_idx` in `Embedding` out of range
- ValueError: Negative dimension size caused by subtracting 2 from 1 for…
- CUDA_ERROR_SYSTEM_DRIVER_MISMATCH [How to Solve]
- [Solved] volatile was removed and now has no effect. Use `with torch.no_grad():` instead.
- [Solved] Kubernetes Error: failed to list *core.Secret: unable to transform key
- [Solved] TF2.4 Error: Failed to get convolution algorithm. This is probably because cuDNN failed to initialize
- [Solved] UserWarning: CUDA initialization: CUDA unknown error
- [Solved] RuntimeError: Error(s) in loading state_dict for FasterRCNN: Missing key(s) in state_dict:……
- [Solved] Could not load library cudnn_cnn_infer64_8.dll. Error code 126