This problem has been solved for a day….
Train the code well. If you change a machine, you will report an error.
I thought it was cuda11. I was worried that the CUDA version did not match the pytorch version. I reinstalled it, but it didn’t solve the problem.
Problem phenomenon:
raceback (most recent call last):
File "train.py", line 100, in <module>
main(opt)
File "train.py", line 71, in main
……
File "/home/xxxx/.local/lib/python3.8/site-packages/torch/nn/modules/conv.py", line 395, in _conv_forward
return F.conv2d(input, weight, bias, self.stride,
RuntimeError: cuDNN error: CUDNN_STATUS_INTERNAL_ERROR
You can try to repro this exception using the following code snippet. If that doesn't trigger the error, please include your original repro script when reporting this issue.
import torch
torch.backends.cuda.matmul.allow_tf32 = True
torch.backends.cudnn.benchmark = True
torch.backends.cudnn.deterministic = False
torch.backends.cudnn.allow_tf32 = True
data = torch.randn([1, 64, 80, 144], dtype=torch.float, device='cuda', requires_grad=True)
net = torch.nn.Conv2d(64, 64, kernel_size=[3, 3], padding=[1, 1], stride=[1, 1], dilation=[1, 1], groups=1)
net = net.cuda().float()
out = net(data)
out.backward(torch.randn_like(out))
torch.cuda.synchronize()
ConvolutionParams
data_type = CUDNN_DATA_FLOAT
padding = [1, 1, 0]
stride = [1, 1, 0]
dilation = [1, 1, 0]
groups = 1
deterministic = false
allow_tf32 = true
input: TensorDescriptor 0xaa030590
type = CUDNN_DATA_FLOAT
nbDims = 4
dimA = 1, 64, 80, 144,
strideA = 737280, 11520, 144, 1,
output: TensorDescriptor 0xaa0d6560
type = CUDNN_DATA_FLOAT
nbDims = 4
dimA = 1, 64, 80, 144,
strideA = 737280, 11520, 144, 1,
weight: FilterDescriptor 0xaa0d0360
type = CUDNN_DATA_FLOAT
tensor_format = CUDNN_TENSOR_NCHW
nbDims = 4
dimA = 64, 64, 3, 3,
Pointer addresses:
input: 0x567e50000
output: 0x568120000
weight: 0x550a2da00
Solution:
Save CUDA’s prompt to a file,
import torch
torch.backends.cuda.matmul.allow_tf32 = True
torch.backends.cudnn.benchmark = True
torch.backends.cudnn.deterministic = False
torch.backends.cudnn.allow_tf32 = True
data = torch.randn([1, 64, 80, 144], dtype=torch.float, device='cuda', requires_grad=True)
net = torch.nn.Conv2d(64, 64, kernel_size=[3, 3], padding=[1, 1], stride=[1, 1], dilation=[1, 1], groups=1)
net = net.cuda().float()
out = net(data)
out.backward(torch.randn_like(out))
torch.cuda.synchronize()
When Python runs it, it will report the same error, then select the switch to adjust it, and try again whether it still reports an error.
For my code, modifying the following is work.
torch.backends.cudnn.benchmark = False
Then put this in front of the problem code.
Read More:
- [Solved] Could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR
- Tensorflow Error: Could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR
- Tensorflow Run Error or the interface is stuck or report error: Could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR
- RuntimeError: CUDNN_STATUS_EXECUTION_FAILED [How to Solve]
- Error using tensorflow GPU: could not create cudnn handle: cudnn_STATUS_NOT_INITIALIZED
- [Solved] Could not load library cudnn_cnn_infer64_8.dll. Error code 126
- [Solved] TF2.4 Error: Failed to get convolution algorithm. This is probably because cuDNN failed to initialize
- Failed to get convolution algorithm. This is probably because cuDNN failed to initialize,
- [Solved] Failed to get convolution algorithm. This is probably because cuDNN failed to initialize
- [Solved] RuntimeError: CUDA error: CUBLAS_STATUS_ALLOC_FAILED when calling `cublasCreate(handle)`
- [Solved] Eclipse Connect Hadoop error: /An internal error occurred during: “Map/Reduce location status updater“
- Springboot Error: There was an unexpected error (type = internal server error, status = 500)
- [Solved] RuntimeError: CUDA error: out of memory
- How to Solve Error: RuntimeError: all tensors must be on devices[0]
- [Solved] Runtimeerror during dcgan training: found dtype long but expected float
- [Solved] RuntimeError: each element in list of batch should be of equal size
- [Solved] RuntimeError: An attempt has been made to start a new process before the current process…
- [Solved] torch Do Targer Detection Error: RuntimeError: CUDA error: device-side assert triggered
- [Solved] RuntimeError: 1only batches of spatial targets supported (non-empty 3D tensors) but got targets of s
- [How to Solve] Internal: blas sgemm launch failed