[Solved] RuntimeError: cuDNN error: CUDNN_STATUS_INTERNAL_ERROR

This problem has been solved for a day….

Train the code well. If you change a machine, you will report an error.

I thought it was cuda11. I was worried that the CUDA version did not match the pytorch version. I reinstalled it, but it didn’t solve the problem.

Problem phenomenon:

raceback (most recent call last):
  File "train.py", line 100, in <module>
    main(opt)
  File "train.py", line 71, in main

……

  File "/home/xxxx/.local/lib/python3.8/site-packages/torch/nn/modules/conv.py", line 395, in _conv_forward
    return F.conv2d(input, weight, bias, self.stride,
RuntimeError: cuDNN error: CUDNN_STATUS_INTERNAL_ERROR
You can try to repro this exception using the following code snippet. If that doesn't trigger the error, please include your original repro script when reporting this issue.

import torch
torch.backends.cuda.matmul.allow_tf32 = True
torch.backends.cudnn.benchmark = True
torch.backends.cudnn.deterministic = False
torch.backends.cudnn.allow_tf32 = True
data = torch.randn([1, 64, 80, 144], dtype=torch.float, device='cuda', requires_grad=True)
net = torch.nn.Conv2d(64, 64, kernel_size=[3, 3], padding=[1, 1], stride=[1, 1], dilation=[1, 1], groups=1)
net = net.cuda().float()
out = net(data)
out.backward(torch.randn_like(out))
torch.cuda.synchronize()

ConvolutionParams
    data_type = CUDNN_DATA_FLOAT
    padding = [1, 1, 0]
    stride = [1, 1, 0]
    dilation = [1, 1, 0]
    groups = 1
    deterministic = false
    allow_tf32 = true
input: TensorDescriptor 0xaa030590
    type = CUDNN_DATA_FLOAT
    nbDims = 4
    dimA = 1, 64, 80, 144,
    strideA = 737280, 11520, 144, 1,
output: TensorDescriptor 0xaa0d6560
    type = CUDNN_DATA_FLOAT
    nbDims = 4
    dimA = 1, 64, 80, 144,
    strideA = 737280, 11520, 144, 1,
weight: FilterDescriptor 0xaa0d0360
    type = CUDNN_DATA_FLOAT
    tensor_format = CUDNN_TENSOR_NCHW
    nbDims = 4
    dimA = 64, 64, 3, 3,
Pointer addresses:
    input: 0x567e50000
    output: 0x568120000
    weight: 0x550a2da00

Solution:

Save CUDA’s prompt to a file,

import torch
torch.backends.cuda.matmul.allow_tf32 = True
torch.backends.cudnn.benchmark = True
torch.backends.cudnn.deterministic = False
torch.backends.cudnn.allow_tf32 = True
data = torch.randn([1, 64, 80, 144], dtype=torch.float, device='cuda', requires_grad=True)
net = torch.nn.Conv2d(64, 64, kernel_size=[3, 3], padding=[1, 1], stride=[1, 1], dilation=[1, 1], groups=1)
net = net.cuda().float()
out = net(data)
out.backward(torch.randn_like(out))
torch.cuda.synchronize()

When Python runs it, it will report the same error, then select the switch to adjust it, and try again whether it still reports an error.

For my code, modifying the following is work.

torch.backends.cudnn.benchmark = False

Then put this in front of the problem code.

Read More: