PyTorch CUDA error: an illegal memory access was encountered

Debugging the Python code encountered this error
there is a similar error CUDA error: cublas_ STATUS_ INTERNAL_ ERROR when calling cublasSgemm(...)

Network search, all kinds of answers, driver version, fixed CUDA device number and so on. Although all of them have been successful, they feel unreliable.

This error message looks like a memory access error
solutions:

Check the code carefully and unify the data on CPU or GPU.

Inspection process is very troublesome, in order to facilitate inspection, I wrote a small function.

def printTensor(t, tag:str):
    sz = t.size()
    p = t
    for i in range(len(sz)-1):
        p = p[0]
    if len(p)>3:
        p = p[:3]

    print('\t%s.size'%tag, t.size(), ' dev :', t.device, ": ",p.data)
    return 

When using, printtensor (context, 'context') , the output is similar

context.size torch.Size([4, 10, 10]) dev : cuda:0 : tensor([0, 0, 0], device=‘ cuda:0 ’)

This function has two main points

    output device output data

    The second point is particularly important. Only output devices do not necessarily trigger errors. Only when you output data and pytorch runs down according to the process, can you make a real error.

    Finally, the author found that the network of NN. * did not call to (device) explicitly. However, the customized models do inherit NN. Module , which needs to be checked in the future.


Read More: