Debugging the Python code encountered this error
there is a similar error CUDA error: cublas_ STATUS_ INTERNAL_ ERROR when calling cublasSgemm(...)
Network search, all kinds of answers, driver version, fixed CUDA device number and so on. Although all of them have been successful, they feel unreliable.
This error message looks like a memory access error
solutions:
Check the code carefully and unify the data on CPU or GPU.
Inspection process is very troublesome, in order to facilitate inspection, I wrote a small function.
def printTensor(t, tag:str):
sz = t.size()
p = t
for i in range(len(sz)-1):
p = p[0]
if len(p)>3:
p = p[:3]
print('\t%s.size'%tag, t.size(), ' dev :', t.device, ": ",p.data)
return
When using, printtensor (context, 'context')
, the output is similar
context.size torch.Size([4, 10, 10]) dev : cuda:0 : tensor([0, 0, 0], device=‘ cuda:0 ’)
This function has two main points
- output device output data
The second point is particularly important. Only output devices do not necessarily trigger errors. Only when you output data and pytorch runs down according to the process, can you make a real error.
Finally, the author found that the network of NN. *
did not call to (device)
explicitly. However, the customized models do inherit NN. Module
, which needs to be checked in the future.