# How to Solve Error: RuntimeError CUDA out of memory

Error: RuntimeError: CUDA out of memory.
Error Messages:

Traceback (most recent call last):
File "xxx.py", line 110, in <module>
loss.backward()
File "/nfsshare/apps/anaconda3/lib/python3.7/site-packages/torch/tensor.py", line 185, in backward
File "/nfsshare/apps/anaconda3/lib/python3.7/site-packages/torch/autograd/__init__.py", line 127, in backward
allow_unreachable=True)  # allow_unreachable flag
RuntimeError: CUDA out of memory. Tried to allocate 132.00 MiB (GPU 0; 15.78 GiB total capacity; 13.69 GiB already allocated; 91.50 MiB free; 14.53 GiB reserved in total by PyTorch)
Exception raised from malloc at /pytorch/c10/cuda/CUDACachingAllocator.cpp:272 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x42 (0x14c9ce19a1e2 in /nfsshare/apps/anaconda3/lib/python3.7/site-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0x1e64b (0x14c9ce3f064b in /nfsshare/apps/anaconda3/lib/python3.7/site-packages/torch/lib/libc10_cuda.so)
frame #2: <unknown function> + 0x1f464 (0x14c9ce3f1464 in /nfsshare/apps/anaconda3/lib/python3.7/site-packages/torch/lib/libc10_cuda.so)
frame #3: <unknown function> + 0x1faa1 (0x14c9ce3f1aa1 in /nfsshare/apps/anaconda3/lib/python3.7/site-packages/torch/lib/libc10_cuda.so)
frame #4: at::native::empty_cuda(c10::ArrayRef<long>, c10::TensorOptions const&, c10::optional<c10::MemoryFormat>) + 0x11e (0x14c9d10fc90e in /nfsshare/apps/anaconda3/lib/python3.7/site-packages/torch/lib/libtorch_cuda.so)
frame #5: <unknown function> + 0xf33949 (0x14c9cf536949 in /nfsshare/apps/anaconda3/lib/python3.7/site-packages/torch/lib/libtorch_cuda.so)
frame #6: <unknown function> + 0xf4d777 (0x14c9cf550777 in /nfsshare/apps/anaconda3/lib/python3.7/site-packages/torch/lib/libtorch_cuda.so)
frame #7: <unknown function> + 0x10e9c7d (0x14ca0a2ecc7d in /nfsshare/apps/anaconda3/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)
frame #8: <unknown function> + 0x10e9f97 (0x14ca0a2ecf97 in /nfsshare/apps/anaconda3/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)
frame #9: at::empty(c10::ArrayRef<long>, c10::TensorOptions const&, c10::optional<c10::MemoryFormat>) + 0xfa (0x14ca0a3f7a1a in /nfsshare/apps/anaconda3/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)
frame #10: at::native::mm_cuda(at::Tensor const&, at::Tensor const&) + 0x6c (0x14c9d05ebffc in /nfsshare/apps/anaconda3/lib/python3.7/site-packages/torch/lib/libtorch_cuda.so)
frame #11: <unknown function> + 0xf22a20 (0x14c9cf525a20 in /nfsshare/apps/anaconda3/lib/python3.7/site-packages/torch/lib/libtorch_cuda.so)
frame #12: <unknown function> + 0xa56530 (0x14ca09c59530 in /nfsshare/apps/anaconda3/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)
frame #13: at::Tensor c10::Dispatcher::call<at::Tensor, at::Tensor const&, at::Tensor const&>(c10::TypedOperatorHandle<at::Tensor (at::Tensor const&, at::Tensor const&)> const&, at::Tensor const&, at::Tensor const&) const + 0xbc (0x14ca0a44181c in /nfsshare/apps/anaconda3/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)
frame #14: at::mm(at::Tensor const&, at::Tensor const&) + 0x4b (0x14ca0a3926ab in /nfsshare/apps/anaconda3/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)
frame #15: <unknown function> + 0x2ed0a2f (0x14ca0c0d3a2f in /nfsshare/apps/anaconda3/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)
frame #16: <unknown function> + 0xa56530 (0x14ca09c59530 in /nfsshare/apps/anaconda3/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)
frame #17: at::Tensor c10::Dispatcher::call<at::Tensor, at::Tensor const&, at::Tensor const&>(c10::TypedOperatorHandle<at::Tensor (at::Tensor const&, at::Tensor const&)> const&, at::Tensor const&, at::Tensor const&) const + 0xbc (0x14ca0a44181c in /nfsshare/apps/anaconda3/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)
frame #18: at::Tensor::mm(at::Tensor const&) const + 0x4b (0x14ca0a527cab in /nfsshare/apps/anaconda3/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)
frame #19: <unknown function> + 0x2d11c34 (0x14ca0bf14c34 in /nfsshare/apps/anaconda3/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)
frame #20: torch::autograd::generated::MmBackward::apply(std::vector<at::Tensor, std::allocator<at::Tensor> >&&) + 0x294 (0x14ca0bf30814 in /nfsshare/apps/anaconda3/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)
frame #21: <unknown function> + 0x3375bb7 (0x14ca0c578bb7 in /nfsshare/apps/anaconda3/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)
frame #26: <unknown function> + 0xbd6df (0x14ca5616b6df in /usr/lib/x86_64-linux-gnu/libstdc++.so.6)
frame #27: <unknown function> + 0x76db (0x14ca5a6356db in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #28: clone + 0x3f (0x14ca5a35ea3f in /lib/x86_64-linux-gnu/libc.so.6)


Solution:
CUDA_LAUNCH_BLOCKING=1 python xx.py failed
Reduce the batch size to solve the problem.

# [Solved] Pytorch Download CIFAR1 Datas Error: urllib.error.URLError: ＜urlopen error [SSL: CERTIFICATE_VERIFY_FAILED] certi

urllib.error.URLError: < urlopen error [SSL: CERTIFICATE_VERIFY_FAILED] certi

Solution:

Add the following two lines of code before the code starts:

import ssl
ssl._create_default_https_context = ssl._create_unverified_context


Complete example:

import torch
import torchvision
import torchvision.transforms as transforms
import ssl
ssl._create_default_https_context = ssl._create_unverified_context
#Download the data set and adjust the image, because the output of the torchvision data set is in PILImage format, and the data field is in [0,1]
#We convert it into the tensor format of the standard data field [-1,1]
#transform Data Converter
transform=transforms.Compose([transforms.ToTensor(),transforms.Normalize((0.5,0.5,0.5),(0.5,0.5,0.5))])
# batch_size=4 batch processing

classes=('airplane', 'automobile', 'bird', 'cat', 'deer', 'dog', 'frog', 'horse', 'ship', 'truck')


The errors reported by pytoch are as follows:

Pytorch distributed RuntimeError: Address already in use

reason:

The port is occupied during model multi card training. Just change the port.

Solution:

Add a parameter — master before running the command_ For example:

 --master_port 29501

The following parameter 29501 can be set to any other port

be careful:

This parameter should be loaded in front of xxx.py, for example:

CUDA_VISIBLE_DEVICES=2,7 python3 -m torch.distributed.run /
--nproc_per_node 2  --master_port 29501  train.py 

# Python learning notes (5) — cross entropy error runtimeerror: 1D target tensor expected, multi target not supported

When I use cross entropy as the loss function, an error occurs:

RuntimeError: 1D target tensor expected, multi-target not supported


I checked the relevant information, and the statements in it are basically:

the dimension of the input labels should be 1, and the precision cannot be double. It must be replaced by long; Dimensionality reduction of the input label

But it can’t solve my problem, because my tag data has been processed with the following code after processing:

torch.LongTensor(labels)


And I also printed the dimension of my label data:

torch.Size([16, 11])


Here 16 refers to  batch_ Size , so it’s not a dimension problem.

But I was inspired when I read this blog (runtimeerror: multi target not supported at). It says:

When calculating the cross entropy loss function in pytorch, the correct label input cannot be in one hot format. The function will process itself into one hot format. Therefore, you do not need to enter [0 1], just enter 4.

My tag data is a multi tag problem, as follows:

tensor([0, 1, 0, 0, 1, 0, 0, 1, 1, 0, 0])


Then, when passing through  loss ,  crossentropyloss  will automatically code it as  one-hot , which will increase it by one dimension to:

tensor([[1., 0.],
[0., 1.],
[1., 0.],
[1., 0.],
[0., 1.],
[1., 0.],
[1., 0.],
[0., 1.],
[0., 1.],
[1., 0.],
[1., 0.]])


Therefore, the solution is to use the loss function of the multi label problem. For example,  multilabelsoftmarginloss , or the most original  mselos .

reference resources

[1] Wang’s technical road. Runtimeerror: multi target not supported at [EB/OL]. (December 10, 2019) [October 27, 2021] https://www.cnblogs.com/blogwangwang/p/12018897.html
[2] Python free. Solution of “one-dimensional target tensor expectation, multi-objective unsupported” in cross entropy loss function, calculation, lossfunction, error report, 1dtargettensorexpected, multitargetnotsupported, Solution [EB/OL] (2020-07-04) [2021-10-27] https://www.pythonf.cn/read/125399

# RuntimeError: cuda runtime error (100) : no CUDA-capable device is detected at /opt/conda/conda-bld/

problem

RuntimeError: cuda runtime error (100) : no CUDA-capable device is detected at /opt/conda/conda-bld/


solve

This problem is likely to be your CUDA number is wrong
for example, the variables you set use GPUs 2 and 3, but in fact you only have two GPUs 0 and 1, which will lead to this error.

The reason is that the SSL certificate needs to be verified, but the certificate verification failed. I tried some methods on the Internet, mainly including canceling certificate verification and installing the latest certificate. I didn’t find a suitable method to install the latest certificate, so I used the method of canceling certificate verification. The specific operation is to add the following code at the location where the file needs to be downloaded:

import ssl
ssl._create_default_https_context = ssl._create_unverified_context

# RuntimeError: Expected hidden[0] size (x, x, x), got(x, x, x)

The above figure shows the problem when training the bilstm network.

Problem Description: define the initial weights H0 and C0 of bilstm network and input them to the network as the initial weight of bilstm, which is realized by the following code

output, (hn, cn) = self.bilstm(input, (h0, c0))

The network structure is as follows:

self.bilstm = nn.LSTM(
input_size=self.input_size,
hidden_size=self.hidden_size,
num_layers=self.num_layers,
bidirectional=True,
bias=True,
dropout=config.drop_out
)

The dimension of initial weight is defined as   H0 and C0 are initialized. The dimension is:

**h_0** of shape (num_layers * num_directions, batch, hidden_size)
**c_0** of shape (num_layers * num_directions, batch, hidden_size)

In bilstm network, the parameters are defined as follows:

num_layers: 2

num_directions: 2

batch: 4

seq_len: 10

input_size: 300

hidden_size: 100 

Then according to the definition in the official documents    H0, C0 dimensions should be: (2 * 2, 4100) = (4, 4100)

However, according to the error screenshot at the beginning of the article, the dimension of the initial weight of the hidden layer should be (4, 10100), which makes me doubt whether the dimension specified in the official document is correct.

Obviously, the official documents cannot be wrong, and the hidden state dimensions when using blstm, RNN and bigru in the past are the same as those specified by the official, so I don’t know where to start.

Therefore, we re examined the network structure and found that an important parameter, batch, was missing_ First, let’s take a look at all the parameters required by bilstm:

Args:
input_size: The number of expected features in the input x
hidden_size: The number of features in the hidden state h
num_layers: Number of recurrent layers. E.g., setting num_layers=2
would mean stacking two LSTMs together to form a stacked LSTM,
with the second LSTM taking in outputs of the first LSTM and
computing the final results. Default: 1
bias: If False, then the layer does not use bias weights b_ih and b_hh.
Default: True
batch_first: If True, then the input and output tensors are provided
as (batch, seq, feature). Default: False
dropout: If non-zero, introduces a Dropout layer on the outputs of each
LSTM layer except the last layer, with dropout probability equal to
:attr:dropout. Default: 0
bidirectional: If True, becomes a bidirectional LSTM. Default: False

batch_ The first parameter can make the dimension batch in the first dimension during training, that is, the input data dimension is

(batch size, SEQ len, embedding dim), if not added   batch_ First = true, the dimension is

（seq len，batch size，embedding dim）

Because there was no break at noon, I vaguely forgot to add this important parameter, resulting in an error: the initial weight dimension is incorrect, and I can add it   batch_ Run smoothly after first = true.

The modified network structure is as follows:

self.bilstm = nn.LSTM(
input_size=self.input_size,
hidden_size=self.hidden_size,
num_layers=self.num_layers,
batch_first=True,
bidirectional=True,
bias=True,
dropout=config.drop_out
)

Extension: when we use RNN and its variant network, if we want to add the initial weight, the dimension must be the officially specified dimension, i.e

(num_layers * num_directions, batch, hidden_size)

At the same time, be sure to set batch_ First = true. The official document does not specify when batch is set_ When first = true, the dimensions of H0, C0, HN and CN are (num_layers * num_directions, batch, hidden_size), so be careful!

At the same time, check whether batch is set when the dimensions of HN and CN are incorrect_ First parameter, RNN and its variant networks are applicable to this method!

# RuntimeError: cuDNN error: CUDNN_STATUS_EXECUTION_FAILED

RuntimeError: cuDNN error: CUDNN_ STATUS_ EXECUTION_ FAILED

The error is in   cuda：10.0     Pytorch: 1.2 problems in the training model under the GPU server environment, error prompt   Cudnn status execution failed

The problem with this error is that CUDA’s version does not correspond to pytorch’s version, resulting in CUDA’s failure to speed up model training and execute at the same time.

When downloading pytorch, we need to correctly download the corresponding relationship between pytorch and CUDA version on the official website. In the local training model, my environment is CUDA 10.0 and pytorch 1.9. Therefore, reinstall pytorch version 1.9 in the server and run successfully.

Performance: CUDA’s version does not correspond to pytorch’s version. The most obvious performance is that when running the program, the video memory does not change. When the normally loaded data and model enter the video memory, the video memory will increase significantly, while when the version does not correspond, the video memory does not change significantly. At the same time, the program will be very slow when loading the model, and even the model cannot be loaded into the video memory for 20 minutes.

# Pytorch directly creates a tensor on the GPU error [How to Solve]

Pytoch directly creates tensors on the GPU and reports an error: Legacy constructor expectations device type: cpubut device type: CUDA was passed

General tensor creation method:

torch.Tensor(x)


However, by default, the tensor is placed in the CPU (memory). If we want to use the GPU to train the model, we must also copy the tensor to the GPU, which will obviously save time
I’ve seen other articles before saying that tensors can be created directly on the GPU, so I’ve also made a try:

MyDevice=torch.device('cuda:0')
x = torch.LongTensor(x, device=MyDevice)


An error is reported when running the program:

legacy constructor expects device type: cpu but device type: cuda was passed


According to the error report analysis, the problem is that the device parameter cannot be passed ‘CUDA’?After checking, I found that the official answer given by pytorch is that tensor class is a subclass of tensor and cannot pass parameters to its device. There is no problem using the tensor class to build
⭐ Therefore, the code is changed as follows:

MyDevice = torch.device('cuda:0')
x = torch.tensor(x, device=MyDevice)
x = x.long()


Now, there will be no more errors.

# PyTorch Error: RuntimeError: CUDA error: CUBLAS_STATUS_INVALID_VALUE when calling cublasSgemm()

Complete error reporting information

RuntimeError: CUDA error: CUBLAS_STATUS_INVALID_VALUE when
calling cublasSgemm( handle, opa, opb, m, n, k, &alpha, a,
lda, b, ldb, &beta, c, ldc)


Error causes and solutions

Usually linear layer

n

n

.

L

i

n

e

a

r

\rm nn.Linear

NN. When defining linear, the data dimension parameter does not match the actual data dimension, so it needs to be checked and modified.