Tag Archives: Deep learning

[Solved] RuntimeError: CUDA error: out of memory

1. Check whether the appropriate version of torch is used

print(torch.__version__)  # 1.9.1+cu111
print(torch.version.cuda)  # 11.1
print(torch.backends.cudnn.version())  # 8005
print(torch.cuda.current_device())  # 0
print(torch.cuda.is_available())  # TRUE

2. Check whether the video memory is insufficient, try to modify the batch size of the training, and it still cannot be solved when it is modified to the minimum, and then use the following command to monitor the video memory occupation in real time

watch -n 0.5 nvidia-smi

When the program is not called, the display memory is occupied

Therefore, the problem is that the program specifies to use four GPUs. There is no problem when calling the first two resources, but the third block is occupied by the programs of other small partners, so an error is reported.

3. Specify the GPU to use

device = torch.device("cuda" if torch.cuda.is_available() and not args.no_cuda else "cpu")  # cuda Specifies the GPU device to be used
model = torch.nn.DataParallel(model, device_ids=[0, 1, 3])  # Specify the device number to be used for multi-GPU parallel processing

So you can run happily

[Solved] RuntimeError: Expected object of scalar type Float but got scalar type Double for argument #2 ‘mat1‘

Error Message (Error Codes below):
RuntimeError: Expected object of scalar type Float but got scalar type Double for argument #2 ‘mat1’ in call to _th_addmm

for epoch in range(num_epochs):
    # Convert numpy arrays to torch tensors
    inputs = torch.from_numpy(x_train)
    targets = torch.from_numpy(y_train)
    # Forward pass
    outputs = model(inputs)
    loss = criterion(outputs, targets)
    
    # Backward and optimize
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()
    
    print("Epoch [{}/{}] Loss: {:.4f}".format(epoch+1, num_epochs, loss.item()))

Solution:

Method 1. Add

inputs = inputs.float()
targets = targets.float()
model = model.float()

Complete code

for epoch in range(num_epochs):
    # Convert numpy arrays to torch tensors
    inputs = torch.from_numpy(x_train)
    targets = torch.from_numpy(y_train)
    inputs = inputs.float()
    targets = inputs.float()
    model = model.float()
    # Forward pass
    outputs = model(inputs)
    loss = criterion(outputs, targets)
    
    # Backward and optimize
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()
    
    print("Epoch [{}/{}] Loss: {:.4f}".format(epoch+1, num_epochs, loss.item()))

Method 2. Add

inputs = inputs.double()
targets = inputs.double()
model = model.double()

Complete code

for epoch in range(num_epochs):
    # Convert numpy arrays to torch tensors
    inputs = torch.from_numpy(x_train)
    targets = torch.from_numpy(y_train)
    inputs = inputs.double()
    targets = inputs.double()
    model = model.double()
    # Forward pass
    outputs = model(inputs)
    loss = criterion(outputs, targets)
    
    # Backward and optimize
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()
    
    print("Epoch [{}/{}] Loss: {:.4f}".format(epoch+1, num_epochs, loss.item()))

Pytorch CUDA Error: UserWarning: CUDA initialization: CUDA unknown error…

After CUDA is installed, the following error is reported using pytorch

UserWarning: CUDA initialization: CUDA unknown error – this may be due to an incorrectly set up environment, e.g. changing env variable CUDA_ VISIBLE_ DEVICES after program start.

Solution: after CUDA and pytorch are installed, add the following in. Bashrc

export  PATH=/usr/local/cuda-11.4/bin:$PATH
export  LD_LIBRARY_PATH=/usr/local/cuda-11.4/lib64$LD_LIBRARY_PATH
export CUDA_HOME=/usr/local/cuda-11.4/bin
export CUDA_VISIBLE_DEVICES=0,1

If there is still a problem, use sudo apt-get install NVIDIA modprobe to install it. After the installation, you can use it

Methods of checking CUDA

import torch
flag = torch.cuda.is_available()
print(flag)

Output is: True cuda normal

Error in training yolox: error in importing apex

This error is reported because you did not successfully install apex. Note ~: it is not PIP install apex

ERROR: Command errored out with exit status 1: /usr/bin/python3 -u -c 'import io, os, sys, setuptools, tokenize; sys.argv[0] = '"'"'/tmp/pip-req-build-6o4wusvf/setup.py'"'"'; __file__='"'"'/tmp/pip-req-build-6o4wusvf/setup.py'"'"';f = getattr(tokenize, '"'"'open'"'"', open)(__file__) if os.path.exists(__file__) else io.StringIO('"'"'from setuptools import setup; setup()'"'"');code = f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' --cpp_ext --cuda_ext install --record /tmp/pip-record-07hjl8r1/install-record.txt --single-version-externally-managed --compile --install-headers /usr/local/include/python3.7/apex Check the logs for full command output.
Exception information:
Traceback (most recent call last):

  Processing method: Step 1: use this command to check the CUDA version supported by your machine:

nvcc --version

Step 2: use the following command to view the version of CUDA you currently have installed.

pip list

Note: when installing apex, you must ensure that the two versions are consistent. That is, if the version supported by the machine is 11.0, you can install CUDA of the corresponding torch version.

First use the following command to uninstall the original torch of your machine

!pip uninstall -y torch torchvision torchaudio

Then use the following command to install. For example, the machine here supports CUDA version 11.0

!pip install torch==1.7.1+cu110 torchvision==0.8.2+cu110 torchaudio==0.7.2 -f https://download.pytorch.org/whl/torch_stable.html

Next, install apex:

!git clone https://github.com/NVIDIA/apex
%cd apex
!pip install -v --disable-pip-version-check --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" ./

The above work is completed.

[Solved] RuntimeError: CUDA error: CUBLAS_STATUS_EXECUTION_FAILED when calling `cubla…

Resolve runtime error: CUDA error: cublas_STATUS_EXECUTION_FAILED when calling `cubla…

The running experiment encountered this problem. At the beginning, it was found that some people said it was because the dimensions might be different, but after inspection, this problem did not exist.

Another solution is to add a sentence of code

torch.backends.cudnn.enabled = false , but I haven’t tried yet, because it is found that the CUDA device settings of the main.py file and other files are different (there is not much data, I didn’t set nn.dataparallel, so there will be no problem after the changes are consistent.

Therefore, if you encounter this problem, you can check whether each variable and model are on the same CUDA device.

[Solved] RuntimeError: CUDA error: invalid device ordinal

Error Message:
RuntimeError: CUDA error: invalid device ordinal

Solution:

args.device = torch.device('cuda:' + str(args.gpu_id))

Used the gpu_id exceeds that of the GPU card

My code is written in the above way. Different codes are written in different ways. In the final analysis, just change the number after CUDA: to be appropriate