Tag Archives: Deep learning

[Solved] RuntimeError: CUDA error: out of memory

1. Check whether the appropriate version of torch is used

print(torch.__version__)  # 1.9.1+cu111
print(torch.version.cuda)  # 11.1
print(torch.backends.cudnn.version())  # 8005
print(torch.cuda.current_device())  # 0
print(torch.cuda.is_available())  # TRUE

2. Check whether the video memory is insufficient, try to modify the batch size of the training, and it still cannot be solved when it is modified to the minimum, and then use the following command to monitor the video memory occupation in real time

watch -n 0.5 nvidia-smi

When the program is not called, the display memory is occupied

Therefore, the problem is that the program specifies to use four GPUs. There is no problem when calling the first two resources, but the third block is occupied by the programs of other small partners, so an error is reported.

3. Specify the GPU to use

device = torch.device("cuda" if torch.cuda.is_available() and not args.no_cuda else "cpu")  # cuda Specifies the GPU device to be used
model = torch.nn.DataParallel(model, device_ids=[0, 1, 3])  # Specify the device number to be used for multi-GPU parallel processing

So you can run happily

[Solved] RuntimeError: Expected object of scalar type Float but got scalar type Double for argument #2 ‘mat1‘

Error Message (Error Codes below):
RuntimeError: Expected object of scalar type Float but got scalar type Double for argument #2 ‘mat1’ in call to _th_addmm

for epoch in range(num_epochs):
    # Convert numpy arrays to torch tensors
    inputs = torch.from_numpy(x_train)
    targets = torch.from_numpy(y_train)
    # Forward pass
    outputs = model(inputs)
    loss = criterion(outputs, targets)
    
    # Backward and optimize
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()
    
    print("Epoch [{}/{}] Loss: {:.4f}".format(epoch+1, num_epochs, loss.item()))

Solution:

Method 1. Add

inputs = inputs.float()
targets = targets.float()
model = model.float()

Complete code

for epoch in range(num_epochs):
    # Convert numpy arrays to torch tensors
    inputs = torch.from_numpy(x_train)
    targets = torch.from_numpy(y_train)
    inputs = inputs.float()
    targets = inputs.float()
    model = model.float()
    # Forward pass
    outputs = model(inputs)
    loss = criterion(outputs, targets)
    
    # Backward and optimize
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()
    
    print("Epoch [{}/{}] Loss: {:.4f}".format(epoch+1, num_epochs, loss.item()))

Method 2. Add

inputs = inputs.double()
targets = inputs.double()
model = model.double()

Complete code

for epoch in range(num_epochs):
    # Convert numpy arrays to torch tensors
    inputs = torch.from_numpy(x_train)
    targets = torch.from_numpy(y_train)
    inputs = inputs.double()
    targets = inputs.double()
    model = model.double()
    # Forward pass
    outputs = model(inputs)
    loss = criterion(outputs, targets)
    
    # Backward and optimize
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()
    
    print("Epoch [{}/{}] Loss: {:.4f}".format(epoch+1, num_epochs, loss.item()))

Pointnet-pytorch Error: importError: No module named ‘pointnet‘

Add before header file:

import sys
sys.path.append("../")
from pointnet.dataset import ShapeNetDataset
from pointnet.model import PointNetCls

Python Pandas Error: No module named ‘openpyxl‘

Use pandas to convert CSV to ecexl XLS file, and an error is reported no module named ‘openpyxl’

Additional XLS libraries are required

Available in CONDA environment:
CONDA install openpyxl

Base environment can also be used:
PIP install openpyxl

Error Code 2: Internal Error (Assertion cublasStatus == CUBLAS_STATUS_SUCCES

Error code 2: internal error (assertion cubrasstatus = = cublas_status_successes

This is a bug, 🤬, When installing cuda10.2, install all three official websites. Don’t only install the first

Pytorch CUDA Error: UserWarning: CUDA initialization: CUDA unknown error…

After CUDA is installed, the following error is reported using pytorch

UserWarning: CUDA initialization: CUDA unknown error – this may be due to an incorrectly set up environment, e.g. changing env variable CUDA_ VISIBLE_ DEVICES after program start.

Solution: after CUDA and pytorch are installed, add the following in. Bashrc

export  PATH=/usr/local/cuda-11.4/bin:$PATH
export  LD_LIBRARY_PATH=/usr/local/cuda-11.4/lib64$LD_LIBRARY_PATH
export CUDA_HOME=/usr/local/cuda-11.4/bin
export CUDA_VISIBLE_DEVICES=0,1

If there is still a problem, use sudo apt-get install NVIDIA modprobe to install it. After the installation, you can use it

Methods of checking CUDA

import torch
flag = torch.cuda.is_available()
print(flag)

Output is: True cuda normal

[Solved] Yolox Run Error: can‘t find starting number

Yolox operation error – can’t find starting number

Error: can’t find starting number (in the name of file): video/test.avi in function ‘icvxtractpattern’

An error is reported when running the following command

Solution:

Add 01 to the file name to:

Error in training yolox: error in importing apex

This error is reported because you did not successfully install apex. Note ~: it is not PIP install apex

ERROR: Command errored out with exit status 1: /usr/bin/python3 -u -c 'import io, os, sys, setuptools, tokenize; sys.argv[0] = '"'"'/tmp/pip-req-build-6o4wusvf/setup.py'"'"'; __file__='"'"'/tmp/pip-req-build-6o4wusvf/setup.py'"'"';f = getattr(tokenize, '"'"'open'"'"', open)(__file__) if os.path.exists(__file__) else io.StringIO('"'"'from setuptools import setup; setup()'"'"');code = f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' --cpp_ext --cuda_ext install --record /tmp/pip-record-07hjl8r1/install-record.txt --single-version-externally-managed --compile --install-headers /usr/local/include/python3.7/apex Check the logs for full command output.
Exception information:
Traceback (most recent call last):

Processing method: Step 1: use this command to check the CUDA version supported by your machine:

nvcc --version

Step 2: use the following command to view the version of CUDA you currently have installed.

pip list

Note: when installing apex, you must ensure that the two versions are consistent. That is, if the version supported by the machine is 11.0, you can install CUDA of the corresponding torch version.

First use the following command to uninstall the original torch of your machine

!pip uninstall -y torch torchvision torchaudio

Then use the following command to install. For example, the machine here supports CUDA version 11.0

!pip install torch==1.7.1+cu110 torchvision==0.8.2+cu110 torchaudio==0.7.2 -f https://download.pytorch.org/whl/torch_stable.html

Next, install apex:

!git clone https://github.com/NVIDIA/apex
%cd apex
!pip install -v --disable-pip-version-check --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" ./

The above work is completed.

[Solved] RuntimeError: CUDA error: CUBLAS_STATUS_EXECUTION_FAILED when calling `cubla…

Resolve runtime error: CUDA error: cublas_STATUS_EXECUTION_FAILED when calling `cubla…

The running experiment encountered this problem. At the beginning, it was found that some people said it was because the dimensions might be different, but after inspection, this problem did not exist.

Another solution is to add a sentence of code

torch.backends.cudnn.enabled = false , but I haven’t tried yet, because it is found that the CUDA device settings of the main.py file and other files are different (there is not much data, I didn’t set nn.dataparallel, so there will be no problem after the changes are consistent.

Therefore, if you encounter this problem, you can check whether each variable and model are on the same CUDA device.

[Solved] RuntimeError: CUDA error: invalid device ordinal

Error Message:
RuntimeError: CUDA error: invalid device ordinal

Solution:

args.device = torch.device('cuda:' + str(args.gpu_id))

Used the gpu_id exceeds that of the GPU card

My code is written in the above way. Different codes are written in different ways. In the final analysis, just change the number after CUDA: to be appropriate

linux/tensorflow: failed call to cuDevicePrimaryCtxRetain: CUDA_ERROR_INVALID_DEVICE

Error reporting during CUDA training:

solution:
used during operation:

             CUDA_VISIBLE_DEVICES=-1 python train_single.py

[br] https://www.136.la/tech/show-629533.html

PyTorch – AttributeError: ‘bool‘ object has no attribute ‘sum‘

The reason is that torch.max() is changed to torch.argmax()

	out = model(img)
    loss = criterion(out, label)
    eval_loss += loss.data.item()*label.size(0)
    pred = torch.argmax(out, 1)
    num_correct = (pred == label).sum()

ProgrammerAH

Programmer Guide, Tips and Tutorial