Tag Archives: Deep learning

[Solved] RuntimeError: a view of a leaf Variable that requires grad is being used in an in-place

yolov5 Error: RuntimeError: a view of a leaf Variable that requires grad is being used in an in-place

Solution:
In model/yolo.py file

        for mi, s in zip(m.m, m.stride):  # from
            b = mi.bias.view(m.na, -1)  # conv.bias(255) to (3,85)
            b[:, 4] += math.log(8/(640/s) ** 2)  # obj (8 objects per 640 image)
            b[:, 5:] += math.log(0.6/(m.nc - 0.99)) if cf is None else torch.log(cf/cf.sum())  # cls
            mi.bias = torch.nn.Parameter(b.view(-1), requires_grad=True)

Add with torch.no_grad(): as follows

        for mi, s in zip(m.m, m.stride):  # from
            b = mi.bias.view(m.na, -1)  # conv.bias(255) to (3,85)
            with torch.no_grad():
                b[:, 4] += math.log(8/(640/s) ** 2)  # obj (8 objects per 640 image)
                b[:, 5:] += math.log(0.6/(m.nc - 0.99)) if cf is None else torch.log(cf/cf.sum())  # cls
            mi.bias = torch.nn.Parameter(b.view(-1), requires_grad=True)

——> The root cause is to add:

with torch.no_grad():

[Solved] OMP: Error #15: Initializing libiomp5md.dll, but found libiomp5md.dll already initialized.

Error 1. OMP: Error #15: Initializing libiomp5md.dll, but found libiomp5md.dll already initialized.
Error Message:

OMP: Error #15: Initializing libiomp5md.dll, but found libiomp5md.dll already initialized.
OMP: Hint This means that multiple copies of the OpenMP runtime have been linked into the program. That is dangerous, since it can degrade performance or cause incorrect results. The best thing to do is to ensure that only a single OpenMP runtime is linked into the process, e.g. by avoiding static linking of the OpenMP runtime in any library. As an unsafe, unsupported, undocumented workaround you can set the environment variable KMP_DUPLICATE_LIB_OK=TRUE to allow the program to continue to execute, but that may cause crashes or silently produce incorrect results. For more information, please see http://www.intel.com/software/products/support/.

Solution 1:

When debugging a program in pycharm, you can directly add these two statements in front of the program

import os
os.environ[“KMP_DUPLICATE_LIB_OK”]=“TRUE”

Solution 2:

If Method 1 fails to solve the problem, even importing torch directly on the terminal will cause this problem:

The reason for this is actually that there are two libiomp5md.dll files in the environment of anaconda. So go directly to the path of the virtual environment and search for this file, you can see that there are two dll files in the environment.

The first one is in the path of torch, and the second one is in the path of virtual environment itself. Go to the second directory and cut it to other paths for backup (it is better to back up the path as well).

Error 2.ModuleNotFoundError: No module named ‘mmcv._ext ‘

When using the target detection open source tool MMDetection, the following error occurs:

ModuleNotFoundError: No module named 'mmcv._ext'

It is likely that you did not specify a version when you started the installation of mmcv-full and chose to install it directly, as follows:

pip install mmcv-full

By default, mmcv full is installed. If it does not match the cuda and torch versions in your environment, the above error is likely to occur

Uninstall the original mmcv

pip uninstall mmcv-full

Reinstall the correct version of mmcv full

where {cu_version}, {torch_version} correspond to the version numbers of cuda and torch respectively

pip install mmcv-full -f https://download.openmmlab.com/mmcv/dist/{cu_version}/{torch_version}/index.html

For example, I installed cuda10.2 and pytorch 1.8.0, so I should enter the command:

pip install mmcv-full==1.2.4 -f https://download.openmmlab.com/mmcv/dist/cu102/torch1.8.0/index.html

Note: mmcv-full I recommend version 1.2.4.

[Solved] RuntimeError: a view of a leaf Variable that requires grad is being used in an in-place operation

source code

	def anim(i):
		# update SMBLD
		cur_beta_idx, cur_step = i // num_steps, i % num_steps
		val = shape_range[cur_step]
		mesh.multi_betas[0, cur_beta_idx] = val  # Update betas
		fig.suptitle(f"{name.title()}\nS{cur_beta_idx} : {val:+.2f}", fontsize=50)  # update text

		return dict(mesh=mesh.get_meshes(), equalize=False)


Modified code

Add with torch.no_grad(): will be OK!

	def anim(i):
		# update SMBLD
		cur_beta_idx, cur_step = i // num_steps, i % num_steps
		val = shape_range[cur_step]
		#print("\ncur_beta_idx:",cur_beta_idx,mesh.multi_betas[0, cur_beta_idx])
		with torch.no_grad():###添加
			mesh.multi_betas[0, cur_beta_idx] = val  # Update betas
		fig.suptitle(f"{name.title()}\nS{cur_beta_idx} : {val:+.2f}", fontsize=50)  # update text

		return dict(mesh=mesh.get_meshes(), equalize=False)

[Solved] AttributeError: module ‘keras.preprocessing.image‘ has no attribute ‘load_img‘

Original code:

from keras.preprocessing import image
img         =   image.load_img(img_path,target_size=(224,224))
x           =   image.img_to_array(img)

report errors:

AttributeError: module 'keras.preprocessing.image' has no attribute 'load_img'

Reason: The keras version has been updated.

Solution:

from keras.utils import image_utils
img         =   image_utils.load_img(img_path,target_size=(224,224))
x           =   image_utils.img_to_array(img)

[Solved] AttributeError: module ‘tensorboard.summary._tf.summary‘ has no attribute ‘merge‘

The environment is TensorFlow2.9

The code to be run on github The TensorFlow environment is 1.x

Many apis cannot be called

Original code

train_summary_op = tf.summary.merge([loss_summary])

Error reporting in method 1

train_summary_op = tf.compat.v1.summary.merge([loss_summary])

display

TypeError: Tensors in list passed to 'inputs' of 'MergeSummary' Op have types [bool] that do not match expected type string.

If it can’t be solved, I will try

import tensorflow as tf

Modified as

import tensorflow._api.v2.compat.v1 as tf
tf.disable_v2_behavior()

Finally, solve the problem successfully

How to Solve Pytorch eval Stuck Error

Question

The single card training is very fast. When it comes to eval, it doesn’t move after running a batch, and there is no error.

Tried, still not moving

1, change the pin_memory of valid_loader to False. if it is True, it will automatically load the data into pin_memory, which speeds up the data
transfer speed to GPU.
2, change num_workers to 1, some people say too many workers may lead to multi-process interlock, can reduce or not

 

Final Solution:

valid_loader:
pin_memory = true # this is very important. Before, people on the Internet said that changing false might solve the problem. My experiment proved that if you do not work, you can run normally by changing back to true.
num_workers=4
batch_size=8

train_loader:
pin_memory=True
num_workers=4
batch_size = 8
these parameters are the same as valid_loader

In general, first of all, the pin_memory of valid_loader is kept True, which is well understood, the data is automatically loaded into pin_memory, which speeds up the data transfer to the GPU and naturally speeds up the inference process. Then, the number of workers and batch_size is reduced, and both valid_loader and train_loader are reduced. pin_memory of train_loader is also kept True.

[Solved] LeNet Script Train Error: AttributeError: ‘DictIterator’ object has no attribute ‘get_next’

My training environment:

Windows10 64bit;

MindSpore1.5.0-beta;

CPU;

python3.9;

When training Mnist data set with LeNet , the following error occurs

How to solve this problem??

It’s true that the version is a bit old – the above use case needs to be modified if implemented on a newer version – from the current implementation of Iterator’s code, it no longer has the get_next method: https://gitee.com/mindspore/mindspore/blob/master/mindspore/python/mindspore/dataset/engine/iterators.py#L59

But it has __next__ method, so the above line can be modified, you can try it: > original: data = ds.get_next() > modified: data = next(ds)

[Solved] Labelimg Open an image Error: Error opening file

Labelimg program error, interface


Solution: re-save all the pictures to be marked according to the following procedure

import os
from tqdm import tqdm
from PIL import Image

dir_origin_path = "image save address"
dir_save_path = "image resave address"

img_names = os.listdir(dir_origin_path)
for img_name in tqdm(img_names):
    if img_name.lower().endswith(('.bmp', '.dib', '.png', '.jpg', '.jpeg', '.pbm', '.pgm', '.ppm', '.tif', '.tiff')):
        image_path = os.path.join(dir_origin_path, img_name)
        image = Image.open(image_path)
        image = image.convert('RGB')

        if not os.path.exists(dir_save_path):
            os.makedirs(dir_save_path)
        image.save(os.path.join(dir_save_path, img_name))

[Solved] RuntimeError: “unfolded2d_copy“ not implemented for ‘Half‘

report errors

RuntimeError: "unfolded2d_copy" not implemented for 'Half'

reason

Parameters use_half=true passed in by the model, that is, the CPU is reasoned by using fp16 mixed precision calculation. Fp16 is used to speed up the speed, but the pytorch CPU does not support fp16,

Solution:

  1. Add use_half=False.
  2. Modify half() to float().

So that the model can be calculated;

Modification of my error report:


I hope this article is useful to you!

Thank you for your comments!

[Solved] RuntimeError: DefaultCPUAllocator: not enough memory: you tried to allocate 1105920 bytes.

Question

RuntimeError: [enforce fail at ..\c10\core\CPUAllocator.cpp:76] data. DefaultCPUAllocator: not enough memory: you tried to allocate 1105920 bytes.

Today, when running yoov7 on my own computer, I used the CPU to run the test model because I didn’t have a GPU. I used the CPU to predict an independent image. There is no problem running an image. It is very nice!!! However, when I predict a video (multiple images), he told me that the memory allocation was insufficient,

DefaultCPUAllocator: not enough memory: you tried to allocate 1105920 bytes.,

Moreover, it does not appear after the second image is run. It appears when the 17th image is calculated. The memory can not be released several times later~~~~~~~~

analysis

In pytorch, a tensor has a requires_grad parameter, which, if set to True, is automatically derived when backpropagating the tensor. tensor’s requires_grad property defaults to False, and if a node (leaf variable: tensor created by itself) requires_grad is set to True, then all nodes that depend on it require_grad to be True (even if other dependent tensors have requires_grad = False). grad is set to True, then all the nodes that depend on it will have True (even if the other tensor’s requires_grad = False)


Note:

requires_grad is a property of the generic data structure Tensor in Pytorch, which is used to indicate whether the current quantity needs to retain the corresponding gradient information in the calculation. Taking linear regression as an example, it is easy to know that the weights w and deviations b are the objects to be trained, and in order to get the most suitable parameter values, we need to set a relevant loss function, based on the idea of gradient back propagation Perform training.

When requires_grad is set to False, the backpropagation is not automatically derivative, so it saves memory or video memory.

Then the solution to this problem follows, just let the model not record the gradient during the test, because it is not really used.

 

Solution:

Use with torch.no_grad(), let the model not save the gradient during the test:

with torch.no_grad():
    output, _ = model(image) # Add before the image calculation

In this way, when the model calculates each image, the derivative will not be obtained and the gradient will not be saved!

Perfect solution!

[Solved] RuntimeError: CUDA error: CUBLAS_STATUS_ALLOC_FAILED when calling `cublasCreate(handle)`

[problem description]

The previous code can run normally. After the data set is expanded, the following errors are reported in the GPU program running the deep learning training model, but CUDA out of memory error is not prompted.

RuntimeError: CUDA error: CUBLAS_STATUS_ALLOC_FAILED when calling `cublasCreate(handle)`

[solution 1]

Change the program to run on the CPU and find that it can run normally, but the speed will be very slow and it will take a long time.

--device cpu

[solution 2]

Try to reduce the batch size used in the training model, and it can run normally.

[Solved] Using summary to View network parameters Error: RuntimeError: Input type (torch.cuda.FloatTensor)

Use summary to view network parameters

If you need to view the specific parameters of the network, use the use summary

from torchsummary import summary
summary(model, (3, 448, 448))

Show results

        Layer (type)               Output Shape         Param #
================================================================
            Conv2d-1         [-1, 64, 224, 224]           9,408
       BatchNorm2d-2         [-1, 64, 224, 224]             128
              ReLU-3         [-1, 64, 224, 224]               0
         MaxPool2d-4         [-1, 64, 112, 112]               0
            Conv2d-5         [-1, 64, 112, 112]           4,096
       BatchNorm2d-6         [-1, 64, 112, 112]             128
              ReLU-7         [-1, 64, 112, 112]               0
            Conv2d-8         [-1, 64, 112, 112]          36,864
       BatchNorm2d-9         [-1, 64, 112, 112]             128
             ReLU-10         [-1, 64, 112, 112]               0
           Conv2d-11        [-1, 256, 112, 112]          16,384
      BatchNorm2d-12        [-1, 256, 112, 112]             512
           Conv2d-13        [-1, 256, 112, 112]          16,384
      BatchNorm2d-14        [-1, 256, 112, 112]             512
             ReLU-15        [-1, 256, 112, 112]               0
       Bottleneck-16        [-1, 256, 112, 112]               0
           Conv2d-17         [-1, 64, 112, 112]          16,384
      BatchNorm2d-18         [-1, 64, 112, 112]             128
             ReLU-19         [-1, 64, 112, 112]               0
           Conv2d-20         [-1, 64, 112, 112]          36,864
      BatchNorm2d-21         [-1, 64, 112, 112]             128
             ReLU-22         [-1, 64, 112, 112]               0
           Conv2d-23        [-1, 256, 112, 112]          16,384
      BatchNorm2d-24        [-1, 256, 112, 112]             512
             ReLU-25        [-1, 256, 112, 112]               0
       Bottleneck-26        [-1, 256, 112, 112]               0
           Conv2d-27         [-1, 64, 112, 112]          16,384
      BatchNorm2d-28         [-1, 64, 112, 112]             128
             ReLU-29         [-1, 64, 112, 112]               0
           Conv2d-30         [-1, 64, 112, 112]          36,864
      BatchNorm2d-31         [-1, 64, 112, 112]             128
             ReLU-32         [-1, 64, 112, 112]               0
           Conv2d-33        [-1, 256, 112, 112]          16,384
      BatchNorm2d-34        [-1, 256, 112, 112]             512
             ReLU-35        [-1, 256, 112, 112]               0
       Bottleneck-36        [-1, 256, 112, 112]               0
           Conv2d-37        [-1, 128, 112, 112]          32,768
      BatchNorm2d-38        [-1, 128, 112, 112]             256
             ReLU-39        [-1, 128, 112, 112]               0
           Conv2d-40          [-1, 128, 56, 56]         147,456
      BatchNorm2d-41          [-1, 128, 56, 56]             256
             ReLU-42          [-1, 128, 56, 56]               0
           Conv2d-43          [-1, 512, 56, 56]          65,536
      BatchNorm2d-44          [-1, 512, 56, 56]           1,024
           Conv2d-45          [-1, 512, 56, 56]         131,072
      BatchNorm2d-46          [-1, 512, 56, 56]           1,024
             ReLU-47          [-1, 512, 56, 56]               0
       Bottleneck-48          [-1, 512, 56, 56]               0
           Conv2d-49          [-1, 128, 56, 56]          65,536
      BatchNorm2d-50          [-1, 128, 56, 56]             256
             ReLU-51          [-1, 128, 56, 56]               0
           Conv2d-52          [-1, 128, 56, 56]         147,456
      BatchNorm2d-53          [-1, 128, 56, 56]             256
             ReLU-54          [-1, 128, 56, 56]               0
           Conv2d-55          [-1, 512, 56, 56]          65,536
      BatchNorm2d-56          [-1, 512, 56, 56]           1,024
             ReLU-57          [-1, 512, 56, 56]               0
       Bottleneck-58          [-1, 512, 56, 56]               0
           Conv2d-59          [-1, 128, 56, 56]          65,536
      BatchNorm2d-60          [-1, 128, 56, 56]             256
             ReLU-61          [-1, 128, 56, 56]               0
           Conv2d-62          [-1, 128, 56, 56]         147,456
      BatchNorm2d-63          [-1, 128, 56, 56]             256
             ReLU-64          [-1, 128, 56, 56]               0
           Conv2d-65          [-1, 512, 56, 56]          65,536
      BatchNorm2d-66          [-1, 512, 56, 56]           1,024
             ReLU-67          [-1, 512, 56, 56]               0
       Bottleneck-68          [-1, 512, 56, 56]               0
           Conv2d-69          [-1, 128, 56, 56]          65,536
      BatchNorm2d-70          [-1, 128, 56, 56]             256
             ReLU-71          [-1, 128, 56, 56]               0
           Conv2d-72          [-1, 128, 56, 56]         147,456
      BatchNorm2d-73          [-1, 128, 56, 56]             256
             ReLU-74          [-1, 128, 56, 56]               0
           Conv2d-75          [-1, 512, 56, 56]          65,536
      BatchNorm2d-76          [-1, 512, 56, 56]           1,024
             ReLU-77          [-1, 512, 56, 56]               0
       Bottleneck-78          [-1, 512, 56, 56]               0
           Conv2d-79          [-1, 256, 56, 56]         131,072
      BatchNorm2d-80          [-1, 256, 56, 56]             512
             ReLU-81          [-1, 256, 56, 56]               0
           Conv2d-82          [-1, 256, 28, 28]         589,824
      BatchNorm2d-83          [-1, 256, 28, 28]             512
             ReLU-84          [-1, 256, 28, 28]               0
           Conv2d-85         [-1, 1024, 28, 28]         262,144
      BatchNorm2d-86         [-1, 1024, 28, 28]           2,048
           Conv2d-87         [-1, 1024, 28, 28]         524,288
      BatchNorm2d-88         [-1, 1024, 28, 28]           2,048
             ReLU-89         [-1, 1024, 28, 28]               0
       Bottleneck-90         [-1, 1024, 28, 28]               0
           Conv2d-91          [-1, 256, 28, 28]         262,144
      BatchNorm2d-92          [-1, 256, 28, 28]             512
             ReLU-93          [-1, 256, 28, 28]               0
           Conv2d-94          [-1, 256, 28, 28]         589,824
      BatchNorm2d-95          [-1, 256, 28, 28]             512
             ReLU-96          [-1, 256, 28, 28]               0
           Conv2d-97         [-1, 1024, 28, 28]         262,144
      BatchNorm2d-98         [-1, 1024, 28, 28]           2,048
             ReLU-99         [-1, 1024, 28, 28]               0
      Bottleneck-100         [-1, 1024, 28, 28]               0
          Conv2d-101          [-1, 256, 28, 28]         262,144
     BatchNorm2d-102          [-1, 256, 28, 28]             512
            ReLU-103          [-1, 256, 28, 28]               0
          Conv2d-104          [-1, 256, 28, 28]         589,824
     BatchNorm2d-105          [-1, 256, 28, 28]             512
            ReLU-106          [-1, 256, 28, 28]               0
          Conv2d-107         [-1, 1024, 28, 28]         262,144
     BatchNorm2d-108         [-1, 1024, 28, 28]           2,048
            ReLU-109         [-1, 1024, 28, 28]               0
      Bottleneck-110         [-1, 1024, 28, 28]               0
          Conv2d-111          [-1, 256, 28, 28]         262,144
     BatchNorm2d-112          [-1, 256, 28, 28]             512
            ReLU-113          [-1, 256, 28, 28]               0
          Conv2d-114          [-1, 256, 28, 28]         589,824
     BatchNorm2d-115          [-1, 256, 28, 28]             512
            ReLU-116          [-1, 256, 28, 28]               0
          Conv2d-117         [-1, 1024, 28, 28]         262,144
     BatchNorm2d-118         [-1, 1024, 28, 28]           2,048
            ReLU-119         [-1, 1024, 28, 28]               0
      Bottleneck-120         [-1, 1024, 28, 28]               0
          Conv2d-121          [-1, 256, 28, 28]         262,144
     BatchNorm2d-122          [-1, 256, 28, 28]             512
            ReLU-123          [-1, 256, 28, 28]               0
          Conv2d-124          [-1, 256, 28, 28]         589,824
     BatchNorm2d-125          [-1, 256, 28, 28]             512
            ReLU-126          [-1, 256, 28, 28]               0
          Conv2d-127         [-1, 1024, 28, 28]         262,144
     BatchNorm2d-128         [-1, 1024, 28, 28]           2,048
            ReLU-129         [-1, 1024, 28, 28]               0
      Bottleneck-130         [-1, 1024, 28, 28]               0
          Conv2d-131          [-1, 256, 28, 28]         262,144
     BatchNorm2d-132          [-1, 256, 28, 28]             512
            ReLU-133          [-1, 256, 28, 28]               0
          Conv2d-134          [-1, 256, 28, 28]         589,824
     BatchNorm2d-135          [-1, 256, 28, 28]             512
            ReLU-136          [-1, 256, 28, 28]               0
          Conv2d-137         [-1, 1024, 28, 28]         262,144
     BatchNorm2d-138         [-1, 1024, 28, 28]           2,048
            ReLU-139         [-1, 1024, 28, 28]               0
      Bottleneck-140         [-1, 1024, 28, 28]               0
          Conv2d-141          [-1, 512, 28, 28]         524,288
     BatchNorm2d-142          [-1, 512, 28, 28]           1,024
            ReLU-143          [-1, 512, 28, 28]               0
          Conv2d-144          [-1, 512, 14, 14]       2,359,296
     BatchNorm2d-145          [-1, 512, 14, 14]           1,024
            ReLU-146          [-1, 512, 14, 14]               0
          Conv2d-147         [-1, 2048, 14, 14]       1,048,576
     BatchNorm2d-148         [-1, 2048, 14, 14]           4,096
          Conv2d-149         [-1, 2048, 14, 14]       2,097,152
     BatchNorm2d-150         [-1, 2048, 14, 14]           4,096
            ReLU-151         [-1, 2048, 14, 14]               0
      Bottleneck-152         [-1, 2048, 14, 14]               0
          Conv2d-153          [-1, 512, 14, 14]       1,048,576
     BatchNorm2d-154          [-1, 512, 14, 14]           1,024
            ReLU-155          [-1, 512, 14, 14]               0
          Conv2d-156          [-1, 512, 14, 14]       2,359,296
     BatchNorm2d-157          [-1, 512, 14, 14]           1,024
            ReLU-158          [-1, 512, 14, 14]               0
          Conv2d-159         [-1, 2048, 14, 14]       1,048,576
     BatchNorm2d-160         [-1, 2048, 14, 14]           4,096
            ReLU-161         [-1, 2048, 14, 14]               0
      Bottleneck-162         [-1, 2048, 14, 14]               0
          Conv2d-163          [-1, 512, 14, 14]       1,048,576
     BatchNorm2d-164          [-1, 512, 14, 14]           1,024
            ReLU-165          [-1, 512, 14, 14]               0
          Conv2d-166          [-1, 512, 14, 14]       2,359,296
     BatchNorm2d-167          [-1, 512, 14, 14]           1,024
            ReLU-168          [-1, 512, 14, 14]               0
          Conv2d-169         [-1, 2048, 14, 14]       1,048,576
     BatchNorm2d-170         [-1, 2048, 14, 14]           4,096
            ReLU-171         [-1, 2048, 14, 14]               0
      Bottleneck-172         [-1, 2048, 14, 14]               0
          Conv2d-173          [-1, 256, 14, 14]         524,288
     BatchNorm2d-174          [-1, 256, 14, 14]             512
          Conv2d-175          [-1, 256, 14, 14]         589,824
     BatchNorm2d-176          [-1, 256, 14, 14]             512
          Conv2d-177          [-1, 256, 14, 14]          65,536
     BatchNorm2d-178          [-1, 256, 14, 14]             512
          Conv2d-179          [-1, 256, 14, 14]         524,288
     BatchNorm2d-180          [-1, 256, 14, 14]             512
detnet_bottleneck-181          [-1, 256, 14, 14]               0
          Conv2d-182          [-1, 256, 14, 14]          65,536
     BatchNorm2d-183          [-1, 256, 14, 14]             512
          Conv2d-184          [-1, 256, 14, 14]         589,824
     BatchNorm2d-185          [-1, 256, 14, 14]             512
     BatchNorm2d-197           [-1, 30, 14, 14]              60
================================================================

Error reported:

RuntimeError: Input type (torch.cuda.FloatTensor) and weight type (torch.FloatTensor) should be the same

Run the model in the graphics card:

from torchsummary import summary
summary(net.cuda(), (3, 448, 448))