Tag Archives: Deep learning

Windows10 DOTA_devkit Error: TypeError: ‘>=‘ not supported between instances of ‘NoneType‘ and ‘str‘

1. The system configuration is: windows10 + python3.6 + anaconda3 + swig4.0.2. Swig needs to add environment variables to the system environment manually.

2. According to the official readme file, execute the command [swig – C + + – Python polyiou. I], basically no problem.

3. When executing the command [Python setup. Py build]_ The error [if self. LD] appears when “ext — inplace”_ version >= ” 2.10.90″: TypeError: ‘>=’ not supported between instances of ‘NoneType’ and ‘str’】。 In the python installation path of the virtual environment, find the file of  lib  distutils  distutils.cfg, open it and find the following content: [build] compiler = mingw32, the content of the file is correct.

4. Solution: after exploration and research, we need to configure GCC, install mingw-w64 compiler, command is [CONDA install libpython m2w64 toolchain – C msys2], and then execute the command [Python setup. Py build]_ Ext — inplace], run successfully, and you will find that the_ polyiou.cp36-win_ Amd64.pyd file.

Deep learning model error + 1: CUDA error: device side assert triggered

Scenario:
some time ago, when running the fast RCNN model in Google’s colab, there was no problem. Later, when using featurize to rent a server to run the model, the same code kept reporting the error “CUDA error: device side assert triggered”
these two days have driven me crazy. There are many blog articles about this situation on the Internet. Most of them say that the label is out of bounds, and some of them have problems in the calculation of loss function
I can only debug step by step, and I’d better solve my own problems.

'''When running with GPU, this function reports an error “CUDA error: device-side assert triggered”'''
perm1 = torch.randperm(positive.numel(), device=positive.device)[:num_pos]
perm2 = torch.randperm(negative.numel(), device=negative.device)[:num_neg]

'''After modification, change device to cpu'''
perm1 = torch.randperm(positive.numel(), device="cpu")[:num_pos]
perm2 = torch.randperm(negative.numel(), device="cpu")[:num_neg]

Make a record, hoping to help people in the same situation.

After the new video card rtx3060 arrives, configure tensorflow and run “TF. Test. Is”_ gpu_ The solution of “available ()” output false

First of all, install according to the normal installation method:
the necessary conditions for success are:
1. The version number should be correct, that is, CUDA should be installed above 11.1 (because CUDA version supported by 30 AMP architecture graphics card starts from 11.1)
link: https://developer.nvidia.com/zh-cn/cuda-downloads
2. Cudnn needs to install the, Link (to register and log in to NVIDIA account) https://developer.nvidia.com/zh-cn/cudnn
If you haven’t installed it, you can see other posts https://so.csdn.net/so/search/all?q=3060%20tensorflow& t=all& p=1& s=0& tm=0& lv=-1& ft=0& l=& U =
after installation, enter the created environment and run tf.test.is_ gpu_ available()。
if the computer can detect the graphics card, it can display the number of cores, computing power and other parameters of each graphics card, but the final answer is false
if the command line shows that cusolver64 cannot be found_ 10 documents

, at the following address C:// program files/NVIDIA GPU computing toolkit/CUDA/V11.1/bin

Will cusolver64_ 11. DLL renamed to cusolver64_ 10. Dll
and then run tf.test.is again_ gpu_ available()

Your uncle made it!

pytorch raise RuntimeError(‘Error(s) in loading state_dict for {}:\n\t{}‘.format

When training the model, we need to find out whether there is multi GPU training

If using Python to load the model normally:

model.load_state_dict(torch.load(model_path))

If multi GPU training is used in training

model = torch.nn.DataParallel(model, device_ids=range(opt.ngpu))

If so, loading the model requires

model.load_state_dict({k.replace('module.',''):v for k,v in torch.load(model_path).items()})

Libtorch Error: Expected object of type Variable but found type CUDALongType for argument #2 ‘index’

what(): Expected object of type Variable but found type CUDALongType for argument #2 ‘index’ (checked_ cast_ variable at … /… /torch/csrc/autograd/ VariableTypeManual.cpp:38 )

Problem Description: using the libtorch function torch:: index_ select(detections_ class_ Left, 0, index), the operation will report an error as above and enter the index_ The definition of select shows: static inline tensor index_ select(const Tensor & self, int64_ t dim, const Tensor & index)

Problem analysis:
through the error prompt, argument # 2 ‘index’ should be of variable type, and I use the tensor (data type is cudalongtype), so I always report an error.

Problem solving:
1. Make clear the difference between variable type and tensor, and quote another article’s interpretation: Zhihu’s article;

2. Turn the tensor into a variable and use the function: Torch:: autograd:: make_ variable(left_ index, false);// tensot—> Variable

The GPU is still occupied after the program stops

When running deep learning programs, sometimes the program is forced to terminate, but the GPU resources occupied by the program are still not released. After being trapped for a long time, it is thought that the GPU has been occupied by others. As a result, the GPU resources are leaked.

You can use this command to view the usage of GPU in Linux system

nvidia-smi

The result is as shown in the figure

At this time, you can manually kill the process that occupies the GPU to release the GPU resources

kill -9 49461

If the screen command is used, the program running in the background stops and occupies the GPU, you can also close all screen windows to release the GPU

killall screen

Of course, it’s OK to kill the process directly

RuntimeError: Integer division of tensors using div or / is no longer supported, and in a future rel

RuntimeError: Integer division of tensors using div or/is no longer supported, and in a future release div will perform true division as in Python 3. Use true_divide or floor_divide (// in Python) instead.

from torchvision import transforms
import numpy as np

data = np.random.randint(0, 255, size=12)
img = data.reshape(2, 2, 3)
print(img.shape)
img_tensor = transforms.ToTensor()(img)  # Convert to tensor
print(img_tensor)
print(img_tensor.shape)
print("*" * 20)
norm_img = transforms.Normalize((10, 10, 10), (1, 1, 1))(img_tensor)  # Perform normative processing
print(norm_img)

Operation effect:

reason:

Pytorch1.5.0 is OK, but when upgrading to 1.6.0, it is found that division between tenor and int cannot be directly performed with ‘/’.

Solution:

Standardize the data processing

Example code:

from torchvision import transforms
import numpy as np

data = np.random.randint(0, 255, size=12)
img = data.reshape(2, 2, 3)
print(img.shape)
img_tensor = transforms.ToTensor()(img)  # convert to tensor
print(img_tensor)
print(img_tensor.shape)
print("*" * 20)
img_tensor = img_tensor.float()  # Add this line
norm_img = transforms.Normalize((10, 10, 10), (1, 1, 1))(img_tensor) # Perform normalization
print(norm_img)

Results of operation:

Several solutions to HDF5 error reporting in Python environment

Several solutions to the problem of HDF5 error reporting in Python environment (personal test)
the content of error reporting is as follows:
warning! HDF5 library version mismatched error
the HDF5 header files used to compile this application do not match
the version used by the HDF5 library to which this application is linked.
data corruption or segmentation faults may occur if the application continues.
This can happen when an application was compiled by one version of HDF5 but
linked with a different version of static or shared HDF5 library.
You should recompile the application or check your shared library related
settings such as ‘LD_ LIBRARY_ PATH’.
You can, at your own risk, disable this warning by setting the environment
variable ‘HDF5_ DISABLE_ VERSION_ CHECK’ to a value of ‘1’.
Setting it to 2 or higher will suppress the warning messages totally.
Headers are 1.10.4, library is 1.10.5

There are two ways to solve this problem.
first of all, this problem may be the mismatch of HDF5 library, or it may be something similar to warning. I will talk about it in detail below.
The first solution: uninstall HDF5 and then install it again.
The code executed by the terminal is as follows:
CONDA install HDF5
there are many friends on the Internet who use this method to be useful. I personally test that this method is useless to me.
The second solution: check the set path: LD_ LIBRARY_ Path
I personally test: because the system I use is win10, but LD_ LIBRARY_ I couldn’t find the path for a long time. Later, I searched for the path of Linux, so I didn’t use this method.
The third solution: the HDF5_ DISABLE_ VERSION_ Check is set to a higher level, ignoring warnings.
Before import tensorflow, add the following code to the code:
Import OS;
Import OS;
Import OS os.environ [‘HDF5_ DISABLE_ VERSION_ Check ‘] =’2’
my personal test: this method is really useful!

The function of flatten layer in deep learning

The official account of WeChat

Flatten layer is implemented in Keras.layers.core . flatten() class.

effect:

Flatten layer is used to “flatten” the input, that is, to make the multi-dimensional input one-dimensional. It is often used in the transition from convolution layer to fully connected layer. Flatten does not affect the size of the batch.

example:

from keras.models import Sequential
from keras.layers.core import Flatten
from keras.layers.convolutional import Convolution2D
from keras.utils.vis_utils import plot_model


model = Sequential()
model.add(Convolution2D(64,3,3,border_mode="same",input_shape=(3,32,32)))
# now:model.output_shape==(None,64,32,32)

model.add(Flatten())
# now: model.output_shape==(None,65536)

plot_model(model, to_file='Flatten.png', show_shapes=True)

In order to better understand the function of flatten layer, I visualize this neural network, as shown in the figure below:

Python: CUDA error: an illegal memory access was accounted for

Error in pytorch1.6 training:

RuntimeError: CUDA error: an illegal memory access was encountered

The reason for the error is the same as that of the lower version of python (such as version 1.1)

Runtimeerror: expected object of backend CUDA but get backend CPU for argument https://blog.csdn.net/weixin_ 44414948/article/details/109783988

Cause of error:

The essence of this kind of error reporting is model and input data_ image、input_ Label) is not all moved to GPU (CUDA).
* * tips: * * when debugging, you must carefully check whether every input variable and network model have been moved to the GPU. I usually report an error because I have missed one or two of them.

resolvent:

Model, input_ image、input_ The example code is as follows:

model = model.cuda()
input_image = input_iamge.cuda()
input_label = input_label.cuda()

or

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = model.to(device)
input_image = input_iamge.to(device)
input_label = input_label.to(device)

numpy.random.rand()

numpy.random.randn (d0, d1, … , DN) is to return one or more sample values from the standard normal distribution.  
numpy.random.rand (d0, d1, … , DN) in [0,1].   

 

numpy.random.rand (d0,d1,… ,dn)

The rand function generates data between [0,1] according to the given dimension, including 0 and excluding 1DN table. The return value of each dimension is the array of the specified dimension

np.random.rand(4,2)
array([[0.64959905, 0.14584702],
       [0.56862369, 0.5992007 ],
       [0.42512475, 0.83075541],
       [0.75685279, 0.00910825]])

np.random.rand(4,3,2) # shape: 4*3*2
array([[[0.07304796, 0.48810928],
        [0.59523586, 0.83281804],
        [0.47530734, 0.50402275]],

       [[0.63153869, 0.19636159],
        [0.93727986, 0.13564719],
        [0.11122609, 0.59646316]],

       [[0.17276155, 0.66621767],
        [0.81926792, 0.28781293],
        [0.20228714, 0.72412133]],

       [[0.29365696, 0.53956076],
        [0.19105394, 0.47044441],
        [0.85930046, 0.3867359 ]]])