Tag Archives: neural network

[Solved] RuntimeError: CUDA error: CUBLAS_STATUS_ALLOC_FAILED when calling `cublasCreate(handle)`

[problem description]

The previous code can run normally. After the data set is expanded, the following errors are reported in the GPU program running the deep learning training model, but CUDA out of memory error is not prompted.

RuntimeError: CUDA error: CUBLAS_STATUS_ALLOC_FAILED when calling `cublasCreate(handle)`

[solution 1]

Change the program to run on the CPU and find that it can run normally, but the speed will be very slow and it will take a long time.

--device cpu

[solution 2]

Try to reduce the batch size used in the training model, and it can run normally.

[Solved] Mindspore 1.3.0 Compile Error: CMake Error at cmake/utils

[Function Module].
Compile mindspore 1.3.0 with error CMake Error at cmake/utils.cmake:301 (message): Failed patch:

[Steps & Problems]
1. The compilation process after executing bash build.sh -e ascend gives an error, as follows:

-- Build files have been written to: /root/mindspore-v1.3.0/build/mindspore/_deps/icu4c-subbuild
[100%] Built target icu4c-populate
icu4c_SOURCE_DIR : /root/mindspore-v1.3.0/build/mindspore/_deps/icu4c-src
patching /root/mindspore-v1.3.0/build/mindspore/_deps/icu4c-src -p1 < /root/mindspore-v1.3.0/build/mindspore/_ms_patch/icu4c.patch01
patching file icu4c/source/runConfigureICU
Reversed (or previously applied) patch detected!  Assume -R?[n]
Apply anyway?[n]
Skipping patch.
1 out of 1 hunk ignored -- saving rejects to file icu4c/source/runConfigureICU.rej
CMake Error at cmake/utils.cmake:301 (message):
  Failed patch:
  /root/mindspore-v1.3.0/build/mindspore/_ms_patch/icu4c.patch01
Call Stack (most recent call first):
  cmake/external_libs/icu4c.cmake:36 (mindspore_add_pkg)
  cmake/mind_expression.cmake:85 (include)
  CMakeLists.txt:51 (include)

-- Configuring incomplete, errors occurred!
See also "/root/mindspore-v1.3.0/build/mindspore/CMakeFiles/CMakeOutput.log".
See also "/root/mindspore-v1.3.0/build/mindspore/CMakeFiles/CMakeError.log".

——————————————————————————————————————————-
This situation is usually due to the failure of the last compilation on the way, try to compile again, please try to deal with the following.

rm -rf mindspore/build/mindspore/_deps/icu*

Then retry the compilation.

RuntimeError: CUDA error: an illegal memory access was encountered

Question:

When I encountered this problem on the way to write the model, baidu either said it was the pytorch version problem or the category index exceeded, but it was useless, because the error was a very simple assignment operation.

scores[:, 0] = -float("inf") 
#RuntimeError: CUDA error: an illegal memory access was encountered

At the same time, in the process of debugging, it is found that a warning burst after the execution of a network of the model

lm_logits = self.linear(outputs) + self.bias
#warning:Thudacheck FAIL file=/pytorch/aten/c/THC/Thccachinghostallocator cpp Line=278 error=700: an illegal memory access was encountered

At first glance, both places are relatively simple, but they reported strange mistakes.

Solution:

The debug process found an exception

In the data data output by the pytorch network, the variable does not display the specific network output value, but the address information of the data

T:torch.Tensor object at 0x7fb27e7c8f30
data:torch.Tensor object at 0x7fb27e7c8f30

Later, it was found that it was because of self The linear layer is’ CPU ‘, while other networks are on’ CUDA ‘, which is equivalent to the inconsistency caused by the forward propagation of’ CUDA ‘type data to the’ CPU ‘network. Just transfer the network to’ CUDA ‘.

How to Solve Error: RuntimeError: all tensors must be on devices[0]

Problem description

The code running Zheng Zhedong’s aicity2020 reported an error. After searching, it was found that the problem is that the code running is more than GPU, but the specified code is a single GPU

Solution:

In test2020.py, the code comments in lines 126 and 129 are replaced by the following code

# set gpu ids
# if len(gpu_ids)>0:
# torch.cuda.set_device(gpu_ids[0])
# cudnn.benchmark = True
os.environ['CUDA_VISIBLE_DEVICES'] = str(gpu_ids[0])

[Solved] Pytorch Download CIFAR1 Datas Error: urllib.error.URLError: <urlopen error [SSL: CERTIFICATE_VERIFY_FAILED] certi

urllib.error.URLError: < urlopen error [SSL: CERTIFICATE_VERIFY_FAILED] certi

Solution:

Add the following two lines of code before the code starts:

import ssl
ssl._create_default_https_context = ssl._create_unverified_context

Complete example:

import torch
import torchvision
import torchvision.transforms as transforms
import ssl
ssl._create_default_https_context = ssl._create_unverified_context
#Download the data set and adjust the image, because the output of the torchvision data set is in PILImage format, and the data field is in [0,1]
#We convert it into the tensor format of the standard data field [-1,1]
#transform Data Converter
transform=transforms.Compose([transforms.ToTensor(),transforms.Normalize((0.5,0.5,0.5),(0.5,0.5,0.5))])
trainset=torchvision.datasets.CIFAR10(root='./data',train=True,download=True,transform=transform)
# The downloaded data is placed in the trainset
trainloader=torch.utils.data.DataLoader(trainset,batch_size=4,shuffle=True,num_workers=2)
# DataLoader Data Iterator Encapsulate data into DataLoader
# num_workers: Two threads read data
# batch_size=4 batch processing

testset=torchvision.datasets.CIFAR10(root='./data',train=False,download=True,transform=transform)
testloader=torch.utils.data.DataLoader(testset,batch_size=4,shuffle=False,num_workers=2)
classes=('airplane', 'automobile', 'bird', 'cat', 'deer', 'dog', 'frog', 'horse', 'ship', 'truck')

Download result

Using next (ITER (data. Dataloader()) to report an error stopiteration

The error stopiteration is reported when using next (ITER (data. Dataloader()). This is because when using next() When accessing an iterator that has been iterated, an error will be triggered: stopiteration, that is, after a round of iteration after the dataloader imports the data, it is found that there is no data when importing again, that is, after the Iterable is completed, stopiteration is triggered, and then the loop jumps out

resolvent:

Since there is no data when importing again, we can use a data loader.

Put the in train.py

inps, targets = next(self.batch_iterator)

Change to:

try:
    inps, targets = next(self.batch_iterator)
except StopIteration:
    self.batch_iterator = iter(data.DataLoader(self.train_dataset, self.args.batch_size, shuffle=True, num_workers=self.args.num_workers, collate_fn=detection_collate))
    inps, targets = next(self.batch_iterator)

Problem solving.

Mmdetection reports an error when running its own coco dataset. Does not match the length of \ ` classes \ ` 80) in cocodataset

Mmdetection trains its own data set to report errors ⚠️ :

# AssertionError: The `num_ classes` (3) in Shared2FCBBoxHead of MMDataParallel does not matches the length of `CLASSES` 80) in CocoDataset

This means that the category (3) you specified does not match the category (80) of cocodataset.

You may have modified the following files, but you still report an error:

mmdetection-master\mmdet\core\evaluation\class_ names.py

 def coco_classes():
     return [
         # 'person', 'bicycle', 'car', 'motorcycle', 'airplane', 'bus', 'train',
         # 'truck', 'boat', 'traffic_light', 'fire_hydrant', 'stop_sign',
         # 'parking_meter', 'bench', 'bird', 'cat', 'dog', 'horse', 'sheep',
         # 'cow', 'elephant', 'bear', 'zebra', 'giraffe', 'backpack', 'umbrella',
         # 'handbag', 'tie', 'suitcase', 'frisbee', 'skis', 'snowboard',
         # 'sports_ball', 'kite', 'baseball_bat', 'baseball_glove', 'skateboard',
         # 'surfboard', 'tennis_racket', 'bottle', 'wine_glass', 'cup', 'fork',
         # 'knife', 'spoon', 'bowl', 'banana', 'apple', 'sandwich', 'orange',
         # 'broccoli', 'carrot', 'hot_dog', 'pizza', 'donut', 'cake', 'chair',
         # 'couch', 'potted_plant', 'bed', 'dining_table', 'toilet', 'tv',
         # 'laptop', 'mouse', 'remote', 'keyboard', 'cell_phone', 'microwave',
         # 'oven', 'toaster', 'sink', 'refrigerator', 'book', 'clock', 'vase',
         # 'scissors', 'teddy_bear', 'hair_drier', 'toothbrush'
         'lm','ls'
     ]

mmdetection-master\mmdet\datasets\coco.py

 class CocoDataset(CustomDataset):
     # CLASSES = (
     #     'person', 'bicycle', 'car', 'motorcycle', 'airplane', 'bus',
     #            'train', 'truck', 'boat', 'traffic light', 'fire hydrant',
     #            'stop sign', 'parking meter', 'bench', 'bird', 'cat', 'dog',
     #            'horse', 'sheep', 'cow', 'elephant', 'bear', 'zebra', 'giraffe',
     #            'backpack', 'umbrella', 'handbag', 'tie', 'suitcase', 'frisbee',
     #            'skis', 'snowboard', 'sports ball', 'kite', 'baseball bat',
     #            'baseball glove', 'skateboard', 'surfboard', 'tennis racket',
     #            'bottle', 'wine glass', 'cup', 'fork', 'knife', 'spoon', 'bowl',
     #            'banana', 'apple', 'sandwich', 'orange', 'broccoli', 'carrot',
     #            'hot dog', 'pizza', 'donut', 'cake', 'chair', 'couch',
     #            'potted plant', 'bed', 'dining table', 'toilet', 'tv', 'laptop',
     #            'mouse', 'remote', 'keyboard', 'cell phone', 'microwave',
     #            'oven', 'toaster', 'sink', 'refrigerator', 'book', 'clock',
     #            'vase', 'scissors', 'teddy bear', 'hair drier', 'toothbrush'
     #     )
     CLASSES = ('lm', 'ls')

No more nonsense, go straight to the method. There are several methods:

one ️⃣ If you have two classes, you can replace the first two classes in the above two places with your classes. The method is relatively simple, but there may be hidden dangers.

two ️⃣ The second method is to modify the class_ After names.py and voc.py, the code must be recompiled (run Python setup.py install) and then trained.

I tried, but I still made the same mistake. Maybe my method is wrong.

reference resources:

New mmdetection v2.3.0 training test notes – it610.com

Mmdetectionv2. X version trains its own VOC dataset_ Peach jam Momo blog – CSDN blog

three ️⃣ The third method, which I use, is actually the same as recompilation. The reason for recompilation is that you report an error because the source file in the environment has not been modified. There are only some Python files in the mmdetection master directory. When the program is actually running, it is still the source file in the environment, because we directly modify the source file in the environment.

Suppose my CONDA environment is called CONDA_ env_ Name, so go to the following directory and modify two files respectively:

\anaconda3\envs\conda_ env_ name\lib\python3.7\site-packages\mmdet\core\evaluation\class_ names.py

\anaconda3\envs\conda_ env_ name\lib\python3.7\site-packages\mmdet\datasets\coco.py

Modify the categories in these two files in the CONDA environment.

⭐ In the end, I did my best to solve this bug and write this blog to help you avoid detours.

RuntimeError: Expected hidden[0] size (x, x, x), got(x, x, x)

Start with the above figure:

The above figure shows the problem when training the bilstm network.

Problem Description: define the initial weights H0 and C0 of bilstm network and input them to the network as the initial weight of bilstm, which is realized by the following code

output, (hn, cn) = self.bilstm(input, (h0, c0))

  The network structure is as follows:

self.bilstm = nn.LSTM(
            input_size=self.input_size,
            hidden_size=self.hidden_size,
            num_layers=self.num_layers,
            bidirectional=True,
            bias=True,
            dropout=config.drop_out
        )

The dimension of initial weight is defined as   H0 and C0 are initialized. The dimension is:

**h_0** of shape `(num_layers * num_directions, batch, hidden_size)`
**c_0** of shape `(num_layers * num_directions, batch, hidden_size)`

In bilstm network, the parameters are defined as follows:

num_layers: 2

num_directions: 2

batch: 4

seq_len: 10

input_size: 300

hidden_size: 100 

Then according to the definition in the official documents    H0, C0 dimensions should be: (2 * 2, 4100) = (4, 4100)

However, according to the error screenshot at the beginning of the article, the dimension of the initial weight of the hidden layer should be (4, 10100), which makes me doubt whether the dimension specified in the official document is correct.

Obviously, the official documents cannot be wrong, and the hidden state dimensions when using blstm, RNN and bigru in the past are the same as those specified by the official, so I don’t know where to start.

Therefore, we re examined the network structure and found that an important parameter, batch, was missing_ First, let’s take a look at all the parameters required by bilstm:

Args:
        input_size: The number of expected features in the input `x`
        hidden_size: The number of features in the hidden state `h`
        num_layers: Number of recurrent layers. E.g., setting ``num_layers=2``
            would mean stacking two LSTMs together to form a `stacked LSTM`,
            with the second LSTM taking in outputs of the first LSTM and
            computing the final results. Default: 1
        bias: If ``False``, then the layer does not use bias weights `b_ih` and `b_hh`.
            Default: ``True``
        batch_first: If ``True``, then the input and output tensors are provided
            as (batch, seq, feature). Default: ``False``
        dropout: If non-zero, introduces a `Dropout` layer on the outputs of each
            LSTM layer except the last layer, with dropout probability equal to
            :attr:`dropout`. Default: 0
        bidirectional: If ``True``, becomes a bidirectional LSTM. Default: ``False``

batch_ The first parameter can make the dimension batch in the first dimension during training, that is, the input data dimension is

(batch size, SEQ len, embedding dim), if not added   batch_ First = true, the dimension is

(seq len,batch size,embedding dim)

Because there was no break at noon, I vaguely forgot to add this important parameter, resulting in an error: the initial weight dimension is incorrect, and I can add it   batch_ Run smoothly after first = true.

The modified network structure is as follows:

self.bilstm = nn.LSTM(
            input_size=self.input_size,
            hidden_size=self.hidden_size,
            num_layers=self.num_layers,
            batch_first=True,
            bidirectional=True,
            bias=True,
            dropout=config.drop_out
        )

 

Extension: when we use RNN and its variant network, if we want to add the initial weight, the dimension must be the officially specified dimension, i.e

(num_layers * num_directions, batch, hidden_size)

At the same time, be sure to set batch_ First = true. The official document does not specify when batch is set_ When first = true, the dimensions of H0, C0, HN and CN are (num_layers * num_directions, batch, hidden_size), so be careful!

At the same time, check whether batch is set when the dimensions of HN and CN are incorrect_ First parameter, RNN and its variant networks are applicable to this method!

RuntimeError: cuDNN error: CUDNN_STATUS_EXECUTION_FAILED

RuntimeError: cuDNN error: CUDNN_ STATUS_ EXECUTION_ FAILED

The error is in   cuda:10.0     Pytorch: 1.2 problems in the training model under the GPU server environment, error prompt   Cudnn status execution failed

The problem with this error is that CUDA’s version does not correspond to pytorch’s version, resulting in CUDA’s failure to speed up model training and execute at the same time.

When downloading pytorch, we need to correctly download the corresponding relationship between pytorch and CUDA version on the official website. In the local training model, my environment is CUDA 10.0 and pytorch 1.9. Therefore, reinstall pytorch version 1.9 in the server and run successfully.

Performance: CUDA’s version does not correspond to pytorch’s version. The most obvious performance is that when running the program, the video memory does not change. When the normally loaded data and model enter the video memory, the video memory will increase significantly, while when the version does not correspond, the video memory does not change significantly. At the same time, the program will be very slow when loading the model, and even the model cannot be loaded into the video memory for 20 minutes.

[Solved] AttributeError: module ‘tensorflow._api.v2.train‘ has no attribute ‘AdampOptimizer‘

Error Message: AttributeError: module ‘tensorflow._api.v2.train’ has no attribute ‘AdampOptimizer’
Tensorflow 2.6.0:

>>> import tensorflow as tf
>>> tf.__version__
'2.6.0'

Error code snippet:

model.compile(optimizer=tf.train.AdampOptimizer(0.001),
              loss='mse',
              metrics=['accuracy'])

Replace with:

from tensorflow.keras.optimizers import Adam
model.compile(optimizer=Adam(learning_rate=0.001), loss='mse', metrics=['accuracy'])

It can also be changed as follows:

model.compile(optimizer='adam',
              loss='mse',
              metrics=['accuracy'])

Both are operational.