Tag Archives: Deep learning

RuntimeError: CUDNN_STATUS_EXECUTION_FAILED [How to Solve]

The error is as follows:

Traceback (most recent call last):
  File "main.py", line 23, in <module>
    t.train()
  File "c:\Paper Code\RCAN-master-Real\RCAN_TrainCode\code\trainer.py", line 51, in train
    sr = self.model(lr, idx_scale)
  File "C:\Anaconda3\envs\pytorch0.4.0\lib\site-packages\torch\nn\modules\module.py", line 491, in __call__
    result = self.forward(*input, **kwargs)
  File "c:\Paper Code\RCAN-master-Real\RCAN_TrainCode\code\model\__init__.py", line 54, in forward
    return self.model(x)
  File "C:\Anaconda3\envs\pytorch0.4.0\lib\site-packages\torch\nn\modules\module.py", line 491, in __call__
    result = self.forward(*input, **kwargs)
  File "c:\Paper Code\RCAN-master-Real\RCAN_TrainCode\code\model\rcan.py", line 107, in forward
    x = self.sub_mean(x)
  File "C:\Anaconda3\envs\pytorch0.4.0\lib\site-packages\torch\nn\modules\module.py", line 491, in __call__
    result = self.forward(*input, **kwargs)
  File "C:\Anaconda3\envs\pytorch0.4.0\lib\site-packages\torch\nn\modules\conv.py", line 301, in forward
    self.padding, self.dilation, self.groups)
RuntimeError: CUDNN_STATUS_EXECUTION_FAILED

Before code modification:

if __name__ == '__main__':
    torch.manual_seed(args.seed)
    checkpoint = utility.checkpoint(args)

    if checkpoint.ok:
        loader = data.Data(args)
        model = model.Model(args, checkpoint)
        loss = loss.Loss(args, checkpoint) if not args.test_only else None
        t = Trainer(args, loader, model, loss, checkpoint)
        while not t.terminate():
            t.train()
            t.test()

        checkpoint.done()

After code modification

if __name__ == '__main__':
    torch.backends.cudnn.enabled = False
    torch.manual_seed(args.seed)
    checkpoint = utility.checkpoint(args)

    if checkpoint.ok:
        loader = data.Data(args)
        model = model.Model(args, checkpoint)
        loss = loss.Loss(args, checkpoint) if not args.test_only else None
        t = Trainer(args, loader, model, loss, checkpoint)
        while not t.terminate():
            t.train()
            t.test()

        checkpoint.done()

Running ~ ~

[Solved] Mindspot error: Error: runtimeerror:_kernel.cc:88 CheckParam] AddN output shape must be equal to input…

Mindspot rewrites withlosscell, traionestepcell interface

Error: runtimeerror:_kernel.cc:88 CheckParam] AddN output shape must be equal to input shape.Trace: In file add_impl.py(272)/ return F.addn((x, y))/

Solution: do not return multiple parameters in the construct method of withlosscell, otherwise an error will be reported in the gradoperation in the construct of traionestepcell.

Keras:KeyError:‘Failed to format this callback filepath:{val_loss:.4f}.h5. Reason: \‘val_loss\‘‘

If you use keras’s imagedatagenerator for image enhancement, it is as follows:

validataion_generator = validation_datagen.flow_from_directory(validation_dir,
                                                               target_size=target_size,
                                                               batch_size=batch_size,
                                                               class_mode='categorical',
                                                               subset="validation")

In addition, your training set and verification set are two different folders, and the file name of the saved model is required to include the loss of the model above the verification set, then the error described in the title is likely to occur:

Keras:KeyError:'Failed to format this callback filepath:{val_loss:.4f}.h5. Reason: \'val_loss\''

At this time, you only need to delete the subset parameter. The subset parameter is generally used for training and verification. It is a batch of data and has been passed into validation_ Subset needs to be defined only when split is used for data division. The specific modifications are as follows:

validataion_generator = validation_datagen.flow_from_directory(validation_dir,
                                                               target_size=target_size,
                                                               batch_size=batch_size,
                                                               class_mode='categorical')

Re run the program, there will be no error, normal training!

[Solved] RuntimeError: each element in list of batch should be of equal size

RuntimeError: each element in list of batch should be of equal size

1. Example code 2. Running result 3. Error reason 4. Batch_ Size = 25. Analyze reason 6. Complete code

1. Example code

"""
Complete the preparation of the dataset
"""
from torch.utils.data import DataLoader, Dataset
import os
import re

def tokenlize(content):
    content = re.sub('<.*?>', ' ', content, flags=re.S)
    filters = ['!', '"', '#', '$', '%', '&', '\(', '\)', '\*', '\+', ',', '-', '\.', '/', ':', ';', '<', '=', '>', '\?',
               '@', '\[', '\\', '\]', '^', '_', '`', '\{', '\|', '\}', '~', '\t', '\n', '\x97', '\x96', '”', '“', ]

    content = re.sub('|'.join(filters), ' ', content)
    tokens = [i.strip().lower() for i in content.split()]

    return tokens


class ImdbDataset(Dataset):
    def __init__(self, train=True):
        self.train_data_path = r'E:\Python资料\视频\Py5.0\00.8-12课件资料V5.0\阶段9-人工智能NLP项目\第四天\代码\data\aclImdb_v1\aclImdb\train'
        self.test_data_path = r'E:\Python资料\视频\Py5.0\00.8-12课件资料V5.0\阶段9-人工智能NLP项目\第四天\代码\data\aclImdb_v1\aclImdb\test'
        data_path = self.train_data_path if train else self.test_data_path

        temp_data_path = [os.path.join(data_path, 'pos'), os.path.join(data_path, 'neg')]
        self.total_file_path = [] 
        for path in temp_data_path:
            file_name_list = os.listdir(path)
            file_path_list = [os.path.join(path, i) for i in file_name_list if i.endswith('.txt')]
            self.total_file_path.extend(file_path_list)

    def __getitem__(self, idx):
        file_path = self.total_file_path[idx]
        # 获取了label
        label_str = file_path.split('\\')[-2]
        label = 0 if label_str == 'neg' else 1
        # 获取内容
        # 分词
        tokens = tokenlize(open(file_path).read())
        return tokens, label

    def __len__(self):
        return len(self.total_file_path)


def get_dataloader(train=True):
    imdb_dataset = ImdbDataset(train)
    print(imdb_dataset[1])
    data_loader = DataLoader(imdb_dataset, batch_size=2, shuffle=True)
    return data_loader


if __name__ == '__main__':
    for idx, (input, target) in enumerate(get_dataloader()):
        print('idx', idx)
        print('input', input)
        print('target', target)
        break

2. Operation results

3. Reasons for error reporting

dataloader = DataLoader(dataset=dataset, batch_size=2, shuffle=True)

If batch_ Size = 2 changed to batch_ When size = 1 , no more errors will be reported. The operation results are as follows:

4. batch_ size=2

However, if you want batch_ When size = 2 , how to solve it?

resolvent:

The reason for the problem is the parameter collate in the dataloader_ fn

collate_ The default value of FN is Default customized by torch_ collate, collate_ FN is used to process each batch, and the default default_ Collate processing error.

Solution:

Here, use method 2 to customize a collate_ FN , and then observe the results:

def collate_fn(batch):
    """
    Processing of batch data
    :param batch: [the result of a getitem, the result of getitem, the result of getitem]
    :return: tuple
    """
    reviews,labels = zip(*batch)
    reviews = torch.LongTensor([config.ws.transform(i,max_len=config.max_len) for i in reviews])
    labels = torch.LongTensor(labels)

    return reviews, labels

collate_fn第二种定义方式：

import config

def collate_fn(batch):
    """
    Processing of batch data
    :param batch: [the result of a getitem, the result of getitem, the result of getitem]
    :return: tuple
    """
    reviews,labels = zip(*batch)
    reviews = torch.LongTensor([config.ws.transform(i,max_len=config.max_len) for i in reviews])
    labels = torch.LongTensor(labels)

    return reviews,labels

5. Analyze the causes

According to the error information, you can find the source of the error in the collate. Py source code, and the error appears in default_ Collate() function. Baidu found the defaul of this source code_ The collate function is the default batch processing method of the dataloader class. If collate is not used when defining the dataloader_ FN parameter specifies the function, and the method in the following source code will be called by default. If you have the above error, it should be the error in the penultimate line of this function

Source code:

def default_collate(batch):
    r"""Puts each data field into a tensor with outer dimension batch size"""

    elem = batch[0]
    elem_type = type(elem)
    if isinstance(elem, torch.Tensor):
        out = None
        if torch.utils.data.get_worker_info() is not None:
            # If we're in a background process, concatenate directly into a
            # shared memory tensor to avoid an extra copy
            numel = sum([x.numel() for x in batch])
            storage = elem.storage()._new_shared(numel)
            out = elem.new(storage)
        return torch.stack(batch, 0, out=out)
    elif elem_type.__module__ == 'numpy' and elem_type.__name__ != 'str_' \
            and elem_type.__name__ != 'string_':
        if elem_type.__name__ == 'ndarray' or elem_type.__name__ == 'memmap':
            # array of string classes and object
            if np_str_obj_array_pattern.search(elem.dtype.str) is not None:
                raise TypeError(default_collate_err_msg_format.format(elem.dtype))

            return default_collate([torch.as_tensor(b) for b in batch])
        elif elem.shape == ():  # scalars
            return torch.as_tensor(batch)
    elif isinstance(elem, float):
        return torch.tensor(batch, dtype=torch.float64)
    elif isinstance(elem, int_classes):
        return torch.tensor(batch)
    elif isinstance(elem, string_classes):
        return batch
    elif isinstance(elem, container_abcs.Mapping):
        return {key: default_collate([d[key] for d in batch]) for key in elem}
    elif isinstance(elem, tuple) and hasattr(elem, '_fields'):  # namedtuple
        return elem_type(*(default_collate(samples) for samples in zip(*batch)))
    elif isinstance(elem, container_abcs.Sequence):
        # check to make sure that the elements in batch have consistent size
        it = iter(batch)
        elem_size = len(next(it))
        if not all(len(elem) == elem_size for elem in it):
            raise RuntimeError('each element in list of batch should be of equal size')
        transposed = zip(*batch)
        return [default_collate(samples) for samples in transposed]

    raise TypeError(default_collate_err_msg_format.format(elem_type))

The function of this function is to pass in a batch data tuple. Each data in the tuple is in the dataset class you define__ getitem__() method. The tuple length is your batch_ Size sets the size of the. However, one field of the iteratable object finally returned in the dataloader class is batch_ The corresponding fields of the size sample are spliced together.

Therefore, when this method is called by default, the penultimate line of the statement return [default] will be entered for the first time_ Collate (samples) for samples in translated] generate iteratable objects from batch tuples through zip function. Then, the same field is extracted through iteration and recursively re passed in default_ In the collate() function, take out the first field and judge that the data type is among the types listed above, then the dateset content can be returned correctly.

If batch data is processed in the above order, the above error will not occur. If the data of the element is not in the listed data type after the second recursion, it will still enter the next recursion, that is, the third recursion. At this time, even if the data can be returned normally, it does not meet our requirements, and the error usually appears after the third recursion. Therefore, to solve this error, you need to carefully check the data type of the returned field of the dataset class you define. It can also be found in defaule_ The collate() method outputs the batch content before and after processing. View the specific processing flow of the function to help you find the error of the returned field data type.

Friendly tip: do not change the defaule in the source file_ The collate() method can copy this code and define its own collate_ Fn() function and specify its own defined collate when instantiating the dataloader class_ FN function.

6. Complete code

"""
Complete the preparation of the dataset
"""
from torch.utils.data import DataLoader, Dataset
import os
import re
import torch

def tokenlize(content):
    content = re.sub('<.*?>', ' ', content, flags=re.S)
    # filters = ['!', '"', '#', '$', '%', '&', '\(', '\)', '\*', '\+', ',', '-', '\.', '/', ':', ';', '<', '=', '>', '\?',
    #            '@', '\[', '\\', '\]', '^', '_', '`', '\{', '\|', '\}', '~', '\t', '\n', '\x97', '\x96', '”', '“', ]
    filters = ['\.', '\t', '\n', '\x97', '\x96', '#', '$', '%', '&']
    content = re.sub('|'.join(filters), ' ', content)
    tokens = [i.strip().lower() for i in content.split()]
    return tokens


class ImdbDataset(Dataset):
    def __init__(self, train=True):
        self.train_data_path = r'.\aclImdb\train'
        self.test_data_path = r'.\aclImdb\test'
        data_path = self.train_data_path if train else self.test_data_path

        temp_data_path = [os.path.join(data_path, 'pos'), os.path.join(data_path, 'neg')]
        self.total_file_path = []  
        for path in temp_data_path:
            file_name_list = os.listdir(path)
            file_path_list = [os.path.join(path, i) for i in file_name_list if i.endswith('.txt')]
            self.total_file_path.extend(file_path_list)

    def __getitem__(self, idx):
        file_path = self.total_file_path[idx]
        label_str = file_path.split('\\')[-2]
        label = 0 if label_str == 'neg' else 1
        tokens = tokenlize(open(file_path).read().strip())  
        return label, tokens

    def __len__(self):
        return len(self.total_file_path)

def collate_fn(batch):
    batch = list(zip(*batch))
    labels = torch.tensor(batch[0], dtype=torch.int32)
    texts = batch[1]
    del batch
    return labels, texts


def get_dataloader(train=True):
    imdb_dataset = ImdbDataset(train)
    data_loader = DataLoader(imdb_dataset, batch_size=2, shuffle=True, collate_fn=collate_fn)
    return data_loader

if __name__ == '__main__':
    for idx, (input, target) in enumerate(get_dataloader()):
        print('idx', idx)
        print('input', input)
        print('target', target)
        break

I wish you solve the bug and run through the model as soon as possible!

Onnx to tensorrt model error [How to Solve]

Error message:

[2021-07-26 07:16:07   ERROR] 2: [ltWrapper.cpp::setupHeuristic::327] Error Code 2: Internal Error (Assertion cublasStatus == CUBLAS_STATUS_SUCCESS failed.)
terminate called after throwing an instance of 'std::runtime_error'
  what():  Failed to create object
Abandoned (core dumped)

Solution:
an error is reported here. Cuda10.2 is used
Download patch 1 here and install it

Python Run Error: TypeError: hog() got an unexpected keyword argument ‘visualise‘”

Running Python code reports an error “typeerror: hog() got an unexpected keyword argument ‘visualise'”

FD, hog_ image = hog(image, orientations=8, pixels_ per_ cell=(12, 12),
cells_ per_ Block = (1, 1), visualise = true) can be normal by changing visualise to visualize, that is (changing the letter S to Z):

 fd, hog_image = hog(image, orientations=8, pixels_per_cell=(12, 12),
                    cells_per_block=(1, 1), visualize=True)

[Solved] modulenotfounderror: no module named ‘torchtext.legacy.data.datasets_ utils‘

Cause: pytorch version is too low## Title
Solution.
Step 1: Find the utils.py file inside the downloaded E:\anaconda\package\envs\pytorch_gpu\Lib\site-packages\d2lzh_pytorch installation package and open it.
Step 2: Change the import torchtext inside to import torchtext.legacy as torchtext
Step 3: Close jupyter and reopen it, successfully.

Keras: Cannot convert ‘auto‘ to EagerTensor of dtype float

Take a look at your custom loss function. First define a class and then pass in parameters:

loss=tf.keras.losses.MeanSquaredError(y_true,y_pred) #wrong

m=tf.keras.losses.MeanSquaredError()
loss=m(y_true,y_pred) #right

[Solved] ValueError: Connection error, and we cannot find the requested files in the cached path…

error:

self.tokenizer = CamembertTokenizer.from_pretrained(“camembert-base”)
resolved_vocab_files[file_id] = cached_path(
output_path = get_from_cache(
raise ValueError(
ValueError: Connection error, and we cannot find the requested files in the cached path. Please try again or make sure your Internet connection is on.

Solution.
When running the command line type

TRANSFORMERS_OFFLINE=1  python test.py
sh file is:
TRANSFORMERS_OFFLINE=1  \ python test.py

When running the code
Refer to the official website at
Reason for error reporting.

Firewalled environments
Some cloud and intranet setups have their GPU instances firewalled to the outside world, so if your script is trying to download model weights or datasets it will first hang and then timeout with an error message like:
ValueError: Connection error, and we cannot find the requested files in the cached path.
Please try again or make sure your Internet connection is on.
One possible solution in this situation is to use the “offline-mode”.

Solution: Offline mode

It’s possible to run 🤗 Transformers in a firewalled or a no-network environment.
Setting environment variable TRANSFORMERS_OFFLINE=1 will tell 🤗 Transformers to use local files only and will not try to look things up.
Most likely you may want to couple this with HF_DATASETS_OFFLINE=1 that performs the same for 🤗 Datasets if you’re using the latter.

[Solved] CUDA driver version is insufficient for CUDA runtime version

CUDA driver version is insufficient for CUDA runtime version

Question:

An error is reported when docker runs ONEFLOW code of insightface

 Failed to get cuda runtime version: CUDA driver version is insufficient for CUDA runtime version

reason:

1. View CUDA runtime version

cat /usr/local/cuda/version.txt

The CUDA version in my docker is 10.0.130

CUDA Version 10.0.130

2. The CUDA version has requirements for the graphics card driver version, see the following link.
https://docs.nvidia.com/cuda/cuda-toolkit-release-notes/index.html

CUDA Toolkit	Linux x86 64 Driver Version	Windows x86 and 64 Driver Version
CUDA 11.0.3 Update 1
CUDA 11.0.2 GA	>= 450.51.05	>= 451.48
CUDA 11.0.1 RC	>= 450.36.06	>= 451.22
CUDA 10.2.89	>= 440.33	>= 441.22
CUDA 10.1 (10.1.105 general release, and updates)	>= 418.39	>= 418.96
CUDA 10.0.130	>= 410.48	>= 411.31
CUDA 9.2 (9.2.148 Update 1)	>= 396.37	>= 398.26
CUDA 9.2 (9.2.88)	>= 396.26	>= 397.44

cat /proc/driver/nvidia/version took a look at the server’s graphics card driver is 418.67, CUDA 10.1 should be installed, and I installed 10.0.130 cuda.

NVRM version: NVIDIA UNIX x86_64 Kernel Module  418.67  Sat Apr  6 03:07:24 CDT 2019
GCC version:  gcc version 7.3.0 (Ubuntu 7.3.0-27ubuntu1~18.04)

solve:

Installing CUDA 10.1

(1) First in https://developer.nvidia.com/cuda-toolkit-archive According to the machine environment, download the corresponding cuda10.1 installation file. For the installer type, I choose runfile (local). The installation steps will be simpler.

wget https://developer.download.nvidia.com/compute/cuda/10.1/Prod/local_installers/cuda_10.1.243_418.87.00_linux.runsudo sh

(2) Installation

sh cuda_10.1.243_418.87.00_linux.run

The same error occurred, unresolved
it will be updated when a solution is found later.

ModuleNotFoundError: No module named ‘notebook‘

ModuleNotFoundError: No module named ‘notebook’

Problem modulenotfounderror: no module named ‘notebook’

This problem occurred when running notebook today. Now I’d like to share with you how to solve this problem

terms of settlement

new_lrs[:5] = lr_warm [12] TypeError: can only assign an iterable

new_lrs[:5] = lr_warm
[12] TypeError: can only assign an iterable

explain:

In Python, using list [0:3] =’xxx ‘will not cause an error, so that the elements with subscripts of 0, 1, 2 are assigned to’ xxx ‘; This is because the string itself is a character array in Python, which can be iterated.

In list [0:2] = 1, an error will be generated: typeerror: can only assign an Iterable

This is because integer 1, which has no iteration ability, is a value. If the goal is not achieved, write list [0:2] = (1,)

The right side of this assignment must be an iteratable type, not an integer, but [int] is OK

lr =[0.0001,0.00012,0.00013]
new_lrs = [0.001, 0.0009,0.0008,0.0007,0.0006]
new_lrs[:3] = lr
new_lrs
Out[5]: [0.0001, 0.00012, 0.00013, 0.0007, 0.0006]

It is encountered in the process of adding learning rate to warmup. The complete code is as follows

import torch
import math
from torch.optim.lr_scheduler import _LRScheduler
from utils.utils import read_cfg

cfg = read_cfg(cfg_file="/yangjiang/CDCN-Face-Anti-Spoofing.pytorch/config/CDCNpp_adam_lr1e-3.yaml")

class CosineAnnealingLR_with_Restart(_LRScheduler):
    """Set the learning rate of each parameter group using a cosine annealing
    schedule, where :math:`\eta_{max}` is set to the initial lr and
    :math:`T_{cur}` is the number of epochs since the last restart in SGDR:

    .. math::

        \eta_t = \eta_{min} + \frac{1}{2}(\eta_{max} - \eta_{min})(1 +
        \cos(\frac{T_{cur}}{T_{max}}\pi))

    When last_epoch=-1, sets initial lr as lr.

    It has been proposed in
    `SGDR: Stochastic Gradient Descent with Warm Restarts`_. The original pytorch
    implementation only implements the cosine annealing part of SGDR,
    I added my own implementation of the restarts part.

    Args:
        optimizer (Optimizer): Wrapped optimizer.
        T_max (int): Maximum number of iterations.
        T_mult (float): Increase T_max by a factor of T_mult
        eta_min (float): Minimum learning rate. Default: 0.
        last_epoch (int): The index of last epoch. Default: -1.
        model (pytorch model): The model to save.
        out_dir (str): Directory to save snapshots
        take_snapshot (bool): Whether to save snapshots at every restart

    .. _SGDR\: Stochastic Gradient Descent with Warm Restarts:
        https://arxiv.org/abs/1608.03983
    """

    def __init__(self, optimizer, T_max, T_mult, model, out_dir, take_snapshot, eta_min=0, last_epoch=-1):
        self.T_max = T_max
        self.T_mult = T_mult
        self.Te = self.T_max
        self.eta_min = eta_min
        self.current_epoch = last_epoch

        self.model = model
        self.out_dir = out_dir
        self.take_snapshot = take_snapshot

        self.lr_history = []

        super(CosineAnnealingLR_with_Restart, self).__init__(optimizer, last_epoch)

    def get_lr(self):
        if self.current_epoch < 5:
            warm_factor = (cfg['train']['lr']/cfg['train']['warmup_start_lr']) ** (1/cfg['train']['warmup_epochs'])
            lr = cfg['train']['warmup_start_lr'] * warm_factor ** self.current_epoch
            new_lrs = [lr]
        else:
            new_lrs = [self.eta_min + (base_lr - self.eta_min) *
                   (1 + math.cos(math.pi * self.current_epoch/self.Te))/2
                   for base_lr in self.base_lrs]
        
        #new_lrs[:5] = lr_warm
        #self.lr_history.append(new_lrs)
        #print('new_lrs', new_lrs,len(new_lrs))
        return new_lrs

    def step(self, epoch=None):
        if epoch is None:
            epoch = self.last_epoch + 1
        self.last_epoch = epoch
        self.current_epoch += 1

        for param_group, lr in zip(self.optimizer.param_groups, self.get_lr()):
            param_group['lr'] = lr

        ## restart
        if self.current_epoch == self.Te:
            print("restart at epoch {:03d}".format(self.last_epoch + 1))

            if self.take_snapshot:
                torch.save({
                    'epoch': self.T_max,
                    'state_dict': self.model.state_dict()
                }, self.out_dir + "Weight/" + 'snapshot_e_{:03d}.pth.tar'.format(self.T_max))

            ## reset epochs since the last reset
            self.current_epoch = 0

            ## reset the next goal
            self.Te = int(self.Te * self.T_mult)
            self.T_max = self.T_max + self.Te

ProgrammerAH

Programmer Guide, Tips and Tutorial