Category Archives: Python

[Solved] TensorFlow Error: UnknownError (see above for traceback): Failed to get convolution algorithm.

[Python/Pytorch – Bug] –UnknownError (see above for traceback): Failed to get convolution algorithm.

 

Question

Problem: TensorFlow reports an error: unknown error (see above for traceback): failed to get revolution algorithm

 

analysis

Analysis: the reason is that the memory of the graphics card is not enough. Selecting the appropriate memory of the graphics card can solve the problem.

 

Solution:
1. Gpustat checks the usage of the graphics card
2. Select a graphics card with enough memory;

[Solved] CUDA failure 999: unknown error ; GPU=-351697408 ; hostname=4f5e6dff58e6 ; expr=cudaSetDevice(info_.device_id);

How to Solve error: CUDA failure 999: unknown error

1. Error Message:

CUDA failure 999: unknown error ; GPU=-351697408 ; hostname=4f5e6dff58e6 ; expr=cudaSetDevice(info_.device_id);

 

2. Solution:

To reload the nvidia kernel module, enter the following command.

sudo rmmod nvidia_uvm

sudo modprobe nvidia_uvm

[Solved] Pytorch Error: RuntimeError: Error(s) in loading state_dict for Network: size mismatch

Problem background

GitHub open source project: https://github.com/zhang-tao-whu/e2ec

python train_net.py coco_finetune --bs 12 \
--type finetune --checkpoint data/model/model_coco.pth

The error is reported as follows:

loading annotations into memory...
Done (t=0.09s)
creating index...
index created!
load model: data/model/model_coco.pth
Traceback (most recent call last):
  File "train_net.py", line 67, in <module>
    main()
  File "train_net.py", line 64, in main
    train(network, cfg)
  File "train_net.py", line 40, in train
    begin_epoch = load_network(network, model_dir=args.checkpoint, strict=False)
  File "/root/autodl-tmp/e2ec/train/model_utils/utils.py", line 66, in load_network
    net.load_state_dict(net_weight, strict=strict)
  File "/root/miniconda3/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1051, in load_state_dict
    raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format(
RuntimeError: Error(s) in loading state_dict for Network:
        size mismatch for dla.ct_hm.2.weight: copying a param with shape torch.Size([80, 256, 1, 1]) from checkpoint, the shape in current model is torch.Size([1, 256, 1, 1]).
        size mismatch for dla.ct_hm.2.bias: copying a param with shape torch.Size([80]) from checkpoint, the shape in current model is torch.Size([1]).

Since my own dataset has only 1 category, while the COCO dataset has 80 categories, the size of the dla.ct_hm.2 parameter in the pre-training model does not match mine, so the weight of this parameter in the pre-training model needs to be discarded.

Solution:

Modify in e2ec/train/model_utils/utils.py:

def load_network(net, model_dir, strict=True, map_location=None):

    if not os.path.exists(model_dir):
        print(colored('WARNING: NO MODEL LOADED !!!', 'red'))
        return 0

    print('load model: {}'.format(model_dir))
    if map_location is None:
        pretrained_model = torch.load(model_dir, map_location={'cuda:0': 'cpu', 'cuda:1': 'cpu',
                                                               'cuda:2': 'cpu', 'cuda:3': 'cpu'})
    else:
        pretrained_model = torch.load(model_dir, map_location=map_location)
    if 'epoch' in pretrained_model.keys():
        epoch = pretrained_model['epoch'] + 1
    else:
        epoch = 0
    pretrained_model = pretrained_model['net']

    net_weight = net.state_dict()
    for key in net_weight.keys():
        net_weight.update({key: pretrained_model[key]})
    '''
	Discard some parameters
	'''
    net_weight.pop("dla.ct_hm.2.weight")
    net_weight.pop("dla.ct_hm.2.bias")
    
    net.load_state_dict(net_weight, strict=strict)
    return epoch

Note: setting strict=False in load_state_dict is only useful for adding or removing partial layers, not for changing the dimension size on the original parameters.

[Solved] PyTorch Load Model Error: Missing key(s) RuntimeError: Error(s) in loading state_dict for

torch.load() error reporting missing key(s) pytorch

Error condition: error loading the pre training model

RuntimeError: Error(s) in loading state_dict for :

Missing key(s) in state_dict: “features.0.weight” …

Unexpected key(s) in state_dict: “module.features.0.weight” …

 

Error reason:

The keywords of model parameters wrapped with nn.DataParallel will have an extra “module.” in front of them than the keywords of model parameters not wrapped with nn.DataParallel

 

Solution:

1. Loading nn.DataParallel(net) trained models using net.

Delete module.

# original saved file with DataParallel
state_dict = torch.load('model_path')
# create new OrderedDict that does not contain `module.`
from collections import OrderedDict
new_state_dict = OrderedDict()
for k, v in state_dict.items():
    name = k[7:] # remove `module.`
    new_state_dict[name] = v
# load params
net.load_state_dict(new_state_dict)

Code source

checkpoint = torch.load('model_path')
for key in list(checkpoint.keys()):
    if 'model.' in key:
        checkpoint[key.replace('model.', '')] = checkpoint[key]
        del checkpoint[key]

net.load_state_dict(checkpoint)

Use nn.DataParallel when loading the model

checkpoint = torch.load('model_path')
net = torch.nn.DataParallel(net)
net.load_state_dict(checkpoint)

2. Load the net trained model using nn.DataParallel(net).

Before saving the weight, add module

If you use torch.save() when saving weights, use model.module.state_dict() to get model weight

torch.save(net.module.state_dict(), 'model_path')

Read the model before using nn.DataParallel and then use nn.

net.load_state_dict(torch.load('model_path'))
net = nn.DataParallel(net, device_ids=[0, 1]) 

Add module manually

net = nn.DataParallel(net) 
from collections import OrderedDict
new_state_dict = OrderedDict()
state_dict =savepath #Pre-trained model path
for k, v in state_dict.items():
	# add “module.” manually
    if 'module' not in k:
        k = 'module.'+k
    else:
    # Swap the location of modules and features
        k = k.replace('features.module.', 'module.features.')
    new_state_dict[k]=v

net.load_state_dict(new_state_dict)

[Solved] Tensorflow Error: NameError: name ‘layers‘ is not defined

Error code:

 import tensorflow as tf
 net = layers.Dense(10)
 net.build((4, 10))
 net.kernel

NameError: name ‘layers’ is not defined

Error reason: TensorFlow does not load layers
Solution:

import tensorflow as tf
from tensorflow.keras import datasets, layers,optimizers
net = layers.Dense(10)
net.build((4, 10))
net.kernel

Operation results:

<tf.Variable 'kernel:0' shape=(10, 10) dtype=float32, numpy=
array([[ 0.22973484,  0.00857711, -0.21515384, -0.5346802 , -0.2584985 ,
         0.03767496,  0.22262502,  0.10832614,  0.12043941,  0.3197981 ],
       [ 0.12034583,  0.01719284, -0.37415868,  0.22801459,  0.49012756,
        -0.01656079, -0.02581853,  0.22888458, -0.3193212 , -0.23586014],
       [-0.50331104, -0.18943703,  0.47028244, -0.33412236,  0.04251152,
        -0.54133296,  0.23136115,  0.02571291, -0.36819634,  0.5134926 ],
       [-0.06907243,  0.33713734,  0.34277046,  0.24761981,  0.50419617,
        -0.20183799, -0.27459818, -0.34057558, -0.23564544,  0.34107167],
       [-0.51874346,  0.30625004,  0.07017416,  0.4792788 , -0.08462432,
         0.1762883 ,  0.47576356, -0.08242992,  0.0560475 ,  0.5385151 ],
       [-0.02134383,  0.02438915, -0.11708987,  0.26330394, -0.4951692 ,
         0.19778156, -0.1931901 , -0.41975048,  0.0376184 ,  0.23603398],
       [-0.20051709, -0.46164495,  0.15974921, -0.05227134,  0.14756906,
         0.12185448, -0.5285519 , -0.5298273 ,  0.14063555,  0.02481627],
       [-0.35953748,  0.30639488, -0.02970898, -0.5232449 , -0.10309196,
        -0.3557127 , -0.19765031,  0.3171267 ,  0.34930962, -0.15071085],
       [ 0.20013565,  0.11569405, -0.46884173, -0.40876222,  0.36319625,
         0.33609563,  0.2721032 , -0.04006624,  0.09699225,  0.20260221],
       [-0.03152204, -0.48894358,  0.3079273 , -0.5283493 , -0.44822672,
        -0.34838638,  0.41896552, -0.34962398, -0.24334553,  0.38500214]],
      dtype=float32)>

Problem solved.

[Solved] librosa Install Error: ImportError: DLL load failed: Could Not Found

CMD sends an error after executing the command PIP istall librosa:

from numba.core.typeconv import _typeconv
ImportError: DLL load failed: not found

The reason is that the versions of numba, llvmlite and resampy are conflicting and incompatible

Solution:
reinstall the compatible package

pip install llvmlite==0.31.0
pip install numba==0.48.0
pip install resampy== 0.3.0

The final versions of all libraries are shown in the following figure:

XGBoost Common Errors and Their Solutions

The solutions to xgboost errors

Error 1: ‘Dict_ Items’ object has no attribute ‘copy’
there should be a problem with the form of a parameter. It must be in the list format. Solution: based on the source code

plst = params.items()

Add a sentence format conversion

plst = params.items()
plst = list(params.items())

Error 2: WARNING: C:/Users/Administrator/workspace/xgboost-win64_release_1.6.0/src/learner.cc:627: Parameter…
Solution: Add the following code

xgb.set_config(verbosity=0)

Python Post kafka Messages Error: [Error 10] MessageSizeTooLargeError

Error Message:

[Error 10] MessageSizeTooLargeError: The message is 1177421 bytes when serialized which is larger than the maximum request size you have configured with the max_request_size configuration

Solution: Add the max_request_size configuration when instantiating the KafkaProducer class, and modify the default size:

[Solved] AttributeError: module ‘PIL.Image‘ has no attribute ‘open‘

AttributeError: module ‘PIL. Image’ has no attribute ‘open’. It means PIL.image does not has an open method. I have searched lots of solutions online, but they are not work. Finally, I inadvertently saw the address of image.py (c:\users\lenovo\pycharmprojects\kk\venv\lib\site packages\pil\image.py). I know the reason of the error.

from PIL import Image
import os
import csv
import time

Reason: the image.py file under the PIL package was accidentally emptied, so image.open() cannot be realized.

temp_img_now = Image.open(temp_file)

Solution: uninstall the pilot and pillow-PIL, and then reinstall them.

[Solved] yolov5-6.0 ERROR: AttributeError: ‘Upsample‘ object has no attribute ‘recompute_scale_factor‘

Preface: using yolov5-6.0 version, you want to detect several pictures, but there is a problem in the title. It can be seen that the upsampling function is not quite right. Now record the solution.

Version: yolov5-6.0, python3.8, pytorch1.11.0

1. Problem recurrence

2. Official website solution

This problem first appeared in yolov5 and is related to pytoch 1.11.0.

In other words, this problem may be encountered in both train and detect. The following is the solution to reduce the pytoch version to less than 10.

Then the blogger made a fix for PyTorch version 1.11.0

But it doesn’t seem to be solved. The my torch version is 1.11.0, but this problem still occurs.

The solution of modifying the upper sampling function given by netizens.

Just Comment out this part below.

It’s really solved.

[Solved] with ERRTYPE = cudaError CUDA failure 999 unknown error

Project scenario [with errtype = cudaerror; bool thrw = true] CUDA failure 999: unknown error; GPU=24 :

The old program needs to be upgraded. The previous CUDA is 10.2


Problem Description:

environment

CUDA 11.2 (previously 10.2)

onnxruntime-gpu 1.10

python 3.9.7

When starting the program

Traceback (most recent call last):
  File "/home/aiuser/cover/liheng-foggun/app.py", line 15, in <module>
    model = DetectMultiBackend(weights=config.paddle.model_file)
  File "/home/aiuser/miniconda3/envs/cover/lib/python3.9/site-packages/torch/autograd/grad_mode.py", line 28, in decorate_context
    return func(*args, **kwargs)
  File "/home/aiuser/cover/liheng-foggun/models/yolo.py", line 37, in __init__
    self.session = onnxruntime.InferenceSession(weights, providers=['CUDAExecutionProvider'])
  File "/home/aiuser/miniconda3/envs/cover/lib/python3.9/site-packages/onnxruntime/capi/onnxruntime_inference_collection.py", line 335, in __init__
    self._create_inference_session(providers, provider_options, disabled_optimizers)
  File "/home/aiuser/miniconda3/envs/cover/lib/python3.9/site-packages/onnxruntime/capi/onnxruntime_inference_collection.py", line 379, in _create_inference_session
    sess.initialize_session(providers, provider_options, disabled_optimizers)
RuntimeError: /onnxruntime_src/onnxruntime/core/providers/cuda/cuda_call.cc:122 bool onnxruntime::CudaCall(ERRTYPE, const char*, const char*, ERRTYPE, const char*) [with ERRTYPE =
 cudaError; bool THRW = true] /onnxruntime_src/onnxruntime/core/providers/cuda/cuda_call.cc:116 bool onnxruntime::CudaCall(ERRTYPE, const char*, const char*
, ERRTYPE, const char*) [with ERRTYPE = cudaError; bool THRW = true] CUDA failure 999: unknown error ; GPU=24 ; hostname=aiserver-sl-01 ; expr=cudaSetDevice(info_.device_id);

Cause analysis:

1. At first, I thought it was the onnxruntime GPU version problem, upgraded to 1.12 it still reports an error.

2. It is said that it is incompatible.

3. Try to reinstall the driver. When 11.2 is uninstalled, nvidia-smi finds that the previous 10.2 driver still exists.

4. The reason is that the previous drive was not unloaded completely


Solution:

1. Uninstall 10.2

sudo /usr/local/cuda-10.2/bin/cuda-uninstaller

2. Install a new drive

#install 515.57 offline
sudo ./NVIDIA-Linux-x86_64-515.57.run -no-x-check -no-nouveau-check

VIDIA-Linux-x86_64-515.57.run -no-x-check -no-nouveau-check