Category Archives: Python

[Solved] Pytorch Error: RuntimeError: Error(s) in loading state_dict for Network: size mismatch

Problem background

GitHub open source project: https://github.com/zhang-tao-whu/e2ec

python train_net.py coco_finetune --bs 12 \
--type finetune --checkpoint data/model/model_coco.pth

The error is reported as follows:

loading annotations into memory...
Done (t=0.09s)
creating index...
index created!
load model: data/model/model_coco.pth
Traceback (most recent call last):
  File "train_net.py", line 67, in <module>
    main()
  File "train_net.py", line 64, in main
    train(network, cfg)
  File "train_net.py", line 40, in train
    begin_epoch = load_network(network, model_dir=args.checkpoint, strict=False)
  File "/root/autodl-tmp/e2ec/train/model_utils/utils.py", line 66, in load_network
    net.load_state_dict(net_weight, strict=strict)
  File "/root/miniconda3/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1051, in load_state_dict
    raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format(
RuntimeError: Error(s) in loading state_dict for Network:
        size mismatch for dla.ct_hm.2.weight: copying a param with shape torch.Size([80, 256, 1, 1]) from checkpoint, the shape in current model is torch.Size([1, 256, 1, 1]).
        size mismatch for dla.ct_hm.2.bias: copying a param with shape torch.Size([80]) from checkpoint, the shape in current model is torch.Size([1]).

Since my own dataset has only 1 category, while the COCO dataset has 80 categories, the size of the dla.ct_hm.2 parameter in the pre-training model does not match mine, so the weight of this parameter in the pre-training model needs to be discarded.

Solution:

Modify in e2ec/train/model_utils/utils.py:

def load_network(net, model_dir, strict=True, map_location=None):

    if not os.path.exists(model_dir):
        print(colored('WARNING: NO MODEL LOADED !!!', 'red'))
        return 0

    print('load model: {}'.format(model_dir))
    if map_location is None:
        pretrained_model = torch.load(model_dir, map_location={'cuda:0': 'cpu', 'cuda:1': 'cpu',
                                                               'cuda:2': 'cpu', 'cuda:3': 'cpu'})
    else:
        pretrained_model = torch.load(model_dir, map_location=map_location)
    if 'epoch' in pretrained_model.keys():
        epoch = pretrained_model['epoch'] + 1
    else:
        epoch = 0
    pretrained_model = pretrained_model['net']

    net_weight = net.state_dict()
    for key in net_weight.keys():
        net_weight.update({key: pretrained_model[key]})
    '''
	Discard some parameters
	'''
    net_weight.pop("dla.ct_hm.2.weight")
    net_weight.pop("dla.ct_hm.2.bias")
    
    net.load_state_dict(net_weight, strict=strict)
    return epoch

Note: setting strict=False in load_state_dict is only useful for adding or removing partial layers, not for changing the dimension size on the original parameters.

[Solved] PyTorch Load Model Error: Missing key(s) RuntimeError: Error(s) in loading state_dict for

torch.load() error reporting missing key(s) pytorch

Error condition: error loading the pre training model

RuntimeError: Error(s) in loading state_dict for :

Missing key(s) in state_dict: “features.0.weight” …

Unexpected key(s) in state_dict: “module.features.0.weight” …

 

Error reason:

The keywords of model parameters wrapped with nn.DataParallel will have an extra “module.” in front of them than the keywords of model parameters not wrapped with nn.DataParallel

 

Solution:

1. Loading nn.DataParallel(net) trained models using net.

Delete module.

# original saved file with DataParallel
state_dict = torch.load('model_path')
# create new OrderedDict that does not contain `module.`
from collections import OrderedDict
new_state_dict = OrderedDict()
for k, v in state_dict.items():
    name = k[7:] # remove `module.`
    new_state_dict[name] = v
# load params
net.load_state_dict(new_state_dict)

Code source

checkpoint = torch.load('model_path')
for key in list(checkpoint.keys()):
    if 'model.' in key:
        checkpoint[key.replace('model.', '')] = checkpoint[key]
        del checkpoint[key]

net.load_state_dict(checkpoint)

Use nn.DataParallel when loading the model

checkpoint = torch.load('model_path')
net = torch.nn.DataParallel(net)
net.load_state_dict(checkpoint)

2. Load the net trained model using nn.DataParallel(net).

Before saving the weight, add module

If you use torch.save() when saving weights, use model.module.state_dict() to get model weight

torch.save(net.module.state_dict(), 'model_path')

Read the model before using nn.DataParallel and then use nn.

net.load_state_dict(torch.load('model_path'))
net = nn.DataParallel(net, device_ids=[0, 1]) 

Add module manually

net = nn.DataParallel(net) 
from collections import OrderedDict
new_state_dict = OrderedDict()
state_dict =savepath #Pre-trained model path
for k, v in state_dict.items():
	# add “module.” manually
    if 'module' not in k:
        k = 'module.'+k
    else:
    # Swap the location of modules and features
        k = k.replace('features.module.', 'module.features.')
    new_state_dict[k]=v

net.load_state_dict(new_state_dict)

[Solved] Tensorflow Error: NameError: name ‘layers‘ is not defined

Error code:

 import tensorflow as tf
 net = layers.Dense(10)
 net.build((4, 10))
 net.kernel

NameError: name ‘layers’ is not defined

Error reason: TensorFlow does not load layers
Solution:

import tensorflow as tf
from tensorflow.keras import datasets, layers,optimizers
net = layers.Dense(10)
net.build((4, 10))
net.kernel

Operation results:

<tf.Variable 'kernel:0' shape=(10, 10) dtype=float32, numpy=
array([[ 0.22973484,  0.00857711, -0.21515384, -0.5346802 , -0.2584985 ,
         0.03767496,  0.22262502,  0.10832614,  0.12043941,  0.3197981 ],
       [ 0.12034583,  0.01719284, -0.37415868,  0.22801459,  0.49012756,
        -0.01656079, -0.02581853,  0.22888458, -0.3193212 , -0.23586014],
       [-0.50331104, -0.18943703,  0.47028244, -0.33412236,  0.04251152,
        -0.54133296,  0.23136115,  0.02571291, -0.36819634,  0.5134926 ],
       [-0.06907243,  0.33713734,  0.34277046,  0.24761981,  0.50419617,
        -0.20183799, -0.27459818, -0.34057558, -0.23564544,  0.34107167],
       [-0.51874346,  0.30625004,  0.07017416,  0.4792788 , -0.08462432,
         0.1762883 ,  0.47576356, -0.08242992,  0.0560475 ,  0.5385151 ],
       [-0.02134383,  0.02438915, -0.11708987,  0.26330394, -0.4951692 ,
         0.19778156, -0.1931901 , -0.41975048,  0.0376184 ,  0.23603398],
       [-0.20051709, -0.46164495,  0.15974921, -0.05227134,  0.14756906,
         0.12185448, -0.5285519 , -0.5298273 ,  0.14063555,  0.02481627],
       [-0.35953748,  0.30639488, -0.02970898, -0.5232449 , -0.10309196,
        -0.3557127 , -0.19765031,  0.3171267 ,  0.34930962, -0.15071085],
       [ 0.20013565,  0.11569405, -0.46884173, -0.40876222,  0.36319625,
         0.33609563,  0.2721032 , -0.04006624,  0.09699225,  0.20260221],
       [-0.03152204, -0.48894358,  0.3079273 , -0.5283493 , -0.44822672,
        -0.34838638,  0.41896552, -0.34962398, -0.24334553,  0.38500214]],
      dtype=float32)>

Problem solved.

[Solved] librosa Install Error: ImportError: DLL load failed: Could Not Found

CMD sends an error after executing the command PIP istall librosa:

from numba.core.typeconv import _typeconv
ImportError: DLL load failed: not found

The reason is that the versions of numba, llvmlite and resampy are conflicting and incompatible

Solution:
reinstall the compatible package

pip install llvmlite==0.31.0
pip install numba==0.48.0
pip install resampy== 0.3.0

The final versions of all libraries are shown in the following figure:

XGBoost Common Errors and Their Solutions

The solutions to xgboost errors

Error 1: ‘Dict_ Items’ object has no attribute ‘copy’
there should be a problem with the form of a parameter. It must be in the list format. Solution: based on the source code

plst = params.items()

Add a sentence format conversion

plst = params.items()
plst = list(params.items())

Error 2: WARNING: C:/Users/Administrator/workspace/xgboost-win64_release_1.6.0/src/learner.cc:627: Parameter…
Solution: Add the following code

xgb.set_config(verbosity=0)

Python Post kafka Messages Error: [Error 10] MessageSizeTooLargeError

Error Message:

[Error 10] MessageSizeTooLargeError: The message is 1177421 bytes when serialized which is larger than the maximum request size you have configured with the max_request_size configuration

Solution: Add the max_request_size configuration when instantiating the KafkaProducer class, and modify the default size:

[Solved] AttributeError: module ‘PIL.Image‘ has no attribute ‘open‘

AttributeError: module ‘PIL. Image’ has no attribute ‘open’. It means PIL.image does not has an open method. I have searched lots of solutions online, but they are not work. Finally, I inadvertently saw the address of image.py (c:\users\lenovo\pycharmprojects\kk\venv\lib\site packages\pil\image.py). I know the reason of the error.

from PIL import Image
import os
import csv
import time

Reason: the image.py file under the PIL package was accidentally emptied, so image.open() cannot be realized.

temp_img_now = Image.open(temp_file)

Solution: uninstall the pilot and pillow-PIL, and then reinstall them.

[Solved] yolov5-6.0 ERROR: AttributeError: ‘Upsample‘ object has no attribute ‘recompute_scale_factor‘

Preface: using yolov5-6.0 version, you want to detect several pictures, but there is a problem in the title. It can be seen that the upsampling function is not quite right. Now record the solution.

Version: yolov5-6.0, python3.8, pytorch1.11.0

1. Problem recurrence

2. Official website solution

This problem first appeared in yolov5 and is related to pytoch 1.11.0.

In other words, this problem may be encountered in both train and detect. The following is the solution to reduce the pytoch version to less than 10.

Then the blogger made a fix for PyTorch version 1.11.0

But it doesn’t seem to be solved. The my torch version is 1.11.0, but this problem still occurs.

The solution of modifying the upper sampling function given by netizens.

Just Comment out this part below.

It’s really solved.

[Solved] with ERRTYPE = cudaError CUDA failure 999 unknown error

Project scenario [with errtype = cudaerror; bool thrw = true] CUDA failure 999: unknown error; GPU=24 :

The old program needs to be upgraded. The previous CUDA is 10.2


Problem Description:

environment

CUDA 11.2 (previously 10.2)

onnxruntime-gpu 1.10

python 3.9.7

When starting the program

Traceback (most recent call last):
  File "/home/aiuser/cover/liheng-foggun/app.py", line 15, in <module>
    model = DetectMultiBackend(weights=config.paddle.model_file)
  File "/home/aiuser/miniconda3/envs/cover/lib/python3.9/site-packages/torch/autograd/grad_mode.py", line 28, in decorate_context
    return func(*args, **kwargs)
  File "/home/aiuser/cover/liheng-foggun/models/yolo.py", line 37, in __init__
    self.session = onnxruntime.InferenceSession(weights, providers=['CUDAExecutionProvider'])
  File "/home/aiuser/miniconda3/envs/cover/lib/python3.9/site-packages/onnxruntime/capi/onnxruntime_inference_collection.py", line 335, in __init__
    self._create_inference_session(providers, provider_options, disabled_optimizers)
  File "/home/aiuser/miniconda3/envs/cover/lib/python3.9/site-packages/onnxruntime/capi/onnxruntime_inference_collection.py", line 379, in _create_inference_session
    sess.initialize_session(providers, provider_options, disabled_optimizers)
RuntimeError: /onnxruntime_src/onnxruntime/core/providers/cuda/cuda_call.cc:122 bool onnxruntime::CudaCall(ERRTYPE, const char*, const char*, ERRTYPE, const char*) [with ERRTYPE =
 cudaError; bool THRW = true] /onnxruntime_src/onnxruntime/core/providers/cuda/cuda_call.cc:116 bool onnxruntime::CudaCall(ERRTYPE, const char*, const char*
, ERRTYPE, const char*) [with ERRTYPE = cudaError; bool THRW = true] CUDA failure 999: unknown error ; GPU=24 ; hostname=aiserver-sl-01 ; expr=cudaSetDevice(info_.device_id);

Cause analysis:

1. At first, I thought it was the onnxruntime GPU version problem, upgraded to 1.12 it still reports an error.

2. It is said that it is incompatible.

3. Try to reinstall the driver. When 11.2 is uninstalled, nvidia-smi finds that the previous 10.2 driver still exists.

4. The reason is that the previous drive was not unloaded completely


Solution:

1. Uninstall 10.2

sudo /usr/local/cuda-10.2/bin/cuda-uninstaller

2. Install a new drive

#install 515.57 offline
sudo ./NVIDIA-Linux-x86_64-515.57.run -no-x-check -no-nouveau-check

VIDIA-Linux-x86_64-515.57.run -no-x-check -no-nouveau-check

[Solved] RuntimeError: cuDNN error: CUDNN_STATUS_EXECUTION_FAILED

Background:

Use a graphics card in the ubuntu18.04 system geforce RTX 3090 to reproduce r2c


problem

RuntimeError: cuDNN error: CUDNN_STATUS_EXECUTION_FAILED

Cause analysis:

The graphics card geforce RTX 3090 only supports versions of cuda11 and above.


Solution:

Update pytorch and CUDA versions:

conda install pytorch==1.11.0 torchvision==0.12.0 torchaudio==0.11.0 cudatoolkit=11.3 -c pytorch

[Solved] Operator Not Allowed In Graph Error & Attribute Error Tensor object has no attribute numpy

The reason for the above error when compiling custom functions is that tf2.x’s keras.compile does not support specific values by default

Questions

When using the wrapping method to customize the loss function of the keras model and need to calculate accuracy metrics such as precision or recall, or need to extract the specific values of the inputs y_true and y_prd (operations such as y_true.numpy()), an error message appears:

OperatorNotAllowedInGraphError: using a `tf.Tensor` as a Python `bool` is not allowed: AutoGraph did convert this function. This might indicate you are trying to use an unsupported feature.

Or

 AttributeError: 'Tensor' object has no attribute 'numpy'

 

Solution:

Pass in parameters in the compile function:

run_eagerly=True

 

Reason:

Tf2.x enables eager mode by default, namely eager execution, that is, dynamic calculation graph. Compared with the static calculation graph of tf1.x, the advantage of eager mode is that it is convenient for debugging, which can easily print tensor values ​​and evaluate results; and Numpy interacts well, and the conversion between tensor and ndarray is convenient and even universal. The tradeoff is that it runs significantly slower. After the static calculation graph is defined, it is almost always executed with C++ code on the tensorflow core, so the calculation efficiency is higher and the speed is faster.

Even so, run_eagerly defaults to False in the model.compile method, which means that the logic of the model is encapsulated in tf.function, which achieves faster computational efficiency (the autograph mechanism converts the dynamic computational graph through the @tf.function wrapper). is a static computation graph). But the @tf.function wrapper requires the function to use basic tf operations, not other operations in python or even functions from other packages, so the first error occurs when calling functions such as sklearn.metrics’ accuracy_score or imblearn.metrcis’ geometric_mean_score function. The second error occurs when using the y_true.numpy() method. The fundamental reason is that the model.compile method does not support the above operations after the static calculation graph converted by the @tf.function wrapper, although tf2.x enables the use of dynamic calculation graphs by default.

After passing run_eagerly=True to the model.compile method, the dynamic calculation graph is used to run, and the above operations can be performed normally. The disadvantage is that the dynamic calculation graph has the disadvantage of low operation efficiency.