Tag Archives: Deep learning

MindSpore Error: [ERROR] MD:unexpected error.Not a valid index

ERROR:[MD]:unexpected error. Not a valid index

problem phenomenon: a single card does not report an error, and the training process can be correctly executed. However, when switching to distributed training, an error in the diagram is reported. After troubleshooting, the cause of the error is found to be the wrong use of distributed sampling
the error reporting method is as follows:

the order of distributed sampling and random sampling needs to be changed. The correct way is to perform random sampling first and then distributed sampling

The correct modification is as follows:

distributed training can be performed correctly after modification

[Solved] Tensorflow Error: NameError: name ‘layers‘ is not defined

Error code:

 import tensorflow as tf
 net = layers.Dense(10)
 net.build((4, 10))

NameError: name ‘layers’ is not defined

Error reason: TensorFlow does not load layers

import tensorflow as tf
from tensorflow.keras import datasets, layers,optimizers
net = layers.Dense(10)
net.build((4, 10))

Operation results:

<tf.Variable 'kernel:0' shape=(10, 10) dtype=float32, numpy=
array([[ 0.22973484,  0.00857711, -0.21515384, -0.5346802 , -0.2584985 ,
         0.03767496,  0.22262502,  0.10832614,  0.12043941,  0.3197981 ],
       [ 0.12034583,  0.01719284, -0.37415868,  0.22801459,  0.49012756,
        -0.01656079, -0.02581853,  0.22888458, -0.3193212 , -0.23586014],
       [-0.50331104, -0.18943703,  0.47028244, -0.33412236,  0.04251152,
        -0.54133296,  0.23136115,  0.02571291, -0.36819634,  0.5134926 ],
       [-0.06907243,  0.33713734,  0.34277046,  0.24761981,  0.50419617,
        -0.20183799, -0.27459818, -0.34057558, -0.23564544,  0.34107167],
       [-0.51874346,  0.30625004,  0.07017416,  0.4792788 , -0.08462432,
         0.1762883 ,  0.47576356, -0.08242992,  0.0560475 ,  0.5385151 ],
       [-0.02134383,  0.02438915, -0.11708987,  0.26330394, -0.4951692 ,
         0.19778156, -0.1931901 , -0.41975048,  0.0376184 ,  0.23603398],
       [-0.20051709, -0.46164495,  0.15974921, -0.05227134,  0.14756906,
         0.12185448, -0.5285519 , -0.5298273 ,  0.14063555,  0.02481627],
       [-0.35953748,  0.30639488, -0.02970898, -0.5232449 , -0.10309196,
        -0.3557127 , -0.19765031,  0.3171267 ,  0.34930962, -0.15071085],
       [ 0.20013565,  0.11569405, -0.46884173, -0.40876222,  0.36319625,
         0.33609563,  0.2721032 , -0.04006624,  0.09699225,  0.20260221],
       [-0.03152204, -0.48894358,  0.3079273 , -0.5283493 , -0.44822672,
        -0.34838638,  0.41896552, -0.34962398, -0.24334553,  0.38500214]],

Problem solved.

How to Solve Yolov5 Common Error (Three Errors)

1. detect.py run error: IndexError: list index out of range Reason: Path error
Solution: Add r in front of the path
before the path

2.subprocess.CalledProcessError: Command ‘git tag‘ returned non-zero exit status

Solution: open the python environment and find the subprocess.py in Lib, change the check value to False in line 415.

My path is D:\Miniconda3\envs\pytorch\Lib

3.UnicodeDecodeError: ‘utf-8‘ codec can‘t decode byte 0xb2 in position 6:invalidstartbyte

Solution: in line 58 of the torch_utils.py file, modify the original decode() to decode(encoding = ‘gbk’)

[Solved] with ERRTYPE = cudaError CUDA failure 999 unknown error

Project scenario [with errtype = cudaerror; bool thrw = true] CUDA failure 999: unknown error; GPU=24 :

The old program needs to be upgraded. The previous CUDA is 10.2

Problem Description:


CUDA 11.2 (previously 10.2)

onnxruntime-gpu 1.10

python 3.9.7

When starting the program

Traceback (most recent call last):
  File "/home/aiuser/cover/liheng-foggun/app.py", line 15, in <module>
    model = DetectMultiBackend(weights=config.paddle.model_file)
  File "/home/aiuser/miniconda3/envs/cover/lib/python3.9/site-packages/torch/autograd/grad_mode.py", line 28, in decorate_context
    return func(*args, **kwargs)
  File "/home/aiuser/cover/liheng-foggun/models/yolo.py", line 37, in __init__
    self.session = onnxruntime.InferenceSession(weights, providers=['CUDAExecutionProvider'])
  File "/home/aiuser/miniconda3/envs/cover/lib/python3.9/site-packages/onnxruntime/capi/onnxruntime_inference_collection.py", line 335, in __init__
    self._create_inference_session(providers, provider_options, disabled_optimizers)
  File "/home/aiuser/miniconda3/envs/cover/lib/python3.9/site-packages/onnxruntime/capi/onnxruntime_inference_collection.py", line 379, in _create_inference_session
    sess.initialize_session(providers, provider_options, disabled_optimizers)
RuntimeError: /onnxruntime_src/onnxruntime/core/providers/cuda/cuda_call.cc:122 bool onnxruntime::CudaCall(ERRTYPE, const char*, const char*, ERRTYPE, const char*) [with ERRTYPE =
 cudaError; bool THRW = true] /onnxruntime_src/onnxruntime/core/providers/cuda/cuda_call.cc:116 bool onnxruntime::CudaCall(ERRTYPE, const char*, const char*
, ERRTYPE, const char*) [with ERRTYPE = cudaError; bool THRW = true] CUDA failure 999: unknown error ; GPU=24 ; hostname=aiserver-sl-01 ; expr=cudaSetDevice(info_.device_id);

Cause analysis:

1. At first, I thought it was the onnxruntime GPU version problem, upgraded to 1.12 it still reports an error.

2. It is said that it is incompatible.

3. Try to reinstall the driver. When 11.2 is uninstalled, nvidia-smi finds that the previous 10.2 driver still exists.

4. The reason is that the previous drive was not unloaded completely


1. Uninstall 10.2

sudo /usr/local/cuda-10.2/bin/cuda-uninstaller

2. Install a new drive

#install 515.57 offline
sudo ./NVIDIA-Linux-x86_64-515.57.run -no-x-check -no-nouveau-check

VIDIA-Linux-x86_64-515.57.run -no-x-check -no-nouveau-check

[Solved] Operator Not Allowed In Graph Error & Attribute Error Tensor object has no attribute numpy

The reason for the above error when compiling custom functions is that tf2.x’s keras.compile does not support specific values by default


When using the wrapping method to customize the loss function of the keras model and need to calculate accuracy metrics such as precision or recall, or need to extract the specific values of the inputs y_true and y_prd (operations such as y_true.numpy()), an error message appears:

OperatorNotAllowedInGraphError: using a `tf.Tensor` as a Python `bool` is not allowed: AutoGraph did convert this function. This might indicate you are trying to use an unsupported feature.


 AttributeError: 'Tensor' object has no attribute 'numpy'



Pass in parameters in the compile function:




Tf2.x enables eager mode by default, namely eager execution, that is, dynamic calculation graph. Compared with the static calculation graph of tf1.x, the advantage of eager mode is that it is convenient for debugging, which can easily print tensor values ​​and evaluate results; and Numpy interacts well, and the conversion between tensor and ndarray is convenient and even universal. The tradeoff is that it runs significantly slower. After the static calculation graph is defined, it is almost always executed with C++ code on the tensorflow core, so the calculation efficiency is higher and the speed is faster.

Even so, run_eagerly defaults to False in the model.compile method, which means that the logic of the model is encapsulated in tf.function, which achieves faster computational efficiency (the autograph mechanism converts the dynamic computational graph through the @tf.function wrapper). is a static computation graph). But the @tf.function wrapper requires the function to use basic tf operations, not other operations in python or even functions from other packages, so the first error occurs when calling functions such as sklearn.metrics’ accuracy_score or imblearn.metrcis’ geometric_mean_score function. The second error occurs when using the y_true.numpy() method. The fundamental reason is that the model.compile method does not support the above operations after the static calculation graph converted by the @tf.function wrapper, although tf2.x enables the use of dynamic calculation graphs by default.

After passing run_eagerly=True to the model.compile method, the dynamic calculation graph is used to run, and the above operations can be performed normally. The disadvantage is that the dynamic calculation graph has the disadvantage of low operation efficiency.

[Solved] ProxyError: Conda cannot proceed due to an error in your proxy configuration.

ProxyError: Conda cannot proceed due to an error in your proxy configuration.

0. Problem reporting error

conda install pytorch torchvision torchaudio cudatoolkit=11.3 -c pytorch

The following errors are reported when installing pytorch:

ProxyError: Conda cannot proceed due to an error in your proxy configuration.
Check for typos and other configuration errors in any '.netrc' file in your home directory,
any environment variables ending in '_PROXY', and any other system-wide proxy
configuration settings.

The problem lies in agency

1. Solutions

(1) View current terminal agent

env | grep -i "_PROXY"

(2) Delete agents in turn

unset https_proxy
unset http_proxy
unset no_proxy
unset NO_PROXY

3. Whether the verification is successful

Enter again

env | grep -i "_PROXY"

then enter the following:
env. grep -i "PROXY"

[Solved] Yolov5 Deep Learning Error: RuntimeError: DataLoader worker (pid(s) 2516, 1768) exited unexpectedly

Project scenario:

There is a problem when using yolov5 for deep learning. I use GPU for learning.

Problem description

An error is reported at the beginning of learning, RuntimeError: DataLoader worker (pid(s) 2516, 1768) exited unexpectedly.

Cause analysis:

Because I use GPU to learn, Anaconda’s virtual memory is also allocated enough, so the problem should be the setting of the number of CPU threads. Before that, I tried to adjust the batch size, but it didn’t work.


In train There is a parameter of --workers in the file of py.

There is a parameter named --workers in the train.py file. Set it to 0.

the following is my setting, you can refer to it~~~

def parse_opt(known=False):
    parser = argparse.ArgumentParser()
    parser.add_argument('--weights', type=str, default=ROOT/'yolov5x.pt', help='initial weights path') #初始权重值
    parser.add_argument('--cfg', type=str, default='yolov5_Scan_FDDI/PLC_model.yaml', help='model.yaml path') #训练模型文件
    parser.add_argument('--data', type=str, default=ROOT/'yolov5_Scan_FDDI/PLC_parameter.yaml', help='dataset.yaml path') #数据集参数文件
    parser.add_argument('--hyp', type=str, default=ROOT/'data/hyps/hyp.scratch-low.yaml', help='hyperparameters path') #超参数设置
    parser.add_argument('--epochs', type=int, default=100) #训练轮数
    parser.add_argument('--batch-size', type=int, default=4, help='total batch size for all GPUs, -1 for autobatch') #batch size
    parser.add_argument('--imgsz', '--img', '--img-size', type=int, default=320, help='train, val image size (pixels)') #图片大小
    parser.add_argument('--rect', action='store_true', help='rectangular training')
    parser.add_argument('--resume', nargs='?', const=True, default=False, help='resume most recent training') #断续训练
    parser.add_argument('--nosave', action='store_true', help='only save final checkpoint')
    parser.add_argument('--noval', action='store_true', help='only validate final epoch')
    parser.add_argument('--noautoanchor', action='store_true', help='disable AutoAnchor')
    parser.add_argument('--noplots', action='store_true', help='save no plot files')
    parser.add_argument('--evolve', type=int, nargs='?', const=300, help='evolve hyperparameters for x generations')
    parser.add_argument('--bucket', type=str, default='', help='gsutil bucket')
    parser.add_argument('--cache', type=str, nargs='?', const='ram', help='--cache images in "ram" (default) or "disk"')
    parser.add_argument('--image-weights', action='store_true', help='use weighted image selection for training')
    parser.add_argument('--device', default='0', help='cuda device, i.e. 0 or 0,1,2,3 or cpu') #GPU
    parser.add_argument('--multi-scale', action='store_true', help='vary img-size +/- 50%%')
    parser.add_argument('--single-cls', action='store_true', help='train multi-class data as single-class')
    parser.add_argument('--optimizer', type=str, choices=['SGD', 'Adam', 'AdamW'], default='SGD', help='optimizer')
    parser.add_argument('--sync-bn', action='store_true', help='use SyncBatchNorm, only available in DDP mode')
    parser.add_argument('--workers', type=int, default=0, help='max dataloader workers (per RANK in DDP mode)') #CPU线程数设置
    parser.add_argument('--project', default=ROOT/'runs/train', help='save to project/name')
    parser.add_argument('--name', default='exp', help='save to project/name')
    parser.add_argument('--exist-ok', action='store_true', help='existing project/name ok, do not increment')
    parser.add_argument('--quad', action='store_true', help='quad dataloader')
    parser.add_argument('--cos-lr', action='store_true', help='cosine LR scheduler')
    parser.add_argument('--label-smoothing', type=float, default=0.0, help='Label smoothing epsilon')
    parser.add_argument('--patience', type=int, default=100, help='EarlyStopping patience (epochs without improvement)')
    parser.add_argument('--freeze', nargs='+', type=int, default=[0], help='Freeze layers: backbone=10, first3=0 1 2')
    parser.add_argument('--save-period', type=int, default=-1, help='Save checkpoint every x epochs (disabled if < 1)')
    parser.add_argument('--seed', type=int, default=0, help='Global training seed')
    parser.add_argument('--local_rank', type=int, default=-1, help='Automatic DDP Multi-GPU argument, do not modify')

    # Weights & Biases arguments
    parser.add_argument('--entity', default=None, help='W&B: Entity')
    parser.add_argument('--upload_dataset', nargs='?', const=True, default=False, help='W&B: Upload data, "val" option')
    parser.add_argument('--bbox_interval', type=int, default=-1, help='W&B: Set bounding-box image logging interval')
    parser.add_argument('--artifact_alias', type=str, default='latest', help='W&B: Version of dataset artifact to use')

    opt = parser.parse_known_args()[0] if known else parser.parse_args()
    return opt

[Solved] caffe Error: Check failed: cv_img.data Could not load

The following problems were encountered when running Caffe Code:

Problem Description:

E0727 11:22:09.213124  6200 io.cpp:89] Could not open or find file D:/.../
F0727 11:22:09.213124  6200 image_data_layer.cpp:129] Check failed: cv_img.data Could not load D:/.../
*** Check failure stack trace: ***


Check the train or test file for the presence of empty lines (usually the last line) and delete them.

[Solved] Pytorch Error: PytorchStreamReader failed reading zip archive failed finding central directory

Pytoch reports an error:

PytorchStreamReader failed reading zip archive: failed finding central directory

Error reporting position

An error is reported if the pre training model is not downloaded

resnet101 = torchvision.models.resnet101(pretrained=True)


Delete the file C:\Users\Username/.cache\torch\hub\checkpoints.pth

[Solved] Pytorch Error: PytorchStreamReader failed reading zip archive failed finding central directory

Pytoch reports an error: pytochstreamreader failed reading zip archive: failed finding central directory

Error reporting position

An error is reported if the pre training model is not downloaded

resnet101 = torchvision.models.resnet101(pretrained=True)


Download the files from the above URL and put them in the location of the path behind to replace the weights that have not been downloaded

[Solved] pytorch loss.backward() Error: RuntimeError: Function AddBackward0 returned an invalid gradient at index 1…

How to Solve pytorch loss.backward() Error

The specific errors are as follows:

In fact, the error message makes it clear that the error is caused by device inconsistency in the backpropagation process.
In the code, we first load all input, target, model, and loss to the GPU via .to(device). But when calculating the loss, we initialize a loss ourselves
loss_1 = torch.tensor(0.0)
This way by default loss_1 is loaded on the CPU.
The final execution loss_seg = loss + loss_1
results in the above error for loss_seg.backward().

Modification method:
loss_1 = torch.tensor(0.0)
loss_1 = torch.tensor(0.0).to(device)
loss_1 = torch.tensor(0.0).cuda()

tensorflow2.3 InvalidArgumentError: jpeg::Uncompress failed [How to Solve]

When training your own dataset, you often report errors:

tensorflow2.3 InvalidArgumentError: jpeg::Uncompress failed
[[{{node decode_image/DecodeImage}}]] [Op:IteratorGetNext]


check whether the picture is damaged before training:

import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers
import os

num_skipped = 0
for folder_name in ("Fruit apples", "Fruit bananas", "Fruit oranges"):
    folder_path = os.path.join(".\data\image_data", folder_name)
    for fname in os.listdir(folder_path):

        fpath = os.path.join(folder_path, fname)

            fobj = open(fpath, mode="rb")
            is_jfif = tf.compat.as_bytes("JFIF") in fobj.peek(10)

        if not is_jfif:
            num_skipped += 1
            # Delete corrupted image

print("Deleted %d images" % num_skipped)

Delete the damaged picture and train again to solve the problem
if an error is prompted again, use:

# Determine if an image is corrupt from local
def is_valid_image(path):
    Check if the file is corrupt
        bValid = True
        fileObj = open(path, 'rb')  # Open in binary form
        buf = fileObj.read()
        if not buf.startswith(b'\xff\xd8'): # whether to start with \xff\xd8
            bValid = False
        elif buf[6:10] in (b'JFIF', b'Exif'): # ASCII code of "JFIF"
            if not buf.rstrip(b'\0\r\n').endswith(b'\xff\xd9'): # whether it ends with \xff\xd9
                bValid = False
            except Exception as e:
                bValid = False
    except Exception as e:
        return False
    return bValid
 num_skipped = 0
for folder_name in ("fruit-apple", "fruit-banana", "fruit-orange"):
    #os.path.join() joins two or more pathname components
    folder_path = os.path.join(". \data\image_data", folder_name)
    # os.listdir(path) lists the subdirectories under this directory
    for fname in os.listdir(folder_path):
        fpath = os.path.join(folder_path, fname)
        flag1 = is_valid_image(fpath)
        if not flag1:
            print(fpath)#Print the path and name of the error file

Adjust the error file and train again to solve the problem.