Tag Archives: Deep learning

[ncclUnhandledCudaError] unhandled cuda error, NCCL version xx.x.x

Problem description

Problems encountered during distributed training

RuntimeError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:47, unhandled cuda error, NCCL version 21.0.3
ncclUnhandledCudaError: Call to CUDA function failed.

The specific errors are as follows:

 

Problem-solving

According to the analysis of error reporting information, an error is reported during initialization during distributed training, not during training. Therefore, the problem is located on the initialization of distributed training.

Enter the following command to check the card of the current server

nvidia-smi -L

The first card found is 3070

GPU 0: NVIDIA GeForce RTX 2080 Ti (UUID: xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx)
GPU 1: NVIDIA GeForce RTX 3070 (UUID: xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx)
GPU 2: NVIDIA GeForce RTX 2080 Ti (UUID: xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx)
GPU 3: NVIDIA GeForce RTX 2080 Ti (UUID: xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx)
GPU 4: NVIDIA GeForce RTX 2080 Ti (UUID: xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx)
GPU 5: NVIDIA GeForce RTX 2080 Ti (UUID: xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx)
GPU 6: NVIDIA GeForce RTX 2080 Ti (UUID: xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx)
GPU 7: NVIDIA GeForce RTX 2080 Ti (UUID: xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx)

Therefore, here I directly try to use 2-7 cards for training.

Correct solution!

[Solved] error indicates that your module has parameters that were not used in producing loss

Error Messages:
RuntimeError: Expected to have finished reduction in the prior iteration before starting a new one. This error indicates that your module has parameters that were not used in producing loss. You can enable unused parameter detection by passing the keyword argument find_unused_parameters=True to torch.nn.parallel.DistributedDataParallel, and by
making sure all forward function outputs participate in calculating loss.
If you already have done the above, then the distributed data parallel module wasn’t able to locate the output tensors in the return value of your module’s forward function. Please include the loss function and the structure of the return value of forward of your module when reporting this issue (e.g. list, dict, iterable).
Parameter indices which did not receive grad for rank 0: 160 161 182 183 204 205 230 231 252 253 274 275 330 331 414 415 438 439 462 463 486 487 512 513 536 537 560 561 584 585
In addition, you can set the environment variable TORCH_DISTRIBUTED_DEBUG to either INFO or DETAIL to print out information about which particular parameters did not receive gradient on this rank as part of this error

Solution:

Original Code

 class AttentionBlock(nn.Module):
    def __init__(
        self,
    ):
        super().__init__()
        self.encoder_kv = conv_nd(1, 512, channels * 2, 1) #这行没有注释掉
        self.encoder_qkv = conv_nd(1, 512, channels * 3, 1)
        self.trans = nn.Linear(resolution*resolution*9+128,resolution*resolution*9)
    def forward(self, x, encoder_out=None):
        b, c, *spatial = x.shape
        x = x.reshape(b, c, -1)
        qkv = self.qkv(self.norm(x))
        if encoder_out is not None:
            # encoder_out = self.encoder_kv(encoder_out)  #这行代码注释了,没有用self.encoder_kv
            encoder_out = self.encoder_qkv(encoder_out)
        return encode_out

Error reason:

self.encoder_kv is written in def__init__, but not used in forward, resulting in an error in torch.nn.parallel.DistributedDataParallel. Correction method

Modified code

class AttentionBlock(nn.Module):
   def __init__(
       self,
   ):
       super().__init__()
       #self.encoder_kv = conv_nd(1, 512, channels * 2, 1) #这行在forward中没有用到注释掉
       self.encoder_qkv = conv_nd(1, 512, channels * 3, 1)
       self.trans = nn.Linear(resolution*resolution*9+128,resolution*resolution*9)
   def forward(self, x, encoder_out=None):
       b, c, *spatial = x.shape
       x = x.reshape(b, c, -1)
       qkv = self.qkv(self.norm(x))
       if encoder_out is not None:
           # encoder_out = self.encoder_kv(encoder_out)  
           encoder_out = self.encoder_qkv(encoder_out)
       return encode_out

Comment out the function self.encoder_kv = conv_nd(1, 512, channels * 2, 1) that is not used in the forward, and the program will run normally.

[Solved] Mindspore 1.3.0 Compile Error: CMake Error at cmake/utils

[Function Module].
Compile mindspore 1.3.0 with error CMake Error at cmake/utils.cmake:301 (message): Failed patch:

[Steps & Problems]
1. The compilation process after executing bash build.sh -e ascend gives an error, as follows:

-- Build files have been written to: /root/mindspore-v1.3.0/build/mindspore/_deps/icu4c-subbuild
[100%] Built target icu4c-populate
icu4c_SOURCE_DIR : /root/mindspore-v1.3.0/build/mindspore/_deps/icu4c-src
patching /root/mindspore-v1.3.0/build/mindspore/_deps/icu4c-src -p1 < /root/mindspore-v1.3.0/build/mindspore/_ms_patch/icu4c.patch01
patching file icu4c/source/runConfigureICU
Reversed (or previously applied) patch detected!  Assume -R?[n]
Apply anyway?[n]
Skipping patch.
1 out of 1 hunk ignored -- saving rejects to file icu4c/source/runConfigureICU.rej
CMake Error at cmake/utils.cmake:301 (message):
  Failed patch:
  /root/mindspore-v1.3.0/build/mindspore/_ms_patch/icu4c.patch01
Call Stack (most recent call first):
  cmake/external_libs/icu4c.cmake:36 (mindspore_add_pkg)
  cmake/mind_expression.cmake:85 (include)
  CMakeLists.txt:51 (include)

-- Configuring incomplete, errors occurred!
See also "/root/mindspore-v1.3.0/build/mindspore/CMakeFiles/CMakeOutput.log".
See also "/root/mindspore-v1.3.0/build/mindspore/CMakeFiles/CMakeError.log".

——————————————————————————————————————————-
This situation is usually due to the failure of the last compilation on the way, try to compile again, please try to deal with the following.

rm -rf mindspore/build/mindspore/_deps/icu*

Then retry the compilation.

RuntimeError: stack expects each tensor to be equal size, but got [x] at entry 0 and [x] at entry 1

RuntimeError: stack expects each tensor to be equal size, but got [x] at entry 0 and [x] at entry 1

Problem description: When generating a dataloader, the training set can be run, but the test set has this error: RuntimeError: stack expects each tensor to be equal size, but got [200] at entry 0 and [116] at entry 1.

How to Solve: In generating the dataloader, I need to generate a dataset, so my error occurred because there is a minibatch in the dataset with a different number of data than the other minibatch, so I went into the custom dataset method to check, and through print debugging, I found that it was a problem with the dataset label.

Solution: Go into the dataset and print the output of the dataset.

MindSpore Error: [ERROR] MD:unexpected error.Not a valid index

ERROR:[MD]:unexpected error. Not a valid index


problem phenomenon: a single card does not report an error, and the training process can be correctly executed. However, when switching to distributed training, an error in the diagram is reported. After troubleshooting, the cause of the error is found to be the wrong use of distributed sampling
the error reporting method is as follows:

the order of distributed sampling and random sampling needs to be changed. The correct way is to perform random sampling first and then distributed sampling

The correct modification is as follows:

distributed training can be performed correctly after modification

[Solved] Tensorflow Error: NameError: name ‘layers‘ is not defined

Error code:

 import tensorflow as tf
 net = layers.Dense(10)
 net.build((4, 10))
 net.kernel

NameError: name ‘layers’ is not defined

Error reason: TensorFlow does not load layers
Solution:

import tensorflow as tf
from tensorflow.keras import datasets, layers,optimizers
net = layers.Dense(10)
net.build((4, 10))
net.kernel

Operation results:

<tf.Variable 'kernel:0' shape=(10, 10) dtype=float32, numpy=
array([[ 0.22973484,  0.00857711, -0.21515384, -0.5346802 , -0.2584985 ,
         0.03767496,  0.22262502,  0.10832614,  0.12043941,  0.3197981 ],
       [ 0.12034583,  0.01719284, -0.37415868,  0.22801459,  0.49012756,
        -0.01656079, -0.02581853,  0.22888458, -0.3193212 , -0.23586014],
       [-0.50331104, -0.18943703,  0.47028244, -0.33412236,  0.04251152,
        -0.54133296,  0.23136115,  0.02571291, -0.36819634,  0.5134926 ],
       [-0.06907243,  0.33713734,  0.34277046,  0.24761981,  0.50419617,
        -0.20183799, -0.27459818, -0.34057558, -0.23564544,  0.34107167],
       [-0.51874346,  0.30625004,  0.07017416,  0.4792788 , -0.08462432,
         0.1762883 ,  0.47576356, -0.08242992,  0.0560475 ,  0.5385151 ],
       [-0.02134383,  0.02438915, -0.11708987,  0.26330394, -0.4951692 ,
         0.19778156, -0.1931901 , -0.41975048,  0.0376184 ,  0.23603398],
       [-0.20051709, -0.46164495,  0.15974921, -0.05227134,  0.14756906,
         0.12185448, -0.5285519 , -0.5298273 ,  0.14063555,  0.02481627],
       [-0.35953748,  0.30639488, -0.02970898, -0.5232449 , -0.10309196,
        -0.3557127 , -0.19765031,  0.3171267 ,  0.34930962, -0.15071085],
       [ 0.20013565,  0.11569405, -0.46884173, -0.40876222,  0.36319625,
         0.33609563,  0.2721032 , -0.04006624,  0.09699225,  0.20260221],
       [-0.03152204, -0.48894358,  0.3079273 , -0.5283493 , -0.44822672,
        -0.34838638,  0.41896552, -0.34962398, -0.24334553,  0.38500214]],
      dtype=float32)>

Problem solved.

How to Solve Yolov5 Common Error (Three Errors)

1. detect.py run error: IndexError: list index out of range Reason: Path error
Solution: Add r in front of the path
before the path

2.subprocess.CalledProcessError: Command ‘git tag‘ returned non-zero exit status

Solution: open the python environment and find the subprocess.py in Lib, change the check value to False in line 415.

My path is D:\Miniconda3\envs\pytorch\Lib

3.UnicodeDecodeError: ‘utf-8‘ codec can‘t decode byte 0xb2 in position 6:invalidstartbyte

Solution: in line 58 of the torch_utils.py file, modify the original decode() to decode(encoding = ‘gbk’)

[Solved] with ERRTYPE = cudaError CUDA failure 999 unknown error

Project scenario [with errtype = cudaerror; bool thrw = true] CUDA failure 999: unknown error; GPU=24 :

The old program needs to be upgraded. The previous CUDA is 10.2


Problem Description:

environment

CUDA 11.2 (previously 10.2)

onnxruntime-gpu 1.10

python 3.9.7

When starting the program

Traceback (most recent call last):
  File "/home/aiuser/cover/liheng-foggun/app.py", line 15, in <module>
    model = DetectMultiBackend(weights=config.paddle.model_file)
  File "/home/aiuser/miniconda3/envs/cover/lib/python3.9/site-packages/torch/autograd/grad_mode.py", line 28, in decorate_context
    return func(*args, **kwargs)
  File "/home/aiuser/cover/liheng-foggun/models/yolo.py", line 37, in __init__
    self.session = onnxruntime.InferenceSession(weights, providers=['CUDAExecutionProvider'])
  File "/home/aiuser/miniconda3/envs/cover/lib/python3.9/site-packages/onnxruntime/capi/onnxruntime_inference_collection.py", line 335, in __init__
    self._create_inference_session(providers, provider_options, disabled_optimizers)
  File "/home/aiuser/miniconda3/envs/cover/lib/python3.9/site-packages/onnxruntime/capi/onnxruntime_inference_collection.py", line 379, in _create_inference_session
    sess.initialize_session(providers, provider_options, disabled_optimizers)
RuntimeError: /onnxruntime_src/onnxruntime/core/providers/cuda/cuda_call.cc:122 bool onnxruntime::CudaCall(ERRTYPE, const char*, const char*, ERRTYPE, const char*) [with ERRTYPE =
 cudaError; bool THRW = true] /onnxruntime_src/onnxruntime/core/providers/cuda/cuda_call.cc:116 bool onnxruntime::CudaCall(ERRTYPE, const char*, const char*
, ERRTYPE, const char*) [with ERRTYPE = cudaError; bool THRW = true] CUDA failure 999: unknown error ; GPU=24 ; hostname=aiserver-sl-01 ; expr=cudaSetDevice(info_.device_id);

Cause analysis:

1. At first, I thought it was the onnxruntime GPU version problem, upgraded to 1.12 it still reports an error.

2. It is said that it is incompatible.

3. Try to reinstall the driver. When 11.2 is uninstalled, nvidia-smi finds that the previous 10.2 driver still exists.

4. The reason is that the previous drive was not unloaded completely


Solution:

1. Uninstall 10.2

sudo /usr/local/cuda-10.2/bin/cuda-uninstaller

2. Install a new drive

#install 515.57 offline
sudo ./NVIDIA-Linux-x86_64-515.57.run -no-x-check -no-nouveau-check

VIDIA-Linux-x86_64-515.57.run -no-x-check -no-nouveau-check

[Solved] Operator Not Allowed In Graph Error & Attribute Error Tensor object has no attribute numpy

The reason for the above error when compiling custom functions is that tf2.x’s keras.compile does not support specific values by default

Questions

When using the wrapping method to customize the loss function of the keras model and need to calculate accuracy metrics such as precision or recall, or need to extract the specific values of the inputs y_true and y_prd (operations such as y_true.numpy()), an error message appears:

OperatorNotAllowedInGraphError: using a `tf.Tensor` as a Python `bool` is not allowed: AutoGraph did convert this function. This might indicate you are trying to use an unsupported feature.

Or

 AttributeError: 'Tensor' object has no attribute 'numpy'

 

Solution:

Pass in parameters in the compile function:

run_eagerly=True

 

Reason:

Tf2.x enables eager mode by default, namely eager execution, that is, dynamic calculation graph. Compared with the static calculation graph of tf1.x, the advantage of eager mode is that it is convenient for debugging, which can easily print tensor values ​​and evaluate results; and Numpy interacts well, and the conversion between tensor and ndarray is convenient and even universal. The tradeoff is that it runs significantly slower. After the static calculation graph is defined, it is almost always executed with C++ code on the tensorflow core, so the calculation efficiency is higher and the speed is faster.

Even so, run_eagerly defaults to False in the model.compile method, which means that the logic of the model is encapsulated in tf.function, which achieves faster computational efficiency (the autograph mechanism converts the dynamic computational graph through the @tf.function wrapper). is a static computation graph). But the @tf.function wrapper requires the function to use basic tf operations, not other operations in python or even functions from other packages, so the first error occurs when calling functions such as sklearn.metrics’ accuracy_score or imblearn.metrcis’ geometric_mean_score function. The second error occurs when using the y_true.numpy() method. The fundamental reason is that the model.compile method does not support the above operations after the static calculation graph converted by the @tf.function wrapper, although tf2.x enables the use of dynamic calculation graphs by default.

After passing run_eagerly=True to the model.compile method, the dynamic calculation graph is used to run, and the above operations can be performed normally. The disadvantage is that the dynamic calculation graph has the disadvantage of low operation efficiency.

[Solved] ProxyError: Conda cannot proceed due to an error in your proxy configuration.

ProxyError: Conda cannot proceed due to an error in your proxy configuration.

0. Problem reporting error

conda install pytorch torchvision torchaudio cudatoolkit=11.3 -c pytorch

The following errors are reported when installing pytorch:

ProxyError: Conda cannot proceed due to an error in your proxy configuration.
Check for typos and other configuration errors in any '.netrc' file in your home directory,
any environment variables ending in '_PROXY', and any other system-wide proxy
configuration settings.

The problem lies in agency

1. Solutions

(1) View current terminal agent

env | grep -i "_PROXY"

(2) Delete agents in turn

unset HTTP_PROXY
unset https_proxy
unset http_proxy
unset no_proxy
unset NO_PROXY

3. Whether the verification is successful

Enter again

env | grep -i "_PROXY"

then enter the following:
env. grep -i "PROXY"

[Solved] Yolov5 Deep Learning Error: RuntimeError: DataLoader worker (pid(s) 2516, 1768) exited unexpectedly

Project scenario:

There is a problem when using yolov5 for deep learning. I use GPU for learning.


Problem description

An error is reported at the beginning of learning, RuntimeError: DataLoader worker (pid(s) 2516, 1768) exited unexpectedly.


Cause analysis:

Because I use GPU to learn, Anaconda’s virtual memory is also allocated enough, so the problem should be the setting of the number of CPU threads. Before that, I tried to adjust the batch size, but it didn’t work.


Solution:

In train There is a parameter of --workers in the file of py.

There is a parameter named --workers in the train.py file. Set it to 0.

the following is my setting, you can refer to it~~~

def parse_opt(known=False):
    parser = argparse.ArgumentParser()
    parser.add_argument('--weights', type=str, default=ROOT/'yolov5x.pt', help='initial weights path') #初始权重值
    parser.add_argument('--cfg', type=str, default='yolov5_Scan_FDDI/PLC_model.yaml', help='model.yaml path') #训练模型文件
    parser.add_argument('--data', type=str, default=ROOT/'yolov5_Scan_FDDI/PLC_parameter.yaml', help='dataset.yaml path') #数据集参数文件
    parser.add_argument('--hyp', type=str, default=ROOT/'data/hyps/hyp.scratch-low.yaml', help='hyperparameters path') #超参数设置
    parser.add_argument('--epochs', type=int, default=100) #训练轮数
    parser.add_argument('--batch-size', type=int, default=4, help='total batch size for all GPUs, -1 for autobatch') #batch size
    parser.add_argument('--imgsz', '--img', '--img-size', type=int, default=320, help='train, val image size (pixels)') #图片大小
    parser.add_argument('--rect', action='store_true', help='rectangular training')
    parser.add_argument('--resume', nargs='?', const=True, default=False, help='resume most recent training') #断续训练
    parser.add_argument('--nosave', action='store_true', help='only save final checkpoint')
    parser.add_argument('--noval', action='store_true', help='only validate final epoch')
    parser.add_argument('--noautoanchor', action='store_true', help='disable AutoAnchor')
    parser.add_argument('--noplots', action='store_true', help='save no plot files')
    parser.add_argument('--evolve', type=int, nargs='?', const=300, help='evolve hyperparameters for x generations')
    parser.add_argument('--bucket', type=str, default='', help='gsutil bucket')
    parser.add_argument('--cache', type=str, nargs='?', const='ram', help='--cache images in "ram" (default) or "disk"')
    parser.add_argument('--image-weights', action='store_true', help='use weighted image selection for training')
    parser.add_argument('--device', default='0', help='cuda device, i.e. 0 or 0,1,2,3 or cpu') #GPU
    parser.add_argument('--multi-scale', action='store_true', help='vary img-size +/- 50%%')
    parser.add_argument('--single-cls', action='store_true', help='train multi-class data as single-class')
    parser.add_argument('--optimizer', type=str, choices=['SGD', 'Adam', 'AdamW'], default='SGD', help='optimizer')
    parser.add_argument('--sync-bn', action='store_true', help='use SyncBatchNorm, only available in DDP mode')
    parser.add_argument('--workers', type=int, default=0, help='max dataloader workers (per RANK in DDP mode)') #CPU线程数设置
    parser.add_argument('--project', default=ROOT/'runs/train', help='save to project/name')
    parser.add_argument('--name', default='exp', help='save to project/name')
    parser.add_argument('--exist-ok', action='store_true', help='existing project/name ok, do not increment')
    parser.add_argument('--quad', action='store_true', help='quad dataloader')
    parser.add_argument('--cos-lr', action='store_true', help='cosine LR scheduler')
    parser.add_argument('--label-smoothing', type=float, default=0.0, help='Label smoothing epsilon')
    parser.add_argument('--patience', type=int, default=100, help='EarlyStopping patience (epochs without improvement)')
    parser.add_argument('--freeze', nargs='+', type=int, default=[0], help='Freeze layers: backbone=10, first3=0 1 2')
    parser.add_argument('--save-period', type=int, default=-1, help='Save checkpoint every x epochs (disabled if < 1)')
    parser.add_argument('--seed', type=int, default=0, help='Global training seed')
    parser.add_argument('--local_rank', type=int, default=-1, help='Automatic DDP Multi-GPU argument, do not modify')

    # Weights & Biases arguments
    parser.add_argument('--entity', default=None, help='W&B: Entity')
    parser.add_argument('--upload_dataset', nargs='?', const=True, default=False, help='W&B: Upload data, "val" option')
    parser.add_argument('--bbox_interval', type=int, default=-1, help='W&B: Set bounding-box image logging interval')
    parser.add_argument('--artifact_alias', type=str, default='latest', help='W&B: Version of dataset artifact to use')

    opt = parser.parse_known_args()[0] if known else parser.parse_args()
    return opt

[Solved] caffe Error: Check failed: cv_img.data Could not load

The following problems were encountered when running Caffe Code:

Problem Description:

E0727 11:22:09.213124  6200 io.cpp:89] Could not open or find file D:/.../
F0727 11:22:09.213124  6200 image_data_layer.cpp:129] Check failed: cv_img.data Could not load D:/.../
*** Check failure stack trace: ***

Solution:

Check the train or test file for the presence of empty lines (usually the last line) and delete them.