Tag Archives: artificial intelligence

[Solved] PyTorch Lightning Error: KeyError: ‘hidden_states‘

How to Solve PyTorch Lightning error KeyError: ‘hidden_ states’

Problem description: PyTorch Lightning error: KeyError: ‘hidden_ states’.

model = BertModel.from_pretrained('bert-base-uncased')

Solution: add a parameter after the above code, config=BertConfig.from_pretrained(‘bert-base-uncased’,output_hidden_states=True), as below:

model = BertModel.from_pretrained('bert-base-uncased', config=BertConfig.from_pretrained('bert-base-uncased',output_hidden_states=True))

[Solved] ONNXImporter::handleNode DNN/ONNX和create layer “onnx::Gather_384“ of type “NonMaxSuppression“

Today I encountered a lot of OpenCV loading model errors when debugging the yolov7 model conversion and loading problemm There is no way to fully display it due to the title length limit, I will post it here in its entirety.

[ERROR:0] global D:\opencv-python\opencv\modules\dnn\src\onnx\onnx_importer.cpp (720) cv::dnn::dnn4_v20211004: :ONNXImporter::handleNode DNN/ONNX: ERROR during processing node with 5 inputs and 1 outputs: [NonMaxSuppression]:(onnx::Gather_384)
cv2.error: OpenCV(4.5.4) D:\opencv-python\opencv\modules\dnn\src\onnx\onnx_importer.cpp:739: error: (- 2:Unspecified error) in function 'cv::dnn::dnn4_v20211004::ONNXImporter::handleNode'
cv2.error: OpenCV(4.5.4) D:\opencv-python\opencv\modules\dnn\src\onnx\onnx_importer.cpp:739: error: (- 2:Unspecified error) in function 'cv::dnn::dnn4_v20211004::ONNXImporter::handleNode'
> Node [NonMaxSuppression]:(onnx::Gather_384) parse error: OpenCV(4.5.4) D:\opencv-python\opencv\modules\dnn\src\dnn.cpp:615: error: (-2:Unspecified error) Can't create layer "onnx::Gather_384" of type "NonMaxSuppression" in function 'cv::dnn::dnn4_v20211004::LayerData::getLayerInstance&# 39;

At this time, I think of a way to compare my own model with the official model one by one,Comparison of one node and one node, Finally found the problem at the end.

[Official Model]

[My own model]

Seeing this, I’m wondering if there is such a big difference??It shouldn’t be,It’s all models built from the same code,So I started to trace the source,Sure enough Problem found.

At the position of my red frame, the official model ends here, and there is a large string of, tensor shapes for debugging both by printing, I guess that there may be a problem with the parameter settings during the model export process, So I tried to verify basically all the uncertain parameters, I found the problem.

In order to facilitate your understanding, I am giving my original conversion operation command here:

python export.py --weights best.pt --grid --end2end --simplify --topk-all 100 --iou-thres 0.65 --conf-thres 0.35 --img-size 640 640 --max-wh 640 

This is the command after:

python38 export.py --weights best.pt --grid --simplify --topk-all 100 --iou-thres 0.65 --conf-thres 0.35 --img -size 640 640 --max-wh 640 

See the difference, In fact, it is caused by the parameter end2end, After the modification, my model is as follows:

Because what I am doing here is the detection of the category, so the final output is: 1x25200x6, and the official one is: 1x25200x85.

[Solved] Labelimg Open an image Error: Error opening file

Labelimg program error, interface


Solution: re-save all the pictures to be marked according to the following procedure

import os
from tqdm import tqdm
from PIL import Image

dir_origin_path = "image save address"
dir_save_path = "image resave address"

img_names = os.listdir(dir_origin_path)
for img_name in tqdm(img_names):
    if img_name.lower().endswith(('.bmp', '.dib', '.png', '.jpg', '.jpeg', '.pbm', '.pgm', '.ppm', '.tif', '.tiff')):
        image_path = os.path.join(dir_origin_path, img_name)
        image = Image.open(image_path)
        image = image.convert('RGB')

        if not os.path.exists(dir_save_path):
            os.makedirs(dir_save_path)
        image.save(os.path.join(dir_save_path, img_name))

[Solved] RuntimeError: DefaultCPUAllocator: not enough memory: you tried to allocate 1105920 bytes.

Question

RuntimeError: [enforce fail at ..\c10\core\CPUAllocator.cpp:76] data. DefaultCPUAllocator: not enough memory: you tried to allocate 1105920 bytes.

Today, when running yoov7 on my own computer, I used the CPU to run the test model because I didn’t have a GPU. I used the CPU to predict an independent image. There is no problem running an image. It is very nice!!! However, when I predict a video (multiple images), he told me that the memory allocation was insufficient,

DefaultCPUAllocator: not enough memory: you tried to allocate 1105920 bytes.,

Moreover, it does not appear after the second image is run. It appears when the 17th image is calculated. The memory can not be released several times later~~~~~~~~

analysis

In pytorch, a tensor has a requires_grad parameter, which, if set to True, is automatically derived when backpropagating the tensor. tensor’s requires_grad property defaults to False, and if a node (leaf variable: tensor created by itself) requires_grad is set to True, then all nodes that depend on it require_grad to be True (even if other dependent tensors have requires_grad = False). grad is set to True, then all the nodes that depend on it will have True (even if the other tensor’s requires_grad = False)


Note:

requires_grad is a property of the generic data structure Tensor in Pytorch, which is used to indicate whether the current quantity needs to retain the corresponding gradient information in the calculation. Taking linear regression as an example, it is easy to know that the weights w and deviations b are the objects to be trained, and in order to get the most suitable parameter values, we need to set a relevant loss function, based on the idea of gradient back propagation Perform training.

When requires_grad is set to False, the backpropagation is not automatically derivative, so it saves memory or video memory.

Then the solution to this problem follows, just let the model not record the gradient during the test, because it is not really used.

 

Solution:

Use with torch.no_grad(), let the model not save the gradient during the test:

with torch.no_grad():
    output, _ = model(image) # Add before the image calculation

In this way, when the model calculates each image, the derivative will not be obtained and the gradient will not be saved!

Perfect solution!

[Solved] RuntimeError: CUDA error: CUBLAS_STATUS_ALLOC_FAILED when calling `cublasCreate(handle)`

[problem description]

The previous code can run normally. After the data set is expanded, the following errors are reported in the GPU program running the deep learning training model, but CUDA out of memory error is not prompted.

RuntimeError: CUDA error: CUBLAS_STATUS_ALLOC_FAILED when calling `cublasCreate(handle)`

[solution 1]

Change the program to run on the CPU and find that it can run normally, but the speed will be very slow and it will take a long time.

--device cpu

[solution 2]

Try to reduce the batch size used in the training model, and it can run normally.

[Solved] Camera Calibration Error: ErrorMessage: Image size does not match the measurement in camera parameters

1. problem description

ErrorMessage: Image size does not match the measurement in camera parameters

In the process of Halcon camera calibration, the above error is reported when finding_calib_object to find the calibration plate model and profile after generating the preliminary camera Camera.

2. Solutions and error causes

The problem is that when creating the startCamera parameter, the size of the original image is used, and when finding_calib_object information, the center rotation of the image is rotated, resulting in inconsistent size information.

[Solved] Using summary to View network parameters Error: RuntimeError: Input type (torch.cuda.FloatTensor)

Use summary to view network parameters

If you need to view the specific parameters of the network, use the use summary

from torchsummary import summary
summary(model, (3, 448, 448))

Show results

        Layer (type)               Output Shape         Param #
================================================================
            Conv2d-1         [-1, 64, 224, 224]           9,408
       BatchNorm2d-2         [-1, 64, 224, 224]             128
              ReLU-3         [-1, 64, 224, 224]               0
         MaxPool2d-4         [-1, 64, 112, 112]               0
            Conv2d-5         [-1, 64, 112, 112]           4,096
       BatchNorm2d-6         [-1, 64, 112, 112]             128
              ReLU-7         [-1, 64, 112, 112]               0
            Conv2d-8         [-1, 64, 112, 112]          36,864
       BatchNorm2d-9         [-1, 64, 112, 112]             128
             ReLU-10         [-1, 64, 112, 112]               0
           Conv2d-11        [-1, 256, 112, 112]          16,384
      BatchNorm2d-12        [-1, 256, 112, 112]             512
           Conv2d-13        [-1, 256, 112, 112]          16,384
      BatchNorm2d-14        [-1, 256, 112, 112]             512
             ReLU-15        [-1, 256, 112, 112]               0
       Bottleneck-16        [-1, 256, 112, 112]               0
           Conv2d-17         [-1, 64, 112, 112]          16,384
      BatchNorm2d-18         [-1, 64, 112, 112]             128
             ReLU-19         [-1, 64, 112, 112]               0
           Conv2d-20         [-1, 64, 112, 112]          36,864
      BatchNorm2d-21         [-1, 64, 112, 112]             128
             ReLU-22         [-1, 64, 112, 112]               0
           Conv2d-23        [-1, 256, 112, 112]          16,384
      BatchNorm2d-24        [-1, 256, 112, 112]             512
             ReLU-25        [-1, 256, 112, 112]               0
       Bottleneck-26        [-1, 256, 112, 112]               0
           Conv2d-27         [-1, 64, 112, 112]          16,384
      BatchNorm2d-28         [-1, 64, 112, 112]             128
             ReLU-29         [-1, 64, 112, 112]               0
           Conv2d-30         [-1, 64, 112, 112]          36,864
      BatchNorm2d-31         [-1, 64, 112, 112]             128
             ReLU-32         [-1, 64, 112, 112]               0
           Conv2d-33        [-1, 256, 112, 112]          16,384
      BatchNorm2d-34        [-1, 256, 112, 112]             512
             ReLU-35        [-1, 256, 112, 112]               0
       Bottleneck-36        [-1, 256, 112, 112]               0
           Conv2d-37        [-1, 128, 112, 112]          32,768
      BatchNorm2d-38        [-1, 128, 112, 112]             256
             ReLU-39        [-1, 128, 112, 112]               0
           Conv2d-40          [-1, 128, 56, 56]         147,456
      BatchNorm2d-41          [-1, 128, 56, 56]             256
             ReLU-42          [-1, 128, 56, 56]               0
           Conv2d-43          [-1, 512, 56, 56]          65,536
      BatchNorm2d-44          [-1, 512, 56, 56]           1,024
           Conv2d-45          [-1, 512, 56, 56]         131,072
      BatchNorm2d-46          [-1, 512, 56, 56]           1,024
             ReLU-47          [-1, 512, 56, 56]               0
       Bottleneck-48          [-1, 512, 56, 56]               0
           Conv2d-49          [-1, 128, 56, 56]          65,536
      BatchNorm2d-50          [-1, 128, 56, 56]             256
             ReLU-51          [-1, 128, 56, 56]               0
           Conv2d-52          [-1, 128, 56, 56]         147,456
      BatchNorm2d-53          [-1, 128, 56, 56]             256
             ReLU-54          [-1, 128, 56, 56]               0
           Conv2d-55          [-1, 512, 56, 56]          65,536
      BatchNorm2d-56          [-1, 512, 56, 56]           1,024
             ReLU-57          [-1, 512, 56, 56]               0
       Bottleneck-58          [-1, 512, 56, 56]               0
           Conv2d-59          [-1, 128, 56, 56]          65,536
      BatchNorm2d-60          [-1, 128, 56, 56]             256
             ReLU-61          [-1, 128, 56, 56]               0
           Conv2d-62          [-1, 128, 56, 56]         147,456
      BatchNorm2d-63          [-1, 128, 56, 56]             256
             ReLU-64          [-1, 128, 56, 56]               0
           Conv2d-65          [-1, 512, 56, 56]          65,536
      BatchNorm2d-66          [-1, 512, 56, 56]           1,024
             ReLU-67          [-1, 512, 56, 56]               0
       Bottleneck-68          [-1, 512, 56, 56]               0
           Conv2d-69          [-1, 128, 56, 56]          65,536
      BatchNorm2d-70          [-1, 128, 56, 56]             256
             ReLU-71          [-1, 128, 56, 56]               0
           Conv2d-72          [-1, 128, 56, 56]         147,456
      BatchNorm2d-73          [-1, 128, 56, 56]             256
             ReLU-74          [-1, 128, 56, 56]               0
           Conv2d-75          [-1, 512, 56, 56]          65,536
      BatchNorm2d-76          [-1, 512, 56, 56]           1,024
             ReLU-77          [-1, 512, 56, 56]               0
       Bottleneck-78          [-1, 512, 56, 56]               0
           Conv2d-79          [-1, 256, 56, 56]         131,072
      BatchNorm2d-80          [-1, 256, 56, 56]             512
             ReLU-81          [-1, 256, 56, 56]               0
           Conv2d-82          [-1, 256, 28, 28]         589,824
      BatchNorm2d-83          [-1, 256, 28, 28]             512
             ReLU-84          [-1, 256, 28, 28]               0
           Conv2d-85         [-1, 1024, 28, 28]         262,144
      BatchNorm2d-86         [-1, 1024, 28, 28]           2,048
           Conv2d-87         [-1, 1024, 28, 28]         524,288
      BatchNorm2d-88         [-1, 1024, 28, 28]           2,048
             ReLU-89         [-1, 1024, 28, 28]               0
       Bottleneck-90         [-1, 1024, 28, 28]               0
           Conv2d-91          [-1, 256, 28, 28]         262,144
      BatchNorm2d-92          [-1, 256, 28, 28]             512
             ReLU-93          [-1, 256, 28, 28]               0
           Conv2d-94          [-1, 256, 28, 28]         589,824
      BatchNorm2d-95          [-1, 256, 28, 28]             512
             ReLU-96          [-1, 256, 28, 28]               0
           Conv2d-97         [-1, 1024, 28, 28]         262,144
      BatchNorm2d-98         [-1, 1024, 28, 28]           2,048
             ReLU-99         [-1, 1024, 28, 28]               0
      Bottleneck-100         [-1, 1024, 28, 28]               0
          Conv2d-101          [-1, 256, 28, 28]         262,144
     BatchNorm2d-102          [-1, 256, 28, 28]             512
            ReLU-103          [-1, 256, 28, 28]               0
          Conv2d-104          [-1, 256, 28, 28]         589,824
     BatchNorm2d-105          [-1, 256, 28, 28]             512
            ReLU-106          [-1, 256, 28, 28]               0
          Conv2d-107         [-1, 1024, 28, 28]         262,144
     BatchNorm2d-108         [-1, 1024, 28, 28]           2,048
            ReLU-109         [-1, 1024, 28, 28]               0
      Bottleneck-110         [-1, 1024, 28, 28]               0
          Conv2d-111          [-1, 256, 28, 28]         262,144
     BatchNorm2d-112          [-1, 256, 28, 28]             512
            ReLU-113          [-1, 256, 28, 28]               0
          Conv2d-114          [-1, 256, 28, 28]         589,824
     BatchNorm2d-115          [-1, 256, 28, 28]             512
            ReLU-116          [-1, 256, 28, 28]               0
          Conv2d-117         [-1, 1024, 28, 28]         262,144
     BatchNorm2d-118         [-1, 1024, 28, 28]           2,048
            ReLU-119         [-1, 1024, 28, 28]               0
      Bottleneck-120         [-1, 1024, 28, 28]               0
          Conv2d-121          [-1, 256, 28, 28]         262,144
     BatchNorm2d-122          [-1, 256, 28, 28]             512
            ReLU-123          [-1, 256, 28, 28]               0
          Conv2d-124          [-1, 256, 28, 28]         589,824
     BatchNorm2d-125          [-1, 256, 28, 28]             512
            ReLU-126          [-1, 256, 28, 28]               0
          Conv2d-127         [-1, 1024, 28, 28]         262,144
     BatchNorm2d-128         [-1, 1024, 28, 28]           2,048
            ReLU-129         [-1, 1024, 28, 28]               0
      Bottleneck-130         [-1, 1024, 28, 28]               0
          Conv2d-131          [-1, 256, 28, 28]         262,144
     BatchNorm2d-132          [-1, 256, 28, 28]             512
            ReLU-133          [-1, 256, 28, 28]               0
          Conv2d-134          [-1, 256, 28, 28]         589,824
     BatchNorm2d-135          [-1, 256, 28, 28]             512
            ReLU-136          [-1, 256, 28, 28]               0
          Conv2d-137         [-1, 1024, 28, 28]         262,144
     BatchNorm2d-138         [-1, 1024, 28, 28]           2,048
            ReLU-139         [-1, 1024, 28, 28]               0
      Bottleneck-140         [-1, 1024, 28, 28]               0
          Conv2d-141          [-1, 512, 28, 28]         524,288
     BatchNorm2d-142          [-1, 512, 28, 28]           1,024
            ReLU-143          [-1, 512, 28, 28]               0
          Conv2d-144          [-1, 512, 14, 14]       2,359,296
     BatchNorm2d-145          [-1, 512, 14, 14]           1,024
            ReLU-146          [-1, 512, 14, 14]               0
          Conv2d-147         [-1, 2048, 14, 14]       1,048,576
     BatchNorm2d-148         [-1, 2048, 14, 14]           4,096
          Conv2d-149         [-1, 2048, 14, 14]       2,097,152
     BatchNorm2d-150         [-1, 2048, 14, 14]           4,096
            ReLU-151         [-1, 2048, 14, 14]               0
      Bottleneck-152         [-1, 2048, 14, 14]               0
          Conv2d-153          [-1, 512, 14, 14]       1,048,576
     BatchNorm2d-154          [-1, 512, 14, 14]           1,024
            ReLU-155          [-1, 512, 14, 14]               0
          Conv2d-156          [-1, 512, 14, 14]       2,359,296
     BatchNorm2d-157          [-1, 512, 14, 14]           1,024
            ReLU-158          [-1, 512, 14, 14]               0
          Conv2d-159         [-1, 2048, 14, 14]       1,048,576
     BatchNorm2d-160         [-1, 2048, 14, 14]           4,096
            ReLU-161         [-1, 2048, 14, 14]               0
      Bottleneck-162         [-1, 2048, 14, 14]               0
          Conv2d-163          [-1, 512, 14, 14]       1,048,576
     BatchNorm2d-164          [-1, 512, 14, 14]           1,024
            ReLU-165          [-1, 512, 14, 14]               0
          Conv2d-166          [-1, 512, 14, 14]       2,359,296
     BatchNorm2d-167          [-1, 512, 14, 14]           1,024
            ReLU-168          [-1, 512, 14, 14]               0
          Conv2d-169         [-1, 2048, 14, 14]       1,048,576
     BatchNorm2d-170         [-1, 2048, 14, 14]           4,096
            ReLU-171         [-1, 2048, 14, 14]               0
      Bottleneck-172         [-1, 2048, 14, 14]               0
          Conv2d-173          [-1, 256, 14, 14]         524,288
     BatchNorm2d-174          [-1, 256, 14, 14]             512
          Conv2d-175          [-1, 256, 14, 14]         589,824
     BatchNorm2d-176          [-1, 256, 14, 14]             512
          Conv2d-177          [-1, 256, 14, 14]          65,536
     BatchNorm2d-178          [-1, 256, 14, 14]             512
          Conv2d-179          [-1, 256, 14, 14]         524,288
     BatchNorm2d-180          [-1, 256, 14, 14]             512
detnet_bottleneck-181          [-1, 256, 14, 14]               0
          Conv2d-182          [-1, 256, 14, 14]          65,536
     BatchNorm2d-183          [-1, 256, 14, 14]             512
          Conv2d-184          [-1, 256, 14, 14]         589,824
     BatchNorm2d-185          [-1, 256, 14, 14]             512
     BatchNorm2d-197           [-1, 30, 14, 14]              60
================================================================

Error reported:

RuntimeError: Input type (torch.cuda.FloatTensor) and weight type (torch.FloatTensor) should be the same

Run the model in the graphics card:

from torchsummary import summary
summary(net.cuda(), (3, 448, 448))

[Solved] Error in Summary.factor ‘sum’ not meaningful for factors

 

Question:

The root cause is the wrong data type.

The factor type has no sum method

#create a vector of class vector
factor_vector <- as.factor(c(1, 7, 12, 14, 15))

#attempt to find min value in the vector
sum(factor_vector)

Error in Summary.factor(1:5, na.rm = FALSE) : 
  ‘sum’ not meaningful for factors

Solution:
Convert to numeric values and use the as.numeric function.
mydata$value<-as.numeric(mydata$value)
is.numeric(mydata$value)

#convert factor vector to numeric vector and find the min value
new_vector <- as.numeric(as.character(factor_vector))
sum(new_vector)

#[1] 49

Complete error:

#create a vector of class vector
factor_vector <- as.factor(c(1, 7, 12, 14, 15))

#attempt to find min value in the vector
sum(factor_vector)

Error in Summary.factor(1:5, na.rm = FALSE) : 
  ‘sum’ not meaningful for factors

Other (the minimum value can be obtained for numeric value, string and date type)

Numeric value, string and date type can all be maximized. Similarly, the maximum value can be obtained.

numeric_vector <- c(1, 2, 12, 14)
max(numeric_vector)

#[1] 14

character_vector <- c("a", "b", "f")
max(character_vector)

#[1] "f"

date_vector <- as.Date(c("2019-01-01", "2019-03-05", "2019-03-04"))
max(date_vector)

#[1] "2019-03-05"

The R language is called R partly because of the names of the two R authors (Robert gentleman and Ross ihaka) and partly because of the influence of Bell Labs s language (called the dialect of s language).

R language is a mathematical programming language designed for mathematical researchers. It is mainly used for statistical analysis, drawing and data mining.

If you are a beginner of computer programs and are eager to understand the general programming of computers, R language is not an ideal choice. You can choose python, C or Java.

Both R language and C language are the research achievements of Bell Laboratories, but they have different emphasis areas. R language is an explanatory language for mathematical theory researchers, while C language is designed for computer software engineers.

R language is a language for interpretation and operation (different from the compilation and operation of C language). Its execution speed is much slower than that of C language, which is not conducive to optimization. However, it provides more abundant data structure operation at the syntax level and can easily output text and graphic information, so it is widely used in mathematics, especially in statistics

[Solved] Error in Summary.factor ‘max’ not meaningful for factors

Question:

The root cause is the wrong data type.

The factor type has no max method

#create a vector of class vector
factor_vector <- as.factor(c(1, 7, 12, 14, 15))

#attempt to find max value in the vector
max(factor_vector)

#Error in Summary.factor(1:5, na.rm = FALSE) : 
#  'max' not meaningful for factors

Solution:

Convert to numeric value or string, here convert to numeric value.

mydata$value<-as.numeric(mydata$value)
is.numeric(mydata$value)

#convert factor vector to numeric vector and find the max value
new_vector <- as.numeric(as.character(factor_vector))
max(new_vector)

#[1] 15

Full error:

#create a vector of class vector
factor_vector <- as.factor(c(1, 7, 12, 14, 15))

#attempt to find max value in the vector
max(factor_vector)

#Error in Summary.factor(1:5, na.rm = FALSE) : 
#  'max' not meaningful for factors

 

Other (numeric value, string, date type can be the maximum value)

Numeric value, string and date type can all be the maximum value, and similarly, the minimum value can be obtained.

numeric_vector <- c(1, 2, 12, 14)
max(numeric_vector)

#[1] 14

character_vector <- c("a", "b", "f")
max(character_vector)

#[1] "f"

date_vector <- as.Date(c("2019-01-01", "2019-03-05", "2019-03-04"))
max(date_vector)

#[1] "2019-03-05"

[ncclUnhandledCudaError] unhandled cuda error, NCCL version xx.x.x

Problem description

Problems encountered during distributed training

RuntimeError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:47, unhandled cuda error, NCCL version 21.0.3
ncclUnhandledCudaError: Call to CUDA function failed.

The specific errors are as follows:

 

Problem-solving

According to the analysis of error reporting information, an error is reported during initialization during distributed training, not during training. Therefore, the problem is located on the initialization of distributed training.

Enter the following command to check the card of the current server

nvidia-smi -L

The first card found is 3070

GPU 0: NVIDIA GeForce RTX 2080 Ti (UUID: xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx)
GPU 1: NVIDIA GeForce RTX 3070 (UUID: xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx)
GPU 2: NVIDIA GeForce RTX 2080 Ti (UUID: xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx)
GPU 3: NVIDIA GeForce RTX 2080 Ti (UUID: xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx)
GPU 4: NVIDIA GeForce RTX 2080 Ti (UUID: xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx)
GPU 5: NVIDIA GeForce RTX 2080 Ti (UUID: xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx)
GPU 6: NVIDIA GeForce RTX 2080 Ti (UUID: xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx)
GPU 7: NVIDIA GeForce RTX 2080 Ti (UUID: xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx)

Therefore, here I directly try to use 2-7 cards for training.

Correct solution!

[Solved] CUDA failure 999: unknown error ; GPU=-351697408 ; hostname=4f5e6dff58e6 ; expr=cudaSetDevice(info_.device_id);

How to Solve error: CUDA failure 999: unknown error

1. Error Message:

CUDA failure 999: unknown error ; GPU=-351697408 ; hostname=4f5e6dff58e6 ; expr=cudaSetDevice(info_.device_id);

 

2. Solution:

To reload the nvidia kernel module, enter the following command.

sudo rmmod nvidia_uvm

sudo modprobe nvidia_uvm

MindSpore Error: [ERROR] MD:unexpected error.Not a valid index

ERROR:[MD]:unexpected error. Not a valid index


problem phenomenon: a single card does not report an error, and the training process can be correctly executed. However, when switching to distributed training, an error in the diagram is reported. After troubleshooting, the cause of the error is found to be the wrong use of distributed sampling
the error reporting method is as follows:

the order of distributed sampling and random sampling needs to be changed. The correct way is to perform random sampling first and then distributed sampling

The correct modification is as follows:

distributed training can be performed correctly after modification