Tag Archives: MindSpore Error

MindSpore Error: [ERROR] MD:unexpected error.Not a valid index

ERROR:[MD]:unexpected error. Not a valid index


problem phenomenon: a single card does not report an error, and the training process can be correctly executed. However, when switching to distributed training, an error in the diagram is reported. After troubleshooting, the cause of the error is found to be the wrong use of distributed sampling
the error reporting method is as follows:

the order of distributed sampling and random sampling needs to be changed. The correct way is to perform random sampling first and then distributed sampling

The correct modification is as follows:

distributed training can be performed correctly after modification

[Solved] MindSpore Network custom reverse error: TypeError: The params of function ‘bprop’ of

1. Error description

1.1 System Environment

Hardware Environment(Ascend/GPU/CPU): GPU
Software Environment:

  • MindSpore version (source or binary): 1.7.0
  • Python version (e.g., Python 3.7.5): 3.7.5
  • OS platform and distribution (e.g., Linux Ubuntu 16.04): Ubuntu 18.04.4 LTS
  • GCC/Compiler version (if compiled from source): 7.5.0

1.2 Basic information

1.2.1 Source code

import mindspore as ms
import mindspore.nn as nn
from mindspore.common.tensor import Tensor
from mindspore.ops import composite as C

grad_all = C.GradOperation(get_all=True)

class MulAdd(nn.Cell):
    def construct(self, x, y):
        return 2 * x + y

    def bprop(self, x, y, out):
        return 2 * x, 2 * y
mul_add = MulAdd()
x = Tensor(1, dtype=ms.int32)
y = Tensor(2, dtype=ms.int32)
output = grad_all(mul_add)(x, y)

1.2.2 Error reporting

TypeError: The params of function ‘bprop’ of Primitive or Cell requires the forward inputs as well as the ‘out’ and ‘dout’

Traceback (most recent call last):
  File "test_grad.py", line 20, in <module>
    output = grad_all(mul_add)(x, y)
  File "/home/liangzhibo/mindspore/build/package/mindspore/common/api.py", line 522, in staging_specialize
    out = _MindsporeFunctionExecutor(func, hash_obj, input_signature, process_obj)(*args)
  File "/home/liangzhibo/mindspore/build/package/mindspore/common/api.py", line 93, in wrapper
    results = fn(*arg, **kwargs)
  File "/home/liangzhibo/mindspore/build/package/mindspore/common/api.py", line 353, in __call__
    phase = self.compile(args_list, self.fn.__name__)
  File "/home/liangzhibo/mindspore/build/package/mindspore/common/api.py", line 321, in compile
    is_compile = self._graph_executor.compile(self.fn, compile_args, phase, True)
TypeError: The params of function 'bprop' of Primitive or Cell requires the forward inputs as well as the 'out' and 'dout'.
In file test_grad.py(13)
    def bprop(self, x, y, out):
    ^

----------------------------------------------------
- The Traceback of Net Construct Code:
----------------------------------------------------

# In file test_grad.py(13)
    def bprop(self, x, y, out):
    ^

----------------------------------------------------
- C++ Call Stack: (For framework developers)
----------------------------------------------------
mindspore/ccsrc/frontend/optimizer/ad/kprim.cc:651 BuildOutput

2. Cause analysis and solution

In this use case, we used Cell’s custom reverse rule. And the error message also reminds us that we are the input of custom rules, that is

def bprop(self, x, y, out):

There is an error in this sentence. 

When customizing the reverse rule bprop of a Cell, it needs to accept three types of inputs, namely the forward input of the Cell (x, y in this use case), the forward output of the Cell (out in this use case), and The accumulated gradient of the input network inverse (dout). In this use case, the run fails because the dout input is missing. So we just need to change the code to:

def bprop(self, x, y, out, dout):
    return 2 * x, 2 * y

 The program can run normally.

The following figure shows the meanings of the three types of inputs. dout is the gradient output by the previous node in the reverse graph. The bprop function needs this input to inherit and use the calculated gradient.

Untitled Diagram.png

In addition, the three types of inputs of bprop need to be used when composing a picture, so even if some inputs are not used in the bprop function, they still need to be passed into bprop.

[Solved] MindSpore Error: “ValueError:invalid literal for int()with base10’the’

1 Error description

1.1 System Environment

Hardware Environment(Ascend/GPU/CPU): GPU
Software Environment:
– MindSpore version (source or binary): 1.2.0
– Python version (eg, Python 3.7.5): 3.7.5
– OS platform and distribution (eg, Linux Ubuntu 16.04): Ubuntu 4.15.0-74-generic
– GCC/Compiler version (if compiled from source):

1.2 Basic information

1.2.1 Source code

https://gitee.com/mindspore/mindspore/tree/master/model_zoo/official/nlp/lstm

1.2.2 Error reporting

Error message: ValueError: invalid literal for int()with base10’the’.
picture.png

2 Reason analysis

The data set is not processed according to the tutorial. The README states that a line needs to be added to the data set to read 400,000 words, each of which is represented by a 300-dimensional word vector.

3 Solutions

Insert a line before the first line of the glove.6B.300d.txt file: 400000 300
picture.png

4 Summary

When running the model in the official tutorial, try to follow the steps in the README. Omitting the data processing steps may result in an error.

[Solved] MindSpore Error: `half_pixel_centers`=True only support in Ascend

1 Error description

1.1 System Environment

Hardware Environment(Ascend/GPU/CPU): CPU
Software Environment:
– MindSpore version (source or binary): 1.8.0
– Python version (e.g., Python 3.7.5): 3.7.6
– OS platform and distribution (e.g., Linux Ubuntu 16.04): Ubuntu 4.15.0-74-generic
– GCC/Compiler version (if compiled from source):

1.2 Basic information

1.2.1 Script

Call the ResizeBilinear operator to adjust the input Tensor to the specified size using bilinear interpolation. The script is as follows:

 01 context.set_context(device_target='CPU')
 02 x = Tensor([[[[1, 2, 3, 4, 5], [1, 2, 3, 4, 5]]]], mindspore.float32)
 03 resize_bilinear = ops.ResizeBilinear((5, 5), half_pixel_centers=True)
 04 output = resize_bilinear(x)
 05 print(output)

1.2.2 Error reporting

The error message here is as follows:

Traceback (most recent call last):
  File "C:/Users/l30026544/PycharmProjects/q2_map/new/ResizeBilinear.py", line 7, in <module>
    resize_bilinear = ops.ResizeBilinear((5, 5), half_pixel_centers=True)
  File "C:\Users\l30026544\PycharmProjects\q2_map\lib\site-packages\mindspore\ops\primitive.py", line 687, in deco
    fn(self, *args, **kwargs)
  File "C:\Users\l30026544\PycharmProjects\q2_map\lib\site-packages\mindspore\ops\operations\nn_ops.py", line 3263, in __init__
    raise ValueError(f"Currently `half_pixel_centers`=True only support in Ascend device_target, "
ValueError: Currently `half_pixel_centers`=True only support in Ascend device_target, but got CPU

Cause Analysis

Let’s look at the error message. In ValueError, write Current  half_pixel_centers=True only support in Ascend device_target, but got CPU, which means that only the half_pixel_centers property can be set to True in the Ascend environment. The official website API explains this:
image.png

2 Solutions

For the reasons known above, it is easy to make the following modifications:

 01 context.set_context(device_target='Ascend')
 02 x = Tensor([[[[1, 2, 3, 4, 5], [1, 2, 3, 4, 5]]]], mindspore.float32)
 03 resize_bilinear = ops.ResizeBilinear((5, 5), half_pixel_centers=True)
 04 output = resize_bilinear(x)
 05 print(output)

At this point, the execution is successful, and the output is as follows:

[[[[1. 2. 3. 4. 5.]
[1. 2. 3. 4. 5.]
[1. 2. 3. 4. 5.]
[1. 2. 3. 4. 5.]
[1. 2. 3. 4. 5.]]]]

3 Summary

Steps to locate the error report:

1. Find the line of user code that reports the error: * resize_bilinear = ops.ResizeBilinear((5, 5), half_pixel_centers=True)*;

2. According to the keywords in the log error message, narrow down the scope of the analysis problem. Currently  half_pixel_centers=True only support in Ascend device_target, but got CPU  ;

[Solved] MindSpore Error: StridedSlice operator does not support input of uint8 data type on Ascend hardware

1 Error description

1.1 System Environment

Hardware Environment(Ascend/GPU/CPU): CPU
Software Environment:
– MindSpore version (source or binary): 1.8.0
– Python version (eg, Python 3.7.5): 3.7.6
– OS platform and distribution (eg, Linux Ubuntu 16.04): Ubuntu 4.15.0-74-generic
– GCC/Compiler version (if compiled from source):

1.2 Basic information

1.2.1 Script

The training script is to construct a single-operator network of StridedSlice, and extract slices from the input Tensor according to the step size and index. The script is as follows:

 01 class Net(nn.Cell):
 02     def __init__(self,):
 03         super(Net, self).__init__()
 04         self.strided_slice = ops.StridedSlice()
 05 
 06     def construct(self, x, begin, end, strides):
 07         out = self.strided_slice(x, begin, end, strides)
 08         return out
 09 input_x = Tensor([[[1, 1, 1], [2, 2, 2]], [[3, 3, 3], [4, 4, 4]],
 10                   [[5, 5, 5], [6, 6, 6]]], mindspore.uint8)
 11 out = Net()(input_x,  (1, 0, 2), (3, 1, 3), (1, 1, 1))
 12 print(out)

1.2.2 Error reporting

The error message here is as follows:

Traceback (most recent call last):
  File "160945-strided_slice.py", line 19, in <module>
    out = Net()(input_x,  (1, 0, 2), (3, 1, 3), (1, 1, 1))
  File "/root/miniconda3/envs/high_llj/lib/python3.7/site-packages/mindspore/nn/cell.py", line 574, in __call__
    out = self.compile_and_run(*args)
  File "/root/miniconda3/envs/high_llj/lib/python3.7/site-packages/mindspore/nn/cell.py", line 975, in compile_and_run
    self.compile(*inputs)
  File "/root/miniconda3/envs/high_llj/lib/python3.7/site-packages/mindspore/nn/cell.py", line 948, in compile
    jit_config_dict=self._jit_config_dict)
  File "/root/miniconda3/envs/high_llj/lib/python3.7/site-packages/mindspore/common/api.py", line 1092, in compile
    result = self._graph_executor.compile(obj, args_list, phase, self._use_vm_mode())
TypeError: Operator[StridedSlice]  input(UInt8) output(UInt8) is not supported. This error means the current input type is not supported, please refer to the MindSpore doc for supported types.


Cause Analysis

Let’s look at the error message. In TypeError, write Operator[StridedSlice] input(UInt8) output(UInt8) is not supported.. This error means the current input type is not supported, please refer to the MindSpore doc for supported types, mainly It means that for the StridedSlice operator, on the CPU, the input and output types of the uint data type are currently not supported. The solution is to switch to run on the ascend/gpu platform.

2 Solutions

For the reasons known above, it is easy to make the following modifications:

context.set_context(device_target='Ascend')
class Net(nn.Cell):
    def __init__(self,):
        super(Net, self).__init__()
        self.strided_slice = ops.StridedSlice()

    def construct(self, x, begin, end, strides):
        out = self.strided_slice(x, begin, end, strides)
        return out
input_x = Tensor([[[1, 1, 1], [2, 2, 2]], [[3, 3, 3], [4, 4, 4]],
                  [[5, 5, 5], [6, 6, 6]]], mindspore.uint8)
out = Net()(input_x,  (1, 0, 2), (3, 1, 3), (1, 1, 1))
print(out)

At this point, the execution is successful, and the output is as follows:

[[[3]]

 [[5]]]

 

3 Summary

Steps to locate the error report:

1. Find the line of user code that reported the error: out = Net()(input_x, (1, 0, 2), (3, 1, 3), (1, 1, 1));

2. According to the keywords in the log error message, narrow the scope of the analysis problem: input(kNumberTypeUInt8) output(kNumberTypeUInt8) is not supported;

3. It is necessary to focus on the correctness of variable definition and initialization.

[Solved] MindSpore infer error when passing in sens values for derivation: For ‘MatMul’, the input dimensions

1 Error description

1.1 System Environment

Hardware Environment(Ascend/GPU/CPU): GPU
Software Environment:

  • MindSpore version (source or binary): 1.7.0
  • Python version (e.g., Python 3.7.5): 3.7.5
  • OS platform and distribution (e.g., Linux Ubuntu 16.04): Ubuntu 18.04.4 LTS
  • GCC/Compiler version (if compiled from source): 7.5.0

1.2 Basic information

1.2.1 Source code

import numpy as np
import mindspore.nn as nn
import mindspore.ops as ops
from mindspore import Tensor
from mindspore import ParameterTuple, Parameter
from mindspore import dtype as mstype

x = Tensor([[0.8, 0.6, 0.2], [1.8, 1.3, 1.1]], dtype=mstype.float32)
y = Tensor([[0.11, 3.3, 1.1], [1.1, 0.2, 1.4], [1.1, 2.2, 0.3]], dtype=mstype.float32)

class Net(nn.Cell):
    def __init__(self):
        super(Net, self).__init__()
        self.matmul = ops.MatMul()
        self.z = Parameter(Tensor(np.array([1.0], np.float32)), name='z')

    def construct(self, x, y):
        x = x * self.z
        out = self.matmul(x, y)
        return out


class GradNetWrtN(nn.Cell):
    def __init__(self, net):
        super(GradNetWrtN, self).__init__()
        self.net = net
        self.grad_op = ops.GradOperation(sens_param=True)
        self.grad_wrt_output = Tensor([[0.1, 0.6, 0.2]], dtype=mstype.float32)

    def construct(self, x, y):
        gradient_function = self.grad_op(self.net)
        return gradient_function(x, y, self.grad_wrt_output)


output = GradNetWrtN(Net())(x, y)
print(output)

1.2.2 Error reporting

报错信息:ValueError: For ‘MatMul’, the input dimensions must be equal, but got ‘x1_col’: 2 and ‘x2_row’: 1. And ‘x’ shape [2, 3](#), ‘y’ shape [1, 3](#).
image.png

2 Reason analysis

  1. According to the error message, the MatMul operator checks that the input shape is incorrect when infer shape, specifically the number of columns of x1 is not equal to the number of rows of x2.
  2. Open the debug file provided by the error report /root/gitee/mindspore/rank_0/om/analyze_fail.dat, and the interception part is as follows:
    image.png
    Refer to the analysis_fail.dat file analysis guide , it can be seen that the first red box of MatMul reports an error in the infer shape. Then look at the second red box. The shape of the first input of the operator is (2, 3), and the shape of the second input is (1, 3), which is consistent with the error message (note the transpose_a attribute of MatMul here. is True). Finally, let’s look at the third red box. The MatMul is grad_math_ops.pycalled in line 253 of the file. It is the operator generated by the back-propagation rule of the MatMul operator. The back-propagation rule of the MatMul operator is as follows:
    image.png
    Let’s see The shape of the two inputs to this MatMul operator, namely xand doutxis confirmed to be correct, that is dout, the shape is wrong.
  3. From the mechanism of reverse automatic differentiation, we know that the first operator of the reverse part is generated from the back propagation rule of the last operator of the forward part. The forward network has only one MatMul operator, and it is the last operator, so the reverse MatMul operator reported by the infer shape error is generated from the back propagation rule of this forward MatMul operator (this use case It is relatively simple. There is only one MatMul operator in the forward network. Combine the input and output of the operator to infer from which forward operator a reverse operator is backpropagated. rule-generated), and is the first operator of the reverse part. Therefore, the second input of this reverse MatMul doutcan only be passed in from the outside, that is, the self.grad_wrt_output passed in the use case. That is, the shape of self.grad_wrt_output is wrong.

3 Solutions

The sens value passed in by GradOperation is the gradient of the forward network output passed by the script from the outside, which can play the role of gradient value scaling. Since it is about the gradient output of the forward network, the shape of the sens value needs to be consistent with the output shape of the forward network (which can be obtained by calling the forward network and printing its output shape). Let’s change the value of self.grad_wrt_output in the above use case, as follows:

self.grad_wrt_output = Tensor([[0.1, 0.6, 0.2], [0.8, 1.3, 1.1]], dtype=mstype.float32)

Finally the problem can be solved. 

[Solved] MindSpore Error: For primitive[TensorSummary], the v rank Must be greater than or equal to 0

1 Error description

1.1 System Environment

Hardware Environment(Ascend/GPU/CPU): Ascend
Software Environment:
– MindSpore version (source or binary): 1.8.0
– Python version (eg, Python 3.7.5): 3.7.6
– OS platform and distribution (eg, Linux Ubuntu 16.04): Ubuntu 4.15.0-74-generic
– GCC/Compiler version (if compiled from source):

1.2 Basic information

1.2.1 Script

The training script is to construct a simple operator network, perform the Add operation on the input two tensors, and then call the Tensor
Summary. The script is as follows:

01 class SummaryNet(nn.Cell):
02     def __init__(self,):
03         super(SummaryNet, self).__init__()
04         self.summary = ops.TensorSummary()
05         self.add = ops.Add()
06 
07     def construct(self, x, y):
08         x = self.add(x, y)
09         name = "x"
10         self.summary(name, x.sum())
11         return x
12         
13 x = Tensor(np.array([1, 2, 3]).astype(np.float32))
14 y = Tensor(np.array([4, 5, 6]).astype(np.float32))
15 summary_net = SummaryNet()(x, y)
16 print("out: ", summary_net)

1.2.2 Error reporting

The error message here is as follows:

Traceback (most recent call last):
  File "C:/Users/l30026544/PycharmProjects/q2_map/new/173735.py", line 22, in <module>
    summary_net = SummaryNet()(x, y)
  File "C:\Users\l30026544\PycharmProjects\q2_map\lib\site-packages\mindspore\nn\cell.py", line 586, in __call__
    out = self.compile_and_run(*args)
  File "C:\Users\l30026544\PycharmProjects\q2_map\lib\site-packages\mindspore\nn\cell.py", line 964, in compile_and_run
    self.compile(*inputs)
  File "C:\Users\l30026544\PycharmProjects\q2_map\lib\site-packages\mindspore\nn\cell.py", line 937, in compile
    _cell_graph_executor.compile(self, *inputs, phase=self.phase, auto_parallel_mode=self._auto_parallel_mode)
  File "C:\Users\l30026544\PycharmProjects\q2_map\lib\site-packages\mindspore\common\api.py", line 1006, in compile
    result = self._graph_executor.compile(obj, args_list, phase, self._use_vm_mode())
ValueError: mindspore\core\utils\check_convert_utils.cc:397 CheckInteger] For primitive[TensorSummary], the v rank must be greater than or equal to 1, but got 0.
WARNING: Logging before InitGoogleLogging() is written to STDERR
[CRITICAL] CORE(6472,1,?):2022-6-17 15:47:53 [mindspore\core\utils\check_convert_utils.cc:397] CheckInteger] For primitive[TensorSummary], the v rank must be greater than or equal to 1, but got 0.

Cause Analysis

Let’s look at the error message. In ValueError, write ValueError: For primitive[TensorSummary], the v rank must be greater than or equal to 1, but got 0.
, which means that for TensorSummary, the rank of parameter v must be greater than or equal to 1, But it got 0. Therefore, it is necessary to check whether the rank of v passed into TensorSummary meets the requirements. Checking line 8 of the script finds that x and y are summed and the result is a scalar (constant), hence the error. Regarding TensorSummary, there are input restrictions on the official website, and the rank of the input Tensor must be greater than or equal to 1. If you need to collect scalar data, you can use the ScalarSummary operator.

2 Solutions

For the reasons known above, it is easy to make the following modifications:

01 class SummaryNet(nn.Cell):
02     def __init__(self,):
03         super(SummaryNet, self).__init__()
04         self.summary = ops.ScalarSummary()
05         self.add = ops.Add()
06 
07     def construct(self, x, y):
08         x = self.add(x, y)
09         name = "x"
10         self.summary(name, x.sum())
11         return x
12         
13 x = Tensor(np.array([1, 2, 3]).astype(np.float32))
14 y = Tensor(np.array([4, 5, 6]).astype(np.float32))
15 summary_net = SummaryNet()(x, y)
16 print("out: ", summary_net)

At this point, the execution is successful, and the output is as follows:

out: [5. 7. 9.]

3 Summary

Steps to locate the error report:

1. Find the line of user code that reported the error: * summary_net = SummaryNet()(x, y)*;

2. According to the keywords in the log error message, narrow the scope of the analysis problem* For primitive[TensorSummary], the v rank must be greater than or equal to 1, but got 0.* ;

3. It is necessary to focus on the correctness of variable definition and initialization.

[Solved] MindSpore Error: ValueError: For ‘AvgPool’ every dimension of the output shape must be greater than zero

1 Error description

1.1 System Environment

Hardware Environment(Ascend/GPU/CPU): Ascend
Software Environment:
– MindSpore version (source or binary): 1.8.0
– Python version (eg, Python 3.7.5): 3.7.6
– OS platform and distribution (eg, Linux Ubuntu 16.04): Ubuntu 4.15.0-74-generic
– GCC/Compiler version (if compiled from source):

1.2 Basic information

1.2.1 Script

The training script is to perform a two-dimensional average pooling operation on the input multi-dimensional data by constructing a single-operator network of AvgPool. The script is as follows:


 01 class Net(nn.Cell):
 02     def __init__(self):
 03         super(Net, self).__init__()
 04         self.avgpool_op = ops.AvgPool(pad_mode="VALID", kernel_size=32, strides=1)
 05 
 06     def construct(self, x):
 07         result = self.avgpool_op(x)
 08         return result
 09 
 10 x = Tensor(np.arange(128 * 20 * 32 * 65).reshape(65, 32, 20, 128),mindspore.float32)
 11 net = Net()
 12 output = net(x)
 13 print(output)

1.2.2 Error reporting

The error message here is as follows:


Traceback (most recent call last):
  File "avgpool.py", line 17, in <module>
    output = net(x)
  File "/root/archiconda3/envs/lilinjie_high/lib/python3.7/site-packages/mindspore/nn/cell.py", line 573, in __call__
    out = self.compile_and_run(*args)
  File "/root/archiconda3/envs/lilinjie_high/lib/python3.7/site-packages/mindspore/nn/cell.py", line 956, in compile_and_run
    self.compile(*inputs)
  File "/root/archiconda3/envs/lilinjie_high/lib/python3.7/site-packages/mindspore/nn/cell.py", line 929, in compile
    _cell_graph_executor.compile(self, *inputs, phase=self.phase, auto_parallel_mode=self._auto_parallel_mode)
  File "/root/archiconda3/envs/lilinjie_high/lib/python3.7/site-packages/mindspore/common/api.py", line 1063, in compile
    result = self._graph_executor.compile(obj, args_list, phase, self._use_vm_mode())
  File "/root/archiconda3/envs/lilinjie_high/lib/python3.7/site-packages/mindspore/ops/primitive.py", line 575, in __infer__
    out[track] = fn(*(x[track] for x in args))
  File "/root/archiconda3/envs/lilinjie_high/lib/python3.7/site-packages/mindspore/ops/operations/nn_ops.py", line 1572, in infer_shape
    raise ValueError(f"For '{self.name}', the each element of the output shape must be larger than 0, "
ValueError: For 'AvgPool', the each element of the output shape must be larger than 0, but got output shape: [65, 32, -3, 105]. The input shape: [65, 32, 20, 128], kernel size: (1, 1, 24, 24), strides: (1, 1, 1, 1).Please check the official api documents for more information about the output.

 

2 Solutions

For the reasons known above, it is easy to make the following modifications:


 01 class Net(nn.Cell):
 02     def __init__(self):
 03         super(Net, self).__init__()
 04         self.avgpool_op = ops.AvgPool(pad_mode="VALID", kernel_size=15, strides=1)
 05 
 06     def construct(self, x):
 07         result = self.avgpool_op(x)
 08         return result
 09 
 10 x = Tensor(np.arange(128 * 20 * 32 * 65).reshape(65, 32, 20, 128),mindspore.float32)
 11 net = Net()
 12 output = net(x)
 13 print(output)

At this point, the execution is successful, and the output is as follows:

(65, 32, 6, 114)

3 Summary

Steps to locate the error report:

1. Find the line of user code that reports the error: output = net(x) ;

2. According to the keywords in the log error message, narrow the scope of the analysis problem* the each element of the output shape must be larger than 0* ;

3. It is necessary to focus on the correctness of variable definition and initialization.

[Solved] MindSpore Error: task_fail_info or current_graph_ is nullptr

1 Error description

1.1 System Environment

Hardware Environment(Ascend/GPU/CPU): Ascend
Software Environment:
– MindSpore version (source or binary): 1.8.0
– Python version (eg, Python 3.7.5): 3.7.6
– OS platform and distribution (eg, Linux Ubuntu 16.04): Ubuntu 4.15.0-74-generic
– GCC/Compiler version (if compiled from source):

1.2 Basic information

1.2.1 Script

The training script is to perform a greedy decoding (best path) on the logits given in the input by building a single-operator network of CTC GreedyDecoder. The script is as follows:

 01 class Net(nn.Cell):
 02     def __init__(self):
 03         super(Net, self).__init__()
 04         self.ctc_greedyDecoder = ops.CTCGreedyDecoder()
 05 
 06     def construct(self, input_x, sequence_length):
 07         return self.ctc_greedyDecoder(input_x, sequence_length)
 08 net = Net()
 09 
 10 
 11 inputs = Tensor(np.array([[[0.6, 0.4, 0.2], [0.8, 0.6, 0.3]],
 12                           [[0.0, 0.6, 0.0], [0.5, 0.4, 0.5]]]), mindspore.float32)
 13 sequence_length = Tensor(np.array([4, 2]), mindspore.int32)
 14 
 15 decoded_indices, decoded_values, decoded_shape, log_probability = net(inputs, sequence_length)
 16 print(decoded_indices, decoded_values, decoded_shape, log_probability)

1.2.2 Error reporting

The error message here is as follows:

[ERROR] DEVICE(172230,fffeae7fc160,python):2022-06-28-07:02:12.636.101 [mindspore/ccsrc/plugin/device/ascend/hal/device/ascend_kernel_runtime.cc:603] TaskFailCallback] Execute TaskFailCallback failed. task_fail_info or current_graph_ is nullptr
Traceback (most recent call last):
  File "CTCGreedyDecoder.py", line 26, in <module>
    decoded_indices, decoded_values, decoded_shape, log_probability = net(inputs, sequence_length)
  File "/root/archiconda3/envs/lilinjie_high/lib/python3.7/site-packages/mindspore/nn/cell.py", line 573, in __call__
    out = self.compile_and_run(*args)
  File "/root/archiconda3/envs/lilinjie_high/lib/python3.7/site-packages/mindspore/nn/cell.py", line 979, in compile_and_run
    return _cell_graph_executor(self, *new_inputs, phase=self.phase)
  File "/root/archiconda3/envs/lilinjie_high/lib/python3.7/site-packages/mindspore/common/api.py", line 1128, in __call__
    return self.run(obj, *args, phase=phase)
  File "/root/archiconda3/envs/lilinjie_high/lib/python3.7/site-packages/mindspore/common/api.py", line 1165, in run
    return self._exec_pip(obj, *args, phase=phase_real)
  File "/root/archiconda3/envs/lilinjie_high/lib/python3.7/site-packages/mindspore/common/api.py", line 94, in wrapper
    results = fn(*arg, **kwargs)
  File "/root/archiconda3/envs/lilinjie_high/lib/python3.7/site-packages/mindspore/common/api.py", line 1147, in _exec_pip
    return self._graph_executor(args, phase)
RuntimeError: Call runtime rtStreamSynchronize failed. Op name: Default/CTCGreedyDecoder-op2

Cause Analysis

Let’s look at the error message. In Error, it is written that Execute TaskFailCallback failed. task_fail_info or current_graph_ is nullptr. Although it is not very clear from this error message where the problem is, you can extract the keywords inside for guess verification. There is a nullptr in it, which may be caused by out of bounds. Then carefully check the description of each parameter on the official website,
image.png

Combined with line 13 of the script, it is found that this condition is not satisfied, so an error is reported.

2 Solutions

For the reasons known above, it is easy to make the following modifications:

 01 class Net(nn.Cell):
 02     def __init__(self):
 03         super(Net, self).__init__()
 04         self.ctc_greedyDecoder = ops.CTCGreedyDecoder()
 05 
 06     def construct(self, input_x, sequence_length):
 07         return self.ctc_greedyDecoder(input_x, sequence_length)
 08 net = Net()
 09 
 10 
 11 inputs = Tensor(np.array([[[0.6, 0.4, 0.2], [0.8, 0.6, 0.3]],
 12                           [[0.0, 0.6, 0.0], [0.5, 0.4, 0.5]]]), mindspore.float32)
 13 sequence_length = Tensor(np.array([2, 2]), mindspore.int32)
 14 
 15 decoded_indices, decoded_values, decoded_shape, log_probability = net(inputs, sequence_length)
 16 print(decoded_indices, decoded_values, decoded_shape, log_probability)

At this point, the execution is successful, and the output is as follows:

[[0 0]
 [0 1]
 [1 0]] [0 1 0] [2 2] [[-1.2]
 [-1.3]]

3 Summary

Steps to locate the error report:

1. Find the line of user code that reports the error: 15 decoded_indices, decoded_values, decoded_shape, log_probability = net(inputs, sequence_length) ;

2. According to the keywords in the log error message, narrow down the scope of the analysis problem* Execute TaskFailCallback failed. task_fail_info or current_graph_ is nullptr* ;

[Solved] MindSpore Error: TypeError: For ‘TopK’, the type of ‘x’ should be…

1 Error description

1.1 System Environment

Hardware Environment(Ascend/GPU/CPU): CPU
Software Environment:
– MindSpore version (source or binary): 1.8.0
– Python version (e.g., Python 3.7.5): 3.7.6
– OS platform and distribution (e.g., Linux Ubuntu 16.04): Ubuntu 4.15.0-74-generic
– GCC/Compiler version (if compiled from source):

1.2 Basic information

1.2.1 Script

The training script is an example of computing the softmax cross-entropy of two variables by building a single-operator network of SoftmaxCrossEntropyWithLogits. The script is as follows:

 01 class Net(nn.Cell):
 02     def __init__(self):
 03         super(Net, self).__init__()
 04         self.topk = ops.TopK(sorted=False)
 05 
 06     def construct(self, x, k):
 07         output = self.topk(x, k)
 08         return output
 09
 10 net = Net()
 11 x = Tensor(([[5, 2, 3, 3, 5], [5, 2, 9, 3, 5]]), mindspore.double)
 12 k = 5
 13 values, indices = net(x, k)
 14 print(values, indices)

1.2.2 Error reporting

The error message here is as follows:

Traceback (most recent call last):
  File "C:/Users/l30026544/PycharmProjects/q2_map/new/I4H30H.py", line 21, in <module>
    values, indices = net(x, k)
  File "C:\Users\l30026544\PycharmProjects\q2_map\lib\site-packages\mindspore\nn\cell.py", line 586, in __call__
    out = self.compile_and_run(*args)
  File "C:\Users\l30026544\PycharmProjects\q2_map\lib\site-packages\mindspore\nn\cell.py", line 964, in compile_and_run
    self.compile(*inputs)
  File "C:\Users\l30026544\PycharmProjects\q2_map\lib\site-packages\mindspore\nn\cell.py", line 937, in compile
    _cell_graph_executor.compile(self, *inputs, phase=self.phase, auto_parallel_mode=self._auto_parallel_mode)
  File "C:\Users\l30026544\PycharmProjects\q2_map\lib\site-packages\mindspore\common\api.py", line 1006, in compile
    result = self._graph_executor.compile(obj, args_list, phase, self._use_vm_mode())
  File "C:\Users\l30026544\PycharmProjects\q2_map\lib\site-packages\mindspore\ops\operations\nn_ops.py", line 2178, in __infer__
    validator.check_tensor_dtype_valid('x', x_dtype, valid_dtypes, self.name)
  File "C:\Users\l30026544\PycharmProjects\q2_map\lib\site-packages\mindspore\_checkparam.py", line 541, in check_tensor_dtype_valid
    Validator.check_subclass(arg_name, arg_type, tensor_types, prim_name)
  File "C:\Users\l30026544\PycharmProjects\q2_map\lib\site-packages\mindspore\_checkparam.py", line 493, in check_subclass
    raise TypeError(f"For '{prim_name}', the type of '{arg_name}'"
TypeError: For 'TopK', the type of 'x' should be one of Tensor[Int32], Tensor[Float16], Tensor[Float32], but got Tensor[Float64] . The supported data types depend on the hardware that executes the operator, please refer the official api document to get more information about the data type.
WARNING: Logging before InitGoogleLogging() is written to STDERR
[WARNING] UTILS(11576,1,?):2022-6-25 8:31:24 [mindspore\ccsrc\utils\comm_manager.cc:78] GetInstance] CommManager instance for CPU not found, return default instance.
[ERROR] ANALYZER(11576,1,?):2022-6-25 8:31:24 [mindspore\ccsrc\pipeline\jit\static_analysis\async_eval_result.cc:66] HandleException] Exception happened, check the information as below.

The function call stack (See file 'C:\Users\l30026544\PycharmProjects\q2_map\new\rank_0\om/analyze_fail.dat' for more details):
# 0 In file C:/Users/l30026544/PycharmProjects/q2_map/new/I4H30H.py(15)
        output = self.topk(x, k)
                 ^

Cause Analysis

Let’s look at the error message. In TypeError, write For ‘TopK’, the type of ‘x’ should be one of Tensor[Int32], Tensor[Float16], Tensor[Float32], but got Tensor[Float64], which means For TopK, the input type must be int32, float16 or float32, and the actual result is float64. Locate to the xth line of the code and find that the data type is indeed float64. The solution is to reduce the data precision.

2 Solutions

For the reasons known above, it is easy to make the following modifications:

 01 class Net(nn.Cell):
 02     def __init__(self):
 03         super(Net, self).__init__()
 04         self.topk = ops.TopK(sorted=False)
 05 
 06     def construct(self, x, k):
 07         output = self.topk(x, k)
 08         return output
 09
 10 net = Net()
 11 x = Tensor(([[5, 2, 3, 3, 5], [5, 2, 9, 3, 5]]), mindspore.float32)
 12 k = 5
 13 values, indices = net(x, k)
 14 print(values, indices)

At this point, the execution is successful, and the output is as follows:

[[5. 2. 3. 3. 5.]
[5. 2. 9. 3. 5.]] [[0 1 2 3 4]
[0 1 2 3 4]]

3 Summary

Steps to locate the error report:

1. Find the line of user code that reports the error: output = self.topk(x, k) ;

2. According to the keywords in the log error message, narrow the scope of the analysis problem For ‘TopK’, the type of ‘x’ should be one of Tensor[Int32], Tensor[Float16], Tensor[Float32]  ;

3. It is necessary to focus on the correctness of variable definition and initialization.

[Solved] MindSpore Error: Select GPU kernel op * fail! Incompatible data type

1 Error description

1.1 System Environment

Hardware Environment(Ascend/GPU/CPU): GPU
Software Environment:
– MindSpore version (source or binary): 1.5.2
– Python version (e.g., Python 3.7.5): 3.7.6
– OS platform and distribution (e.g., Linux Ubuntu 16.04): Ubuntu 4.15.0-74-generic
– GCC/Compiler version (if compiled from source):

1.2 Basic information

1.2.1 Script

The training script normalizes the Tensor by constructing a BatchNorm single-operator network. The script is as follows:

 01 class Net(nn.Cell):
 02     def __init__(self):
 03         super(Net, self).__init__()
 04         self.batch_norm = ops.BatchNorm()
 05     def construct(self,input_x, scale, bias, mean, variance):
 06         output = self.batch_norm(input_x, scale, bias, mean, variance)
 07         return output
 08
 09 net = Net()
 10 input_x = Tensor(np.ones([2, 2]), mindspore.float16)
 11 scale = Tensor(np.ones([2]), mindspore.float16)
 12 bias = Tensor(np.ones([2]), mindspore.float16)
 13 bias = Tensor(np.ones([2]), mindspore.float16)
 14 mean = Tensor(np.ones([2]), mindspore.float16)
 15 variance = Tensor(np.ones([2]), mindspore.float16)
 16 output = net(input_x, scale, bias, mean, variance)
 17 print(output)

1.2.2 Error reporting

The error message here is as follows:

Traceback (most recent call last):
  File "116945.py", line 22, in <module>
    output = net(input_x, scale, bias, mean, variance)
  File "/data2/llj/mindspores/r1.5/build/package/mindspore/nn/cell.py", line 407, in __call__
    out = self.compile_and_run(*inputs)
  File "/data2/llj/mindspores/r1.5/build/package/mindspore/nn/cell.py", line 734, in compile_and_run
    self.compile(*inputs)
  File "/data2/llj/mindspores/r1.5/build/package/mindspore/nn/cell.py", line 721, in compile
    _cell_graph_executor.compile(self, *inputs, phase=self.phase, auto_parallel_mode=self._auto_parallel_mode)
  File "/data2/llj/mindspores/r1.5/build/package/mindspore/common/api.py", line 551, in compile
    result = self._graph_executor.compile(obj, args_list, phase, use_vm, self.queue_name)
TypeError: mindspore/ccsrc/runtime/device/gpu/kernel_info_setter.cc:355 PrintUnsupportedTypeException] Select GPU kernel op[BatchNorm] fail! Incompatible data type!
The supported data types are in[float32 float32 float32 float32 float32], out[float32 float32 float32 float32 float32]; in[float16 float32 float32 float32 float32], out[float16 float32 float32 float32 float32]; , but get in [float16 float16 float16 float16 float16 ] out [float16 float16 float16 float16 float16 ]

Cause Analysis

Let’s look at the error message. In TypeError, write Select GPU kernel op[BatchNorm] fail! Incompatible data type!

The supported data types are in[float32 float32 float32 float32 float32], out[float32 float32 float32 float32 float32]; in[float16 float32 float32 float32 float32], out[float16 float32 float32 float32 float32]; , but get in [float16 float16 float16 float16 float16 ] out [float16 float16 float16 float16 float16 ], which probably means that the current input data type combination is not supported in the GPU environment, and explains what the supported data type combinations are: all are float32 or input_x is float16, and the rest are float32 . Check the input of the script and find that all are of type float16, so an error is reported.

2 Solutions

For the reasons known above, it is easy to make the following modifications:

 01 class Net(nn.Cell):
 02     def __init__(self):
 03         super(Net, self).__init__()
 04         self.batch_norm = ops.BatchNorm()
 05     def construct(self,input_x, scale, bias, mean, variance):
 06         output = self.batch_norm(input_x, scale, bias, mean, variance)
 07         return output
 08 
 09 net = Net()
 10 input_x = Tensor(np.ones([2, 2]), mindspore.float16)
 11 scale = Tensor(np.ones([2]), mindspore.float32)
 12 bias = Tensor(np.ones([2]), mindspore.float32)
 13 mean = Tensor(np.ones([2]), mindspore.float32)
 14 variance = Tensor(np.ones([2]), mindspore.float32)
 15 
 16 output = net(input_x, scale, bias, mean, variance)
 17 print(output)

At this point, the execution is successful, and the output is as follows:

output: (Tensor(shape=[2, 2], dtype=Float16, value=
[[ 1.0000e+00,  1.0000e+00],
 [ 1.0000e+00,  1.0000e+00]]), Tensor(shape=[2], dtype=Float32, value= [ 0.00000000e+00,  0.00000000e+00]), Tensor(shape=[2], dtype=Float32, value= [ 0.00000000e+00,  0.00000000e+00]), Tensor(shape=[2], dtype=Float32, value= [ 0.00000000e+00,  0.00000000e+00]), Tensor(shape=[2], dtype=Float32, value= [ 0.00000000e+00,  0.00000000e+00]))

3 Summary

Steps to locate the error report:

1. Find the line of user code that reports the error: 16 output = net(input_x, scale, bias, mean, variance);

2. According to the keywords in the log error message, narrow the scope of the analysis problem: The supported data types are in[float32 float32 float32 float32 float32], out[float32 float32 float32 float32 float32]; in[float16 float32 float32 float32 float32], out [float16 float32 float32 float32 float32]; , but get in [float16 float16 float16 float16 float16 ] out [float16 float16 float16 float16 float16 ]

3. It is necessary to focus on the correctness of variable definition and initialization.

[Solved] MindSpore Error: ReduceMean in the Ascend environment does not support inputs of 8 or more dimensions

1 Error description

1.1 System Environment

Hardware Environment(Ascend/GPU/CPU): Ascend
Software Environment:
– MindSpore version (source or binary): 1.8.0
– Python version (eg, Python 3.7.5): 3.7.6
– OS platform and distribution (eg, Linux Ubuntu 16.04): Ubuntu 4.15.0-74-generic
– GCC/Compiler version (if compiled from source):

1.2 Basic information

1.2.1 Script

The training script is to average and reduce axis1 by constructing the ReduceMean operator network. The script is as follows:

 01 class Net(nn.Cell):
 02     def __init__(self, axis, keep_dims):
 03         super().__init__()
 04         self.reducemean = ops.ReduceMean(keep_dims=keep_dims)
 05         self.axis = axis
 06     def construct(self, input_x):
 07         return self.reducemean(input_x, self.axis)
 08 net = Net(axis=(1,), keep_dims=True)
 09 x = Tensor(np.random.randn(1, 2, 3, 4, 5, 6, 7, 8, 9), mindspore.float32)
 10 out = net(x)
 11 print("out shape: ", out.shape)

1.2.2 Error reporting

The error message here is as follows:

Traceback (most recent call last):
  File "test.py", line 18, in <module>
    out = net(x)
  File "/root/archiconda3/envs/lh37_ascend/lib/python3.7/site-packages/mindspore/nn/cell.py", line 574, in __call__
    out = self.compile_and_run(*args)
  File "/root/archiconda3/envs/lh37_ascend/lib/python3.7/site-packages/mindspore/nn/cell.py", line 975, in compile_and_run
    self.compile(*inputs)
  File "/root/archiconda3/envs/lh37_ascend/lib/python3.7/site-packages/mindspore/nn/cell.py", line 948, in compile
    jit_config_dict=self._jit_config_dict)
  File "/root/archiconda3/envs/lh37_ascend/lib/python3.7/site-packages/mindspore/common/api.py", line 1092, in compile
    result = self._graph_executor.compile(obj, args_list, phase, self._use_vm_mode())
RuntimeError: Single op compile failed, op: reduce_mean_d_1629966128061146056_6
 except_msg: 2022-07-15 01:36:29.720449: Query except_msg:Traceback (most recent call last):
  File "/root/archiconda3/envs/lh37_ascend/lib/python3.7/site-packages/te_fusion/parallel_compilation.py", line 1469, in run
    relation_param=self._relation_param)
  File "/root/archiconda3/envs/lh37_ascend/lib/python3.7/site-packages/te_fusion/fusion_manager.py", line 1283, in build_single_op
    compile_info = call_op()
  File "/root/archiconda3/envs/lh37_ascend/lib/python3.7/site-packages/te_fusion/fusion_manager.py", line 1270, in call_op
    opfunc(*inputs, *outputs, *new_attrs, **kwargs)
  File "/root/archiconda3/envs/lh37_ascend/lib/python3.7/site-packages/tbe/common/utils/para_check.py", line 537, in _in_wrapper
    formal_parameter_list[i][1], op_name)
  File "/root/archiconda3/envs/lh37_ascend/lib/python3.7/site-packages/tbe/common/utils/para_check.py", line 516, in _check_one_op_param
    _check_input(op_param, param_name, param_type, op_name)
  File "/root/archiconda3/envs/lh37_ascend/lib/python3.7/site-packages/tbe/common/utils/para_check.py", line 299, in _check_input
    _check_input_output_dict(op_param, param_name, op_name)
  File "/root/archiconda3/envs/lh37_ascend/lib/python3.7/site-packages/tbe/common/utils/para_check.py", line 223, in _check_input_output_dict
    param_name=param_name)
  File "/root/archiconda3/envs/lh37_ascend/lib/python3.7/site-packages/tbe/common/utils/para_check.py", line 689, in check_shape
    _check_shape_range(max_rank, min_rank, param_name, shape)
  File "/root/archiconda3/envs/lh37_ascend/lib/python3.7/site-packages/tbe/common/utils/para_check.py", line 727, in _check_shape_range
    % (error_info['param_name'], min_rank, max_rank, len(shape)))
RuntimeError: ({'errCode': 'E80012', 'op_name': 'reduce_mean_d', 'param_name': 'input_x', 'min_value': 0, 'max_value': 8, 'real_value': 9}, 'In op, the num of dimensions of input/output[input_x] should be inthe range of [0, 8], but actually is [9].')

Cause Analysis

Let’s look at the error message. In RuntimeError, ‘In op, the num of dimensions of input/output[input_x] should be in the range of [0, 8], but actually is [9].’ means that the input of ReduceMean is a dimension It should be greater than or equal to 0 and less than or equal to 8, but the actual value is 9, which obviously exceeds the dimension supported by the ReduceMean operator. In the official website, ReduceSum also made a description of the input dimension limitation:
image.png

2 Solutions

For the reasons known above, it is easy to make the following modifications:

 01 class Net(nn.Cell):
 02     def __init__(self, axis, keep_dims):
 03         super().__init__()
 04         self.reducemean = ops.ReduceMean(keep_dims=keep_dims)
 05         self.axis = axis
 06     def construct(self, input_x):
 07         return self.reducemean(input_x, self.axis)
 08 net = Net(axis=(1,), keep_dims=True)
 09 x = Tensor(np.random.randn(2, 3, 4, 5, 6, 7, 8, 9), mindspore.float32)
 10 out = net(x)
 11 print("out shape: ", out.shape)

At this point, the execution is successful, and the output is as follows:

out shape: (2, 1, 4, 5, 6, 7, 8, 9)

3 Summary

Steps to locate the error report:
1. Find the user code line that reports the error: out = net(x);
2. According to the keywords in the log error message, narrow the scope of the analysis problem: should be in the range of [0, 8] , but actually is [10];
3. Focus on the correctness of variable definition and initialization.