Tag Archives: object detection

[Solved] mindinsight modelart Error: RuntimeError: An attempt has been made to start a new process before…

 

Question:

Mindinsight uses error reporting on modelart.

After adding the summary and training some epoch normally, the operation will report an error:

Solution:

When using SummaryCollector, you need to put the code block in if__name__ == __main__:

The official mindspire tutorial has been updated. You can refer to the writing method of the latest tutorial: collect summary data – mindspire master document

Codes like this:

def train():
  summary_collector = SummaryCollector(summary_dir='./summary_dir')

  ...

  model.train(...., callback=[summary_collector])

if __name__ == '__main__':
    train()

 

[Solved] mmdetection Error: ImportError: /home/user/repos/mmdetection/mmdet/ops/dcn/deform_conv_cuda.cpython-37m-x

Environment configuration: torch 1.11.0 + CUDA 11.3 (latest)

Use mmdetection to infer:

from mmdet.apis import init_detector, inference_detector

Errors are reported as follows:

ImportError: /home/user/repos/mmdetection/mmdet/ops/dcn/deform_conv_cuda.cpython-37m-x86_64-linux-gnu.so: undefined symbol: _ZN6caffe26detail37_typeMetaDataInstance_preallocated_32E

 

The problem has been solved. The reason for the error is that the pytorch version is too new. Although openmmlab supports the latest version, it will still cause the error.

Solution:

Degrade the pytorch version to torch 1.6.0 + cu102, query the official GitHub of openmmlab, uninstall and reinstall mmcv 1.3.9, and re run the mmdetection code to solve the error.

Vitis-AI Generate a Quantitative Model: NotImplementedError

Vitis AI reports an error when generating a quantitative model

Traceback (most recent call last):
  File "generate_model.py", line 191, in <module>
    run_main()
  File "generate_model.py", line 185, in run_main
    quantize(args.build_dir,args.quant_mode,args.batchsize)
  File "generate_model.py", line 160, in quantize
    quantizer = torch_quantizer(quant_mode, new_model, (rand_in), output_dir=quant_model)
  File "/opt/vitis_ai/conda/envs/vitis-ai-pytorch/lib/python3.6/site-packages/pytorch_nndct/apis.py", line 77, in __init__
    custom_quant_ops = custom_quant_ops)
  File "/opt/vitis_ai/conda/envs/vitis-ai-pytorch/lib/python3.6/site-packages/pytorch_nndct/qproc/base.py", line 122, in __init__
    device=device)
  File "/opt/vitis_ai/conda/envs/vitis-ai-pytorch/lib/python3.6/site-packages/pytorch_nndct/qproc/utils.py", line 175, in prepare_quantizable_module
    graph = parse_module(module, input_args)
  File "/opt/vitis_ai/conda/envs/vitis-ai-pytorch/lib/python3.6/site-packages/pytorch_nndct/qproc/utils.py", line 78, in parse_module
    module, input_args)
  File "/opt/vitis_ai/conda/envs/vitis-ai-pytorch/lib/python3.6/site-packages/pytorch_nndct/parse/parser.py", line 68, in __call__
    raw_graph, raw_params = graph_handler.build_torch_graph(graph_name, module, input_args)
  File "/opt/vitis_ai/conda/envs/vitis-ai-pytorch/lib/python3.6/site-packages/pytorch_nndct/parse/trace_helper.py", line 37, in build_torch_graph
    fw_graph, params = self._trace_graph_from_model(input_args, train)
  File "/opt/vitis_ai/conda/envs/vitis-ai-pytorch/lib/python3.6/site-packages/pytorch_nndct/parse/trace_helper.py", line 61, in _trace_graph_from_model
    train)
  File "/opt/vitis_ai/conda/envs/vitis-ai-pytorch/lib/python3.6/site-packages/pytorch_nndct/utils/jit_utils.py", line 235, in trace_and_get_graph_from_model
    graph, torch_out = _get_trace_graph()(model, args)
  File "/opt/vitis_ai/conda/envs/vitis-ai-pytorch/lib/python3.6/site-packages/torch/jit/__init__.py", line 277, in _get_trace_graph
    outs = ONNXTracedModule(f, _force_outplace, return_inputs, _return_inputs_states)(*args, **kwargs)
  File "/opt/vitis_ai/conda/envs/vitis-ai-pytorch/lib/python3.6/site-packages/torch/nn/modules/module.py", line 532, in __call__
    result = self.forward(*input, **kwargs)
  File "/opt/vitis_ai/conda/envs/vitis-ai-pytorch/lib/python3.6/site-packages/torch/jit/__init__.py", line 360, in forward
    self._force_outplace,
  File "/opt/vitis_ai/conda/envs/vitis-ai-pytorch/lib/python3.6/site-packages/torch/jit/__init__.py", line 347, in wrapper
    outs.append(self.inner(*trace_inputs))
  File "/opt/vitis_ai/conda/envs/vitis-ai-pytorch/lib/python3.6/site-packages/torch/nn/modules/module.py", line 530, in __call__
    result = self._slow_forward(*input, **kwargs)
  File "/opt/vitis_ai/conda/envs/vitis-ai-pytorch/lib/python3.6/site-packages/torch/nn/modules/module.py", line 516, in _slow_forward
    result = self.forward(*input, **kwargs)
  File "/opt/vitis_ai/conda/envs/vitis-ai-pytorch/lib/python3.6/site-packages/torch/nn/modules/module.py", line 96, in forward
    raise NotImplementedError
NotImplementedError

The possible reasons for the above problems are:
no forward implementation is added to the model

How to Solve Yolox Training C Disk Full Issue

0. Problem description: COCO dataset training to half suddenly interrupted, look at the C disk shows red, there is not much memory (training, generated in AppData/Temp in the temporary file too much)
As shown in the figure: as the epoch increases, the file is getting bigger and bigger (the figure is still yolox-tiny), if we use yolox-x, the C drive is directly full!

1. Problem Cause.
YOLOX-main/yolox/evaluators/coco_evaluator.py in line 203 or so **tempfile.mkstemp()** after creating the file, no close() and remove() operations are performed
The following figure.

2. Solution methods
(1) Method 1
As shown above, add os.close(_) and os.remove(tmp) two lines of code, directly delete the file just created after use. Note: import os module at the beginning]
(2) Method 2
The problem is already known, you can use with…as… to create, automatically delete and close the file.
(3) Method 3
If you want to keep each temporary file, and do not want to C drive blow up, then directly change the save location to a custom path.
Code location: Anoconda/envs/using-environment/Lib/tempfile.py in line 159-185.
directly change the direct dirlist operation to the user-defined folder location, as follows:

(4) Method 4
Manually clean up files in temp at regular intervals
Note: VOC format dataset training, no temporary files are generated because it uses the with…as… file creation method. For details, please refer to the end of voc_evaluater.py

[Solved] mmdetection benchmark.py Error: RuntimeError: Distributed package doesn‘t have NCCL built in

Cause:
use mmdetection’s tools/benchmark An error occurs when py calculates FPS
the error contents are as follows:

Traceback (most recent call last):
  File "tools/analysis_tools/benchmark.py", line 191, in <module>
    main()
  File "tools/analysis_tools/benchmark.py", line 183, in main
    init_dist(args.launcher, **cfg.dist_params)
  File "D:\Anaconda\envs\eagermot\lib\site-packages\mmcv\runner\dist_utils.py", line 18, in init_dist
    _init_dist_pytorch(backend, **kwargs)
  File "D:\Anaconda\envs\eagermot\lib\site-packages\mmcv\runner\dist_utils.py", line 32, in _init_dist_pytorch
    dist.init_process_group(backend=backend, **kwargs)
  File "D:\Anaconda\envs\eagermot\lib\site-packages\torch\distributed\distributed_c10d.py", line 510, in init_process_group
    timeout=timeout))
  File "D:\Anaconda\envs\eagermot\lib\site-packages\torch\distributed\distributed_c10d.py", line 597, in _new_process_group_helper
    raise RuntimeError("Distributed package doesn't have NCCL "
RuntimeError: Distributed package doesn't have NCCL built in

Cause analysis:
Windows does not support nccl backend

Solution:
1. Locate the following code location

File "D:\Anaconda\envs\eagermot\lib\site-packages\mmcv\runner\dist_utils.py", line 32, in _init_dist_pytorch

2. Add code before 1 (line 32)

backend='gloo'

[Solved] torch Do Targer Detection Error: RuntimeError: CUDA error: device-side assert triggered

When training torchvision’s maskrcnn with your own data, the following errors are reported:

Traceback (most recent call last):
  File "main_train_detection.py", line 232, in <module>
    main(params)
  File "main_train_detection.py", line 201, in main
    train_one_epoch(model, optimizer, data_loader, device, epoch, print_freq=10)
  File "/raid/huaqing/tyler/suzhou/code/utils/engine.py", line 37, in train_one_epoch
    loss_dict = model(images, targets)
  File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1051, in _call_impl
    return forward_call(*input, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/torchvision/models/detection/generalized_rcnn.py", line 97, in forward
    detections, detector_losses = self.roi_heads(features, proposals, images.image_sizes, targets)
  File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1051, in _call_impl
    return forward_call(*input, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/torchvision/models/detection/roi_heads.py", line 760, in forward
    loss_classifier, loss_box_reg = fastrcnn_loss(
  File "/usr/local/lib/python3.8/dist-packages/torchvision/models/detection/roi_heads.py", line 40, in fastrcnn_loss
    sampled_pos_inds_subset = torch.where(labels > 0)[0]
RuntimeError: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.

The root cause is that the category label is not numbered from 0:
there are actually three categories of targets to be identified, so the total number of categories set is 3 Then set the corresponding relationship between category labels and categories as follows:

cls_dict = {'holes':1, 'marker':2, 'band':3}.

When numbering category labels, they are actually numbered from 0. For a total of 3 categories, the label numbers are 0, 1 and 2 respectively In other words, there is no label = = 3 category Therefore, the above CLS_Dict is adopted, which will cause the number of band class to overflow It should be corrected as follows:

cls_dict = {'holes':0, 'marker':1, 'band':2}