Tag Archives: pytorch

[Solved] ROS fatal error: NvInferRuntimeCommon. h: No such file or directory

The header file of tensorrt package was not found during translation

Solution:

Add the path to the Tensorrt package in CMakeList.txt

Find the location of the package and get the location of trt

locate   NvInferRuntimeCommon.h

Then add the path to the Tensorrt package in CMakeList.txt, here I added the absolute path

include_directories("/home/b502/tensorrt/TensorRT-7.2.1.6/include")

[Solved] RuntimeError: unexpected EOF, expected 73963 more bytes. The file might be corrupted.

RuntimeError: unexpected EOF, expected 73963 more bytes. The file might be corrupted.

Problem Description:

When the project executes Python script, when downloading the pre training model weight of pytorch, if the weight is not downloaded due to network instability and other reasons, an error will be reported runtimeerror: unexpected EOF, expected xxxxx more bytes The file might be corrupted.


Cause analysis:

This error indicates that the downloaded weight file may be damaged. You need to delete the damaged weight file and execute the code to download again.


Solution:

To find the location where the downloaded weight file is saved, this paper analyzes three situations:

1. Windows System & Anaconda Environment

The path of download is D:\Anaconda3\anaconda\envs\yypy36\Lib\site-packages\facexlib\weightsdetection_Resnet50_Final.pt, so you need go to this folder and delete the weight file as the screenshot below:
2. Windows system & Python environment:

The code automatically downloads the model weights file and saves it to the C:\Users\username/.cache\torch\checkpoints folder. Note that .cache may be a hidden file, you need to view the hidden file to see it, just delete the weight file.

3. Linux systems:
Linux system weights files are usually saved under: \home\username\.cache\torch. Note that .cache is a hidden folder and will only be shown if you press ctrl+Alt+H in winSCP; or, in the home directory, use ls -a to show it. root mode, the default path for downloaded weight files is under: /root/.cache/torch/checkpoints. Just delete the weight file.

In the above three cases, after deleting the weight file, execute the code again to download again.

Additional:

If the execution program downloads the code too slowly or the network stability is not good, we can directly download it manually from the website where the weight file is located and put it in the specified location. The Linux system can adopt WGet method.

wget -P Local path where the weights are saved Address of the weights

If the download is interrupted, WGet supports continuous transmission at breakpoints. Add a parameter - C :

wget -P Local path where weights are saved -c Address of weights

eg:

wget -P /home/20220222Proj/pretrained_models -c https://github.com/xinntao/Real-ESRGAN/releases/download/v0.2.1/RealESRGAN_x2plus.pth

[Solved] NVIDIA-SMI has failed because it couldn‘t communicate with the NVIDIA driver.

1. Error
nvidia-smi input error:
NVIDIA-SMI has failed because it couldn’t communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.
input torch.cuda.is_available() return false

2. Solution

Simply execute two commands.

sudo apt-get install dkms
sudo dkms install -m nvidia -v 470.94 (470.94 is the driver version number)

Use the command ll /usr/src/ to see a nvidia-470.94/ folder underneath, the version number varies from your computer

[Solved] AttributeError: ‘NoneType‘ object has no attribute ‘astype‘

Problem description

When running the code, an error is reported attributeerror: 'nonetype' object has no attribute 'asttype', as shown in the following figure:

Traceback (most recent call last):
  File "work/person_search-master/tools/demo.py", line 82, in <module>
    query_feat = net.inference(query_img, query_roi).view(-1, 1)
  File "/home/featurize/work/person_search-master/tools/../lib/models/network.py", line 178, in inference
    processed_img, scale = img_preprocessing(img)
  File "/home/featurize/work/person_search-master/tools/../lib/datasets/data_processing.py", line 49, in img_preprocessing
    processed_img = img.astype(np.float32)
AttributeError: 'NoneType' object has no attribute 'astype'

Solution:

According to the error message, the error is reported because the img is a ‘NoneType’ object, so the ‘astype’ property cannot be used.

In general, the above error occurs when the img does not exist, so

  • You need to make sure that the image exists in the appropriate path in the code.
  • You need to run the command python XXX.py in the correct directory to ensure that XXX.py searches for the image in the correct directory.

[Solved] torchvision Error: UserWarning: Failed to load image Python extension: Could not find module

Tochvision error: userwarning: failed to load image Python extension: could not find module

One reason is that the version of torchvision is too high. It is suspected that the new version of torchvision has its own pot. At first, according to the official website

pip3 install torch==1.10.1+cu102 torchvision==0.11.2+cu102 torchaudio===0.10.1+cu102 -f https://download.pytorch.org/whl/cu102/torch_stable.html

After that, the error of image will be reported. You only need to reduce the version of torch vision. For example, you can enter it in your Anaconda prompt

conda activate ltorch # ltorch is the name of the virtual environment I created
pip install torchvision==0.10.1+cu102 -f https://download.pytorch.org/whl/cu102/torch_stable.html

It was 0.11.2, but I reduced it to 0.10.1. Import without error:

Of course, you can change the version according to the following link

https://download.pytorch.org/whl/cu102/torch_stable.html

[Solved] RuntimeError: scatter(): Expected dtype int64 for index


RuntimeError: scatter(): Expected dtype int64 for index

1. Error reporting reason:

Scatter requires the data to be of type Int64, and I wrote torch when defining tensor Tensor (x) should be written as torch Longtensor (x), specified as Int64 type.

2. Solutions

Find the definition method of the original data and change it
generally, dtype = NP int64; dtype=np.
in float32 (most definition functions have dtype attribute)
it is better to have the same number of bits of int and float

import numpy as np
a = np.random.randint(100, size=(10**6), dtype="int64")
print(a)
print(type(a[0]))

[Solved] AttributeError: module ‘setuptools._distutils‘ has no attribute ‘version‘

AttributeError: module ‘setuptools._distutils’ has no attribute ‘version’
pytorch tensorboard error: AttributeError: module ‘setuptools._distutils’ has no attribute ‘version’

from torch.utils.tensorboard import SummaryWriter

writer=SummaryWriter("logs")

# writer.add_image()  
#y=x
for i in range(100):

    writer.add_scalar("y=x",i,i)  

writer.close()

Cause of problem:

Setuptools version too high

Solution:

Install lower version setuptools
Enter:
PIP uninstall setuptools
PIP install setuptools = = 59.5.0// it needs to be lower than your previous version

[Solved] ERROR: No matching distribution found for torch-cluster==x.x.x

Refer to the configuration of others and configure py36 in CONDA virtual environment

conda create -n py36 python=3.6

The default is Python 3 6.0. Later, pytorch = 1.8.0 and cudatoolkit = 11.1.1 are installed successfully, and then pip is used to install
– torch cluster = = 1.5.9
– torch scatter = = 2.0.6
– torch spark = = 0.6.9
– torch spline conv = = 1.2.1

ERROR: No matching distribution found for torch-cluster==1.5.9

After trying various methods on the Internet, it still doesn’t work. Even if you remove the version limit, you still report an error
later, I checked the environment configuration of others I referred to. It was the wrong version of Python I used. I should use Python 3 6.10
then execute in this virtual environment:

conda install python=3.6.10=hcf32534_1

Then execute it

pip install torch-xxxx==x.x.x

You can install it successfully

[Solved] mmdetection benchmark.py Error: RuntimeError: Distributed package doesn‘t have NCCL built in

Cause:
use mmdetection’s tools/benchmark An error occurs when py calculates FPS
the error contents are as follows:

Traceback (most recent call last):
  File "tools/analysis_tools/benchmark.py", line 191, in <module>
    main()
  File "tools/analysis_tools/benchmark.py", line 183, in main
    init_dist(args.launcher, **cfg.dist_params)
  File "D:\Anaconda\envs\eagermot\lib\site-packages\mmcv\runner\dist_utils.py", line 18, in init_dist
    _init_dist_pytorch(backend, **kwargs)
  File "D:\Anaconda\envs\eagermot\lib\site-packages\mmcv\runner\dist_utils.py", line 32, in _init_dist_pytorch
    dist.init_process_group(backend=backend, **kwargs)
  File "D:\Anaconda\envs\eagermot\lib\site-packages\torch\distributed\distributed_c10d.py", line 510, in init_process_group
    timeout=timeout))
  File "D:\Anaconda\envs\eagermot\lib\site-packages\torch\distributed\distributed_c10d.py", line 597, in _new_process_group_helper
    raise RuntimeError("Distributed package doesn't have NCCL "
RuntimeError: Distributed package doesn't have NCCL built in

Cause analysis:
Windows does not support nccl backend

Solution:
1. Locate the following code location

File "D:\Anaconda\envs\eagermot\lib\site-packages\mmcv\runner\dist_utils.py", line 32, in _init_dist_pytorch

2. Add code before 1 (line 32)

backend='gloo'

[Solved] CUDA unknown error – this may be due to an incorrectly set up environment

preface

Today, a project using pytorch on the viewing server suddenly made an error after upgrading. The whole content of the error report is limited by the title. I’ll send it below.

builtins. RuntimeError: CUDA unknown error – this may be due to an incorrectly set up environment, e.g. changing env variable CUDA_ VISIBLE_ DEVICES after program start. Setting the available devices to be zero.

Screenshot of error reporting

Later, I consulted some materials, and the following are some solutions.

Solution:

Method 1: add environment variables

Since I started the project as a docker container, I installed VIM after entering the container, and then in ~/Bashrc finally added something.

export CUDA_ VISIBLE_ DEVICES=0

Since the selected graphics card number is 0 when building the container, the number I configured above is 0.

Check $CUDA after restarting the container_ VISIBLE_ The devices output is normal, but the problem is not solved, and the error is still reported.

Method 2: add environment variables to the code

Add the following code at the beginning of the initialization CUDA area.

import os
os.environ['CUDA_VISIBLE_DEVICES'] =‘0’

It still hasn’t solved the problem.

Method 3: restart the server

Referring to some articles, I mentioned that if the system upgrades the graphics card driver without restarting, it will also lead to the same error.

So I restarted the server and solved the problem.

How to Solve Error: RuntimeError CUDA out of memory

Error: RuntimeError: CUDA out of memory.
Error Messages:

Traceback (most recent call last):
  File "xxx.py", line 110, in <module>
    loss.backward()
  File "/nfsshare/apps/anaconda3/lib/python3.7/site-packages/torch/tensor.py", line 185, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph)
  File "/nfsshare/apps/anaconda3/lib/python3.7/site-packages/torch/autograd/__init__.py", line 127, in backward
    allow_unreachable=True)  # allow_unreachable flag
RuntimeError: CUDA out of memory. Tried to allocate 132.00 MiB (GPU 0; 15.78 GiB total capacity; 13.69 GiB already allocated; 91.50 MiB free; 14.53 GiB reserved in total by PyTorch)
Exception raised from malloc at /pytorch/c10/cuda/CUDACachingAllocator.cpp:272 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x42 (0x14c9ce19a1e2 in /nfsshare/apps/anaconda3/lib/python3.7/site-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0x1e64b (0x14c9ce3f064b in /nfsshare/apps/anaconda3/lib/python3.7/site-packages/torch/lib/libc10_cuda.so)
frame #2: <unknown function> + 0x1f464 (0x14c9ce3f1464 in /nfsshare/apps/anaconda3/lib/python3.7/site-packages/torch/lib/libc10_cuda.so)
frame #3: <unknown function> + 0x1faa1 (0x14c9ce3f1aa1 in /nfsshare/apps/anaconda3/lib/python3.7/site-packages/torch/lib/libc10_cuda.so)
frame #4: at::native::empty_cuda(c10::ArrayRef<long>, c10::TensorOptions const&, c10::optional<c10::MemoryFormat>) + 0x11e (0x14c9d10fc90e in /nfsshare/apps/anaconda3/lib/python3.7/site-packages/torch/lib/libtorch_cuda.so)
frame #5: <unknown function> + 0xf33949 (0x14c9cf536949 in /nfsshare/apps/anaconda3/lib/python3.7/site-packages/torch/lib/libtorch_cuda.so)
frame #6: <unknown function> + 0xf4d777 (0x14c9cf550777 in /nfsshare/apps/anaconda3/lib/python3.7/site-packages/torch/lib/libtorch_cuda.so)
frame #7: <unknown function> + 0x10e9c7d (0x14ca0a2ecc7d in /nfsshare/apps/anaconda3/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)
frame #8: <unknown function> + 0x10e9f97 (0x14ca0a2ecf97 in /nfsshare/apps/anaconda3/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)
frame #9: at::empty(c10::ArrayRef<long>, c10::TensorOptions const&, c10::optional<c10::MemoryFormat>) + 0xfa (0x14ca0a3f7a1a in /nfsshare/apps/anaconda3/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)
frame #10: at::native::mm_cuda(at::Tensor const&, at::Tensor const&) + 0x6c (0x14c9d05ebffc in /nfsshare/apps/anaconda3/lib/python3.7/site-packages/torch/lib/libtorch_cuda.so)
frame #11: <unknown function> + 0xf22a20 (0x14c9cf525a20 in /nfsshare/apps/anaconda3/lib/python3.7/site-packages/torch/lib/libtorch_cuda.so)
frame #12: <unknown function> + 0xa56530 (0x14ca09c59530 in /nfsshare/apps/anaconda3/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)
frame #13: at::Tensor c10::Dispatcher::call<at::Tensor, at::Tensor const&, at::Tensor const&>(c10::TypedOperatorHandle<at::Tensor (at::Tensor const&, at::Tensor const&)> const&, at::Tensor const&, at::Tensor const&) const + 0xbc (0x14ca0a44181c in /nfsshare/apps/anaconda3/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)
frame #14: at::mm(at::Tensor const&, at::Tensor const&) + 0x4b (0x14ca0a3926ab in /nfsshare/apps/anaconda3/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)
frame #15: <unknown function> + 0x2ed0a2f (0x14ca0c0d3a2f in /nfsshare/apps/anaconda3/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)
frame #16: <unknown function> + 0xa56530 (0x14ca09c59530 in /nfsshare/apps/anaconda3/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)
frame #17: at::Tensor c10::Dispatcher::call<at::Tensor, at::Tensor const&, at::Tensor const&>(c10::TypedOperatorHandle<at::Tensor (at::Tensor const&, at::Tensor const&)> const&, at::Tensor const&, at::Tensor const&) const + 0xbc (0x14ca0a44181c in /nfsshare/apps/anaconda3/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)
frame #18: at::Tensor::mm(at::Tensor const&) const + 0x4b (0x14ca0a527cab in /nfsshare/apps/anaconda3/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)
frame #19: <unknown function> + 0x2d11c34 (0x14ca0bf14c34 in /nfsshare/apps/anaconda3/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)
frame #20: torch::autograd::generated::MmBackward::apply(std::vector<at::Tensor, std::allocator<at::Tensor> >&&) + 0x294 (0x14ca0bf30814 in /nfsshare/apps/anaconda3/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)
frame #21: <unknown function> + 0x3375bb7 (0x14ca0c578bb7 in /nfsshare/apps/anaconda3/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)
frame #22: torch::autograd::Engine::evaluate_function(std::shared_ptr<torch::autograd::GraphTask>&, torch::autograd::Node*, torch::autograd::InputBuffer&, std::shared_ptr<torch::autograd::ReadyQueue> const&) + 0x1400 (0x14ca0c574400 in /nfsshare/apps/anaconda3/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)
frame #23: torch::autograd::Engine::thread_main(std::shared_ptr<torch::autograd::GraphTask> const&) + 0x451 (0x14ca0c574fa1 in /nfsshare/apps/anaconda3/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)
frame #24: torch::autograd::Engine::thread_init(int, std::shared_ptr<torch::autograd::ReadyQueue> const&, bool) + 0x89 (0x14ca0c56d119 in /nfsshare/apps/anaconda3/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)
frame #25: torch::autograd::python::PythonEngine::thread_init(int, std::shared_ptr<torch::autograd::ReadyQueue> const&, bool) + 0x4a (0x14ca19d0ddea in /nfsshare/apps/anaconda3/lib/python3.7/site-packages/torch/lib/libtorch_python.so)
frame #26: <unknown function> + 0xbd6df (0x14ca5616b6df in /usr/lib/x86_64-linux-gnu/libstdc++.so.6)
frame #27: <unknown function> + 0x76db (0x14ca5a6356db in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #28: clone + 0x3f (0x14ca5a35ea3f in /lib/x86_64-linux-gnu/libc.so.6)

Solution:
CUDA_LAUNCH_BLOCKING=1 python xx.py failed
Reduce the batch size to solve the problem.

[Solved] Pyg load dataset Error: attributeerror [pytorch geometry]

AttributeError: ‘GlobalStorage’ object has no attribute ‘train_mask’ Solution

 def create_masks(data):
    """
    Splits data into training, validation, and test splits in a stratified manner if
    it is not already splitted. Each split is associated with a mask vector, which
    specifies the indices for that split. The data will be modified in-place
    :param data: Data object
    :return: The modified data
    """
    if not hasattr(data, "val_mask"):

        data.train_mask = data.dev_mask = data.test_mask = None

        for i in range(20):
            labels = data.y.numpy()
            dev_size = int(labels.shape[0] * 0.1)
            test_size = int(labels.shape[0] * 0.8)

            perm = np.random.permutation(labels.shape[0])
            test_index = perm[:test_size]
            dev_index = perm[test_size:test_size + dev_size]

            data_index = np.arange(labels.shape[0])
            test_mask = torch.tensor(np.in1d(data_index, test_index), dtype=torch.bool)
            dev_mask = torch.tensor(np.in1d(data_index, dev_index), dtype=torch.bool)
            train_mask = ~(dev_mask + test_mask)

            test_mask = test_mask.reshape(1, -1)
            dev_mask = dev_mask.reshape(1, -1)
            train_mask = train_mask.reshape(1, -1)


            if data.train_mask is None:
                data.train_mask = train_mask
                data.val_mask = dev_mask
                data.test_mask = test_mask
            else:

                data.train_mask = torch.cat((data.train_mask, train_mask), dim=0)
                data.val_mask = torch.cat((data.val_mask, dev_mask), dim=0)
                data.test_mask = torch.cat((data.test_mask, test_mask), dim=0)

    else:  # in the case of WikiCS
        data.train_mask = data.train_mask.T
        data.val_mask = data.val_mask.T

    return data

AttributeError: 'GlobalStorage' object has no attribute 'train_mask'
Line 33: Change
if data.train_mask is None: to if 'train_mask' not in data: