Tag Archives: pytorch

RuntimeError: Default process group has not been initialized, please make sure to call init_process_

Problems encountered when using mmsegmentation framework:

 File "C:\software\Anaconda3\envs\python36\lib\site-packages\torch\distributed\distributed_c10d.py", line 347, in _get_default_group
    raise RuntimeError("Default process group has not been initialized, "
RuntimeError: Default process group has not been initialized, please make sure to call init_process_group.

After debugging and positioning, it is found that there is a problem with the normalization of a convolution module:

        self.linear_fuse = ConvModule(
            in_channels=embedding_dim*4,
            out_channels=embedding_dim,
            kernel_size=1,
            norm_cfg=dict(type='SyncBN', requires_grad=True)
          
        )

Norm here_ In CFG, if it is multi card training, use “syncbn”; If it is a single card training, change the type to ‘BN’.

[Solved] RuntimeError: CUDA error: out of memory

1. Check whether the appropriate version of torch is used

print(torch.__version__)  # 1.9.1+cu111
print(torch.version.cuda)  # 11.1
print(torch.backends.cudnn.version())  # 8005
print(torch.cuda.current_device())  # 0
print(torch.cuda.is_available())  # TRUE

2. Check whether the video memory is insufficient, try to modify the batch size of the training, and it still cannot be solved when it is modified to the minimum, and then use the following command to monitor the video memory occupation in real time

watch -n 0.5 nvidia-smi

When the program is not called, the display memory is occupied

Therefore, the problem is that the program specifies to use four GPUs. There is no problem when calling the first two resources, but the third block is occupied by the programs of other small partners, so an error is reported.

3. Specify the GPU to use

device = torch.device("cuda" if torch.cuda.is_available() and not args.no_cuda else "cpu")  # cuda Specifies the GPU device to be used
model = torch.nn.DataParallel(model, device_ids=[0, 1, 3])  # Specify the device number to be used for multi-GPU parallel processing

So you can run happily

[Solved] RuntimeError: Expected object of scalar type Float but got scalar type Double for argument #2 ‘mat1‘

Error Message (Error Codes below):
RuntimeError: Expected object of scalar type Float but got scalar type Double for argument #2 ‘mat1’ in call to _th_addmm

for epoch in range(num_epochs):
    # Convert numpy arrays to torch tensors
    inputs = torch.from_numpy(x_train)
    targets = torch.from_numpy(y_train)
    # Forward pass
    outputs = model(inputs)
    loss = criterion(outputs, targets)
    
    # Backward and optimize
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()
    
    print("Epoch [{}/{}] Loss: {:.4f}".format(epoch+1, num_epochs, loss.item()))

Solution:

Method 1. Add

inputs = inputs.float()
targets = targets.float()
model = model.float()

Complete code

for epoch in range(num_epochs):
    # Convert numpy arrays to torch tensors
    inputs = torch.from_numpy(x_train)
    targets = torch.from_numpy(y_train)
    inputs = inputs.float()
    targets = inputs.float()
    model = model.float()
    # Forward pass
    outputs = model(inputs)
    loss = criterion(outputs, targets)
    
    # Backward and optimize
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()
    
    print("Epoch [{}/{}] Loss: {:.4f}".format(epoch+1, num_epochs, loss.item()))

Method 2. Add

inputs = inputs.double()
targets = inputs.double()
model = model.double()

Complete code

for epoch in range(num_epochs):
    # Convert numpy arrays to torch tensors
    inputs = torch.from_numpy(x_train)
    targets = torch.from_numpy(y_train)
    inputs = inputs.double()
    targets = inputs.double()
    model = model.double()
    # Forward pass
    outputs = model(inputs)
    loss = criterion(outputs, targets)
    
    # Backward and optimize
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()
    
    print("Epoch [{}/{}] Loss: {:.4f}".format(epoch+1, num_epochs, loss.item()))

Pointnet-pytorch Error: importError: No module named ‘pointnet‘

Add before header file:

import sys
sys.path.append("../")
from pointnet.dataset import ShapeNetDataset
from pointnet.model import PointNetCls

How to Solve Pychart configuration import torch error

Pycharm configuration import torch report error Traceback
Error content error screenshot solution
Problem solved

Error content

Traceback (most recent call last):
File “”, line 1, in
File “D:\PyCharm Community Edition 2021.2.2\plugins\python-ce\helpers\pydev_pydev_bundle\pydev_import_hook.py”, line 21, in do_import
module = self._system_import(name, *args, **kwargs)
ModuleNotFoundError: No module named ‘torch’
Screenshot of error report
Image:
Solution:
File–>Settings

Python Interpreter

Click the ‘+’
Search torch
Choose install Package

Importing the torch again in the Python Console shows True

Done!

How to Solve Python Libsm. So. 6 error

Error: ImportError: libSM.so.6: cannot open shared object file: No such file or directory, Install opencv-python-headless

Python Pandas Error: No module named ‘openpyxl‘

Use pandas to convert CSV to ecexl XLS file, and an error is reported no module named ‘openpyxl’

Additional XLS libraries are required

Available in CONDA environment:
CONDA install openpyxl

Base environment can also be used:
PIP install openpyxl

Pytorch CUDA Error: UserWarning: CUDA initialization: CUDA unknown error…

After CUDA is installed, the following error is reported using pytorch

UserWarning: CUDA initialization: CUDA unknown error – this may be due to an incorrectly set up environment, e.g. changing env variable CUDA_ VISIBLE_ DEVICES after program start.

Solution: after CUDA and pytorch are installed, add the following in. Bashrc

export  PATH=/usr/local/cuda-11.4/bin:$PATH
export  LD_LIBRARY_PATH=/usr/local/cuda-11.4/lib64$LD_LIBRARY_PATH
export CUDA_HOME=/usr/local/cuda-11.4/bin
export CUDA_VISIBLE_DEVICES=0,1

If there is still a problem, use sudo apt-get install NVIDIA modprobe to install it. After the installation, you can use it

Methods of checking CUDA

import torch
flag = torch.cuda.is_available()
print(flag)

Output is: True cuda normal

Error in training yolox: error in importing apex

This error is reported because you did not successfully install apex. Note ~: it is not PIP install apex

ERROR: Command errored out with exit status 1: /usr/bin/python3 -u -c 'import io, os, sys, setuptools, tokenize; sys.argv[0] = '"'"'/tmp/pip-req-build-6o4wusvf/setup.py'"'"'; __file__='"'"'/tmp/pip-req-build-6o4wusvf/setup.py'"'"';f = getattr(tokenize, '"'"'open'"'"', open)(__file__) if os.path.exists(__file__) else io.StringIO('"'"'from setuptools import setup; setup()'"'"');code = f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' --cpp_ext --cuda_ext install --record /tmp/pip-record-07hjl8r1/install-record.txt --single-version-externally-managed --compile --install-headers /usr/local/include/python3.7/apex Check the logs for full command output.
Exception information:
Traceback (most recent call last):

Processing method: Step 1: use this command to check the CUDA version supported by your machine:

nvcc --version

Step 2: use the following command to view the version of CUDA you currently have installed.

pip list

Note: when installing apex, you must ensure that the two versions are consistent. That is, if the version supported by the machine is 11.0, you can install CUDA of the corresponding torch version.

First use the following command to uninstall the original torch of your machine

!pip uninstall -y torch torchvision torchaudio

Then use the following command to install. For example, the machine here supports CUDA version 11.0

!pip install torch==1.7.1+cu110 torchvision==0.8.2+cu110 torchaudio==0.7.2 -f https://download.pytorch.org/whl/torch_stable.html

Next, install apex:

!git clone https://github.com/NVIDIA/apex
%cd apex
!pip install -v --disable-pip-version-check --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" ./

The above work is completed.

Error in PIP install BS4 under Linux

1. The error information is as follows:

$ sudo pip install bs4==0.0.1
Collecting bs4==0.0.1
  Downloading http://pip.pgw.getui.com/packages/10/ed/7e8b97591f6f456174139ec089c769f89a94a1a4025fe967691de971f314/bs4-0.0.1.tar.gz
Collecting beautifulsoup4 (from bs4==0.0.1)
  Downloading http://pip.pgw.getui.com/packages/a1/69/daeee6d8f22c997e522cdbeb59641c4d31ab120aba0f2c799500f7456b7e/beautifulsoup4-4.10.0.tar.gz (399kB)
    100% |████████████████████████████████| 409kB 11.7MB/s
    Complete output from command python setup.py egg_info:
    Traceback (most recent call last):
      File "<string>", line 1, in <module>
      File "/tmp/pip-build-USqeae/beautifulsoup4/setup.py", line 7, in <module>
        from bs4 import __version__
      File "bs4/__init__.py", line 36, in <module>
        raise ImportError('You are trying to use a Python 3-specific version of Beautiful Soup under Python 2. This will not work. The final version of Beautiful Soup to support Python 2 was 4.9.3.')
    ImportError: You are trying to use a Python 3-specific version of Beautiful Soup under Python 2. This will not work. The final version of Beautiful Soup to support Python 2 was 4.9.3.

    ----------------------------------------
Command "python setup.py egg_info" failed with error code 1 in /tmp/pip-build-USqeae/beautifulsoup4/

2 solution

Install beatifulsoup4 first

$ sudo  pip install beautifulsoup4==4.9.0
Collecting beautifulsoup4==4.9.0
  Downloading http://pip.pgw.getui.com/packages/2d/3e/8b2fc5d3c31c84d7209313f4059858f502f2e4a9d986693eca03fe325565/beautifulsoup4-4.9.0-py2-none-any.whl (109kB)
    100% |████████████████████████████████| 112kB 872kB/s
Collecting soupsieve<2.0 (from beautifulsoup4==4.9.0)
  Downloading http://pip.pgw.getui.com/packages/39/36/f35056eb9978a622bbcedc554993d10777e3c6ff1ca24cde53f4be9c5fc4/soupsieve-1.9.6-py2.py3-none-any.whl
Collecting backports.functools-lru-cache; python_version < "3" (from soupsieve<2.0->beautifulsoup4==4.9.0)
  Downloading http://pip.pgw.getui.com/packages/e5/c1/1a48a4bb9b515480d6c666977eeca9243be9fa9e6fb5a34be0ad9627f737/backports.functools_lru_cache-1.6.4-py2.py3-none-any.whl
Installing collected packages: backports.functools-lru-cache, soupsieve, beautifulsoup4
Successfully installed backports.functools-lru-cache-1.6.4 beautifulsoup4-4.9.0 soupsieve-1.9.6

Installing BS4

$ sudo  pip install  bs4==0.0.1
Collecting bs4==0.0.1
  Downloading http://pip.pgw.getui.com/packages/10/ed/7e8b97591f6f456174139ec089c769f89a94a1a4025fe967691de971f314/bs4-0.0.1.tar.gz
Requirement already satisfied (use --upgrade to upgrade): beautifulsoup4 in /usr/lib/python2.7/site-packages (from bs4==0.0.1)
Requirement already satisfied (use --upgrade to upgrade): soupsieve<2.0 in /usr/lib/python2.7/site-packages (from beautifulsoup4->bs4==0.0.1)
Requirement already satisfied (use --upgrade to upgrade): backports.functools-lru-cache; python_version < "3" in /usr/lib/python2.7/site-packages (from soupsieve<2.0->beautifulsoup4->bs4==0.0.1)
Installing collected packages: bs4
  Running setup.py install for bs4 ... done
Successfully installed bs4-0.0.1

[Solved] RuntimeError: CUDA error: CUBLAS_STATUS_EXECUTION_FAILED when calling `cubla…

Resolve runtime error: CUDA error: cublas_STATUS_EXECUTION_FAILED when calling `cubla…

The running experiment encountered this problem. At the beginning, it was found that some people said it was because the dimensions might be different, but after inspection, this problem did not exist.

Another solution is to add a sentence of code

torch.backends.cudnn.enabled = false , but I haven’t tried yet, because it is found that the CUDA device settings of the main.py file and other files are different (there is not much data, I didn’t set nn.dataparallel, so there will be no problem after the changes are consistent.

Therefore, if you encounter this problem, you can check whether each variable and model are on the same CUDA device.

PyTorch – AttributeError: ‘bool‘ object has no attribute ‘sum‘

The reason is that torch.max() is changed to torch.argmax()

	out = model(img)
    loss = criterion(out, label)
    eval_loss += loss.data.item()*label.size(0)
    pred = torch.argmax(out, 1)
    num_correct = (pred == label).sum()

ProgrammerAH

Programmer Guide, Tips and Tutorial