Tag Archives: pytorch

Pytorch failed to specify GPU resolution

Recently, I ran pytorch’s training code on an 8-card server without any problem. However, after the CUDA is re installed, it is impossible to specify which GPU to run on. It can only be used from Block 0 in order. After checking some information, the problem has been solved.

1. To specify which GPU to run on in Python program, the following methods are usually adopted:

import os
import torch

os.environ["CUDA_VISIBLE_DEVICES"] = "4,5,6,7"

Or execute the following commands directly from the command line (not recommended):

export CUDA_VISIBLE_DEVICES=4,5,6,7

2. According to the previous writing method, suddenly the above code is invalid. No matter how to modify the visible GPU number, the final program is used from Block 0 in order. The problem lies in the location of the specified GPU line of code“ os.environ [“CUDA_ VISIBLE_ Devices “] =” 4,5,6,7 “” move to import torch and other codes, followed by import OS, that is, in the following way:

import os

os.environ["CUDA_VISIBLE_DEVICES"] = "4,5,6,7"

import torch

3. Some common instructions for viewing GPU information are attached for later use, as follows:

import torch

torch.cuda.is_available()  # Check if cuda is available

torch.cuda.device_count() # Returns the number of GPUs

torch.cuda.get_device_name(0) # Return the GPU name, the device index starts from 0 by default

torch.cuda.current_device() # Returns the current device index

Libtorch Error: Expected object of type Variable but found type CUDALongType for argument #2 ‘index’

what(): Expected object of type Variable but found type CUDALongType for argument #2 ‘index’ (checked_ cast_ variable at … /… /torch/csrc/autograd/ VariableTypeManual.cpp:38 )

Problem Description: using the libtorch function torch:: index_ select(detections_ class_ Left, 0, index), the operation will report an error as above and enter the index_ The definition of select shows: static inline tensor index_ select(const Tensor & self, int64_ t dim, const Tensor & index)

Problem analysis:
through the error prompt, argument # 2 ‘index’ should be of variable type, and I use the tensor (data type is cudalongtype), so I always report an error.

Problem solving:
1. Make clear the difference between variable type and tensor, and quote another article’s interpretation: Zhihu’s article;

2. Turn the tensor into a variable and use the function: Torch:: autograd:: make_ variable(left_ index, false);// tensot—> Variable

RuntimeError: Integer division of tensors using div or / is no longer supported, and in a future rel

RuntimeError: Integer division of tensors using div or/is no longer supported, and in a future release div will perform true division as in Python 3. Use true_divide or floor_divide (// in Python) instead.

from torchvision import transforms
import numpy as np

data = np.random.randint(0, 255, size=12)
img = data.reshape(2, 2, 3)
print(img.shape)
img_tensor = transforms.ToTensor()(img)  # Convert to tensor
print(img_tensor)
print(img_tensor.shape)
print("*" * 20)
norm_img = transforms.Normalize((10, 10, 10), (1, 1, 1))(img_tensor)  # Perform normative processing
print(norm_img)

Operation effect:

reason:

Pytorch1.5.0 is OK, but when upgrading to 1.6.0, it is found that division between tenor and int cannot be directly performed with ‘/’.

Solution:

Standardize the data processing

Example code:

from torchvision import transforms
import numpy as np

data = np.random.randint(0, 255, size=12)
img = data.reshape(2, 2, 3)
print(img.shape)
img_tensor = transforms.ToTensor()(img)  # convert to tensor
print(img_tensor)
print(img_tensor.shape)
print("*" * 20)
img_tensor = img_tensor.float()  # Add this line
norm_img = transforms.Normalize((10, 10, 10), (1, 1, 1))(img_tensor) # Perform normalization
print(norm_img)

Results of operation:

Python: CUDA error: an illegal memory access was accounted for

Error in pytorch1.6 training:

RuntimeError: CUDA error: an illegal memory access was encountered

The reason for the error is the same as that of the lower version of python (such as version 1.1)

Runtimeerror: expected object of backend CUDA but get backend CPU for argument https://blog.csdn.net/weixin_ 44414948/article/details/109783988

Cause of error:

The essence of this kind of error reporting is model and input data_ image、input_ Label) is not all moved to GPU (CUDA).
* * tips: * * when debugging, you must carefully check whether every input variable and network model have been moved to the GPU. I usually report an error because I have missed one or two of them.

resolvent:

Model, input_ image、input_ The example code is as follows:

model = model.cuda()
input_image = input_iamge.cuda()
input_label = input_label.cuda()

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = model.to(device)
input_image = input_iamge.to(device)
input_label = input_label.to(device)

Cannot find command ‘git’

On win10, cannot find command ‘git’ appears because Git is not installed

resolvent:

Then enter CONDA install GIT

ValueError: Expected more than 1 value per channel when training, got input size torch.Size([1, 256,

When testing the trained network, python finds the above problems and adds them after loading the network model.eval (), the problem is solved.

model = nn.DataParallel(model).cpu()
model.load_state_dict(torch.load(path, map_location=torch.device('cpu')), False)
model.eval()

There are three things to note about this code snippet

 nn.DataParallel(model)

If you add this function to the training network, you also need to add this function to the training network

model.load_state_dict(torch.load(path, map_location=torch.device('cpu')), False)

If the network framework is not saved when the network is saved and only parameters are available, the network can be loaded by the above method. If the GPU is used when the network is trained and the parameter map is added on the CPU when the network is used_ location= torch.device (‘cpu’)

model.eval()

The third is the problem of the topic. After loading the network, add the above function to solve the problem

Resolve call‘ plt.show () ‘no image after

Call in code plt.show No img image is displayed after ()

import matplotlib.pyplot as plt
plt.imshow(img)

The solution is as follows:

First, add the following in the header file:

import pylab

Then in the original code plt.show (IMG) add the following:

pylab.show()

As shown in the figure, the picture can be displayed normally

Build your own resnet18 network and load torch vision’s own weight

import torch
import torchvision
import cv2 as cv
from utils.utils import letter_box
from model.backbone import ResNet18


model1 = ResNet18(1)
model2 = torchvision.models.resnet18(progress=False)
fc = model2.fc
model2.fc = torch.nn.Linear(512, 1)
# print(model)
model_dict1 = model1.state_dict()
model_dict2 = torch.load('resnet18.pth')
model_list1 = list(model_dict1.keys())
model_list2 = list(model_dict2.keys())
len1 = len(model_list1)
len2 = len(model_list2)
minlen = min(len1, len2)
for n in range(minlen):
    if model_dict1[model_list1[n]].shape != model_dict2[model_list2[n]].shape:
        continue
    model_dict1[model_list1[n]] = model_dict2[model_list2[n]]

model1.load_state_dict(model_dict1)
missing, unspected = model2.load_state_dict(model_dict2)
image = cv.imread('zhn1.jpg')
image = letter_box(image, 224)
image = image[:, :, ::-1].transpose(2, 0, 1)
print('Network loading complete.')
model1.eval()
model2.eval()
with torch.no_grad():
    image = torch.tensor(image/256, dtype=torch.float32).unsqueeze(0)
    predict1 = model1(image)
    predict2 = model2(image)
print('finished')
# torch.save(model.state_dict(), 'resnet18.pth')

RuntimeError: log_vml_cpu not implemented for ‘Long’

welcome to my blog
Problem description
Implements Torch. Log (tor.from_numpy (NP.array ([1,2,2]))))t implemented for ‘Long’
why
Long data does not support log operations. Why is a Tensor a Long?Since the numpy array is created without specifying a dtype, int64 is used by default, so when the numpy array is converted to torch.tensor, the data type becomes Long
The solution
Reset torch.log(torch.from_numpy(np.array([1,2,2],np.float)))

Python custom convolution kernel weight parameters

Pytorch build convolution layer generally use nn. Conv2d method, in some cases we need custom convolution kernels weight weight, and nn. Conv2d custom is not allowed in the convolution parameters, can use the torch at this time. The nn. Functional. Conv2d referred to as “f. onv2d

torch.nn.functional.conv2d(input, weight, bias=None, stride=1, padding=0, dilation=1, groups=1)

F.onv2d can and must be required to input the convolution weight and bias bias. Therefore, build the desired convolution kernel parameters, and then input F.conv2d. Here is an example of using f.conv2d to build the convolution layer, where a class is needed for the network model:

class CNN(nn.Module):
    def __init__(self):
        super(CNN, self).__init__()
        self.weight = nn.Parameter(torch.randn(16, 1, 5, 5))  # Customized weights
        self.bias = nn.Parameter(torch.randn(16))    # Customized bias

    def forward(self, x):
        x = x.view(x.size(0), -1)
        out = F.conv2d(x, self.weight, self.bias, stride=1, padding=0)
        return out

It is worth noting that the data type of weights to be trained for each layer in the PyTorch is set to nn.parameter rather than Tensor or Variable. Parameter’s require_grad defaults to true, and Varaible defaults to False.

Solution of visdom startup failure in Windows 10

Task description
Recently collected a batch of data, want to call Cyclegan to complete the domain migration to see the effect. So I found the open source Cyclegan code on the Internet, the code can run normally, but the call to Visidom will always show an Error: HTTP Error. So record the process of my solution

Start the visdom

python -m visdom.server

Calling CMD to start visdom.server but the code will get stuck, stuck in downloading the script

To solve the caton
The reason is that the file is difficult to download. Here’s how to solve it
Find the location of the Visidom package in the current environment, roughly: ~\Lib\site-packages\visdom Open server.py and look for download_scripts and comment this line so that download_scripts() is not executed
After this operation, and then start Visidom, the model will run smoothly, and no exception thrown. But there is a problem, open the page blue screen.

To solve the blue screen
The reason for the blue screen is that it does not download properly. The solution here refers to two articles, both of which are cited in the following references
Into local visdom in static files, there is a index. The HTML files, the backup download reference (2) in the index. The HTML files, to replace the current folder has the backup index. The restart visdom HTML files, open the page, the question remains, to be the next step will be the backup of the original index. The HTML to replace the current index. The HTML restart visdom, problem solving

reference
https://blog.csdn.net/AnthongDai/article/details/79117472https://github.com/chenyuntc/pytorch-book/blob/2c8366137b691aaa8fbeeea478cc1611c09e15f5/README.md#visdom%E6%89%93%E4%B8%8D%E5%BC%80%E5%8F%8A%E5 %85%B6%E8%A7%A3%E5%86%B3%E6%96%B9%E6%A1%88

This article is the author’s original, reproduced need to indicate the source!

The pychar / pytorch page file is too small to complete the operation

Possible reasons for
If baidu, most can tell you is because virtual memory is insufficient, let you increase virtual memory. On Windows, PyTorch may have a problem with its num_workers, which needs to be set to 0.
The solution
The lack of virtual memory may not be a setup problem, but it may simply be a lack of disk space. I cleaned up the disk and the problem was solved successfully.

ProgrammerAH

Programmer Guide, Tips and Tutorial