Tag Archives: pytorch

Using next (ITER (data. Dataloader()) to report an error stopiteration

The error stopiteration is reported when using next (ITER (data. Dataloader()). This is because when using next() When accessing an iterator that has been iterated, an error will be triggered: stopiteration, that is, after a round of iteration after the dataloader imports the data, it is found that there is no data when importing again, that is, after the Iterable is completed, stopiteration is triggered, and then the loop jumps out

resolvent:

Since there is no data when importing again, we can use a data loader.

Put the in train.py

inps, targets = next(self.batch_iterator)

Change to:

try:
    inps, targets = next(self.batch_iterator)
except StopIteration:
    self.batch_iterator = iter(data.DataLoader(self.train_dataset, self.args.batch_size, shuffle=True, num_workers=self.args.num_workers, collate_fn=detection_collate))
    inps, targets = next(self.batch_iterator)

Problem solving.

An error occurred when installing pytorch version 1.7 GPU

Error when installing 1.7 GPU version of pytorch: torch has an invalid wheel,. Dist info directory not found

the reason is that CUDA versions are inconsistent,

Solution 1:

Install torch for CPU version

pip install torch==1.7.0+cpu torchvision==0.8.1+cpu torchaudio===0.7.0 -f https://download.pytorch.org/whl/torch_stable.html

Solution 2:

conda install pytorch==1.7.0 torchvision==0.8.0 torchaudio==0.7.0 cudatoolkit=11.0 -c pytorch

[Solved] Apex Install Error: ERROR: Command errored out with exit status 1

First of all, the most important thing is the version correspondence problem:
my environment
linux Ubuntu 16.04.7 lts
python = 3.7
python = 1.4
CUDA = 10.1
the versions of the above four should correspond. This can be found online.

This is the official installation procedure:

git clone https://github.com/NVIDIA/apex
cd apex
pip install -v --disable-pip-version-check --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" ./

If not, replace the installation statement with the following:

python setup.py install --cuda_ext --cpp_ext

Errors may be reported here:
filenotfounderror: [errno 2] no such file or directory: ‘:/usr/local/cuda-10.1/bin/nvcc’: ‘:/usr/local/cuda-10.1/bin/nvcc’

The problem here lies in “: /”, so we just need to specify
export cuda_home = {/usr/local/cuda-10.1} to remove the redundant “:”, and then run it again. We will report some warnings. Don’t worry.

python multiprocessing.pool NameError: name is not defined

Question

The paper code encountered a problem similar to the following. After some efforts, it can be regarded as solving this problem.

# import torch
# # print(torch.__version__)
# # print(torch.cuda.is_available())
import multiprocessing
cores = multiprocessing.cpu_count() // 2
def b(xs):

    t = temp_num[xs[0]] * 2
    return t
def a():
    global temp_num
    temp_num = [4, 5, 6]
    pool = multiprocessing.Pool(cores)
    xs = range(3)
    batch_result = pool.map(b,xs)
    print(batch_result)
def main():
    a()
if __name__ == '__main__':
    main()

Error:

solve

import multiprocessing
cores = multiprocessing.cpu_count() // 2
def b(x):
    xs = x[0]
    temp_num = x[1]
    t = temp_num[xs] * 2
    return t
def a():
    global temp_num
    temp_num = [4, 5, 6]
    pool = multiprocessing.Pool(cores)
    xs = range(3)

    z =zip(xs,[temp_num] * 3)
    result = pool.map(b,z)
    print(result)
def main():
    a()

if __name__ == '__main__':

    main()

[Solved] RuntimeError: cuDNN error: CUDNN_STATUS_INTERNAL_ERROR

This problem has been solved for a day….

Train the code well. If you change a machine, you will report an error.

I thought it was cuda11. I was worried that the CUDA version did not match the pytorch version. I reinstalled it, but it didn’t solve the problem.

Problem phenomenon:

raceback (most recent call last):
  File "train.py", line 100, in <module>
    main(opt)
  File "train.py", line 71, in main

……

  File "/home/xxxx/.local/lib/python3.8/site-packages/torch/nn/modules/conv.py", line 395, in _conv_forward
    return F.conv2d(input, weight, bias, self.stride,
RuntimeError: cuDNN error: CUDNN_STATUS_INTERNAL_ERROR
You can try to repro this exception using the following code snippet. If that doesn't trigger the error, please include your original repro script when reporting this issue.

import torch
torch.backends.cuda.matmul.allow_tf32 = True
torch.backends.cudnn.benchmark = True
torch.backends.cudnn.deterministic = False
torch.backends.cudnn.allow_tf32 = True
data = torch.randn([1, 64, 80, 144], dtype=torch.float, device='cuda', requires_grad=True)
net = torch.nn.Conv2d(64, 64, kernel_size=[3, 3], padding=[1, 1], stride=[1, 1], dilation=[1, 1], groups=1)
net = net.cuda().float()
out = net(data)
out.backward(torch.randn_like(out))
torch.cuda.synchronize()

ConvolutionParams
    data_type = CUDNN_DATA_FLOAT
    padding = [1, 1, 0]
    stride = [1, 1, 0]
    dilation = [1, 1, 0]
    groups = 1
    deterministic = false
    allow_tf32 = true
input: TensorDescriptor 0xaa030590
    type = CUDNN_DATA_FLOAT
    nbDims = 4
    dimA = 1, 64, 80, 144,
    strideA = 737280, 11520, 144, 1,
output: TensorDescriptor 0xaa0d6560
    type = CUDNN_DATA_FLOAT
    nbDims = 4
    dimA = 1, 64, 80, 144,
    strideA = 737280, 11520, 144, 1,
weight: FilterDescriptor 0xaa0d0360
    type = CUDNN_DATA_FLOAT
    tensor_format = CUDNN_TENSOR_NCHW
    nbDims = 4
    dimA = 64, 64, 3, 3,
Pointer addresses:
    input: 0x567e50000
    output: 0x568120000
    weight: 0x550a2da00

Solution:

Save CUDA’s prompt to a file,

import torch
torch.backends.cuda.matmul.allow_tf32 = True
torch.backends.cudnn.benchmark = True
torch.backends.cudnn.deterministic = False
torch.backends.cudnn.allow_tf32 = True
data = torch.randn([1, 64, 80, 144], dtype=torch.float, device='cuda', requires_grad=True)
net = torch.nn.Conv2d(64, 64, kernel_size=[3, 3], padding=[1, 1], stride=[1, 1], dilation=[1, 1], groups=1)
net = net.cuda().float()
out = net(data)
out.backward(torch.randn_like(out))
torch.cuda.synchronize()

When Python runs it, it will report the same error, then select the switch to adjust it, and try again whether it still reports an error.

For my code, modifying the following is work.

torch.backends.cudnn.benchmark = False

Then put this in front of the problem code.

A download error occurred while downloading data from pytorch. Urllib.error.urlerror: < urlopen error [SSL: certificate_verify_failed]

The reason is that the SSL certificate needs to be verified, but the certificate verification failed. I tried some methods on the Internet, mainly including canceling certificate verification and installing the latest certificate. I didn’t find a suitable method to install the latest certificate, so I used the method of canceling certificate verification. The specific operation is to add the following code at the location where the file needs to be downloaded:

import ssl
ssl._create_default_https_context = ssl._create_unverified_context

RuntimeError: Expected hidden[0] size (x, x, x), got(x, x, x)

Start with the above figure:

The above figure shows the problem when training the bilstm network.

Problem Description: define the initial weights H0 and C0 of bilstm network and input them to the network as the initial weight of bilstm, which is realized by the following code

output, (hn, cn) = self.bilstm(input, (h0, c0))

  The network structure is as follows:

self.bilstm = nn.LSTM(
            input_size=self.input_size,
            hidden_size=self.hidden_size,
            num_layers=self.num_layers,
            bidirectional=True,
            bias=True,
            dropout=config.drop_out
        )

The dimension of initial weight is defined as   H0 and C0 are initialized. The dimension is:

**h_0** of shape `(num_layers * num_directions, batch, hidden_size)`
**c_0** of shape `(num_layers * num_directions, batch, hidden_size)`

In bilstm network, the parameters are defined as follows:

num_layers: 2

num_directions: 2

batch: 4

seq_len: 10

input_size: 300

hidden_size: 100 

Then according to the definition in the official documents    H0, C0 dimensions should be: (2 * 2, 4100) = (4, 4100)

However, according to the error screenshot at the beginning of the article, the dimension of the initial weight of the hidden layer should be (4, 10100), which makes me doubt whether the dimension specified in the official document is correct.

Obviously, the official documents cannot be wrong, and the hidden state dimensions when using blstm, RNN and bigru in the past are the same as those specified by the official, so I don’t know where to start.

Therefore, we re examined the network structure and found that an important parameter, batch, was missing_ First, let’s take a look at all the parameters required by bilstm:

Args:
        input_size: The number of expected features in the input `x`
        hidden_size: The number of features in the hidden state `h`
        num_layers: Number of recurrent layers. E.g., setting ``num_layers=2``
            would mean stacking two LSTMs together to form a `stacked LSTM`,
            with the second LSTM taking in outputs of the first LSTM and
            computing the final results. Default: 1
        bias: If ``False``, then the layer does not use bias weights `b_ih` and `b_hh`.
            Default: ``True``
        batch_first: If ``True``, then the input and output tensors are provided
            as (batch, seq, feature). Default: ``False``
        dropout: If non-zero, introduces a `Dropout` layer on the outputs of each
            LSTM layer except the last layer, with dropout probability equal to
            :attr:`dropout`. Default: 0
        bidirectional: If ``True``, becomes a bidirectional LSTM. Default: ``False``

batch_ The first parameter can make the dimension batch in the first dimension during training, that is, the input data dimension is

(batch size, SEQ len, embedding dim), if not added   batch_ First = true, the dimension is

(seq len,batch size,embedding dim)

Because there was no break at noon, I vaguely forgot to add this important parameter, resulting in an error: the initial weight dimension is incorrect, and I can add it   batch_ Run smoothly after first = true.

The modified network structure is as follows:

self.bilstm = nn.LSTM(
            input_size=self.input_size,
            hidden_size=self.hidden_size,
            num_layers=self.num_layers,
            batch_first=True,
            bidirectional=True,
            bias=True,
            dropout=config.drop_out
        )

 

Extension: when we use RNN and its variant network, if we want to add the initial weight, the dimension must be the officially specified dimension, i.e

(num_layers * num_directions, batch, hidden_size)

At the same time, be sure to set batch_ First = true. The official document does not specify when batch is set_ When first = true, the dimensions of H0, C0, HN and CN are (num_layers * num_directions, batch, hidden_size), so be careful!

At the same time, check whether batch is set when the dimensions of HN and CN are incorrect_ First parameter, RNN and its variant networks are applicable to this method!

RuntimeError: Input type (torch.FloatTensor) and weight type (torch.cuda.FloatTensor) should be the

RuntimeError: Input type (torch.FloatTensor) and weight type (torch.cuda.FloatTensor) should be the same

This problem occurs because the tensor of the input model is loaded in the CPU, while the model is loaded on CUDA.

Solution: load the input tensor into CUDA or load the model into CPU

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

model = model.to(device)
img = img.to(device)

output = model(img)

Or:

model = model.cuda()
img = img.cuda()

RuntimeError: cuDNN error: CUDNN_STATUS_EXECUTION_FAILED

RuntimeError: cuDNN error: CUDNN_ STATUS_ EXECUTION_ FAILED

The error is in   cuda:10.0     Pytorch: 1.2 problems in the training model under the GPU server environment, error prompt   Cudnn status execution failed

The problem with this error is that CUDA’s version does not correspond to pytorch’s version, resulting in CUDA’s failure to speed up model training and execute at the same time.

When downloading pytorch, we need to correctly download the corresponding relationship between pytorch and CUDA version on the official website. In the local training model, my environment is CUDA 10.0 and pytorch 1.9. Therefore, reinstall pytorch version 1.9 in the server and run successfully.

Performance: CUDA’s version does not correspond to pytorch’s version. The most obvious performance is that when running the program, the video memory does not change. When the normally loaded data and model enter the video memory, the video memory will increase significantly, while when the version does not correspond, the video memory does not change significantly. At the same time, the program will be very slow when loading the model, and even the model cannot be loaded into the video memory for 20 minutes.

Pytorch: error message with chunks of 0 [How to Solve]

File "D:/Codes/code/Python Project/group_reid-master/group_reid-master/main_group_gcn_siamese_part_half_fulltest_sink.py", line 348, in train_gcn
    loss.backward()
  File "D:\Codes\Anaconda3\envs\pytorch_gpu\lib\site-packages\torch\tensor.py", line 185, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph)
  File "D:\Codes\Anaconda3\envs\pytorch_gpu\lib\site-packages\torch\autograd\__init__.py", line 127, in backward
    allow_unreachable=True)  # allow_unreachable flag
RuntimeError: chunk expects `chunks` to be greater than 0, got: 0
Exception raised from chunk at ..\aten\src\ATen\native\TensorShape.cpp:496 (most recent call first):

As shown in the figure, I always reported an error chunk of 0. At first, I was puzzled. There was no similar situation with me when looking for information on the Internet. I typed the error and found that an error was reported in the derivation of loss (the figure below). I thought that loss is a function called, and it is impossible to report such an error. So throw it directly on the server to debug. Due to different versions of pytorch, the error content is also different. We finally found the error in pytorch version 1.1.

            loss.backward()

It was caused by dimension mismatch during cutting:

env11_junk1 = env11.squeeze().unsqueeze(0).unsqueeze(0).repeat((5-x1_valid.shape[0]), parts, 1)
env22_junk2 = env22.squeeze().unsqueeze(0).unsqueeze(0).repeat((5-x2_valid.shape[0]), parts, 1)
env11 = env11.squeeze().unsqueeze(0).unsqueeze(0).repeat(x1_valid.shape[0], parts, 1)
env22 = env22.squeeze().unsqueeze(0).unsqueeze(0).repeat(x2_valid.shape[0], parts, 1)

  # calculate within graph and inter graph message
h_k1 = torch.cat((self.W_x(x1[i, :sample_size1, :]), self.W_neib(x_neib1), self.W_relative(mu1), self.W_env(env11)), 2).unsqueeze(0)  
h_k_junk1 = torch.cat((self.W_x(x1[i, sample_size1:, :]), self.W_x(x1[i, sample_size1:, :]), self.W_x(x1[i, sample_size1:, :]),self.W_env(env11_junk1)), 2).unsqueeze(0)

h_k2 = torch.cat((self.W_x(x2[i, :sample_size2, :]), self.W_neib(x_neib2), self.W_relative(mu2), self.W_env(env22)), 2).unsqueeze(0)
h_k_junk2 = torch.cat((self.W_x(x2[i, sample_size2:, :]), self.W_x(x2[i, sample_size2:, :]), self.W_x(x2[i, sample_size2:, :]),self.W_env(env22_junk2)), 2).unsqueeze(0)                       

In my code (square and unsqueeze are redundant, just premute directly). I intended to copy the same first dimension as X1 in the first dimension, but in the actual data set, the first dimension of X1 may be 0. Therefore, if the parameter of repeat is 0, an error will be reported, which cannot be less than the original dim. Since the error reported is not obvious, I wasted half a day thinking about this problem. I hereby record it.