Tag Archives: pytorch

Sslcertverificationerror when downloading using Python: [SSL: certificate_verify_failed] error

When downloading the dataset on the pytorch official website, an error is reported in sslcertverificationerror: [SSL: certificate_verify_failed]. The following is my solution.

Error reason: when opening HTTPS link with urllib, SSL certificate will be checked once. When the target website uses a self signed certificate, it will throw the error of urlib2.urlerror.

Solution: cancel certificate validation globally
Import SSL
SSL_ create_ default_ https_ context = ssl._ create_ unverified_ context

Reference article link: https://blog.csdn.net/yixieling4397/article/details/79861379

Normalize error: TypeError: Input tensor should be a float tensor…

The following error is reported when using tensor ` normalization

from torchvision import transforms
import numpy as np
import torchvision
import torch

data = np.random.randint(0, 255, size=12)
img = data.reshape(2,2,3)


print(img)
print("*"*100)
transform1 = transforms.Compose([
    transforms.ToTensor(), # range [0, 255] -> [0.0,1.0]
    transforms.Normalize(mean = (10,10,10), std = (1,1,1)),
    ]
)
# img = img.astype('float')
norm_img = transform1(img) 
print(norm_img)

You can add this sentence. In fact, it is to set the element type. See the tips above

Using next (ITER (data. Dataloader()) to report an error stopiteration

The error stopiteration is reported when using next (ITER (data. Dataloader()). This is because when using next() When accessing an iterator that has been iterated, an error will be triggered: stopiteration, that is, after a round of iteration after the dataloader imports the data, it is found that there is no data when importing again, that is, after the Iterable is completed, stopiteration is triggered, and then the loop jumps out

resolvent:

Since there is no data when importing again, we can use a data loader.

Put the in train.py

inps, targets = next(self.batch_iterator)

Change to:

try:
    inps, targets = next(self.batch_iterator)
except StopIteration:
    self.batch_iterator = iter(data.DataLoader(self.train_dataset, self.args.batch_size, shuffle=True, num_workers=self.args.num_workers, collate_fn=detection_collate))
    inps, targets = next(self.batch_iterator)

Problem solving.

An error occurred when installing pytorch version 1.7 GPU

Error when installing 1.7 GPU version of pytorch: torch has an invalid wheel,. Dist info directory not found

the reason is that CUDA versions are inconsistent,

Solution 1:

Install torch for CPU version

pip install torch==1.7.0+cpu torchvision==0.8.1+cpu torchaudio===0.7.0 -f https://download.pytorch.org/whl/torch_stable.html

Solution 2:

conda install pytorch==1.7.0 torchvision==0.8.0 torchaudio==0.7.0 cudatoolkit=11.0 -c pytorch

[Solved] Apex Install Error: ERROR: Command errored out with exit status 1

First of all, the most important thing is the version correspondence problem:
my environment
linux Ubuntu 16.04.7 lts
python = 3.7
python = 1.4
CUDA = 10.1
the versions of the above four should correspond. This can be found online.

This is the official installation procedure:

git clone https://github.com/NVIDIA/apex
cd apex
pip install -v --disable-pip-version-check --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" ./

If not, replace the installation statement with the following:

python setup.py install --cuda_ext --cpp_ext

Errors may be reported here:
filenotfounderror: [errno 2] no such file or directory: ‘:/usr/local/cuda-10.1/bin/nvcc’: ‘:/usr/local/cuda-10.1/bin/nvcc’

The problem here lies in “: /”, so we just need to specify
export cuda_home = {/usr/local/cuda-10.1} to remove the redundant “:”, and then run it again. We will report some warnings. Don’t worry.

python multiprocessing.pool NameError: name is not defined

Question

The paper code encountered a problem similar to the following. After some efforts, it can be regarded as solving this problem.

# import torch
# # print(torch.__version__)
# # print(torch.cuda.is_available())
import multiprocessing
cores = multiprocessing.cpu_count() // 2
def b(xs):

    t = temp_num[xs[0]] * 2
    return t
def a():
    global temp_num
    temp_num = [4, 5, 6]
    pool = multiprocessing.Pool(cores)
    xs = range(3)
    batch_result = pool.map(b,xs)
    print(batch_result)
def main():
    a()
if __name__ == '__main__':
    main()

Error:

solve

import multiprocessing
cores = multiprocessing.cpu_count() // 2
def b(x):
    xs = x[0]
    temp_num = x[1]
    t = temp_num[xs] * 2
    return t
def a():
    global temp_num
    temp_num = [4, 5, 6]
    pool = multiprocessing.Pool(cores)
    xs = range(3)

    z =zip(xs,[temp_num] * 3)
    result = pool.map(b,z)
    print(result)
def main():
    a()

if __name__ == '__main__':

    main()

[Solved] RuntimeError: cuDNN error: CUDNN_STATUS_INTERNAL_ERROR

This problem has been solved for a day….

Train the code well. If you change a machine, you will report an error.

I thought it was cuda11. I was worried that the CUDA version did not match the pytorch version. I reinstalled it, but it didn’t solve the problem.

Problem phenomenon:

raceback (most recent call last):
  File "train.py", line 100, in <module>
    main(opt)
  File "train.py", line 71, in main

……

  File "/home/xxxx/.local/lib/python3.8/site-packages/torch/nn/modules/conv.py", line 395, in _conv_forward
    return F.conv2d(input, weight, bias, self.stride,
RuntimeError: cuDNN error: CUDNN_STATUS_INTERNAL_ERROR
You can try to repro this exception using the following code snippet. If that doesn't trigger the error, please include your original repro script when reporting this issue.

import torch
torch.backends.cuda.matmul.allow_tf32 = True
torch.backends.cudnn.benchmark = True
torch.backends.cudnn.deterministic = False
torch.backends.cudnn.allow_tf32 = True
data = torch.randn([1, 64, 80, 144], dtype=torch.float, device='cuda', requires_grad=True)
net = torch.nn.Conv2d(64, 64, kernel_size=[3, 3], padding=[1, 1], stride=[1, 1], dilation=[1, 1], groups=1)
net = net.cuda().float()
out = net(data)
out.backward(torch.randn_like(out))
torch.cuda.synchronize()

ConvolutionParams
    data_type = CUDNN_DATA_FLOAT
    padding = [1, 1, 0]
    stride = [1, 1, 0]
    dilation = [1, 1, 0]
    groups = 1
    deterministic = false
    allow_tf32 = true
input: TensorDescriptor 0xaa030590
    type = CUDNN_DATA_FLOAT
    nbDims = 4
    dimA = 1, 64, 80, 144,
    strideA = 737280, 11520, 144, 1,
output: TensorDescriptor 0xaa0d6560
    type = CUDNN_DATA_FLOAT
    nbDims = 4
    dimA = 1, 64, 80, 144,
    strideA = 737280, 11520, 144, 1,
weight: FilterDescriptor 0xaa0d0360
    type = CUDNN_DATA_FLOAT
    tensor_format = CUDNN_TENSOR_NCHW
    nbDims = 4
    dimA = 64, 64, 3, 3,
Pointer addresses:
    input: 0x567e50000
    output: 0x568120000
    weight: 0x550a2da00

Solution:

Save CUDA’s prompt to a file,

import torch
torch.backends.cuda.matmul.allow_tf32 = True
torch.backends.cudnn.benchmark = True
torch.backends.cudnn.deterministic = False
torch.backends.cudnn.allow_tf32 = True
data = torch.randn([1, 64, 80, 144], dtype=torch.float, device='cuda', requires_grad=True)
net = torch.nn.Conv2d(64, 64, kernel_size=[3, 3], padding=[1, 1], stride=[1, 1], dilation=[1, 1], groups=1)
net = net.cuda().float()
out = net(data)
out.backward(torch.randn_like(out))
torch.cuda.synchronize()

When Python runs it, it will report the same error, then select the switch to adjust it, and try again whether it still reports an error.

For my code, modifying the following is work.

torch.backends.cudnn.benchmark = False

Then put this in front of the problem code.

A download error occurred while downloading data from pytorch. Urllib.error.urlerror: < urlopen error [SSL: certificate_verify_failed]

The reason is that the SSL certificate needs to be verified, but the certificate verification failed. I tried some methods on the Internet, mainly including canceling certificate verification and installing the latest certificate. I didn’t find a suitable method to install the latest certificate, so I used the method of canceling certificate verification. The specific operation is to add the following code at the location where the file needs to be downloaded:

import ssl
ssl._create_default_https_context = ssl._create_unverified_context

RuntimeError: Expected hidden[0] size (x, x, x), got(x, x, x)

Start with the above figure:

The above figure shows the problem when training the bilstm network.

Problem Description: define the initial weights H0 and C0 of bilstm network and input them to the network as the initial weight of bilstm, which is realized by the following code

output, (hn, cn) = self.bilstm(input, (h0, c0))

  The network structure is as follows:

self.bilstm = nn.LSTM(
            input_size=self.input_size,
            hidden_size=self.hidden_size,
            num_layers=self.num_layers,
            bidirectional=True,
            bias=True,
            dropout=config.drop_out
        )

The dimension of initial weight is defined as   H0 and C0 are initialized. The dimension is:

**h_0** of shape `(num_layers * num_directions, batch, hidden_size)`
**c_0** of shape `(num_layers * num_directions, batch, hidden_size)`

In bilstm network, the parameters are defined as follows:

num_layers: 2

num_directions: 2

batch: 4

seq_len: 10

input_size: 300

hidden_size: 100 

Then according to the definition in the official documents    H0, C0 dimensions should be: (2 * 2, 4100) = (4, 4100)

However, according to the error screenshot at the beginning of the article, the dimension of the initial weight of the hidden layer should be (4, 10100), which makes me doubt whether the dimension specified in the official document is correct.

Obviously, the official documents cannot be wrong, and the hidden state dimensions when using blstm, RNN and bigru in the past are the same as those specified by the official, so I don’t know where to start.

Therefore, we re examined the network structure and found that an important parameter, batch, was missing_ First, let’s take a look at all the parameters required by bilstm:

Args:
        input_size: The number of expected features in the input `x`
        hidden_size: The number of features in the hidden state `h`
        num_layers: Number of recurrent layers. E.g., setting ``num_layers=2``
            would mean stacking two LSTMs together to form a `stacked LSTM`,
            with the second LSTM taking in outputs of the first LSTM and
            computing the final results. Default: 1
        bias: If ``False``, then the layer does not use bias weights `b_ih` and `b_hh`.
            Default: ``True``
        batch_first: If ``True``, then the input and output tensors are provided
            as (batch, seq, feature). Default: ``False``
        dropout: If non-zero, introduces a `Dropout` layer on the outputs of each
            LSTM layer except the last layer, with dropout probability equal to
            :attr:`dropout`. Default: 0
        bidirectional: If ``True``, becomes a bidirectional LSTM. Default: ``False``

batch_ The first parameter can make the dimension batch in the first dimension during training, that is, the input data dimension is

(batch size, SEQ len, embedding dim), if not added   batch_ First = true, the dimension is

(seq len,batch size,embedding dim)

Because there was no break at noon, I vaguely forgot to add this important parameter, resulting in an error: the initial weight dimension is incorrect, and I can add it   batch_ Run smoothly after first = true.

The modified network structure is as follows:

self.bilstm = nn.LSTM(
            input_size=self.input_size,
            hidden_size=self.hidden_size,
            num_layers=self.num_layers,
            batch_first=True,
            bidirectional=True,
            bias=True,
            dropout=config.drop_out
        )

 

Extension: when we use RNN and its variant network, if we want to add the initial weight, the dimension must be the officially specified dimension, i.e

(num_layers * num_directions, batch, hidden_size)

At the same time, be sure to set batch_ First = true. The official document does not specify when batch is set_ When first = true, the dimensions of H0, C0, HN and CN are (num_layers * num_directions, batch, hidden_size), so be careful!

At the same time, check whether batch is set when the dimensions of HN and CN are incorrect_ First parameter, RNN and its variant networks are applicable to this method!