Tag Archives: Deep learning

python multiprocessing.pool NameError: name is not defined

Question

The paper code encountered a problem similar to the following. After some efforts, it can be regarded as solving this problem.

# import torch
# # print(torch.__version__)
# # print(torch.cuda.is_available())
import multiprocessing
cores = multiprocessing.cpu_count() // 2
def b(xs):

    t = temp_num[xs[0]] * 2
    return t
def a():
    global temp_num
    temp_num = [4, 5, 6]
    pool = multiprocessing.Pool(cores)
    xs = range(3)
    batch_result = pool.map(b,xs)
    print(batch_result)
def main():
    a()
if __name__ == '__main__':
    main()

Error:

solve

import multiprocessing
cores = multiprocessing.cpu_count() // 2
def b(x):
    xs = x[0]
    temp_num = x[1]
    t = temp_num[xs] * 2
    return t
def a():
    global temp_num
    temp_num = [4, 5, 6]
    pool = multiprocessing.Pool(cores)
    xs = range(3)

    z =zip(xs,[temp_num] * 3)
    result = pool.map(b,z)
    print(result)
def main():
    a()

if __name__ == '__main__':

    main()

[Solved] RuntimeError: cuDNN error: CUDNN_STATUS_INTERNAL_ERROR

This problem has been solved for a day….

Train the code well. If you change a machine, you will report an error.

I thought it was cuda11. I was worried that the CUDA version did not match the pytorch version. I reinstalled it, but it didn’t solve the problem.

Problem phenomenon:

raceback (most recent call last):
  File "train.py", line 100, in <module>
    main(opt)
  File "train.py", line 71, in main

……

  File "/home/xxxx/.local/lib/python3.8/site-packages/torch/nn/modules/conv.py", line 395, in _conv_forward
    return F.conv2d(input, weight, bias, self.stride,
RuntimeError: cuDNN error: CUDNN_STATUS_INTERNAL_ERROR
You can try to repro this exception using the following code snippet. If that doesn't trigger the error, please include your original repro script when reporting this issue.

import torch
torch.backends.cuda.matmul.allow_tf32 = True
torch.backends.cudnn.benchmark = True
torch.backends.cudnn.deterministic = False
torch.backends.cudnn.allow_tf32 = True
data = torch.randn([1, 64, 80, 144], dtype=torch.float, device='cuda', requires_grad=True)
net = torch.nn.Conv2d(64, 64, kernel_size=[3, 3], padding=[1, 1], stride=[1, 1], dilation=[1, 1], groups=1)
net = net.cuda().float()
out = net(data)
out.backward(torch.randn_like(out))
torch.cuda.synchronize()

ConvolutionParams
    data_type = CUDNN_DATA_FLOAT
    padding = [1, 1, 0]
    stride = [1, 1, 0]
    dilation = [1, 1, 0]
    groups = 1
    deterministic = false
    allow_tf32 = true
input: TensorDescriptor 0xaa030590
    type = CUDNN_DATA_FLOAT
    nbDims = 4
    dimA = 1, 64, 80, 144,
    strideA = 737280, 11520, 144, 1,
output: TensorDescriptor 0xaa0d6560
    type = CUDNN_DATA_FLOAT
    nbDims = 4
    dimA = 1, 64, 80, 144,
    strideA = 737280, 11520, 144, 1,
weight: FilterDescriptor 0xaa0d0360
    type = CUDNN_DATA_FLOAT
    tensor_format = CUDNN_TENSOR_NCHW
    nbDims = 4
    dimA = 64, 64, 3, 3,
Pointer addresses:
    input: 0x567e50000
    output: 0x568120000
    weight: 0x550a2da00

Solution:

Save CUDA’s prompt to a file,

import torch
torch.backends.cuda.matmul.allow_tf32 = True
torch.backends.cudnn.benchmark = True
torch.backends.cudnn.deterministic = False
torch.backends.cudnn.allow_tf32 = True
data = torch.randn([1, 64, 80, 144], dtype=torch.float, device='cuda', requires_grad=True)
net = torch.nn.Conv2d(64, 64, kernel_size=[3, 3], padding=[1, 1], stride=[1, 1], dilation=[1, 1], groups=1)
net = net.cuda().float()
out = net(data)
out.backward(torch.randn_like(out))
torch.cuda.synchronize()

When Python runs it, it will report the same error, then select the switch to adjust it, and try again whether it still reports an error.

For my code, modifying the following is work.

torch.backends.cudnn.benchmark = False

Then put this in front of the problem code.

NVIDIA-SMI has failed because it couldn’t communicate with the NVIDIA driver

Problem: Ubuntu 18.04 installs the graphics card driver normally. When the Ubuntu system is opened, the graphics card driver cannot be used. Use NVIDIA SMI to check the above errors.

resolvent:

Just execute two commands:

Sudo apt get install DKMS
sudo DKMS install – M NVIDIA – V 440.44 (440.44 indicates the driver version number)

Use the command ll/usr/SRC/to view the nvidia-440.44/ folder below. The version number varies from computer to computer

Then restart the computer, success.

“The CC version check failed” is reported when Ubuntu installs NVIDIA graphics card driver

I have referred to the questions answered by many bloggers, but there are many problems or troubles. I recommend a better solution after I try to solve it:

The reason for this problem is that the driver may be relatively new, and the GCC version of the system kernel is inconsistent with the default GCC version of the compiler. First, GCC -v check the default GCC version of the current compiler. If it is relatively low and you have not updated the GCC version, upgrade the GCC and C + + versions: sudo apt get install gcc-9 G + ± 9 – Y

Then open the installation path to see the GCC versions: LS/usr/bin/GCC*

Generally, the GCC version has been upgraded. If there are multiple versions in this directory, set the soft connection between GCC and G + +:
CD/usr/bin

sudo ln -sf gcc-9 gcc
sudo ln -sf g+±9 g++

Finally, just update the environment:
source/etc/environment

At this time, GCC -v check the GCC version of the current compiler. If it is consistent with the version prompted by the kernel, you can install it according to the normal graphics card driver steps

[Solved] Keil5 Error: error 65: access violation at 0x08040000 : no ‘execute/read‘ permission

1. Situation

When using keil5, build is no problem, but it is stuck in the systeminit() function during simulation debugging. It is not stuck in the loop, but can’t get in.

2. Solution

It may be the problem of version mismatch. Click the magic wand to enter the target interface, and select version 5 in the arm complier. Maybe the problem may be solved

Tensorflow Run Error or the interface is stuck or report error: Could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR

When running the more complex deep learning model of tessorflow, it is easy to get stuck in the interface
if it is a single graphics card (there is f at the end of the CPU model), you can install OpenGL and restart it before running the code
if course not create cudnn handle: cudnn appears_STATUS_INTERNAL_Error is reported, which is caused by insufficient video memory. Add

#Used to limit the use of video memory
config = tf.compat.v1.ConfigProto(gpu_options=tf.compat.v1.GPUOptions(allow_growth=True))
sess = tf.compat.v1.Session(config=config)
###############

The problem can be solved

RuntimeError: Expected hidden[0] size (x, x, x), got(x, x, x)

Start with the above figure:

The above figure shows the problem when training the bilstm network.

Problem Description: define the initial weights H0 and C0 of bilstm network and input them to the network as the initial weight of bilstm, which is realized by the following code

output, (hn, cn) = self.bilstm(input, (h0, c0))

  The network structure is as follows:

self.bilstm = nn.LSTM(
            input_size=self.input_size,
            hidden_size=self.hidden_size,
            num_layers=self.num_layers,
            bidirectional=True,
            bias=True,
            dropout=config.drop_out
        )

The dimension of initial weight is defined as   H0 and C0 are initialized. The dimension is:

**h_0** of shape `(num_layers * num_directions, batch, hidden_size)`
**c_0** of shape `(num_layers * num_directions, batch, hidden_size)`

In bilstm network, the parameters are defined as follows:

num_layers: 2

num_directions: 2

batch: 4

seq_len: 10

input_size: 300

hidden_size: 100 

Then according to the definition in the official documents    H0, C0 dimensions should be: (2 * 2, 4100) = (4, 4100)

However, according to the error screenshot at the beginning of the article, the dimension of the initial weight of the hidden layer should be (4, 10100), which makes me doubt whether the dimension specified in the official document is correct.

Obviously, the official documents cannot be wrong, and the hidden state dimensions when using blstm, RNN and bigru in the past are the same as those specified by the official, so I don’t know where to start.

Therefore, we re examined the network structure and found that an important parameter, batch, was missing_ First, let’s take a look at all the parameters required by bilstm:

Args:
        input_size: The number of expected features in the input `x`
        hidden_size: The number of features in the hidden state `h`
        num_layers: Number of recurrent layers. E.g., setting ``num_layers=2``
            would mean stacking two LSTMs together to form a `stacked LSTM`,
            with the second LSTM taking in outputs of the first LSTM and
            computing the final results. Default: 1
        bias: If ``False``, then the layer does not use bias weights `b_ih` and `b_hh`.
            Default: ``True``
        batch_first: If ``True``, then the input and output tensors are provided
            as (batch, seq, feature). Default: ``False``
        dropout: If non-zero, introduces a `Dropout` layer on the outputs of each
            LSTM layer except the last layer, with dropout probability equal to
            :attr:`dropout`. Default: 0
        bidirectional: If ``True``, becomes a bidirectional LSTM. Default: ``False``

batch_ The first parameter can make the dimension batch in the first dimension during training, that is, the input data dimension is

(batch size, SEQ len, embedding dim), if not added   batch_ First = true, the dimension is

(seq len,batch size,embedding dim)

Because there was no break at noon, I vaguely forgot to add this important parameter, resulting in an error: the initial weight dimension is incorrect, and I can add it   batch_ Run smoothly after first = true.

The modified network structure is as follows:

self.bilstm = nn.LSTM(
            input_size=self.input_size,
            hidden_size=self.hidden_size,
            num_layers=self.num_layers,
            batch_first=True,
            bidirectional=True,
            bias=True,
            dropout=config.drop_out
        )

 

Extension: when we use RNN and its variant network, if we want to add the initial weight, the dimension must be the officially specified dimension, i.e

(num_layers * num_directions, batch, hidden_size)

At the same time, be sure to set batch_ First = true. The official document does not specify when batch is set_ When first = true, the dimensions of H0, C0, HN and CN are (num_layers * num_directions, batch, hidden_size), so be careful!

At the same time, check whether batch is set when the dimensions of HN and CN are incorrect_ First parameter, RNN and its variant networks are applicable to this method!

RuntimeError: Input type (torch.FloatTensor) and weight type (torch.cuda.FloatTensor) should be the

RuntimeError: Input type (torch.FloatTensor) and weight type (torch.cuda.FloatTensor) should be the same

This problem occurs because the tensor of the input model is loaded in the CPU, while the model is loaded on CUDA.

Solution: load the input tensor into CUDA or load the model into CPU

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

model = model.to(device)
img = img.to(device)

output = model(img)

Or:

model = model.cuda()
img = img.cuda()

RuntimeError: cuDNN error: CUDNN_STATUS_EXECUTION_FAILED

RuntimeError: cuDNN error: CUDNN_ STATUS_ EXECUTION_ FAILED

The error is in   cuda:10.0     Pytorch: 1.2 problems in the training model under the GPU server environment, error prompt   Cudnn status execution failed

The problem with this error is that CUDA’s version does not correspond to pytorch’s version, resulting in CUDA’s failure to speed up model training and execute at the same time.

When downloading pytorch, we need to correctly download the corresponding relationship between pytorch and CUDA version on the official website. In the local training model, my environment is CUDA 10.0 and pytorch 1.9. Therefore, reinstall pytorch version 1.9 in the server and run successfully.

Performance: CUDA’s version does not correspond to pytorch’s version. The most obvious performance is that when running the program, the video memory does not change. When the normally loaded data and model enter the video memory, the video memory will increase significantly, while when the version does not correspond, the video memory does not change significantly. At the same time, the program will be very slow when loading the model, and even the model cannot be loaded into the video memory for 20 minutes.

[Solved] AttributeError: module ‘tensorflow._api.v2.train‘ has no attribute ‘AdampOptimizer‘

Error Message: AttributeError: module ‘tensorflow._api.v2.train’ has no attribute ‘AdampOptimizer’
Tensorflow 2.6.0:

>>> import tensorflow as tf
>>> tf.__version__
'2.6.0'

Error code snippet:

model.compile(optimizer=tf.train.AdampOptimizer(0.001),
              loss='mse',
              metrics=['accuracy'])

Replace with:

from tensorflow.keras.optimizers import Adam
model.compile(optimizer=Adam(learning_rate=0.001), loss='mse', metrics=['accuracy'])

It can also be changed as follows:

model.compile(optimizer='adam',
              loss='mse',
              metrics=['accuracy'])

Both are operational.

RuntimeError: Default process group has not been initialized, please make sure to call init_process_

Problems encountered when using mmsegmentation framework:

 File "C:\software\Anaconda3\envs\python36\lib\site-packages\torch\distributed\distributed_c10d.py", line 347, in _get_default_group
    raise RuntimeError("Default process group has not been initialized, "
RuntimeError: Default process group has not been initialized, please make sure to call init_process_group.

After debugging and positioning, it is found that there is a problem with the normalization of a convolution module:

        self.linear_fuse = ConvModule(
            in_channels=embedding_dim*4,
            out_channels=embedding_dim,
            kernel_size=1,
            norm_cfg=dict(type='SyncBN', requires_grad=True)
          
        )

Norm here_ In CFG, if it is multi card training, use “syncbn”; If it is a single card training, change the type to ‘BN’.