Tag Archives: pytorch

[Solved] AttributeError: ‘DataParallel‘ object has no attribute ‘save‘

Error message:

trainer.model.save(self.dir, epoch, is_best=is_best)
AttributeError: 'DataParallel' object has no attribute 'save'

Source code analysis:

 trainer.model.save(self.dir, epoch, is_best=is_best)

The above code is the code before using single machine multi card parallel. My parallel code is implemented as follows:

os.environ["CUDA_VISIBLE_DEVICES"] = "3,2,1"
model = torch.nn.DataParallel(model,device_ids=[0,1]).cuda()

Cause analysis: attributeerror: ‘dataparallel’ object has no attribute ‘save‘

Under torch multi GPU training, the whole model is stored instead of the model state_Dict(), so we need to use model when calling model Module mode. After using the above modification method, the code is as follows:

 trainer.model.module.save(self.dir, epoch, is_best=is_best)

RuntimeError: CUDA error: an illegal memory access was encountered

Question:

When I encountered this problem on the way to write the model, baidu either said it was the pytorch version problem or the category index exceeded, but it was useless, because the error was a very simple assignment operation.

scores[:, 0] = -float("inf") 
#RuntimeError: CUDA error: an illegal memory access was encountered

At the same time, in the process of debugging, it is found that a warning burst after the execution of a network of the model

lm_logits = self.linear(outputs) + self.bias
#warning:Thudacheck FAIL file=/pytorch/aten/c/THC/Thccachinghostallocator cpp Line=278 error=700: an illegal memory access was encountered

At first glance, both places are relatively simple, but they reported strange mistakes.

Solution:

The debug process found an exception

In the data data output by the pytorch network, the variable does not display the specific network output value, but the address information of the data

T:torch.Tensor object at 0x7fb27e7c8f30
data:torch.Tensor object at 0x7fb27e7c8f30

Later, it was found that it was because of self The linear layer is’ CPU ‘, while other networks are on’ CUDA ‘, which is equivalent to the inconsistency caused by the forward propagation of’ CUDA ‘type data to the’ CPU ‘network. Just transfer the network to’ CUDA ‘.

[Solved] Runtime error: expected scalar type Float but found Double

Error: Runtime error: expected scalar type Float but found Double

w_true=torch.tensor([2,-3.4]).T
b_true=4.2
feature = torch.from_numpy(np.random.normal(0,1,(num_input,num_example)))
#feature = torch.float32(feature)
labels = torch.matmul(w_true.T,feature)+b_true

Problem: runtime error: expected scalar type float but found double
reason: NP random. The data generated by Rand() is float64, while torch defaults to float32, so the problem arises
solution

feature = torch.from_numpy(np.float32(np.random.normal(0,1,(num_input,num_example))))

[Solved] KeyError: ‘Transformer/encoderblock_0/MultiHeadDotProductAttention_1/query\\kernel is

Recently, I've been working on the application of Transformer to fine-grained images.
Solving the problem with the vit source code
 
KeyError: 'Transformer/encoderblock_0/MultiHeadDotProductAttention_1/query\kernel is not a file in the archive'
 
This is a problem when merging paths with os.path.join
 
Solution.
 
1. In the modeling.py file
 
Add '/' to the following paths:
ATTENTION_Q = "MultiHeadDotProductAttention_1/query/"
ATTENTION_K = "MultiHeadDotProductAttention_1/key/"
ATTENTION_V = "MultiHeadDotProductAttention_1/value/"
ATTENTION_OUT = "MultiHeadDotProductAttention_1/out/"
FC_0 = "MlpBlock_3/Dense_0/"
FC_1 = "MlpBlock_3/Dense_1/"
ATTENTION_NORM = "LayerNorm_0/"
MLP_NORM = "LayerNorm_2/"
 
2. In the vit_modeling_resnet.py file
 
ResNetV2 class Add '/' after each 'block' and 'unit'
 
self.body = nn.Sequential(OrderedDict([
    ('block1/', nn.Sequential(OrderedDict(
        [('unit1/', PreActBottleneck(cin=width, cout=width*4, cmid=width))] +
        [(f'unit{i:d}/', PreActBottleneck(cin=width*4, cout=width*4, cmid=width)) for i in range(2, block_units[0] + 1)],
        ))),
    ('block2/', nn.Sequential(OrderedDict(
        [('unit1/', PreActBottleneck(cin=width*4, cout=width*8, cmid=width*2, stride=2))] +
        [(f'unit{i:d}/', PreActBottleneck(cin=width*8, cout=width*8, cmid=width*2)) for i in range(2, block_units[1] + 1)],
        ))),
    ('block3/', nn.Sequential(OrderedDict(
        [('unit1/', PreActBottleneck(cin=width*8, cout=width*16, cmid=width*4, stride=2))] +
        [(f'unit{i:d}/', PreActBottleneck(cin=width*16, cout=width*16, cmid=width*4)) for i in range(2, block_units[2] + 1)],
        ))),
]))

RuntimeError: CUDA out of memory. Tried to allocate 600.00 MiB (GPU 0; 23.69 GiB total capacity)

RuntimeError: CUDA out of memory. Tried to allocate 600.00 MiB (GPU 0; 23.69 GiB total capacity; 21.82 GiB already allocated; 115.25 MiB free; 21.87 GiB reserved in total by PyTorch)

Runtime error: CUDA out of memory. Attempt to allocate 600.00 MIB (GPU 0; 23.69 gib total capacity; 21.82 gib allocated; 115.25 MIB free; pytorch reserves a total of 21.87 GIB)

reason

A similar bug is due to insufficient video memory

Solution 1: release the video memory

First fuser - V/dev/NVIDIA * or sudo fuser - V/dev/NVIDIA * , view the processes running on the GPU recently, and then sudo kill the relevant process number.

Of which:

Fuser: it can display which program is currently using a file, mount point, or even network port on the disk, and give the details of the program process – V: detailed mode/dev/NVIDIA *: all NVIDIA related interfaces (such as GPU)

Solution 2: turn down the batch size

If it is not enough after releasing part of the video memory, you can reduce the batch size


Reference:
after NVIDIA GPU kill process, the video memory is still unclear

[Solved] Sklearn Call Error: DLL load failed while importing _arpack

DLL load failed while importing when sklearn is called_Arpack error

There is no effect after reinstalling sklearn and numpy with CONDA
some people in the Post said that they could delete some DLL files under Win32 to prevent error reporting and did not try

Knowing that CONDA may have its own problems (such as some DLL file configurations), I reinstalled sklearn with PIP

pip uninstall scipy
pip install scipy

The problem is solved (of course, the image should be configured in advance)

PS: I still reported an error when I just used the Jupiter notebook. I found that I had to restart it.

Reference Resources:
https://stackoverflow.com/questions/55201924/scikit-learn-dll-load-failed-in-anaconda

[CUDA Environment] Python Pytorch Error: CudaSetupArgument

@TOC
the probability of this problem is that the CUDA version used for compilation is inconsistent with the CUDA version running
first check the CUDA version of the system (that is, the CUDA version used for compilation)

nvcc -V

In my pytorch + CONDA environment, you can use CONDA list to view the cudatoolkit version in the virtual environment. At first, the CUDA version of my system is 9.0 and the cudatoolkit version is 10.2. Therefore, the version is inconsistent, so the error message shown in the title appears. Later, I switched the CUDA version of the system and the problem was solved
brief description of the specific version switching method:
echo $path view CUDA path information, add the path of cuda10.2 and link it to/usr/local/CUDA. The specific instructions are

ln -s /usr/local/cuda10.2 /usr/local/cuda

Then modify the system path as follows:

vim ~/.bashrc

Add code at the end

export PATH=/usr/local/cuda:$PATH
export LD_LIBRARY_PATH=/usr/local/cuda/lib64:$LD_LIBRARY_PATH

Press ESC, enter: WQ, press enter to exit, and then enter on the command line

source ~/.bashrc

Update path information
now enter from the command line

nvcc -V

You can view the CUDA version after switching

[Solved] Pytorch loading model specified GPU card number error or failed to specify

According to the pytorch document, when loading the model, you can specify to load the tensor of the model to a specific target GPU
the loading methods are:

>>> torch.load('tensors.pt')
# 1. Load all tensors onto the GPU 0
>>> torch.load('tensors.pt', map_location=torch.device('cuda:0'))
# 2. Load all tensors onto GPU 1
>>> torch.load('tensors.pt', map_location=lambda storage, loc: storage.cuda(1))
# 3. Map tensors from GPU 1 to GPU 0
>>> torch.load('tensors.pt', map_location={'cuda:1':'cuda:0'})

The actual measurement shows that
method 1 is not loaded into the target card at all. The original card on which the model was trained is still loaded into the original old card number, so the assignment fails
in method 3, errors are reported between codes. Location. Startswitch (‘cuda ‘): attributeerror:’ nonetype ‘object has no attribute’ startswitch ‘. After analyzing the code, it is found that this is a bug of torch itself! Pit father
method 2: normally load the tensors on cuda1.