Tag Archives: pytorch

CONDA install torch error

If you can’t find the corresponding version, you need to add another PyTorch source

conda config --add channels https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/pytorch/

Conda toggle Tsinghua source complete command:

conda config --add channels https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/free/
conda config --add channels https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/conda-forge 
conda config --add channels https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/msys2/
conda config --add channels https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/pytorch/
conda config --set show_channel_urls yes

Installation of Python on MAC

First, install Anaconda
Installation environment:

Although I’m on a MAC and have Python shipped, I have Anaconda installed first. Because it integrates many third-party Python libraries, and it is easy to manage different versions of Python, switching between different versions of Python. Anaconda is also a scientific computing environment. After Anaconda is installed on your computer, you will have some common libraries installed as well as Python installed.

The author installed Python version 2.7 Anaconda, and after installing Anaconda, Python and some common libraries are already installed. In addition, the Spyder was installed automatically.
2. Establish, activate and install Pytorch
Open the terminal and type:

conda create -n [name] python=3.5

[

n

a

m

e

]

[name]

Replace [name] with the name of the environment you want, without typing []. Depending on your needs, you can choose between different versions of Python. Just change 3.5 to 3.6 or 2.7
Then, after the completion of the execution, the execution:

source activate [name]

At this point, the runtime environment is activated.
Then execute PIP install torch torchvision to perform Pytorch installation.
When finished, the installation is complete
If you need to use the GPU version, install it using the source code. Download or visit the page making, others have been compiled Pytorch GPU version of https://github.com/TomHeaven/pytorch-osx-build

CUDA error:out of memory

Today, when I was running the program, I kept reporting this error, saying that I was out of CUDA memory. After a long time of debugging, it turned out to be
 
At first I suspected that the graphics card on the server was being used, but when I got to mvidia-SMi I found that none of the three Gpus were used. That question is obviously impossible. So why is that?
 
Others say the TensorFlow and Pytorch versions conflict. ?????I didn’t get TensorFlow
 
The last reference the post: http://www.cnblogs.com/jisongxie/p/10276742.html
 

Yes, Like the blogger, I’m also using a No. 0 GPU, so I don’t know why my Pytorch process works. I can only see a no. 2 GPU physically, I don’t have a no. 3 GPU. So something went wrong?
 
So I changed the code so that PyTorch could see all the Gpus on the server:

OS. Environ [‘ CUDA_VISIBLE_DEVICES] = ‘0’
 
Then on the physics of no. 0 GPU happily run up ~~~
 
 
 

The lenet model trained by Python failed to predict its own handwritten pictures

LeNet is trained with MNIST’s training set, and the code is not shown here.
directly loads the saved model

lenet = torch.load('resourses/trained_model/LeNet_trained.pkl')

Attached to the test code

print("Testing")
# Define conversion operations
# Read in the test image and transfer it to the model.
test_images = Image.open('resourses/LeNet_test/0.png')
img_to_tensor = transforms.Compose([
    transforms.Resize(32),
    transforms.Grayscale(num_output_channels=1),
    transforms.ToTensor(),
    transforms.Normalize([0.5], [0.5])])
input_images = img_to_tensor(test_images).unsqueeze(0)
# Move models and data to cuda for computation if cuda is available
USE_CUDA = torch.cuda.is_available()
if USE_CUDA:
    input_images = input_images.cuda()
    lenet = lenet.cuda()
output_data = lenet(input_images)
# Print test information
test_labels = torch.max(output_data, 1)[1].data.cpu().numpy().squeeze(0)
print(output_data)
print(test_labels)

At present, there is no correct rate according to my own picture, and I can’t find any reason. At present, the frequency of output 8 is very high.
later looked up relevant information, for the following reasons: </mark b>

    1. parsed MNIST data set, you will find that the pictures in the data set are white words on a black background, such as:

    1. , but our custom test pictures are generally black words on a white background, such as:

    1. , so I took the custom test pictures by pixel and then re-tested
    pixel reverse code is as follows:
from PIL import Image, ImageOps	
image = Image.open('resourses/LeNet_test/0.png')
image_invert = ImageOps.invert(image)
image_invert.show()

After pixel reversal, the accuracy rate of the test reaches 50-60 percent, but the accuracy rate is still not ideal. Please refer to the following reasons

    MNIST data set contains the handwriting of foreigners. The handwriting style and habits of foreigners are slightly different from those of Chinese people, which is also a major factor affecting the accuracy of the test. But the owner of the building has not tested the correct rate of the image test after modifying the font.

RuntimeError: CUDA error: out of memory solution (valid for pro-test)

Said earlier, I am in the model test and not appear this error when training, as for the model training report this error, please refer to my another blog: runtimeerror: about errors CUDA out of memory. The Tried to the allocate 1.17 GB
actually very simple solution, originally I procedures specify the gpu is 3, run the test code is called the title out of memory error, the diagram below:

I specified the gpus as 2 and 3, and then ran the code without error.

pytorch: RuntimeError CUDA error device-side assert triggered

Training network error reporting: RuntimeError: cuda runtime error (710) : Device – side assert triggered the at/pytorch aten/SRC/THC/generic/THCTensorScatterGather cu: 380
the terminate called after throwing an instance of ‘c10: : Error’
I () : CUDA Error: device-side assert triggered (insert_events at /pytorch/c10/cuda/CUDACachingAllocator.cpp:569)
Reason: The label is out of line
Method: Input

CUDA_LAUNCH_BLOCKING=1 python train.py

An error generates specific information

/pytorch/aten/src/THC/THCTensorScatterGather.cu:188: void THCudaTensor_scatterFillKernel(TensorInfo<Real, IndexType>, TensorInfo<long, IndexType>, Real, int, IndexType) [with IndexType = unsigned int, Real = float, Dims = -1]: block: [72,0,0], thread: [32,0,0] Assertion `indexValue >= 0 && indexValue < tensor.sizes[dim]` failed.

Can be seen to be Assertion ‘indexValue & GT; = 0 & & indexValue < The predicate error is tensor. Sizes [dim], which means the label is more than zero or more than the total number and crosses the line. After debugging, I found that there was a setting that was larger than the preset total number of categories when the category was labeled. I modified this label, and the problem was solved.

Pytorch RuntimeError CuDNN error CUDNN_STATUS_SUCCESS (How to Fix)

When I used RNN in Pytorch and sent it to the GPU for computation, there was:
RuntimeError: CuDNN error: CUDNN_STATUS_SUCCESS
Before the error appeared, my environment was:
OS: Ubuntu 16.0.4
GPU: NVIDIA GForce RTX 2080 (view GPU and driver information can be used command: nvidia-smi)
CUDA: cuda9.0
cudnn: 7.x for cuda9.0
Look up the solution on the Internet, in fact, encountered this kind of problem is very simple, is
1. Update cuda and cudnn
change your cuda and cudnn to 9.2 or 10.0 or above the lowest version of cuda supported by your GPU;
2. Update pytorch and torchvision versions
remember, after updating cuda and cudnn, be sure to update pytorch to a version suitable for your cuda, such as torch0.4.1 for cuda9.2.
After the above 2 steps, you can basically solve this problem. The reference method is as follows:
https://discuss.pytorch.org/t/runtimeerror-cudnn-error-cudnn-status-success/28045).

(Solved) pytorch error: RuntimeError: cuDNN error: CUDNN_STATUS_EXECUTION_FAILED (install cuda)

RuntimeError: cuDNN error: CUDNN_STATUS_EXECUTION_FAILED
Reason: Pytorch and CUDA versions are not right
(It is also possible that there is not enough memory space, you can change the virtual space size)
Uninstall Pytorch: Conda Uninstall Pytorch, and if you install CUDA, it will automatically override the CUDA version.
Open CMD and type from the command line

import torch
print(torch.__version__)
print(torch.version.cuda) 


Similar errors occur if the cudA version is not installed with the torch version.

Here’s how to install CUDA:
1. Open the NVIDIA control panel to view the CUDA version supported by the current video card driver:


2. Download CUDA address
https://developer.nvidia.com/cuda-toolkit-archive
Or offline installation package download required in https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/pytorch/linux-64/. Tar..bz2
Background Conda Install XXXX.. tar.bz2
Install after installation is complete

First anaconda Conda switches to the domestic source

conda config --add channels https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/free/
conda config --add channels https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/conda-forge 
conda config --add channels https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/msys2/

conda config --set show_channel_urls yes

Conda Install Pytorch Torchvision Cudatoolkit =10.0
Install other packages
Pytorch official website: Pytorch official website

Download according to the actual situation:

3. After successful download, double-click the exe file to install.
The verification method for successful installation is to enter nvcc-v under CMD

The installation was successful. You can see in system variables:


Or you can see nvCC.exe under the installation path

Solve runtimeerror: reduce failed to synchronize: device side assert triggered problem

first, the previous wave reported an error message:

/pytorch/aten/src/THC/THCTensorIndex.cu:362: void indexSelectLargeIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, IndexType, IndexType, long) [with T = float, IndexType = unsigned int, DstDim = 2, SrcDim = 2, IdxDim = -2, IndexIsMajor = true]: block: [35,0,0], thread: [96,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/THC/THCTensorIndex.cu:362: void indexSelectLargeIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, IndexType, IndexType, long) [with T = float, IndexType = unsigned int, DstDim = 2, SrcDim = 2, IdxDim = -2, IndexIsMajor = true]: block: [35,0,0], thread: [97,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/THC/THCTensorIndex.cu:362: void indexSelectLargeIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, IndexType, IndexType, long) [with T = float, IndexType = unsigned int, DstDim = 2, SrcDim = 2, IdxDim = -2, IndexIsMajor = true]: block: [35,0,0], thread: [98,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/THC/THCTensorIndex.cu:362: void indexSelectLargeIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, IndexType, IndexType, long) [with T = float, IndexType = unsigned int, DstDim = 2, SrcDim = 2, IdxDim = -2, IndexIsMajor = true]: block: [35,0,0], thread: [99,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/THC/THCTensorIndex.cu:362: void indexSelectLargeIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, IndexType, IndexType, long) [with T = float, IndexType = unsigned int, DstDim = 2, SrcDim = 2, IdxDim = -2, IndexIsMajor = true]: block: [35,0,0], thread: [100,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/THC/THCTensorIndex.cu:362: void indexSelectLargeIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, IndexType, IndexType, long) [with T = float, IndexType = unsigned int, DstDim = 2, SrcDim = 2, IdxDim = -2, IndexIsMajor = true]: block: [35,0,0], thread: [101,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/THC/THCTensorIndex.cu:362: void indexSelectLargeIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, IndexType, IndexType, long) [with T = float, IndexType = unsigned int, DstDim = 2, SrcDim = 2, IdxDim = -2, IndexIsMajor = true]: block: [35,0,0], thread: [102,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/THC/THCTensorIndex.cu:362: void indexSelectLargeIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, IndexType, IndexType, long) [with T = float, IndexType = unsigned int, DstDim = 2, SrcDim = 2, IdxDim = -2, IndexIsMajor = true]: block: [35,0,0], thread: [103,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/THC/THCTensorIndex.cu:362: void indexSelectLargeIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, IndexType, 
......
......
......
/pytorch/aten/src/THC/THCTensorIndex.cu:362: void indexSelectLargeIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, IndexType, IndexType, long) [with T = float, IndexType = unsigned int, DstDim = 2, SrcDim = 2, IdxDim = -2, IndexIsMajor = true]: block: [35,0,0], thread: [95,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
Traceback (most recent call last):
  File "../paragrah_selector/para_sigmoid_train.py", line 533, in <module>
    main()
  File "../paragrah_selector/para_sigmoid_train.py", line 463, in main
    eval_loss = eval_model(model, eval_data, device)
  File "../paragrah_selector/para_sigmoid_train.py", line 419, in eval_model
    loss, logits = model(input_ids, segment_ids, input_mask, labels=label_ids)
  File "/home/lisen/.conda/envs/py36/lib/python3.6/site-packages/torch/nn/modules/module.py", line 489, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/lisen/.conda/envs/py36/lib/python3.6/site-packages/torch/nn/parallel/data_parallel.py", line 143, in forward
    outputs = self.parallel_apply(replicas, inputs, kwargs)
  File "/home/lisen/.conda/envs/py36/lib/python3.6/site-packages/torch/nn/parallel/data_parallel.py", line 153, in parallel_apply
    return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)])
  File "/home/lisen/.conda/envs/py36/lib/python3.6/site-packages/torch/nn/parallel/parallel_apply.py", line 83, in parallel_apply
    raise output
  File "/home/lisen/.conda/envs/py36/lib/python3.6/site-packages/torch/nn/parallel/parallel_apply.py", line 59, in _worker
    output = module(*input, **kwargs)
  File "/home/lisen/.conda/envs/py36/lib/python3.6/site-packages/torch/nn/modules/module.py", line 489, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/lisen/caiyun_projects/generative_mrc/paragrah_selector/modeling.py", line 1001, in forward
    loss = loss_fn(logits, labels)
  File "/home/lisen/.conda/envs/py36/lib/python3.6/site-packages/torch/nn/modules/module.py", line 489, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/lisen/.conda/envs/py36/lib/python3.6/site-packages/torch/nn/modules/loss.py", line 504, in forward
    return F.binary_cross_entropy(input, target, weight=self.weight, reduction=self.reduction)
  File "/home/lisen/.conda/envs/py36/lib/python3.6/site-packages/torch/nn/functional.py", line 2027, in binary_cross_entropy
    input, target, weight, reduction_enum)
RuntimeError: reduce failed to synchronize: device-side assert triggered
terminate called after throwing an instance of 'c10::Error'
  what():  CUDA error: device-side assert triggered (insert_events at /pytorch/aten/src/THC/THCCachingAllocator.cpp:470)
frame #0: std::function<std::string ()>::operator()() const + 0x11 (0x7f0e52afc021 in /home/lisen/.conda/envs/py36/lib/python3.6/site-packages/torch/lib/libc10.so)
frame #1: c10::Error::Error(c10::SourceLocation, std::string const&) + 0x2a (0x7f0e52afb8ea in /home/lisen/.conda/envs/py36/lib/python3.6/site-packages/torch/lib/libc10.so)
frame #2: <unknown function> + 0x13dbd92 (0x7f0e5e065d92 in /home/lisen/.conda/envs/py36/lib/python3.6/site-packages/torch/lib/libcaffe2_gpu.so)
frame #3: at::TensorImpl::release_resources() + 0x50 (0x7f0e534c6440 in /home/lisen/.conda/envs/py36/lib/python3.6/site-packages/torch/lib/libcaffe2.so)
frame #4: <unknown function> + 0x2af03b (0x7f0e51bb703b in /home/lisen/.conda/envs/py36/lib/python3.6/site-packages/torch/lib/libtorch.so.1)
frame #5: torch::autograd::Variable::Impl::release_resources() + 0x17 (0x7f0e51e29d27 in /home/lisen/.conda/envs/py36/lib/python3.6/site-packages/torch/lib/libtorch.so.1)
frame #6: <unknown function> + 0x124cfb (0x7f0e8ce4ccfb in /home/lisen/.conda/envs/py36/lib/python3.6/site-packages/torch/lib/libtorch_python.so)
frame #7: <unknown function> + 0x3204af (0x7f0e8d0484af in /home/lisen/.conda/envs/py36/lib/python3.6/site-packages/torch/lib/libtorch_python.so)
frame #8: <unknown function> + 0x3204f1 (0x7f0e8d0484f1 in /home/lisen/.conda/envs/py36/lib/python3.6/site-packages/torch/lib/libtorch_python.so)
<omitting python frames>
frame #25: __libc_start_main + 0xf0 (0x7f0ecf782830 in /lib/x86_64-linux-gnu/libc.so.6)

Aborted (core dumped)
(py36) lisen@octa:~/caiyun_projects/generative_mrc/script$ sh para_sigmoid_train.sh

before heading off to New York next week, we appear under the heading of
before heading off to New York next week. We haven’t lost our labels before heading off to New York before heading off to New York next week. We haven’t lost our labels before heading off to New York next week. And so on. So check your labels carefully. 2. There is something wrong with your word vector, such as the position vector exceeding the preset length of the model, the word vector exceeding the size of the word table, etc.

And then, the point of this article, if you just say these two reasons, it might not be easy to figure out the problem. Let me show you a simple debug method, and you’ll see what the problem is. That is: put the model on the CPU and run . If it doesn’t fit, just turn down the batch size. For example, after I finished the adjustment, I reported the following error:

File "../paragrah_selector/para_sigmoid_train.py", line 533, in <module>
    main()
  File "../paragrah_selector/para_sigmoid_train.py", line 463, in main
    eval_loss = eval_model(model, eval_data, device)
  File "../paragrah_selector/para_sigmoid_train.py", line 419, in eval_model
    loss, logits = model(input_ids, segment_ids, input_mask, labels=label_ids)
  File "/home/lisen/.conda/envs/py36/lib/python3.6/site-packages/torch/nn/modules/module.py", line 489, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/lisen/caiyun_projects/generative_mrc/paragrah_selector/modeling.py", line 987, in forward
    _, pooled_output = self.bert(input_ids, token_type_ids, attention_mask, output_all_encoded_layers=False)
  File "/home/lisen/.conda/envs/py36/lib/python3.6/site-packages/torch/nn/modules/module.py", line 489, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/lisen/caiyun_projects/generative_mrc/paragrah_selector/modeling.py", line 705, in forward
    embedding_output = self.embeddings(input_ids, token_type_ids)
  File "/home/lisen/.conda/envs/py36/lib/python3.6/site-packages/torch/nn/modules/module.py", line 489, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/lisen/caiyun_projects/generative_mrc/paragrah_selector/modeling.py", line 281, in forward
    position_embeddings = self.position_embeddings(position_ids)
  File "/home/lisen/.conda/envs/py36/lib/python3.6/site-packages/torch/nn/modules/module.py", line 489, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/lisen/.conda/envs/py36/lib/python3.6/site-packages/torch/nn/modules/sparse.py", line 118, in forward
    self.norm_type, self.scale_grad_by_freq, self.sparse)
  File "/home/lisen/.conda/envs/py36/lib/python3.6/site-packages/torch/nn/functional.py", line 1454, in embedding
    return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
RuntimeError: index out of range at /pytorch/aten/src/TH/generic/THTensorEvenMoreMath.cpp:191

thorough analysis clearly shows that File “/home/lisen/caiyun_projects/generative_mrc/paragrah_selector/modeling. Py”, line 281, in forward
position_embeddings = self. Position_embeddings (position_ids) “, If the position vector exceeds the preset length value of the model, then I go back to check and find that the longer text is indeed not truncated to that length, leading to this problem.

Pytorch corresponding point multiplication and matrix multiplication

1, corresponding point multiplication, x.ul (y), that is, dot product operation, dot product does not sum operation, also known as Hadamard product; The dot product and the sum is the convolution

>>> a = torch.Tensor([[1,2], [3,4], [5, 6]])
>>> a
tensor([[1., 2.],
        [3., 4.],
        [5., 6.]])
>>> a.mul(a)
tensor([[ 1.,  4.],
        [ 9., 16.],
        [25., 36.]])

# a*a等价于a.mul(a)

2, matrix multiplication, x.m m (y), the matrix size to meet: (I, n) x (n, j)

>>> a
tensor([[1., 2.],
        [3., 4.],
        [5., 6.]])
>>> b = a.t()  # 转置
>>> b
tensor([[1., 3., 5.],
        [2., 4., 6.]])

>>> a.mm(b)
tensor([[ 5., 11., 17.],
        [11., 25., 39.],
        [17., 39., 61.]])


Pytorch torchvision.datasets.ImageFolder Found 0 files in subfolders error for

Pytorch torchvision. Datasets. ImageFolder Found 0 files in subfolders error

recently in learning pytorch actual combat computer vision book, the case of cat and dog battle, according to the original code [external chain picture transfer failure, the source station may have anti-hotlinking mechanism, suggested to save the picture directly upload (C:\Users\dell\AppData\ Typora\typora-user-images\1602066856682.png)

encountered an error loading ImageFolder

[external chain image transfer failure, source station may have anti-hotlinking mechanism, suggested to save the picture directly upload (img-8llgfcz7-1602067092918)(C:\Users\dell\AppData\ Typora\typora-user-images\1602066954030. PNG)

918)]

create a new subfolder in the valid, error resolution.