Tag Archives: hide

[Solved] CUDA unknown error – this may be due to an incorrectly set up environment

preface

Today, a project using pytorch on the viewing server suddenly made an error after upgrading. The whole content of the error report is limited by the title. I’ll send it below.

builtins. RuntimeError: CUDA unknown error – this may be due to an incorrectly set up environment, e.g. changing env variable CUDA_ VISIBLE_ DEVICES after program start. Setting the available devices to be zero.

Screenshot of error reporting

Later, I consulted some materials, and the following are some solutions.

Solution:

Method 1: add environment variables

Since I started the project as a docker container, I installed VIM after entering the container, and then in ~/Bashrc finally added something.

export CUDA_ VISIBLE_ DEVICES=0

Since the selected graphics card number is 0 when building the container, the number I configured above is 0.

Check $CUDA after restarting the container_ VISIBLE_ The devices output is normal, but the problem is not solved, and the error is still reported.

Method 2: add environment variables to the code

Add the following code at the beginning of the initialization CUDA area.

import os
os.environ['CUDA_VISIBLE_DEVICES'] =‘0’

It still hasn’t solved the problem.

Method 3: restart the server

Referring to some articles, I mentioned that if the system upgrades the graphics card driver without restarting, it will also lead to the same error.

So I restarted the server and solved the problem.

pytools.prefork.ExecError: error invoking ‘nvcc –version‘: [Errno 2] No such file or directory

Problem Description:. I run pycuda’s sample code on the local side of Linux without any problems. However, when I use pycharm to debug code remotely, the above problem occurs.

This problem needs two steps. If it can be realized after the first step, the second step is not needed

Step 1:

export PATH=”/usr/local/cuda/bin:$PATH”
export LD_ LIBRARY_ PATH=”/usr/local/cuda/lib64:$LD_LIBRARY_PATH”

Steps
1. find .bashrc file.
2. Add above lines to it.
3. source .bashrc
4. To Test run command “nvcc –version”

Some people use cuda-10.1 (version number) in this place, but I use CUDA because my CUDA here is a cuda-10.1 soft connection (equivalent to a shortcut). So the first “L” in “lrwxrwxrwx” means soft connection. Therefore, the above two methods are OK.

Step 2:

open compiler.py

Add the following code

nvcc = ‘/usr/local/cuda/bin/’ + nvcc

As follows:

    def compile_plain(source, options, keep, nvcc, cache_dir, target="cubin"):
        from os.path import join
    
        assert target in ["cubin", "ptx", "fatbin"]
        nvcc = '/usr/local/cuda/bin/' + nvcc # --> here is the new line
        
        if cache_dir:
            checksum = _new_md5()
            ...

File location of compiler.py:

Because I have an envs, I found it under one of them. You can use the locate command to locate.

anaconda3/envs/torch19/lib/python3.7/site-packages/pycuda

Command:

 find ./lib/python3.7/site-packages -name compiler.py

[Solved] Nvml compilation official use case error: collect2: error: ld returned 1 exit status

Question

When running the example of nvml, an error is reported:

/tmp/tmpxft_00001d31_00000000-4_gpu_monitor.o: In the function ‘main’:
gpu_monitor.c:(.text+0x77): undefined reference to ‘nvmlInit_v2’
gpu_monitor.c:(.text+0x93): undefined reference to ‘nvmlErrorString’
gpu_monitor.c:(.text+0xcd): undefined reference to ‘nvmlDeviceGetCount_v2’
gpu_monitor.c:(.text+0xe9): undefined reference to ‘nvmlErrorString’
gpu_monitor.c:(.text+0x15e): Undefined reference to ‘nvmlDeviceGetHandleByIndex_v2’
gpu_monitor.c:(.text+0x17a): undefined reference to ‘nvmlErrorString’
gpu_monitor.c:(.text+0x1b4): undefined reference to ‘nvmlDeviceGetName’
gpu_monitor.c:(.text+0x1d0): undefined reference to ‘nvmlErrorString’
gpu_monitor.c:(.text+0x208): Undefined reference to ‘nvmlDeviceGetPciInfo_v3’
gpu_monitor.c:(.text+0x224): undefined reference to ‘nvmlErrorString’
gpu_monitor.c:(.text+0x282): undefined reference to ‘nvmlDeviceGetComputeMode’
gpu_monitor.c:(.text+0x2b6): undefined reference to ‘nvmlErrorString’
gpu_monitor.c:(.text+0x318): undefined reference to ‘nvmlDeviceSetComputeMode’
gpu_monitor.c:(.text+0x334): undefined reference to ‘nvmlErrorString’
gpu_monitor.c:(.text+0x379): undefined reference to ‘nvmlErrorString’
gpu_monitor.c:(.text+0x3ce): undefined reference to ‘nvmlDeviceSetComputeMode’
gpu_monitor.c:(.text+0x3ea): undefined reference to ‘nvmlErrorString’
gpu_monitor.c:(.text+0x423): Undefined reference to ‘nvmlDeviceGetPowerUsage’
gpu_monitor.c:(.text+0x43f): undefined reference to ‘nvmlErrorString’
gpu_monitor.c:(.text+0x4a4): undefined reference to ‘nvmlShutdown’
gpu_monitor.c:(.text+0x4c0): undefined reference to ‘nvmlErrorString’
collect2: error: ld returned 1 exit status

Solution:

At the beginning, the above error occurs when compiling with the command nvcc – O example.C example.
compile successfully with the command nvcc – I – L/usr/lib – lnvidia ml example.C – O example

[Solved] NVIDIA driver error: NVIDIA-SMI has failed because it couldn’t communicate with the NVIDIA driver…

NVIDIA driver error reporting solution

Command line input

nvidia-smi

report errors:

NVIDIA-SMI has failed because it couldn’t communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running

Solution 1:

Don’t rush to reinstall the NVIDIA driver. First check whether the security boot is disabled in the BIOS. If not, please enter the BIOS to disable the security boot!

Solution 2:

Confirm that the security boot is disabled, and then follow the online method:

enter the Ubuntu advanced option and select the previous kernel version. If the previous method doesn’t work, reinstall the NVIDIA driver

Vs + CUDA compilation error: msb3721, return code 255

The second time I appeared, I looked for it for a long time. I don’t have a long memory. Record it

Problem Description: compilation error: msb3721, return code 255
solution:
in CUDA C/C + ± & gt; Under generate relocatable device code, select * * * Yes (- RDC = true) * * *

reference: https://www.cnblogs.com/qpswwww/p/11646593.html

RuntimeError: cuda runtime error (100) : no CUDA-capable device is detected at /opt/conda/conda-bld/

problem

RuntimeError: cuda runtime error (100) : no CUDA-capable device is detected at /opt/conda/conda-bld/

solve

This problem is likely to be your CUDA number is wrong
for example, the variables you set use GPUs 2 and 3, but in fact you only have two GPUs 0 and 1, which will lead to this error.

[Solved] RuntimeError: CUDA error: out of memory

1. Check whether the appropriate version of torch is used

print(torch.__version__)  # 1.9.1+cu111
print(torch.version.cuda)  # 11.1
print(torch.backends.cudnn.version())  # 8005
print(torch.cuda.current_device())  # 0
print(torch.cuda.is_available())  # TRUE

2. Check whether the video memory is insufficient, try to modify the batch size of the training, and it still cannot be solved when it is modified to the minimum, and then use the following command to monitor the video memory occupation in real time

watch -n 0.5 nvidia-smi

When the program is not called, the display memory is occupied

Therefore, the problem is that the program specifies to use four GPUs. There is no problem when calling the first two resources, but the third block is occupied by the programs of other small partners, so an error is reported.

3. Specify the GPU to use

device = torch.device("cuda" if torch.cuda.is_available() and not args.no_cuda else "cpu")  # cuda Specifies the GPU device to be used
model = torch.nn.DataParallel(model, device_ids=[0, 1, 3])  # Specify the device number to be used for multi-GPU parallel processing

So you can run happily

Error about XX error 1 querying major version

Error Code 1: Myelin (cuBLASLt error 1 querying major version.)

This error is because the appropriate version information is not found. When looking for dynamic link libraries in some libraries, they will look for them according to the version information behind so . If they are not found, an error will be reported.

resolvent:

Set all the version suffixes of. So to find the path. OK~

For example, use the ln - s libcublaslt.so.11 libcublaslt.so.11.5.2.43 command to set the corresponding so:

 libcublasLt.so -> libcublasLt.so.11
 libcublasLt.so.11
 libcublasLt.so.11.5.2.43 -> libcublasLt.so.11

Pytorch CUDA Error: UserWarning: CUDA initialization: CUDA unknown error…

After CUDA is installed, the following error is reported using pytorch

UserWarning: CUDA initialization: CUDA unknown error – this may be due to an incorrectly set up environment, e.g. changing env variable CUDA_ VISIBLE_ DEVICES after program start.

Solution: after CUDA and pytorch are installed, add the following in. Bashrc

export  PATH=/usr/local/cuda-11.4/bin:$PATH
export  LD_LIBRARY_PATH=/usr/local/cuda-11.4/lib64$LD_LIBRARY_PATH
export CUDA_HOME=/usr/local/cuda-11.4/bin
export CUDA_VISIBLE_DEVICES=0,1

If there is still a problem, use sudo apt-get install NVIDIA modprobe to install it. After the installation, you can use it

Methods of checking CUDA

import torch
flag = torch.cuda.is_available()
print(flag)

Output is: True cuda normal

[Solved] RuntimeError: CUDA error: CUBLAS_STATUS_EXECUTION_FAILED when calling `cubla…

Resolve runtime error: CUDA error: cublas_STATUS_EXECUTION_FAILED when calling `cubla…

The running experiment encountered this problem. At the beginning, it was found that some people said it was because the dimensions might be different, but after inspection, this problem did not exist.

Another solution is to add a sentence of code

torch.backends.cudnn.enabled = false , but I haven’t tried yet, because it is found that the CUDA device settings of the main.py file and other files are different (there is not much data, I didn’t set nn.dataparallel, so there will be no problem after the changes are consistent.

Therefore, if you encounter this problem, you can check whether each variable and model are on the same CUDA device.

Tensorflow GPU error (4 Type Error and their Solutions)

I have just changed my laptop, and I have been reporting errors when training models with TensorFlow-gpu, so I am writing down my solution here. Because I tried to replace the cuda version, replace the cudnn version, replace the tensorflow-gpu and keras version when solving the gpu can not run, so the reported errors are also a mess.

Error 1:
2021-08-09 21:04:53.637764: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cublas64_100.dll
2021-08-09 21:04:58.598447: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cudnn64_7.dll
2021-08-09 21:17:47.603456: W tensorflow/stream_executor/cuda/redzone_allocator.cc:312] Internal: Invoking ptxas not supported on Windows
Relying on driver to perform ptx compilation. This message will be only logged once.
2021-08-09 21:17:47.675868: E tensorflow/stream_executor/cuda/cuda_blas.cc:428] failed to run cuBLAS routine: CUBLAS_STATUS_EXECUTION_FAILED
2021-08-09 21:17:47.676730: I tensorflow/stream_executor/stream.cc:4963] [stream=000001774007A1F0,impl=00000177393F7250] did not memzero GPU location; source: 000000726209DF28
2021-08-09 21:17:47.676867: I tensorflow/stream_executor/stream.cc:316] did not allocate timer: 000000726209DED0
2021-08-09 21:17:47.676954: I tensorflow/stream_executor/stream.cc:1964] [stream=000001774007A1F0,impl=00000177393F7250] did not enqueue ‘start timer’: 000000726209DED0
2021-08-09 21:17:47.677084: I tensorflow/stream_executor/stream.cc:1976] [stream=000001774007A1F0,impl=00000177393F7250] did not enqueue ‘stop timer’: 000000726209DED0
2021-08-09 21:17:47.677201: F tensorflow/stream_executor/gpu/gpu_timer.cc:65] Check failed: start_event_ != nullptr && stop_event_ != nullptr

Error 2:

Error 3:

tensorflow.python.framework.errors_impl.UnknownError: Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above.

Error 4:
CuDNN library: 7.4.1 but source was compiled with: 7.6.0. CuDNN library major and minor version needs to match or have higher minor version in case of CuDNN 7.0 or later version. If using a binary install, upgrade your CuDNN library. If building from sources, make sure the library loaded at runtime is compatible with the version specified during compile configuration
In fact, these errors are caused by a problem: my computer graphics card is 3050ti, belongs to the 30 series, can only install cuda11 version or higher, so I reinstalled cuda11.3.1 and the corresponding cudnn8.2.0 version (cudnn8.2.1 is reported as an error)

[Solved] CUDA driver version is insufficient for CUDA runtime version

CUDA driver version is insufficient for CUDA runtime version

Question:

An error is reported when docker runs ONEFLOW code of insightface

 Failed to get cuda runtime version: CUDA driver version is insufficient for CUDA runtime version

reason:

1. View CUDA runtime version

cat /usr/local/cuda/version.txt

The CUDA version in my docker is 10.0.130

CUDA Version 10.0.130

2. The CUDA version has requirements for the graphics card driver version, see the following link.
https://docs.nvidia.com/cuda/cuda-toolkit-release-notes/index.html

CUDA Toolkit	Linux x86 64 Driver Version	Windows x86 and 64 Driver Version
CUDA 11.0.3 Update 1
CUDA 11.0.2 GA	>= 450.51.05	>= 451.48
CUDA 11.0.1 RC	>= 450.36.06	>= 451.22
CUDA 10.2.89	>= 440.33	>= 441.22
CUDA 10.1 (10.1.105 general release, and updates)	>= 418.39	>= 418.96
CUDA 10.0.130	>= 410.48	>= 411.31
CUDA 9.2 (9.2.148 Update 1)	>= 396.37	>= 398.26
CUDA 9.2 (9.2.88)	>= 396.26	>= 397.44

cat /proc/driver/nvidia/version took a look at the server’s graphics card driver is 418.67, CUDA 10.1 should be installed, and I installed 10.0.130 cuda.

NVRM version: NVIDIA UNIX x86_64 Kernel Module  418.67  Sat Apr  6 03:07:24 CDT 2019
GCC version:  gcc version 7.3.0 (Ubuntu 7.3.0-27ubuntu1~18.04)

solve:

Installing CUDA 10.1

(1) First in https://developer.nvidia.com/cuda-toolkit-archive According to the machine environment, download the corresponding cuda10.1 installation file. For the installer type, I choose runfile (local). The installation steps will be simpler.

wget https://developer.download.nvidia.com/compute/cuda/10.1/Prod/local_installers/cuda_10.1.243_418.87.00_linux.runsudo sh

(2) Installation

sh cuda_10.1.243_418.87.00_linux.run

The same error occurred, unresolved
it will be updated when a solution is found later.

ProgrammerAH

Programmer Guide, Tips and Tutorial