Tag Archives: pytorch

[Solved] jetson Compile pytorch Error: internal compiler error: Segmentation fault

The following error occurred when compiling pytorch by Jetson:

/usr/include/c++/7/cmath: In static member function ‘static scalar_t at::native::div_floor_kernel_cuda(at::TensorIterator&)::<lambda()>::<lambda()>::<lambda(scalar_t, scalar_t)>::_FUN(scalar_t, scalar_t)’:
/usr/include/c++/7/cmath:1302:38: internal compiler error: Segmentation fault
   { return __builtin_copysignf(__x, __y); }

This is because compiling STD::copysign() is too expensive for the Jetson platform.

There are two solutions:
1. upgrade pytorch to version 1.9. Pytorch 1.9 uses the C10::CUDA::compat::copysign()function instead of the STD::copysign()function.
2. Submit and modify the code according to this:
Workaround arm64 gcc error in std::copysign on Jetson platforms

[Solved] Error(s) in loading state_dict for GeneratorResNet

**

Error (s) in loading state_dict for GeneratorResNet

**
cause of the problem: check whether we use dataparallel for multi GPU during training. The model generated by this method will automatically add key: module
observe the error message:
you can find that the key values in the model are more modules

Solution:
1. Delete the module

    gentmps=torch.load("./saved_models/generator_%d.pth" % opt.epoch)
    distmps = torch.load("./saved_models/discriminator_%d.pth" % opt.epoch)
    from collections import OrderedDict
    new_gens = OrderedDict()
    new_diss = OrderedDict()
    for k, v in gentmps.items():
        name = k.replace('module.','') # remove 'module.'
        new_gens[name] = v #The value corresponding to the key value of the new dictionary is a one-to-one value.
    for k, v in distmps.items():
        name = k.replace('module.','') # remove 'module.'
        new_diss[name] = v #The value corresponding to the key value of the new dictionary is a one-to-one value.
    generator.load_state_dict(new_gens)
    discriminator.load_state_dict(new_diss)

[Solved] ERROR: Command errored out with exit status 128: git clone -q

This error occurred when installing git + GitHub web page. The full name is

Collecting git+https://github.com/pytorch/tnt.git@master
  Cloning https://github.com/pytorch/tnt.git (to revision master) to c:\users\lee\appdata\local\temp\pip-req-build-lbtbze6v
  Running command git clone -q https://github.com/pytorch/tnt.git 'C:\Users\lee\AppData\Local\Temp\pip-req-build-lbtbze6v'
  fatal: unable to access 'https://github.com/pytorch/tnt.git/': OpenSSL SSL_read: Connection was reset, errno 10054
WARNING: Discarding git+https://github.com/pytorch/tnt.git@master. Command errored out with exit status 128: git clone -q https://github.com/pytorch/tnt.git 'C:\Users\lee\AppData\Local\Temp\pip-req-build-lbtbze6v' Check the logs for fu
ll command output.
ERROR: Command errored out with exit status 128: git clone -q https://github.com/pytorch/tnt.git 'C:\Users\lee\AppData\Local\Temp\pip-req-build-lbtbze6v' Check the logs for full command output.

Solution:

 pip install git+git://github.com/ildoonet/pytorch-gradual-warmup-lr.git

Successfully installed

Collecting git+git://github.com/ildoonet/pytorch-gradual-warmup-lr.git
  Cloning git://github.com/ildoonet/pytorch-gradual-warmup-lr.git to c:\users\lee\appdata\local\temp\pip-req-build-ncz3svwe
  Running command git clone -q git://github.com/ildoonet/pytorch-gradual-warmup-lr.git 'C:\Users\lee\AppData\Local\Temp\pip-req-build-ncz3svwe'
  Resolved git://github.com/ildoonet/pytorch-gradual-warmup-lr.git to commit 6b5e8953a80aef5b324104dc0c2e9b8c34d622bd
Building wheels for collected packages: warmup-scheduler
  Building wheel for warmup-scheduler (setup.py) ... done
  Created wheel for warmup-scheduler: filename=warmup_scheduler-0.3.2-py3-none-any.whl size=3917 sha256=4cef133c28685f1e5a70364fd6546a3d4d995fe49584781b1024d3707f9f222f
  Stored in directory: C:\Users\lee\AppData\Local\Temp\pip-ephem-wheel-cache-m6_t0cfl\wheels\d2\57\4d\3adb5d109151933485f2b4387f61a90ff8e669f50fc8f1fa14
Successfully built warmup-scheduler
Installing collected packages: warmup-scheduler
Successfully installed warmup-scheduler-0.3.2

[Solved] Failed environment install leads to “unable to create process using“ error

Failed environment install leads to “unable to create process using” error

Maybe it’s because there are too many environments installed in CONDA, and you copy them yourself. Suddenly, CONDA fails.

As long as you enter CONDA activate, it will report

Fatal error in launcher: Unable to create process using '"d:\project\gae_gan\ddgk\venv\scripts\python.exe" "D:\Programs\Anaconda3\Scripts\conda.exe" '

Error.

Finally found a very simple solution

Enter directly in the console

***\Anaconda3\Scripts\activate

Then restart console to be fine!
Referenec:
https://stackoverflow.com/questions/59721699/anaconda-is-unable-to-create-process-on-windows10

[Solved] bushi RuntimeError: version_ <= kMaxSupportedFileFormatVersion INTERNAL ASSERT FAILED at /pytorch/caffe2/s

Error Messages:

RuntimeError: version_ <= kMaxSupportedFileFormatVersion INTERNAL ASSERT FAILED at /pytorch/caffe2/serialize/inline_container.cc:132, please report a bug to PyTorch. Attempted to read a PyTorch file with version 3, but the maximum supported version for reading is 2. Your PyTorch installation may be too old. (init at /pytorch/caffe2/serialize/inline_container.cc:132)
frame #0: c10::Error::Error(c10::SourceLocation, std::string const&) + 0x33 (0x7fdcd0189193 in /home/a430/intel/intelpython3/envs/zayn/lib/python3.6/site-packages/torch/lib/libc10.so)
frame #1: caffe2::serialize::PyTorchStreamReader::init() + 0x1f5b (0x7fdcd33119eb in /home/a430/intel/intelpython3/envs/zayn/lib/python3.6/site-packages/torch/lib/libtorch.so)
frame #2: caffe2::serialize::PyTorchStreamReader::PyTorchStreamReader(std::string const&) + 0x64 (0x7fdcd3312c04 in /home/a430/intel/intelpython3/envs/zayn/lib/python3.6/site-packages/torch/lib/libtorch.so)
frame #3: <unknown function> + 0x6c53a6 (0x7fdd1b2423a6 in /home/a430/intel/intelpython3/envs/zayn/lib/python3.6/site-packages/torch/lib/libtorch_python.so)
frame #4: <unknown function> + 0x2961c4 (0x7fdd1ae131c4 in /home/a430/intel/intelpython3/envs/zayn/lib/python3.6/site-packages/torch/lib/libtorch_python.so)
<omitting python frames>

Some people say there is a problem with the torch version. My torch version is 1.4.0, try upgrading to 1.6.0, torchvision was upgraded to 0.7.0, this problem will be reported:

RuntimeError: Attempting to deserialize object on a CUDA device but torch.cuda.is_available() is False. If you are running on a CPU-only machine, please use torch.load with map_location=torch.device('cpu') to map your storages to the CPU.

Just follow the hint and add map_location=torch.device(‘cpu’) when loading the model, as follows:
best_model = torch.load("../weights/September01-Unet-se_resnext50_32x4d/checkpoint.pth.tar", map_location=torch.device('cpu'))

[Solved] Pytorch error: RuntimeError: one of the variables needed for gradient computation

Error:
runtimeerror: one of the variables needed for gradient computation has been modified by an inplace operation

Analysis: the new version of pytorch integrates variable and tensor into one tensor, and the inplace operation can be used for variable before, but errors will occur when using tensor

Check whether there is the following expression

x += res #error

Change to

x  = x + res

No error is reported

pytorch model.load_state_dict Error [How to Solve]

When pytorch loads the model, if some judgment is used in the model, the judgment is used as the selection execution condition, but it is also saved in the model. However, when calling, the network in the judgment condition is not selected and load_state_Dict is used will report an error. Some operators cannot find the name. For example:

if backbone == "mobilenet":
    self.backbone = mobilenet()
    flat_shape = 1024
    elif backbone == "inception_resnetv1":
    self.backbone = inception_resnet()
else:
    raise ValueError('Unsupported backbone - `{}`, Use mobilenet, inception_resnetv1.'.format(backbone))
    self.avg = nn.AdaptiveAvgPool2d((1,1))
    self.Bottleneck = nn.Linear(flat_shape, embedding_size,bias=False)
    self.last_bn = nn.BatchNorm1d(embedding_size, eps=0.001, momentum=0.1, affine=True)
    if mode == "train": # Judgment condition, test without loading full connection
        self.classifier = nn.Linear(embedding_size, num_classes)

The strict = false option can be added to avoid operators not called in the network:

model2.load_state_dict(state_dict2, strict=False)

[Solved] apex Xavier install torchvision error: illegal instruction

Description: I installed torch vision 0 eight

git clone -b v0.8.1 https://github.com/pytorch/vision.git vision-0.8.1

Installation:

cd vision-0.8.1
sudo /home/nvidia/mambaforge/envs/ultra-fast-lane/bin/python3.6 setup.py install

Note: be sure to make the path and sudo

Error: illegal instruction

Solution: reduce the numpy version. I use numpy = 1.19.3. Success

visdom Install and Run Error: raise Connectionerror [How to Solve]

 

Install visdom

Switch to the environment corresponding to CONDA and use CONDA install visdom. An error is reported and the installation cannot be performed. After query, it is found that the installation can be successful using pip. For some reason, the command is as follows:

pip install visdom

Run visdom

If you want to use visdom in Python code, you must first start the visdom service in the CONDA environment where visdom is installed:

python -m visdom.server

After the service is started, the following prompt will be given:

39: DeprecationWarning: zmq.eventloop.ioloop is deprecated in pyzmq 17. pyzmq now works with default tornado and asyncio eventloops.
  ioloop.install()  # Needs to happen before any tornado imports!
Checking for scripts.
Downloading scripts, this may take a little while
It's Alive!
INFO:root:Application Started
You can navigate to http://localhost:8097

Then you can open it in the browser http://localhost:8097 Address and access visual content

If you do not run the above command, the following error will be reported:

Traceback (most recent call last):
  File "D:\program\conda\envs\python36_gan\lib\site-packages\visdom\__init__.py", line 711, in _send
    data=json.dumps(msg),
  File "D:\program\conda\envs\python36_gan\lib\site-packages\visdom\__init__.py", line 677, in _handle_post
    r = self.session.post(url, data=data)
  File "D:\program\conda\envs\python36_gan\lib\site-packages\requests\sessions.py", line 581, in post
    return self.request('POST', url, data=data, json=json, **kwargs)
  File "D:\program\conda\envs\python36_gan\lib\site-packages\requests\sessions.py", line 533, in request
    resp = self.send(prep, **send_kwargs)
  File "D:\program\conda\envs\python36_gan\lib\site-packages\requests\sessions.py", line 646, in send
    r = adapter.send(request, **kwargs)
  File "D:\program\conda\envs\python36_gan\lib\site-packages\requests\adapters.py", line 516, in send
    raise ConnectionError(e, request=request)
requests.exceptions.ConnectionError: HTTPConnectionPool(host='localhost', port=8097): Max retries exceeded with url: /env/test1 (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x0000025
321093898>: Failed to establish a new connection: [WinError 10061] Unable to connect because the target computer actively refused.',))
[WinError 10061] Unable to connect because the target computer actively refused.
Exception in user code:
------------------------------------------------------------
Traceback (most recent call last):
  File "D:\program\conda\envs\python36_gan\lib\site-packages\urllib3\connection.py", line 157, in _new_conn
    (self._dns_host, self.port), self.timeout, **extra_kw
  File "D:\program\conda\envs\python36_gan\lib\site-packages\urllib3\util\connection.py", line 84, in create_connection
    raise err
  File "D:\program\conda\envs\python36_gan\lib\site-packages\urllib3\util\connection.py", line 74, in create_connection
    sock.connect(sa)
ConnectionRefusedError: [WinError 10061] Unable to connect because the target computer actively refused.

During handling of the above exception, another exception occurred:

[Solved] DDP/DistributedDataParallel Error: RuntimeError: Address already in use

An error is reported when testing pytorch multi card:
store = tcpstore (master_addr, master_port, world_size, start_daemon, timeout)
runtimeerror: address already in use

After investigation, there is another task running with DDP.

Solution:
manually specify an idle port

python -m torch.distributed.launch --master_port 145622

View port occupancy:
terminal input
netstat - nultp