(cp3) zhangrui@test:~/project/cp3/CenterPoint$ python -m torch.distributed.launch --nproc_per_node=4 ./tools/train.py ~/project/cp3/CenterPoint/work_dirs/voxsc_centerpoint_voxelnet_0075voxel_fix_bn_z.py
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable forformance in your application as needed.
*****************************************
No Tensorflow
No Tensorflow
Traceback (most recent call last):
File "./tools/train.py", line 137, in <module>
main()
File "./tools/train.py", line 86, in main
torch.distributed.init_process_group(backend="nccl", init_method="env://")
File "/home/zhangrui/anaconda3/envs/cp3/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 436, in init_process_group
store, rank, world_size = next(rendezvous_iterator)
File "/home/zhangrui/anaconda3/envs/cp3/lib/python3.7/site-packages/torch/distributed/rendezvous.py", line 179, in _env_rendezvous_handler
store = TCPStore(master_addr, master_port, world_size, start_daemon, timeout)
RuntimeError: Address already in use
The TCP port is occupied. Start multiple jobs on one computer. You need to specify a different port for each job (29500 by default) to avoid communication conflict
the solution is to specify the port while running the program, and give the port number arbitrarily before the PY file to be executed:
python -m torch.distributed.launch --nproc_per_node=1 --master_port 66660 ./tools/train.py ~/project/cp3/CenterPoint/work_dirs/voxel_config/nusc_centerpoint_voxelnet_0075voxel_fix_bn_z.py
Another way is to find the occupied port number (insert print output in the program), then find the PID value corresponding to the port number: netstat – nltp, and then cancel the occupation of the port through kill – 9 PID.
Read More:
- [Pytorch Error Solution] Pytorch distributed RuntimeError: Address already in use
- [Solved] DDP/DistributedDataParallel Error: RuntimeError: Address already in use
- [Solved] mmdetection benchmark.py Error: RuntimeError: Distributed package doesn‘t have NCCL built in
- [Solved] Pymysql Use Error: RuntimeError: ‘cryptography‘ package is required for sha256_password
- [Solved] PyTorch Caught RuntimeError in DataLoader worker process 0和invalid argument 0: Sizes of tensors mus
- [Solved] Python matplotlib Error: RuntimeError: In set_size: Could not set the fontsize…
- [Solved] python tqdm raise RuntimeError(“cannot join current thread“) RuntimeError: cannot join current thr
- [Solved] RuntimeError: Error(s) in loading state_dict for Net:
- [Solved] Pytorch Error: RuntimeError: Error(s) in loading state_dict for Network: size mismatch
- [Solved] RuntimeError: Error(s) in loading state_dict for BertForTokenClassification
- Autograd error in Python: runtimeerror: grad can be implicitly created only for scalar outputs
- [Solved] RuntimeError: NCCL error in: XXX, unhandled system error, NCCL version 2.7.8
- [Solved] RuntimeError: Error(s) in loading state dict for YOLOX:
- [Solved] RuntimeError: a view of a leaf Variable that requires grad is being used in an in-place
- How to Solve Error: RuntimeError CUDA out of memory
- [Solved] RuntimeError: ProcessGroupNCCL is only supported with GPUs, no GPUs found
- [Solved] RuntimeError: a view of a leaf Variable that requires grad is being used in an in-place operation
- [Solved] bushi RuntimeError: version_ <= kMaxSupportedFileFormatVersion INTERNAL ASSERT FAILED at /pytorch/caffe2/s
- [Solved] RuntimeError: cublas runtime error : resource allocation failed at
- [Solved] error in REfO setup command: use_2to3 is invalid.