RuntimeError: Address already in use [How to Solve]

(cp3) zhangrui@test:~/project/cp3/CenterPoint$ python -m torch.distributed.launch --nproc_per_node=4 ./tools/train.py ~/project/cp3/CenterPoint/work_dirs/voxsc_centerpoint_voxelnet_0075voxel_fix_bn_z.py
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable forformance in your application as needed.
*****************************************
No Tensorflow
No Tensorflow
Traceback (most recent call last):
  File "./tools/train.py", line 137, in <module>
    main()
  File "./tools/train.py", line 86, in main
    torch.distributed.init_process_group(backend="nccl", init_method="env://")
  File "/home/zhangrui/anaconda3/envs/cp3/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 436, in init_process_group
    store, rank, world_size = next(rendezvous_iterator)
  File "/home/zhangrui/anaconda3/envs/cp3/lib/python3.7/site-packages/torch/distributed/rendezvous.py", line 179, in _env_rendezvous_handler
    store = TCPStore(master_addr, master_port, world_size, start_daemon, timeout)
RuntimeError: Address already in use

The TCP port is occupied. Start multiple jobs on one computer. You need to specify a different port for each job (29500 by default) to avoid communication conflict
the solution is to specify the port while running the program, and give the port number arbitrarily before the PY file to be executed:

python -m torch.distributed.launch --nproc_per_node=1 --master_port 66660  ./tools/train.py ~/project/cp3/CenterPoint/work_dirs/voxel_config/nusc_centerpoint_voxelnet_0075voxel_fix_bn_z.py

Another way is to find the occupied port number (insert print output in the program), then find the PID value corresponding to the port number: netstat – nltp, and then cancel the occupation of the port through kill – 9 PID.

Read More: