The errors reported by pytoch are as follows:
Pytorch distributed RuntimeError: Address already in use
reason:
The port is occupied during model multi card training. Just change the port.
Solution:
Add a parameter — master before running the command_ For example:
--master_port 29501
The following parameter 29501 can be set to any other port
be careful:
This parameter should be loaded in front of xxx.py, for example:
CUDA_VISIBLE_DEVICES=2,7 python3 -m torch.distributed.run /
--nproc_per_node 2 --master_port 29501 train.py