There is no problem running the DDP based pytorch training program on the host computer,
After entering docker and running, the error “unhandled system error, NCCL version 2.7.8” appears.
Solution:
Add NCCL_DEBUG=INFO
before the python -m torch.distributed.launch --nproc_per_node=4
You can see:
s215:623:649 [3] include/shm.h:48 NCCL WARN Error while creating shared memory segment nccl-shm-send-404da1ec128dc62d-0-3-2 (size 4104)
When entering docker, just add --ipc=host
.