Tag Archives: pattern recognition

unhandled system error, NCCL version 2.7.8 [How to Solve]

There is no problem running the DDP based pytorch training program on the host computer,

After entering docker and running, the error “unhandled system error, NCCL version 2.7.8” appears.

Solution:

Add NCCL_DEBUG=INFO before the python -m torch.distributed.launch --nproc_per_node=4

You can see:

s215:623:649 [3] include/shm.h:48 NCCL WARN Error while creating shared memory segment nccl-shm-send-404da1ec128dc62d-0-3-2 (size 4104)

When entering docker, just add --ipc=host.