There is no problem running the DDP based pytorch training program on the host computer,
After entering docker and running, the error “unhandled system error, NCCL version 2.7.8” appears.
Solution:
Add NCCL_DEBUG=INFO
before the python -m torch.distributed.launch --nproc_per_node=4
You can see:
s215:623:649 [3] include/shm.h:48 NCCL WARN Error while creating shared memory segment nccl-shm-send-404da1ec128dc62d-0-3-2 (size 4104)
When entering docker, just add --ipc=host
.
Read More:
- [Solved] NCCL error in: /pytorch/torch/lib/c10d/ProcessGroupNCCL ,unhandled cuda error, NCCLversion 2.7.8
- [Solved] Go Get Download Dependency Error: is not using a known version control system
- [Solved] Elasticsearch error: cannot downgrade a node from version [7.xx.x] to version [7.xx.x]
- [Solved] unhandled error during execution of watcher callback
- android remount of /system failed: Read-only file system [How to Solve]
- Quartz: ERROR threw an unhandled Exception [How to Solve]
- [Solved] The version of springcloud must support the current version of springboot, otherwise the startup project will report an error: error starting ApplicationContext
- [Solved] Android Studio Error: The binary version of its metadata is 1.5.1, expected version is 1.1.15.
- android mediaplayer went away with unhandled event after the recording stopped
- Error code: events.js:183 Thrower; //unhandled ‘error’ event – solution
- Node.js Error: throw er; // Unhandled ‘error‘ event [How to Solve]
- [Solved] Angular build Error: throw er; // Unhandled ‘error’ eventEmitted ‘error’ event on ChildProcess instance
- Android studio version 3.0 import version 2.2.2 error Error:This Gradle plugin requires Studio 3.0 minimum
- laravel-echo-server Run Error: [ioredis] Unhandled error event: ReplyError: NOAUTH Authentication required.
- Node js events.js:183 throw er; // Unhandled ‘error’ event
- CUDA_ERROR_SYSTEM_DRIVER_MISMATCH [How to Solve]
- Pytorch error: `module ‘torch‘ has no attribute ‘__version___‘`
- [Solved] Logging system failed to initialize using configuration from ‘classpathlogbacklogback-spring.xml‘
- [Solved] System.InvalidOperationException: Failed to deploy distro docker-desktop……
- result = e.symbols[symb] KeyError: b‘system‘ [How to Solve]