How to Fix distributed training report terminate called after throwing an instance of’std::length_error’

In conducting the training in a distributed fashion.
INFO: sensorflow:Reduce to /replica:0/task:0/device:CPU:0 then broadcast to (‘/replica:0/task:0/device:CPU:0’,)
I0408 04:01 41.507015 140706188736256 cross_device_ops. reduce to /replica:0/task:0/device:CPU:0 then broadcast to (‘/replica:0/task:0/device:CPU:0’ ,).
INFO: tensorflow:Create CheckpointSaverHook.
I0408 04:01 44.424420 140706188736256 basic_session_run_hooks. py: 541] to create CheckpointSaverHook.
Call termination after throwing the instance ‘std::length_error’
what(): basic_string::append
Fatal Python Error: Abort
I’ve spared a lot of troubleshooting, and by reducing the number of GPUs, I can run it normally!


Read More: