I have image processing python code running in cluster. I'm using ms-mpi with mpi4py for inter-process communication. Sometimes one of python processes is randomly terminated with "0xc0000005" (null pointer exception, I guess).
job aborted:
[ranks] message
[0] terminated
[1] process exited without calling finalize
[2-35] terminated
---- error analysis -----
[1] on clusternode-02
python ended prematurely and may have crashed. exit code 0xc0000005
I'm pretty sure it happens withing opencv I'm using, but it happens completely randomly. I restart all jobs and the same host processes the same job just fine. So, to resolve this issue without dealing with debugging python&opencv I would just decrease number of available processes for currently running task, reschedule failed job and continue. So, the question is: is there way to continue all other jobs when one of jobs is terminated without complete halt of mpiexec?
Thank you
User contributions licensed under CC BY-SA 3.0