How to continue mpiexec execution when one of the jobs is terminated

0

I have image processing python code running in cluster. I'm using ms-mpi with mpi4py for inter-process communication. Sometimes one of python processes is randomly terminated with "0xc0000005" (null pointer exception, I guess).

job aborted:                                                        
[ranks] message                                                     

[0] terminated                                                      

[1] process exited without calling finalize                         

[2-35] terminated                                                   

---- error analysis -----                                           

[1] on clusternode-02                                               
python ended prematurely and may have crashed. exit code 0xc0000005 

I'm pretty sure it happens withing opencv I'm using, but it happens completely randomly. I restart all jobs and the same host processes the same job just fine. So, to resolve this issue without dealing with debugging python&opencv I would just decrease number of available processes for currently running task, reschedule failed job and continue. So, the question is: is there way to continue all other jobs when one of jobs is terminated without complete halt of mpiexec?

Thank you

python
mpi4py
mpiexec
ms-mpi

0 Answers

Nobody has answered this question yet.


User contributions licensed under CC BY-SA 3.0