How to continue mpiexec execution when one of the jobs is terminated

Question

How to continue mpiexec execution when one of the jobs is terminated

I have image processing python code running in cluster. I'm using ms-mpi with mpi4py for inter-process communication. Sometimes one of python processes is randomly terminated with "0xc0000005" (null pointer exception, I guess).

job aborted:                                                        
[ranks] message                                                     

[0] terminated                                                      

[1] process exited without calling finalize                         

[2-35] terminated                                                   

---- error analysis -----                                           

[1] on clusternode-02                                               
python ended prematurely and may have crashed. exit code 0xc0000005

I'm pretty sure it happens withing opencv I'm using, but it happens completely randomly. I restart all jobs and the same host processes the same job just fine. So, to resolve this issue without dealing with debugging python&opencv I would just decrease number of available processes for currently running task, reschedule failed job and continue. So, the question is: is there way to continue all other jobs when one of jobs is terminated without complete halt of mpiexec?

Thank you

python

mpi4py

mpiexec

ms-mpi

asked on Stack Overflow Jul 8, 2019 by

Mikhail Kostousov

0 Answers

Nobody has answered this question yet.

User contributions licensed under CC BY-SA 3.0