keras exit code -1073741819 (0xC0000005) after running training 2 models

1

I use Pycharm to run my script. I have a script that loops. Each loop: 1. Select a dataset. 2. Trains a new Keras model. 3. Evaluate that model.

So the code works perfectly for 2 weeks but when installing a new anaconda environment, the code suddenly fails after two iteration of that loop.

Two models of Siamese Neural Network will train perfectly fine and right before the third loop, it crashes with Process finished with exit code -1073741819 (0xC0000005).

 1/32 [..............................] - ETA: 0s - loss: 0.5075
12/32 [==========>...................] - ETA: 0s - loss: 0.5112
27/32 [========================>.....] - ETA: 0s - loss: 0.4700
32/32 [==============================] - 0s 4ms/step - loss: 0.4805
eval run time : 0.046851396560668945

For LOOCV run 2 out of 32. Model is SNN. Time taken for instance = 6.077638149261475
Post-training results: 
acc = 1.0 , ce = 0.6019332906978302 , f1 score = 1.0 , mcc = 0.0
cm = 
[[1]]
####################################################################################################

Process finished with exit code -1073741819 (0xC0000005)

The strange thing is that the code used to work perfectly fine and even when I am not using the anaconda enviornment and used the previous environment I used, it still exits with the same exit code.

When I use a another type of model (a dense neural network), it also crashes but after 4 iteration. Is it something to do with running out of memory? This is an example of the loop. The exact model does not matter, it always crashes after a certain number of loops at the train model line (Between point 2 and 3)

 # Run k model instance to perform skf
    predicted_labels_store = []
    acc_store = []
    ce_store = []
    f1s_store = []
    mcc_store = []
    folds = []
    val_features_c = []
    val_labels = []
    for fold, fl_tuple in enumerate(fl_store):
        instance_start = time.time()
        (ss_fl, i_ss_fl) = fl_tuple  # ss_fl is training fl, i_ss_fl is validation fl
        if model_mode == 'SNN':
            # Run SNN
            model = SNN(hparams, ss_fl.features_c_dim)
            loader = Siamese_loader(model.siamese_net, ss_fl, hparams)
            loader.train(loader.hparams.get('epochs', 100), loader.hparams.get('batch_size', 32),
                         verbose=loader.hparams.get('verbose', 1))
            predicted_labels, acc, ce, cm, f1s, mcc = loader.eval(i_ss_fl)
            predicted_labels_store.extend(predicted_labels)
            acc_store.append(acc)
            ce_store.append(ce)
            f1s_store.append(f1s)
            mcc_store.append(mcc)
        elif model_mode == 'cDNN':
            # Run DNN
            print('Point 1')
            model = DNN_classifer(hparams, ss_fl)
            print('Point 2')
            model.train_model(ss_fl)
            print('Point 3')
            predicted_labels, acc, ce, cm, f1s, mcc = model.eval(i_ss_fl)
            predicted_labels_store.extend(predicted_labels)
            acc_store.append(acc)
            ce_store.append(ce)
            f1s_store.append(f1s)
            mcc_store.append(mcc)
        del model
        K.clear_session()
        instance_end = time.time()
        if cv_mode == 'skf':
            print('\nFor k-fold run {} out of {}. Model is {}. Time taken for instance = {}\n'
                  'Post-training results: \nacc = {} , ce = {} , f1 score = {} , mcc = {}\ncm = \n{}\n'
                  '####################################################################################################'
                  .format(fold + 1, k_folds, model_mode, instance_end - instance_start, acc, ce, f1s, mcc, cm))
        else:
            print('\nFor LOOCV run {} out of {}. Model is {}. Time taken for instance = {}\n'
                  'Post-training results: \nacc = {} , ce = {} , f1 score = {} , mcc = {}\ncm = \n{}\n'
                  '####################################################################################################'
                  .format(fold + 1, fl.count, model_mode, instance_end - instance_start, acc, ce, f1s, mcc, cm))
        # Preparing output dataframe that consists of all the validation dataset and its predicted labels
        folds.extend([fold] * i_ss_fl.count)  # Make a col that contains the fold number for each example
        val_features_c = np.concatenate((val_features_c, i_ss_fl.features_c_a),
                                        axis=0) if val_features_c != [] else i_ss_fl.features_c_a
        val_labels.extend(i_ss_fl.labels)
        K.clear_session()

And the exit code for a dense neural network.

For LOOCV run 4 out of 32. Model is cDNN. Time taken for instance = 0.7919328212738037
Post-training results: 
acc = 0.0 , ce = 0.7419472336769104 , f1 score = 0.0 , mcc = 0.0
cm = 
[[0 1]
 [0 0]]
####################################################################################################
Point 1
Point 2

Process finished with exit code -1073741819 (0xC0000005)

Any help is greatly appreciated thank you!.

python
keras
asked on Stack Overflow Jul 20, 2018 by Lim Kaizhuo • edited Jul 20, 2018 by Kumar

1 Answer

1

Below is the explanation for the things I suggested in the comments that worked, in case anyone faces the same issue.

Manually setting session for keras rather than using the default one at the start of each loop.

sess = tf.Session()  
K.set_session(sess) 
#..... train your model
K.clear_session()

Deleting loader variable as this object must be having reference to the original model object as I can see you are calling the train() on it.

Explicitly collecting all the memory released by deleting these the variable using gc.collect() after each loop so that we have enough memory for building our new model.

So, the gist is when running multiple independent model in a loop like this make sure you have explicitly set the tensorflow session so that you can clear this session after loop finishes, releasing all the resources uses by this session. Delete all the references that might be tied to tensorflow objects in that loop and collect the free memory.

answered on Stack Overflow Jul 20, 2018 by Kumar • edited Jul 20, 2018 by Kumar

User contributions licensed under CC BY-SA 3.0