Tensorflow2.0: GPU runs out of memory during hyperparameter tuning loop

Question

Tensorflow2.0: GPU runs out of memory during hyperparameter tuning loop

I am trying to perform some hyperparameter tuning of a convolutional neural network written in Tensorflow 2.0 with GPU extension.

My systems settings are:

Windows 10 64bit
GeForce RTX2070, 8GB
Tensorflow 2.0-beta
CUDA 10.0 properly installed (I hope, deviceQuery.exe and bandwidthTest.exe passed positively)

My neural network has 75.572.574 parameters and I am training it on 3777 samples. In a single run, I have no problems in training the CNN.

As next step, I wanted to tune two hyperparameters of the CNN. To this aim, I created a for loop (iterating on 20 steps), in which I build and compile every time a new model, changing the hyperparameters at every loop iteration. The gist of the code (this is not an MWE) is the following

import tensorflow as tf
from tensorflow import keras

def build_model(input_shape, output_shape, lr=0.01, dropout=0, summary=True):
    model = keras.models.Sequential(name="CNN")
    model.add(keras.layers.Conv2D(32, (7, 7), activation='relu', input_shape=input_shape, padding="same"))
    model.add(keras.layers.BatchNormalization())
    model.add(keras.layers.MaxPooling2D((2, 2)))
    model.add(keras.layers.Dropout(dropout))
    model.add(keras.layers.Conv2D(128, (3, 3), activation='relu', padding="same"))
    model.add(keras.layers.BatchNormalization())
    model.add(keras.layers.MaxPooling2D((2, 2)))
    model.add(keras.layers.Dropout(dropout))
    model.add(keras.layers.Flatten())
    model.add(keras.layers.Dense(1024, activation='relu'))
    model.add(keras.layers.BatchNormalization())
    model.add(keras.layers.Dense(output_shape, activation='linear'))
    model.build()
    model.compile(optimizer=keras.optimizers.Adam(learning_rate=lr),
                  loss="mse",
                  metrics=[RMSE])
    if summary:
        print(model.summary())
    return model

...

    for run_id in range(25):
        lr = learning_rate.max_value + (learning_rate.min_value - learning_rate.max_value) * np.random.rand(1)
        dropout = dropout.min_value + (dropout.max_value -
                                                 dropout.min_value) * np.random.rand(1)
        print("%=== Run #{0}".format(run_id))
        run_dir = hparamdir + "\\run{0}".format(run_id)
        model0 = build_model(IMG_SHAPE, Ytrain.shape[1], lr=lr, dropout=dropout)
        model0_history = model0.fit(Xtrain,
                                    Ytrain,
                                    validation_split=0.3,
                                    epochs=2,
                                    verbose=2)

The problem I encountered is that, after a few (6) loops, the program halts returning the error

tensorflow.python.framework.errors_impl.ResourceExhaustedError: OOM when allocating tensor with shape[73728,1024] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc [Op:Add] name: dense_12/kernel/Initializer/random_uniform/

Process finished with exit code 1.

I believe the problem is that the GPU does not release the memory in between each iteration of the for loop and, after a while, it saturates and crashes.

I have digged around and I tried different solutions as suggested in similar posts (post1, post2)

Trying releasing the memory using the Keras backend at the end of every iteration of the for loop using

from keras import backend as K
K.clear_session()

Trying clearing the GPU using Numba and CUDA with

from numba import cuda
cuda.select_device(0)
cuda.close()

I tried deleting the graph using del model0 but that did not work either.
I couldn't try using tf.reset_default_graph() since the programming style of TF2.0 doesn't have a default graph anymore (AFAIK) and thus I have not found a way to kill/delete a graph at runtime.

Solutions 1. and 3. returned the same out of memory error, while solution 2. returned the following error during the second iteration of the for loop, while building the model in the build_model()call:

2019-07-24 19:51:31.909377: F .\tensorflow/core/kernels/random_op_gpu.h:227] Non-OK-status: GpuLaunchKernel(FillPhiloxRandomKernelLaunch<Distribution>, num_blocks, block_size, 0, d.stream(), gen, data, size, dist) status: Internal: invalid resource handle

Process finished with exit code -1073740791 (0xC0000409)

I tried to look around and I don't really understand the last error, I would guess the GPU has not been closed properly/is occupied/cannot be seen by Python anymore.

In any case, I could not find any solution to this issue, except for running the training by hand for every hyperparameter to be tested.

Does anybody have any idea how to solve this problem? Or a workaround for hyperparameter tuning? Should I open an issue in TF2.0 Github issue tracker (it does not appear to be a TensorFlow issue per se, since they declare that they don't want to free the GPU to avoid segmentation problems)?

python

keras

out-of-memory

tensorflow2.0

asked on Stack Overflow Jul 24, 2019 by

DiracRules • edited Jul 24, 2019 by

talonmies

2 Answers

This is due to how TF handles memory.

If you monitor your system while iteratively training TF models, you will observe a linear increase in memory consumption. Additionally, if you watch -n 0.1 nvidia-smi you will notice that the PID for the process remains constant while iterating. TF does not fully release utilized memory until the PID controlling the memory is killed. Also, the Numba documentation notes that cuda.close() is not useful if you want to reset the GPU (though I definitely spent a while trying to make it work when I discovered it!).

The easiest solution is to iterate using the Ray python package and something like the following:

import ray

@ray.remote(
    num_gpus=1 # or however many you want to use (e.g., 0.5, 1, 2)
)
class RayNetWrapper:

    def __init__(self, net):
        self.net = net

    def train(self):
        return self.net.train()

ray.init()
actors = [RayNetWrapper.remote(model) for _ in range(25)]
results = ray.get([actor.train.remote() for actor in actors]

You should then notice GPU processes will cycle on/off with new PIDs each time and your system memory will no longer increase. Alternatively, you can put your model training code in a new python script and iteratively call out using python's subprocess module. You will also notice some latency now when models shutdown and new models boot up, but this is expected because the GPU is restarting. Ray also has an experimental asynchronous framework that I've had some success with, and enables fractional sharing of GPUs (model size permitting).

answered on Stack Overflow Sep 13, 2019 by

prince.e

you can locate these two lines on top of your code.

from tensorflow.python.framework.config import set_memory_growth

tf.compat.v1.disable_v2_behavior()
gpus = tf.config.experimental.list_physical_devices('GPU')
if gpus:
    try:
        for gpu in gpus:
            tf.config.experimental.set_memory_growth(gpu, True)
    except RuntimeError as e:
        print(e)

that works for me.

answered on Stack Overflow Nov 22, 2020 by

s.sarbaziAzad • edited Nov 25, 2020 by

s.sarbaziAzad

User contributions licensed under CC BY-SA 3.0