Why does adding convolution/pool layer crash Keras/Tensorflow model while running on RTX 3070/cudnn8/CUDA11.1?

Question

Why does adding convolution/pool layer crash Keras/Tensorflow model while running on RTX 3070/cudnn8/CUDA11.1?

System Info

OS: Windows 10,
cudnn: 8.0,
CUDA toolkit: 11.1 installed overtop of 10.2,
GPU: Nvidia RTX 3070,
CPU: Intel I7 10700f,
Tensorflow: tf.__version__==2.4.0rc-0 (have also tried with tf-nightly-gpu as late as Dec 7, 2020)
CUDA, cudnn compiled manually from source

Test Code

The below code successfully compiles a model but crashes when model.fit(...) is called.


from tensorflow.keras import datasets, layers, models
import matplotlib.pyplot as plt

(train_images, train_labels), (test_images, test_labels) = datasets.cifar10.load_data()

train_images, test_images = train_images / 255.0, test_images / 255.0

model = models.Sequential()
model.add(layers.Conv2D(32, (3, 3), activation='relu', input_shape=(32, 32, 3)))
model.add(layers.MaxPooling2D((2, 2)))
model.add(layers.Conv2D(64, (3, 3), activation='relu'))
model.add(layers.MaxPooling2D((2, 2)))
model.add(layers.Conv2D(64, (3, 3), activation='relu'))
model.add(layers.Flatten())
model.add(layers.Dense(64, activation='relu'))
model.add(layers.Dense(10))

model.compile(optimizer='Adam', loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True))

history = model.fit(train_images, train_labels, batch_size=10, epochs=100)

By removing the convolutional and maxpooling layers and just flattening the tensors after input the model is able to train fine (obviously the output of this model is useless but it is still able to train).

The error code when program crashes is >Process finished with exit code -1073740791 (0xC0000409)

Additionally tensorflow is able to open library, find the GPU, and logs GPU as available when tf.config.list_physical_devices('GPU') is called

UPDATE I opened an issue on the tensorflow github page which you can find here

python

tensorflow

keras

gpu

asked on Stack Overflow Dec 9, 2020 by

Taylr Cawte • edited Jan 8, 2021 by

Taylr Cawte

1 Answer

For whatever reason when run in the IDE terminal an error message was being suppressed and Process finished with exit code -1073740791 (0xC0000409) was logged as the error message.

When run from the command line the below error messages were displayed instead of logging the exit code error.

Could not load library cudnn_ops_infer64_8.dll. Error code 126
Please make sure cudnn_ops_infer64_8.dll is in your library path!

I recognized this was a package included in the cudnn library and copy and pasted it from the bin folder in cudnn to NVIDIA GPU computing toolkit > CUDA > V11.0 > bin. This process was repeated for the below packages and the issue was resolved.

cudnn_adv_infer64_8.dll
cudnn_adv_train64_8.dll
cudnn_cnn_infer64_8.dll
cudnn_cnn_train64_8.dll
cudnn_ops_infer64_8.dll
cudnn_ops_train64_8.dll

answered on Stack Overflow Jan 10, 2021 by

Taylr Cawte

User contributions licensed under CC BY-SA 3.0