Pytorch 1.8 hangs by chance when calling loss.backward()

1

When I was training an LSTM on pytorch the training process hangs by chance and cannot be terminated by Crlt+C. Then I used faulthandler to locate the problem. The training parameters, environment and faulthandler traceback output are listed below. It seems to be some problem with the C++ backend or the CUDA or even my graphic card I do not know.

Trace output batch_size = 64/32/16, num_workers = 2, CUDA 11.1, pytorch 1.8.0, cuDNN 8.0.5/8.0.1:

Thread 0x000036e8 (most recent call first): File "C:\Users\myUserName\anaconda3\lib\threading.py", line 302 in wait
File "C:\Users\myUserName\anaconda3\lib\multiprocessing\queues.py", line 227 in _feed File "C:\Users\myUserName\anaconda3\lib\threading.py", line 870 in run
File "C:\Users\myUserName\anaconda3\lib\threading.py", line 932 in _bootstrap_inner File "C:\Users\myUserName\anaconda3\lib\threading.py", line 890 in _bootstrap

Thread 0x00004644 (most recent call first): File "C:\Users\myUserName\anaconda3\lib\threading.py", line 302 in wait
File "C:\Users\myUserName\anaconda3\lib\multiprocessing\queues.py", line 227 in _feed File "C:\Users\myUserName\anaconda3\lib\threading.py", line 870 in run
File "C:\Users\myUserName\anaconda3\lib\threading.py", line 932 in _bootstrap_inner File "C:\Users\myUserName\anaconda3\lib\threading.py", line 890 in _bootstrap

Thread 0x00000efc (most recent call first):

Thread 0x00000138 (most recent call first): File "C:\Users\myUserName\anaconda3\lib\threading.py", line 306 in wait
File "C:\Users\myUserName\anaconda3\lib\threading.py", line 558 in wait File "C:\Users\myUserName\anaconda3\lib\site-packages\tqdm_monitor.py", line 59 in run File "C:\Users\myUserName\anaconda3\lib\threading.py", line 932 in _bootstrap_inner File "C:\Users\myUserName\anaconda3\lib\threading.py", line 890 in _bootstrap

Thread 0x00001644 (most recent call first): File "C:\Users\myUserName\anaconda3\lib\threading.py", line 306 in wait
File "C:\Users\myUserName\anaconda3\lib\queue.py", line 179 in get
File "C:\Users\myUserName\anaconda3\lib\site-packages\tensorboard\summary\writer\event_file_writer.py", line 232 in run File "C:\Users\myUserName\anaconda3\lib\threading.py", line 932 in _bootstrap_inner File "C:\Users\myUserName\anaconda3\lib\threading.py", line 890 in _bootstrap

Thread 0x0000443c (most recent call first): File "C:\Users\myUserName\anaconda3\lib\site-packages\torch\autograd_init_.py", line 145 in backward File "C:\Users\myUserName\anaconda3\lib\site-packages\torch\tensor.py", line 245 in backward File "train.py", line 129 in main File "train.py", line 246 in

batch_size = 64, num_workers = 0, CUDA 11.1, pytorch 1.8.0, cuDNN 8.0.5 Thread 0x00003650 (most recent call first):

Thread 0x000043b4 (most recent call first): File "C:\Users\myUserName\anaconda3\lib\threading.py", line 306 in wait File "C:\Users\myUserName\anaconda3\lib\threading.py", line 558 in wait File "C:\Users\myUserName\anaconda3\lib\site-packages\tqdm_monitor.py", line 59 in run File "C:\Users\myUserName\anaconda3\lib\threading.py", line 932 in _bootstrap_inner File "C:\Users\myUserName\anaconda3\lib\threading.py", line 890 in _bootstrap

Thread 0x000017c4 (most recent call first): File "C:\Users\myUserName\anaconda3\lib\threading.py", line 306 in wait File "C:\Users\myUserName\anaconda3\lib\queue.py", line 179 in get File "C:\Users\myUserName\anaconda3\lib\site-packages\tensorboard\summary\writer\event_file_writer.py", line 232 in run File "C:\Users\myUserName\anaconda3\lib\threading.py", line 932 in _bootstrap_inner File "C:\Users\myUserName\anaconda3\lib\threading.py", line 890 in _bootstrap

Thread 0x00001458 (most recent call first): File "C:\Users\myUserName\anaconda3\lib\site-packages\torch\autograd_init_.py", line 145 in backward File "C:\Users\myUserName\anaconda3\lib\site-packages\torch\tensor.py", line 245 in backward File "train.py", line 129 in main File "train.py", line 246 in

When num_workers=0 the output is the same except for lacking two threads below that I think belong to the dataloader.

Thread 0x000036e8 (most recent call first): File "C:\Users\myUserName\anaconda3\lib\threading.py", line 302 in wait File "C:\Users\myUserName\anaconda3\lib\multiprocessing\queues.py", line 227 in _feed File "C:\Users\myUserName\anaconda3\lib\threading.py", line 870 in run File "C:\Users\myUserName\anaconda3\lib\threading.py", line 932 in _bootstrap_inner File "C:\Users\myUserName\anaconda3\lib\threading.py", line 890 in _bootstrap

Thread 0x00004644 (most recent call first): File "C:\Users\myUserName\anaconda3\lib\threading.py", line 302 in wait File "C:\Users\myUserName\anaconda3\lib\multiprocessing\queues.py", line 227 in _feed File "C:\Users\myUserName\anaconda3\lib\threading.py", line 870 in run File "C:\Users\myUserName\anaconda3\lib\threading.py", line 932 in _bootstrap_inner File "C:\Users\myUserName\anaconda3\lib\threading.py", line 890 in _bootstrap

The resource usage is also moderate, with around 20% CPU usage, 16/32GB memory, and 3.8/8GB graphics memory usage. The GPU usage is low when training RNNs. The script was run on Windows 10. The graphic card I use is RTX 3070. The driver version for my graphics card is 461.09.

More Information

When I started debugging code I was using the unmatching versions of CUDA 11.2 with pytorch 1.7.1 and cudnn 8.1.0. At that time I came into CUDA exceptions from time to time, with outputs like kernel launch failed or failed to synchronization, and things just hang without error after I changed my CUDA version.

pytorch
asked on Stack Overflow Mar 10, 2021 by LonelyQuantum • edited Mar 11, 2021 by LonelyQuantum

0 Answers

Nobody has answered this question yet.


User contributions licensed under CC BY-SA 3.0