I have a program that is just a TCP data forwarder. Data packets come in from one source and are buffered intelligently for redistribution to multiple clients. There are a few threads, and there's some mutex'ing going on. There's also a little logging to disk. But mostly it's just TCP socket data (no UDP despite what the dmesg seems to imply). Also, there's no blocking. It's all select on non-blocking sockets and some polling.
This process is hung up. Although it responds to SNMP requests (there are dedicated threads for that) and will accept client connections (there's a dedicated listener thread), no data will flow. Note that this program will run for a month with no problems before it finally hangs up, so it's really hard to probe with a debug version. I'm really hoping someone can help me interpret the dmesg reports I have, because this is all I have go go on.
The main thread that receives packets has this stack frame:
#0 0x00289424 in __kernel_vsyscall ()
#1 0x00a93656 in nanosleep () from /lib/libc.so.6
#2 0x00acc34c in usleep () from /lib/libc.so.6
#3 0x0805a140 in main ()
The sender thread has this stack frame:
#0 0x00289424 in __kernel_vsyscall ()
#1 0x00a93656 in nanosleep () from /lib/libc.so.6
#2 0x00acc34c in usleep () from /lib/libc.so.6
#3 0x08066efc in TSITcpip::TcpipThread(void*) ()
#4 0x08067026 in tcpip_callback(void*) ()
#5 0x00b95a09 in start_thread () from /lib/libpthread.so.0
#6 0x00ad343e in clone () from /lib/libc.so.6
There are 6 more threads. One is the listener thread, and the rest are for SNMP. I don't know how to get more stack frame info to tell what the duration of the sleep is, but there are limits placed on it. In one place, the longest sleep is capped at 5 seconds. Others are hard-coded to 1 second and 5 milliseconds. There are no blocking calls to access sockets. Mutexes are involved, but this one system ran for like a month with no clients attached, which means that blocking on locks would be very rare.
Below is an excerpt from dmesg. Everything before it is related to boot-up, and there's nothing after except for a segfault for an unrelated process. This is my only clue about what's going on, and I really hope someone can help me make sense of it. (I added some blank lines for readability.)
INFO: task str2many:1624 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
str2many D efa23b80 0 1624 1593 0x00000080
efd8c570 00200082 f2987140 efa23b80 00200246 c07d9c2b 00000031 0000bc43
00000000 eff27040 00010ced 00000001 00010ced c0b110c0 c0b110c0 efd8c818
c0b110c0 c0b0caa4 c0b110c0 efd8c818 efac6000 c07d3aa3 efe6d01c f7032780
Call Trace:
[<c07d9c2b>] ? udp_recvmsg+0x14b/0x2c0
[<c07d3aa3>] ? tcp_v4_rcv+0x473/0x770
[<c048792d>] ? exit_robust_list+0x7d/0x140
[<c04529ca>] ? mm_release+0xda/0xf0
[<c0458d9d>] ? exit_mm+0x6d/0x140
[<c0459136>] ? do_exit+0x126/0x740
[<c0788b91>] ? __netif_receive_skb+0x401/0x5f0
[<c046bd60>] ? dequeue_signal+0x30/0x190
[<c053da0c>] ? core_sys_select+0x16c/0x270
[<c045978c>] ? do_group_exit+0x3c/0xa0
[<c046c2fe>] ? get_signal_to_deliver+0x1ce/0x460
[<f8a4df94>] ? e1000_clean_rx_irq+0x274/0x450 [e1000e]
[<c0408fa3>] ? do_signal+0x93/0xa50
[<c0489395>] ? do_futex+0xe5/0xa40
[<c078cd79>] ? net_rx_action+0x199/0x280
[<c0480be0>] ? ktime_get_ts+0xd0/0x100
[<c0600623>] ? copy_to_user+0x33/0x110
[<c053c941>] ? poll_select_copy_remaining+0xe1/0x130
[<c04afed0>] ? __audit_syscall_exit+0x220/0x250
[<c04099d7>] ? do_notify_resume+0x77/0xa0
[<c083135c>] ? work_notifysig+0x13/0x1b
[<c0830000>] ? __mutex_lock_slowpath+0xb0/0x140
INFO: task str2many:1635 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
str2many D c2d03ac0 0 1635 1593 0x00000080
efa75570 00200082 00000200 c2d03ac0 00000002 efac5df4 efac5df0 00000000
00000000 eff27040 00010ced cc11e7f9 00010ced c0b110c0 c0b110c0 efa75818
c0b110c0 c0b0caa4 c0b110c0 efa75818 f6272000 00000000 000005ff efa75570
Call Trace:
[<c048792d>] ? exit_robust_list+0x7d/0x140
[<c04529ca>] ? mm_release+0xda/0xf0
[<c0458d9d>] ? exit_mm+0x6d/0x140
[<c0459136>] ? do_exit+0x126/0x740
[<c05f906a>] ? plist_del+0x2a/0x60
[<c046bd60>] ? dequeue_signal+0x30/0x190
[<c045978c>] ? do_group_exit+0x3c/0xa0
[<c046c2fe>] ? get_signal_to_deliver+0x1ce/0x460
[<c0408fa3>] ? do_signal+0x93/0xa50
[<c04893c5>] ? do_futex+0x115/0xa40
[<c04b78d1>] ? move_native_irq+0x11/0x50
[<c04afed0>] ? __audit_syscall_exit+0x220/0x250
[<c04099d7>] ? do_notify_resume+0x77/0xa0
[<c083135c>] ? work_notifysig+0x13/0x1b
INFO: task str2many:1636 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
str2many D efa75030 0 1636 1593 0x00000080
efa75030 00200082 eff27040 efa75030 c082ecb0 00000000 00000000 0000d164
00000000 eff27040 00010ced 00000001 00010ced c0b110c0 c0b110c0 efa752d8
c0b110c0 c0b0caa4 c0b110c0 efa752d8 efaca000 c0b110c0 efa752d8 efaba000
Call Trace:
[<c082ecb0>] ? schedule+0x3c0/0xae0
[<c07e03ab>] ? inet_sendmsg+0x4b/0xb0
[<c048792d>] ? exit_robust_list+0x7d/0x140
[<c04529ca>] ? mm_release+0xda/0xf0
[<c0458d9d>] ? exit_mm+0x6d/0x140
[<c0459136>] ? do_exit+0x126/0x740
[<c05f906a>] ? plist_del+0x2a/0x60
[<c046bd60>] ? dequeue_signal+0x30/0x190
[<c045978c>] ? do_group_exit+0x3c/0xa0
[<c046c2fe>] ? get_signal_to_deliver+0x1ce/0x460
[<c0408fa3>] ? do_signal+0x93/0xa50
[<c04893c5>] ? do_futex+0x115/0xa40
[<c04b8874>] ? __rcu_process_callbacks+0x44/0x2d0
[<c04b8b35>] ? rcu_process_callbacks+0x35/0x40
[<c045c18e>] ? __do_softirq+0xae/0x1a0
[<c04afed0>] ? __audit_syscall_exit+0x220/0x250
[<c04099d7>] ? do_notify_resume+0x77/0xa0
[<c083135c>] ? work_notifysig+0x13/0x1b
INFO: task str2many:1637 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
str2many D 00000070 0 1637 1593 0x00000080
efa9bab0 00200082 00000000 00000070 f63c559c efaa7030 c2d080fc 000170ed
00000000 eff27040 00010ced cdcefd93 00010ced c0b110c0 c0b110c0 efa9bd58
c0b110c0 c0b0caa4 c0b110c0 efa9bd58 f70fc000 c2d080c0 00200082 c044d665
Call Trace:
[<c044d665>] ? try_to_wake_up+0x205/0x3a0
[<c043b2b7>] ? __wake_up_common+0x47/0x70
[<c043f8f0>] ? complete+0x40/0x60
[<c0458d9d>] ? exit_mm+0x6d/0x140
[<c0459136>] ? do_exit+0x126/0x740
[<c05f906a>] ? plist_del+0x2a/0x60
[<c046bd60>] ? dequeue_signal+0x30/0x190
[<c045978c>] ? do_group_exit+0x3c/0xa0
[<c046c2fe>] ? get_signal_to_deliver+0x1ce/0x460
[<c0408fa3>] ? do_signal+0x93/0xa50
[<c04893c5>] ? do_futex+0x115/0xa40
[<c04b8874>] ? __rcu_process_callbacks+0x44/0x2d0
[<c04b8b35>] ? rcu_process_callbacks+0x35/0x40
[<c045c18e>] ? __do_softirq+0xae/0x1a0
[<c04afed0>] ? __audit_syscall_exit+0x220/0x250
[<c04099d7>] ? do_notify_resume+0x77/0xa0
[<c083135c>] ? work_notifysig+0x13/0x1b
INFO: task str2many:1638 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
str2many D b4dfcbe0 0 1638 1593 0x00000080
efa9b570 00200082 00000000 b4dfcbe0 ef991b64 c0833efa c0439bf0 0001a113
00000000 eff27040 00010ced cd0bd748 00010ced c0b110c0 c0b110c0 efa9b818
c0b110c0 c0b0caa4 c0b110c0 efa9b818 f70fc000 c0b100d8 000000e0 ffffffff
Call Trace:
[<c0833efa>] ? do_page_fault+0x2a/0x90
[<c0439bf0>] ? kmap_atomic_prot+0x120/0x150
[<c05ff1e1>] ? __get_user_4+0x11/0x17
[<c048792d>] ? exit_robust_list+0x7d/0x140
[<c04529ca>] ? mm_release+0xda/0xf0
[<c0458d9d>] ? exit_mm+0x6d/0x140
[<c0459136>] ? do_exit+0x126/0x740
[<c05f906a>] ? plist_del+0x2a/0x60
[<c046bd60>] ? dequeue_signal+0x30/0x190
[<c045978c>] ? do_group_exit+0x3c/0xa0
[<c046c2fe>] ? get_signal_to_deliver+0x1ce/0x460
[<c0439bf0>] ? kmap_atomic_prot+0x120/0x150
[<c0408fa3>] ? do_signal+0x93/0xa50
[<c04893c5>] ? do_futex+0x115/0xa40
[<c0433a3a>] ? __do_page_fault+0x1aa/0x420
[<c04afed0>] ? __audit_syscall_exit+0x220/0x250
[<c04099d7>] ? do_notify_resume+0x77/0xa0
[<c083135c>] ? work_notifysig+0x13/0x1b
INFO: task str2many:1716 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
str2many D eb09ae40 0 1716 1593 0x00000080
efaa7ab0 00200082 ef53e040 eb09ae40 eb09ae60 000005a8 c077c7fc 000bb500
00000000 eff27040 00010ced 00000001 00010ced c0b110c0 c0b110c0 efaa7d58
c0b110c0 c0b0caa4 c0b110c0 efaa7d58 efba2000 00000001 9b0bb0f7 00000000
Call Trace:
[<c077c7fc>] ? sk_reset_timer+0xc/0x20
[<c048792d>] ? exit_robust_list+0x7d/0x140
[<c07cf241>] ? __tcp_push_pending_frames+0x31/0xe0
[<c04529ca>] ? mm_release+0xda/0xf0
[<c0458d9d>] ? exit_mm+0x6d/0x140
[<c0459136>] ? do_exit+0x126/0x740
[<c046bd60>] ? dequeue_signal+0x30/0x190
[<c053da0c>] ? core_sys_select+0x16c/0x270
[<c045978c>] ? do_group_exit+0x3c/0xa0
[<c046c2fe>] ? get_signal_to_deliver+0x1ce/0x460
[<c0408fa3>] ? do_signal+0x93/0xa50
[<c0476450>] ? autoremove_wake_function+0x0/0x40
[<c0480be0>] ? ktime_get_ts+0xd0/0x100
[<c0600623>] ? copy_to_user+0x33/0x110
[<c053c941>] ? poll_select_copy_remaining+0xe1/0x130
[<c04afed0>] ? __audit_syscall_exit+0x220/0x250
[<c04099d7>] ? do_notify_resume+0x77/0xa0
[<c083135c>] ? work_notifysig+0x13/0x1b
INFO: task str2many:1717 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
str2many D c2d03ac0 0 1717 1593 0x00000080
efb11570 00200082 00000200 c2d03ac0 00000002 efba3df4 efba3df0 00000000
00000000 eff27040 00010ced 00000001 00010ced c0b110c0 c0b110c0 efb11818
c0b110c0 c0b0caa4 c0b110c0 efb11818 efac8000 00000000 000009ff efb11570
Call Trace:
[<c0784701>] ? sk_stream_wait_memory+0x1b1/0x210
[<c048792d>] ? exit_robust_list+0x7d/0x140
[<c04529ca>] ? mm_release+0xda/0xf0
[<c0458d9d>] ? exit_mm+0x6d/0x140
[<c0459136>] ? do_exit+0x126/0x740
[<c046bd60>] ? dequeue_signal+0x30/0x190
[<c045978c>] ? do_group_exit+0x3c/0xa0
[<c046c2fe>] ? get_signal_to_deliver+0x1ce/0x460
[<c0408fa3>] ? do_signal+0x93/0xa50
[<c0476450>] ? autoremove_wake_function+0x0/0x40
[<c05a292c>] ? security_file_permission+0xc/0x10
[<c052ab96>] ? rw_verify_area+0x66/0xe0
[<c045c18e>] ? __do_softirq+0xae/0x1a0
[<c04afed0>] ? __audit_syscall_exit+0x220/0x250
[<c04099d7>] ? do_notify_resume+0x77/0xa0
[<c083135c>] ? work_notifysig+0x13/0x1b
CE: hpet increasing min_delta_ns to 40226 nsec
INFO: task str2many:9401 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
str2many D f0238040 0 9401 1593 0x00000080
f0132ab0 00200082 ef802500 f0238040 00200246 c07d9c2b 00000031 00004b07
00000000 f619ec80 00031c6c 00000001 00031c6c c0b110c0 c0b110c0 f0132d58
c0b110c0 c0b0caa4 c0b110c0 f0132d58 f2294000 00000000 efe6d01c f7032780
Call Trace:
[<c07d9c2b>] ? udp_recvmsg+0x14b/0x2c0
[<c0830c65>] ? rwsem_down_failed_common+0x75/0x1a0
[<c0830dda>] ? rwsem_down_read_failed+0x1a/0x24
[<c0830e17>] ? call_rwsem_down_read_failed+0x7/0xc
[<c0458d9d>] ? exit_mm+0x6d/0x140
[<c0459136>] ? do_exit+0x126/0x740
[<c046bd60>] ? dequeue_signal+0x30/0x190
[<c053da0c>] ? core_sys_select+0x16c/0x270
[<c045978c>] ? do_group_exit+0x3c/0xa0
[<c046c2fe>] ? get_signal_to_deliver+0x1ce/0x460
[<c044126b>] ? check_preempt_wakeup+0x16b/0x220
[<c0408fa3>] ? do_signal+0x93/0xa50
[<c04870ee>] ? wake_futex+0x2e/0x60
[<c0489860>] ? do_futex+0x5b0/0xa40
[<c0480be0>] ? ktime_get_ts+0xd0/0x100
[<c0600623>] ? copy_to_user+0x33/0x110
[<c053c941>] ? poll_select_copy_remaining+0xe1/0x130
[<c04afed0>] ? __audit_syscall_exit+0x220/0x250
[<c04099d7>] ? do_notify_resume+0x77/0xa0
[<c083135c>] ? work_notifysig+0x13/0x1b
[<c0830000>] ? __mutex_lock_slowpath+0xb0/0x140
INFO: task str2many:9407 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
str2many D f01a5570 0 9407 1593 0x00000080
f01a5570 00200082 f619ec80 f01a5570 c082ecb0 00000400 c2c03ac0 00018f12
00000000 f619ec80 00031c6c 00000000 00031c6c c0b110c0 c0b110c0 f01a5818
c0b110c0 c0b0caa4 c0b110c0 f01a5818 f0690000 c0b110c0 f01a5818 342d2c7f
Call Trace:
[<c082ecb0>] ? schedule+0x3c0/0xae0
[<c048792d>] ? exit_robust_list+0x7d/0x140
[<c04529ca>] ? mm_release+0xda/0xf0
[<c0458d9d>] ? exit_mm+0x6d/0x140
[<c0459136>] ? do_exit+0x126/0x740
[<c05f906a>] ? plist_del+0x2a/0x60
[<c046bd60>] ? dequeue_signal+0x30/0x190
[<c045978c>] ? do_group_exit+0x3c/0xa0
[<c046c2fe>] ? get_signal_to_deliver+0x1ce/0x460
[<c0408fa3>] ? do_signal+0x93/0xa50
[<c04893c5>] ? do_futex+0x115/0xa40
[<c04b78d1>] ? move_native_irq+0x11/0x50
[<c04afed0>] ? __audit_syscall_exit+0x220/0x250
[<c04099d7>] ? do_notify_resume+0x77/0xa0
[<c083135c>] ? work_notifysig+0x13/0x1b
INFO: task str2many:9408 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
str2many D efa5b570 0 9408 1593 0x00000080
efa5b570 00200082 f619ec80 efa5b570 c082ecb0 00000000 00000000 00018f12
00000000 f619ec80 00031c6c 00000000 00031c6c c0b110c0 c0b110c0 efa5b818
c0b110c0 c0b0caa4 c0b110c0 efa5b818 eb01e000 c0b110c0 efa5b818 efb42000
Call Trace:
[<c082ecb0>] ? schedule+0x3c0/0xae0
[<c07e03ab>] ? inet_sendmsg+0x4b/0xb0
[<c048792d>] ? exit_robust_list+0x7d/0x140
[<c04529ca>] ? mm_release+0xda/0xf0
[<c0458d9d>] ? exit_mm+0x6d/0x140
[<c0459136>] ? do_exit+0x126/0x740
[<c05f906a>] ? plist_del+0x2a/0x60
[<c046bd60>] ? dequeue_signal+0x30/0x190
[<c045978c>] ? do_group_exit+0x3c/0xa0
[<c046c2fe>] ? get_signal_to_deliver+0x1ce/0x460
[<c0408fa3>] ? do_signal+0x93/0xa50
[<c04893c5>] ? do_futex+0x115/0xa40
[<c04b8b35>] ? rcu_process_callbacks+0x35/0x40
[<c045c18e>] ? __do_softirq+0xae/0x1a0
[<c04afed0>] ? __audit_syscall_exit+0x220/0x250
[<c04099d7>] ? do_notify_resume+0x77/0xa0
[<c083135c>] ? work_notifysig+0x13/0x1b
User contributions licensed under CC BY-SA 3.0