Kernel hanging the tty subsystem

1

I am having some issues with the tty subsystem on a RHEL machine. From what I see in the logs, some kernel oopses are generated each time a new console (be it pts or tty) is spawned. To me it seems that there is some kind of race conditions which occurs there. Here is the stack trace:

kernel:  INFO: task sshd:6338 blocked for more than 120 seconds.
kernel:       Tainted: P           ---------------    2.6.32-504.el6.x86_64 #1
kernel:  "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
kernel:  sshd          D 0000000000000000     0  6338   6195 0x00000080
kernel:  ffff88035be8d728 0000000000000082 0000000000000000 0000000000000000
kernel:  ffff88035be8d7f8 ffffffff8105ca34 00009488ef033e83 ffff88035be8d708
kernel:  ffff88035be8d880 0000000109b91c98 ffff881eea341098 ffff88035be8dfd8
kernel:  Call Trace:
kernel:  [<ffffffff8105ca34>] ? find_busiest_group+0x244/0x9e0
kernel:  [<ffffffff8152a8c5>] schedule_timeout+0x215/0x2e0
kernel:  [<ffffffff8152a543>] wait_for_common+0x123/0x180
kernel:  [<ffffffff81064b90>] ? default_wake_function+0x0/0x20
kernel:  [<ffffffff8152a65d>] wait_for_completion+0x1d/0x20
kernel:  [<ffffffff81098bf7>] flush_work+0x77/0xc0
kernel:  [<ffffffff81098460>] ? wq_barrier_func+0x0/0x20
kernel:  [<ffffffff81098e14>] flush_delayed_work+0x54/0x70
kernel:  [<ffffffff813392f5>] tty_flush_to_ldisc+0x15/0x20
kernel:  [<ffffffff81333cc7>] n_tty_poll+0x67/0x1d0
kernel:  [<ffffffff8132f80a>] tty_poll+0x8a/0xa0
kernel:  [<ffffffff811a6895>] do_select+0x3c5/0x7c0
kernel:  [<ffffffff8149cf18>] ? ip_finish_output+0x148/0x310
kernel:  [<ffffffff811a59f0>] ? __pollwait+0x0/0xf0
kernel:  [<ffffffff811a5ae0>] ? pollwake+0x0/0x60
kernel:  [<ffffffff811a5ae0>] ? pollwake+0x0/0x60
kernel:  [<ffffffff811a5ae0>] ? pollwake+0x0/0x60
kernel:  [<ffffffff811a5ae0>] ? pollwake+0x0/0x60
kernel:  [<ffffffff8152d04b>] ? _spin_unlock_bh+0x1b/0x20
kernel:  [<ffffffff8144b835>] ? release_sock+0xe5/0x110
kernel:  [<ffffffff814a52cc>] ? tcp_sendmsg+0x73c/0xa20
kernel:  [<ffffffff8144a72b>] ? sock_aio_write+0x19b/0x1c0
kernel:  [<ffffffff8133158d>] ? tty_wakeup+0x3d/0x80
kernel:  [<ffffffff811a6e1a>] core_sys_select+0x18a/0x2c0
kernel:  [<ffffffff8109eb00>] ? autoremove_wake_function+0x0/0x40
kernel:  [<ffffffff811a71a7>] sys_select+0x47/0x110
kernel:  [<ffffffff810e5c87>] ? audit_syscall_entry+0x1d7/0x200
kernel:  [<ffffffff810e5a7e>] ? __audit_syscall_exit+0x25e/0x290
kernel:  [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b

So, looking at the last 2 function calls, it seems that the task is scheduled to sleep for some time via schedule_timeout(), and after that find_busiest_group tries to balance the load which is generated by that task. Is it correct or is it something I am missing here?

Thanks.

linux
linux-kernel
trace
asked on Stack Overflow Oct 29, 2014 by Alex

2 Answers

1

In case anyone is interested, I opened a case to RedHat and it seems the issue is regarding an HP firmware bug in array controller (hpsa). More details: https://access.redhat.com/solutions/1179703

**> I have reviewed the bugzilla and see that the bug may be tied to the

Smart Array firmware revision.

The bug was reproduced on system with below combination of controller version and hpsa module version. (New hpsa module version and old firmware version)

  • kmod-hpsa: 3.4.4-1-RH1
  • SA Firmware: 3.22
  • Controller: P220i (103c:323b 103c:3355 rev 01)

The system running new hpsa module version and new firmware version did not reproduced this bug.

  • kmod-hpsa: 3.4.4-1-RH1
  • SA Firmware: 3.42
  • Controller: P220i (103c:323b 103c:3355 rev 01)

The system running old hpsa module version and old firmware version also did not reproduced this bug.

  • kmod-hpsa: 3.4.0-1-RH1
  • SA Firmware: 3.22
  • Controller: P220i (103c:323b 103c:3355 rev 01)

In our case the controller firmware version is 3.22 and we were using new hpsa module version 3.4.4-1-RH2.

$ cat proc/scsi/scsi | grep -A 5 P220i Vendor: HP Model: P220i Rev: 3.22 Type: RAID
ANSI SCSI revision: 05

Now I see that with old kernel we are using old version of hpsa module (3.4.0-1-RH1). With this the system should not encounter this bug.

modinfo hpsa filename: /lib/modules/2.6.32-431.23.3.el6.x86_64/kernel/drivers/scsi/hpsa.ko

license: GPL version: 3.4.0-1-RH1 description: Driver for HP Smart Array Controller version 3.4.0-1-RH1 author:
Hewlett-Packard Company**

This was the statement of an RedHat engineer.

answered on Stack Overflow Nov 11, 2014 by Alex
0

From the stack trace, looks like the "sshd" process is going in D-state and getting blocked for more than 120 seconds, therefore the messages are showing up in the syslog. Its possibly that the process is waiting for I/O or some other resources while being blocked.

Though you are running RHEL6, the version of the kernel (2.6.32-504.el6.x86_64) is quite old. I would recommend first update the kernel using yum upgrade. If you are facing the same issue after updating the kernel, I would recommend configuring kdump and obtaining a vmcore when the issue is reproducible. Then if you are comfortable with the linux internals, use crash tool for further analysis of the core or contact the OS vendor for support.

answered on Stack Overflow Nov 10, 2014 by askb

User contributions licensed under CC BY-SA 3.0