I'm dealing with an intermittent lockup issue on an embedded Yocto system. (4.1.15-2.0.0).
We have been able to, over the course of a few weeks, reproduce several times across several units but not consistently. If we are able to produce a lockup, it usually takes several days. System becomes unresponsive via serial port/ethernet, does not respond to pings, and a service we created just to blink an LED dies as well.
I am figuring it's a kernel/driver lock of some sort, and we were finally able to extract a dmesg log from a unit via ssh until point of ssh dying:
I've included the entire dmesg log as it captures the boot as well but the interesting bit starts at 307362 after the audit messages. We're seeing
[307362.408117] INFO: rcu_preempt detected stalls on CPUs/tasks:
[307362.412514] (detected by 1, t=2102 jiffies, g=12684671, c=12684670, q=711)
[307362.418223] All QSes seen, last rcu_preempt kthread activity 2101 (30706237-30704136), jiffies_till_next_fqs=1, root ->qsmask 0x0
[307362.428582] cfinteractive R running 0 38 2 0x00000000
followed by a backtrace. This repeats a few times with an extra message:
[307362.430710] rcu_preempt kthread starved for 2101 jiffies!
before the connection eventually dies and the system presumably locks.
What I'm taking from these messages is that a kernel thread is starved due to something... priority inversion perhaps?
How do I go about chasing down the addresses listed in the backtrace? And does anyone have a suggested direction for tackling this issue?
User contributions licensed under CC BY-SA 3.0