I'm wondering could it be so that a VPS is a root cause of crashes which occurs every 3-7 days at night 03:00 - 04:00 time (kernel bug, or something else), or is it a node on which the virtual server is hosted (an issue with backend).
Details: KVM based VPS with CentOS 7, xfs hosted at VPS provider, who has a back-end and storage back-end infrastructure.
Usually it happens the following way, at once the running kthreadd
process turns to D
-status (i.e. uninterruptible sleep), and then we get messages like: blocked for more than 120 seconds.
and high LA:
May 21 03:08:01 vps root: root 2 0.0 0.0 0 0 ? S May18 0:00 [kthreadd]
May 21 03:10:01 vps root: root 2 0.0 0.0 0 0 ? S May18 0:00 [kthreadd]
May 21 03:12:01 vps root: root 2 0.0 0.0 0 0 ? S May18 0:00 [kthreadd]
May 21 03:14:01 vps root: root 2 0.0 0.0 0 0 ? D May18 0:00 [kthreadd]
May 21 03:15:16 vps kernel: INFO: task kthreadd:2 blocked for more than 120 seconds.
May 21 03:15:16 vps kernel: kthreadd D ffffffffffffffff 0 2 0 0x00000000
May 21 03:15:16 vps kernel: [<ffffffff810a65f2>] kthreadd+0x2b2/0x2f0
May 21 03:16:01 vps root: root 2 0.0 0.0 0 0 ? D May18 0:00 [kthreadd]
May 21 03:18:01 vps root: root 2 0.0 0.0 0 0 ? D May18 0:00 [kthreadd]
May 21 03:20:02 vps root: root 2 0.0 0.0 0 0 ? D May18 0:00 [kthreadd]
here we have a call trace:
May 18 04:14:37 vps kernel: INFO: task kthreadd:2 blocked for more than 120 seconds.
May 18 04:14:37 vps kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
May 18 04:14:37 vps kernel: kthreadd D ffffffffffffffff 0 2 0 0x00000000
May 18 04:14:37 vps kernel: ffff88023413b4e0 0000000000000046 ffff880234120b80 ffff88023413bfd8
May 18 04:14:37 vps kernel: ffff88023413bfd8 ffff88023413bfd8 ffff880234120b80 ffff88023413b628
May 18 04:14:37 vps kernel: ffff88023413b630 7fffffffffffffff ffff880234120b80 ffffffffffffffff
May 18 04:14:37 vps kernel: Call Trace:
May 18 04:14:37 vps kernel: [<ffffffff8163ae49>] schedule+0x29/0x70
May 18 04:14:37 vps kernel: [<ffffffff81638b39>] schedule_timeout+0x209/0x2d0
May 18 04:14:37 vps kernel: [<ffffffff8104fac3>] ? x2apic_send_IPI_mask+0x13/0x20
May 18 04:14:37 vps kernel: [<ffffffff810b8a86>] ? try_to_wake_up+0x1b6/0x300
May 18 04:14:37 vps kernel: [<ffffffff8163b216>] wait_for_completion+0x116/0x170
May 18 04:14:37 vps kernel: [<ffffffff810b8c30>] ? wake_up_state+0x20/0x20
May 18 04:14:37 vps kernel: [<ffffffff8109e7ac>] flush_work+0xfc/0x1c0
May 18 04:14:37 vps kernel: [<ffffffff8109a7e0>] ? move_linked_works+0x90/0x90
May 18 04:14:37 vps kernel: [<ffffffffa021143a>] xlog_cil_force_lsn+0x8a/0x210 [xfs]
May 18 04:14:37 vps kernel: [<ffffffffa020fa7e>] _xfs_log_force_lsn+0x6e/0x2f0 [xfs]
May 18 04:14:37 vps kernel: [<ffffffff81632005>] ? __slab_free+0x10e/0x277
May 18 04:14:37 vps kernel: [<ffffffffa020fd2e>] xfs_log_force_lsn+0x2e/0x90 [xfs]
May 18 04:14:37 vps kernel: [<ffffffffa0201fc9>] ? xfs_iunpin_wait+0x19/0x20 [xfs]
May 18 04:14:37 vps kernel: [<ffffffffa01fe4b7>] __xfs_iunpin_wait+0xa7/0x150 [xfs]
May 18 04:14:37 vps kernel: [<ffffffff810a6b60>] ? wake_atomic_t_function+0x40/0x40
May 18 04:14:37 vps kernel: [<ffffffffa0201fc9>] xfs_iunpin_wait+0x19/0x20 [xfs]
May 18 04:14:37 vps kernel: [<ffffffffa01f684c>] xfs_reclaim_inode+0x8c/0x350 [xfs]
May 18 04:14:37 vps kernel: [<ffffffffa01f6d77>] xfs_reclaim_inodes_ag+0x267/0x390 [xfs]
May 18 04:14:37 vps kernel: [<ffffffffa01f7923>] xfs_reclaim_inodes_nr+0x33/0x40 [xfs]
May 18 04:14:37 vps kernel: [<ffffffffa0206895>] xfs_fs_free_cached_objects+0x15/0x20 [xfs]
May 18 04:14:37 vps kernel: [<ffffffff811e0cd8>] prune_super+0xe8/0x170
May 18 04:14:37 vps kernel: [<ffffffff8117c5c5>] shrink_slab+0x165/0x300
May 18 04:14:37 vps kernel: [<ffffffff811d5f01>] ? vmpressure+0x21/0x90
May 18 04:14:37 vps kernel: [<ffffffff8117f742>] do_try_to_free_pages+0x3c2/0x4e0
May 18 04:14:37 vps kernel: [<ffffffff8117f95c>] try_to_free_pages+0xfc/0x180
May 18 04:14:37 vps kernel: [<ffffffff8117365d>] __alloc_pages_nodemask+0x7fd/0xb90
May 18 04:14:37 vps kernel: [<ffffffff81078d73>] copy_process.part.25+0x163/0x1610
May 18 04:14:37 vps kernel: [<ffffffff810a5a20>] ? kthread_create_on_node+0x140/0x140
May 18 04:14:37 vps kernel: [<ffffffff8107a401>] do_fork+0xe1/0x320
May 18 04:14:37 vps kernel: [<ffffffff8107a666>] kernel_thread+0x26/0x30
May 18 04:14:37 vps kernel: [<ffffffff810a65f2>] kthreadd+0x2b2/0x2f0
May 18 04:14:37 vps kernel: [<ffffffff810a6340>] ? kthread_create_on_cpu+0x60/0x60
May 18 04:14:37 vps kernel: [<ffffffff81645e18>] ret_from_fork+0x58/0x90
May 18 04:14:37 vps kernel: [<ffffffff810a6340>] ? kthread_create_on_cpu+0x60/0x60
A trick with dirty pages did not help.
Only hard reset helps to bring the server into operating state.
Could you help to understand whether it is an issue caused on VPS's side or node's?
Regards, Alex.
It's probably a backup process or something storage-impacting happening at the host level. This is outside of your control and you should push the VPS provider for a solution.
If they can't resolve, consider going elsewhere.
This is because you use Redhat/CentOS 7.2 and xfs. The kernel is not stabile like it was with 7.1. Current solution is migrate to ext4 if you want to use CentOS 7.2.
User contributions licensed under CC BY-SA 3.0