KVM based VPS crashes every 3-7 days. Is it an issue on VPS side or node's?

2

I'm wondering could it be so that a VPS is a root cause of crashes which occurs every 3-7 days at night 03:00 - 04:00 time (kernel bug, or something else), or is it a node on which the virtual server is hosted (an issue with backend).

Details: KVM based VPS with CentOS 7, xfs hosted at VPS provider, who has a back-end and storage back-end infrastructure.

Usually it happens the following way, at once the running kthreadd process turns to D-status (i.e. uninterruptible sleep), and then we get messages like: blocked for more than 120 seconds. and high LA:

May 21 03:08:01 vps root: root 2 0.0 0.0 0 0 ? S May18 0:00 [kthreadd] May 21 03:10:01 vps root: root 2 0.0 0.0 0 0 ? S May18 0:00 [kthreadd] May 21 03:12:01 vps root: root 2 0.0 0.0 0 0 ? S May18 0:00 [kthreadd] May 21 03:14:01 vps root: root 2 0.0 0.0 0 0 ? D May18 0:00 [kthreadd] May 21 03:15:16 vps kernel: INFO: task kthreadd:2 blocked for more than 120 seconds. May 21 03:15:16 vps kernel: kthreadd D ffffffffffffffff 0 2 0 0x00000000 May 21 03:15:16 vps kernel: [<ffffffff810a65f2>] kthreadd+0x2b2/0x2f0 May 21 03:16:01 vps root: root 2 0.0 0.0 0 0 ? D May18 0:00 [kthreadd] May 21 03:18:01 vps root: root 2 0.0 0.0 0 0 ? D May18 0:00 [kthreadd] May 21 03:20:02 vps root: root 2 0.0 0.0 0 0 ? D May18 0:00 [kthreadd]

here we have a call trace:

May 18 04:14:37 vps kernel: INFO: task kthreadd:2 blocked for more than 120 seconds. May 18 04:14:37 vps kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. May 18 04:14:37 vps kernel: kthreadd D ffffffffffffffff 0 2 0 0x00000000 May 18 04:14:37 vps kernel: ffff88023413b4e0 0000000000000046 ffff880234120b80 ffff88023413bfd8 May 18 04:14:37 vps kernel: ffff88023413bfd8 ffff88023413bfd8 ffff880234120b80 ffff88023413b628 May 18 04:14:37 vps kernel: ffff88023413b630 7fffffffffffffff ffff880234120b80 ffffffffffffffff May 18 04:14:37 vps kernel: Call Trace: May 18 04:14:37 vps kernel: [<ffffffff8163ae49>] schedule+0x29/0x70 May 18 04:14:37 vps kernel: [<ffffffff81638b39>] schedule_timeout+0x209/0x2d0 May 18 04:14:37 vps kernel: [<ffffffff8104fac3>] ? x2apic_send_IPI_mask+0x13/0x20 May 18 04:14:37 vps kernel: [<ffffffff810b8a86>] ? try_to_wake_up+0x1b6/0x300 May 18 04:14:37 vps kernel: [<ffffffff8163b216>] wait_for_completion+0x116/0x170 May 18 04:14:37 vps kernel: [<ffffffff810b8c30>] ? wake_up_state+0x20/0x20 May 18 04:14:37 vps kernel: [<ffffffff8109e7ac>] flush_work+0xfc/0x1c0 May 18 04:14:37 vps kernel: [<ffffffff8109a7e0>] ? move_linked_works+0x90/0x90 May 18 04:14:37 vps kernel: [<ffffffffa021143a>] xlog_cil_force_lsn+0x8a/0x210 [xfs] May 18 04:14:37 vps kernel: [<ffffffffa020fa7e>] _xfs_log_force_lsn+0x6e/0x2f0 [xfs] May 18 04:14:37 vps kernel: [<ffffffff81632005>] ? __slab_free+0x10e/0x277 May 18 04:14:37 vps kernel: [<ffffffffa020fd2e>] xfs_log_force_lsn+0x2e/0x90 [xfs] May 18 04:14:37 vps kernel: [<ffffffffa0201fc9>] ? xfs_iunpin_wait+0x19/0x20 [xfs] May 18 04:14:37 vps kernel: [<ffffffffa01fe4b7>] __xfs_iunpin_wait+0xa7/0x150 [xfs] May 18 04:14:37 vps kernel: [<ffffffff810a6b60>] ? wake_atomic_t_function+0x40/0x40 May 18 04:14:37 vps kernel: [<ffffffffa0201fc9>] xfs_iunpin_wait+0x19/0x20 [xfs] May 18 04:14:37 vps kernel: [<ffffffffa01f684c>] xfs_reclaim_inode+0x8c/0x350 [xfs] May 18 04:14:37 vps kernel: [<ffffffffa01f6d77>] xfs_reclaim_inodes_ag+0x267/0x390 [xfs] May 18 04:14:37 vps kernel: [<ffffffffa01f7923>] xfs_reclaim_inodes_nr+0x33/0x40 [xfs] May 18 04:14:37 vps kernel: [<ffffffffa0206895>] xfs_fs_free_cached_objects+0x15/0x20 [xfs] May 18 04:14:37 vps kernel: [<ffffffff811e0cd8>] prune_super+0xe8/0x170 May 18 04:14:37 vps kernel: [<ffffffff8117c5c5>] shrink_slab+0x165/0x300 May 18 04:14:37 vps kernel: [<ffffffff811d5f01>] ? vmpressure+0x21/0x90 May 18 04:14:37 vps kernel: [<ffffffff8117f742>] do_try_to_free_pages+0x3c2/0x4e0 May 18 04:14:37 vps kernel: [<ffffffff8117f95c>] try_to_free_pages+0xfc/0x180 May 18 04:14:37 vps kernel: [<ffffffff8117365d>] __alloc_pages_nodemask+0x7fd/0xb90 May 18 04:14:37 vps kernel: [<ffffffff81078d73>] copy_process.part.25+0x163/0x1610 May 18 04:14:37 vps kernel: [<ffffffff810a5a20>] ? kthread_create_on_node+0x140/0x140 May 18 04:14:37 vps kernel: [<ffffffff8107a401>] do_fork+0xe1/0x320 May 18 04:14:37 vps kernel: [<ffffffff8107a666>] kernel_thread+0x26/0x30 May 18 04:14:37 vps kernel: [<ffffffff810a65f2>] kthreadd+0x2b2/0x2f0 May 18 04:14:37 vps kernel: [<ffffffff810a6340>] ? kthread_create_on_cpu+0x60/0x60 May 18 04:14:37 vps kernel: [<ffffffff81645e18>] ret_from_fork+0x58/0x90 May 18 04:14:37 vps kernel: [<ffffffff810a6340>] ? kthread_create_on_cpu+0x60/0x60

A trick with dirty pages did not help.

Only hard reset helps to bring the server into operating state.

Could you help to understand whether it is an issue caused on VPS's side or node's?

Regards, Alex.

kvm-virtualization
centos7
server-crashes
asked on Server Fault May 24, 2016 by Alex

2 Answers

5

It's probably a backup process or something storage-impacting happening at the host level. This is outside of your control and you should push the VPS provider for a solution.

If they can't resolve, consider going elsewhere.

answered on Server Fault May 24, 2016 by ewwhite
-2

This is because you use Redhat/CentOS 7.2 and xfs. The kernel is not stabile like it was with 7.1. Current solution is migrate to ext4 if you want to use CentOS 7.2.

answered on Server Fault Nov 25, 2016 by crashedagain

User contributions licensed under CC BY-SA 3.0