How should we understand this 'locked for more than 120 seconds' call trace?

0

Yesterday a Java service in our Docker Swarm cluster suddenly stopped responding. The task (container) for this service just hung, did not answer any requests, and could not be killed.

Looking at /var/log/syslog, we see the following at the time when the service stopped. We're just a little bit in doubt about how to understand these log events.

Our understanding is that this issue relates to swapped data that needs to be brought back into memory.

The call trace is written to /var/log/syslog excactly 120 seconds after the last log entry from the Java service.

We're running the cluster on top of Hyper-V, and we see that there is a mention of hv_netvsc (Hyper-V Network Virtual Service Consumer). Could this be an issue caused by the hypervisor?

The Java process is fairly resource hungry, and usually requires about 15 - 20GB RAM.

How should we understand this call trace? Could this call trace describe the root of the issue, or is it more likely to be a consequence of the issue that also killed the service?

Dec  9 20:45:36 prodswarm1 kernel: [216481.008119] INFO: task java:8748 blocked for more than 120 seconds.
Dec  9 20:45:36 prodswarm1 kernel: [216481.008150]       Not tainted 4.4.0-101-generic #124-Ubuntu
Dec  9 20:45:36 prodswarm1 kernel: [216481.008169] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Dec  9 20:45:36 prodswarm1 kernel: [216481.008194] java            D ffff881fdf63f6b8     0  8748   7863 0x00000100
Dec  9 20:45:36 prodswarm1 kernel: [216481.008198]  ffff881fdf63f6b8 ffff881fdf63f690 ffff881fd93af000 ffff881fd93a8e00
Dec  9 20:45:36 prodswarm1 kernel: [216481.008200]  ffff881fdf640000 ffff8810278d6e00 7fffffffffffffff ffffffff81840ed0
Dec  9 20:45:36 cprodswarm1phprodswarm1 kernel: [216481.008202]  ffff881fdf63f810 ffff881fdf63f6d0 ffffffff818406d5 0000000000000000
Dec  9 20:45:36 prodswarm1 kernel: [216481.008204] Call Trace:
Dec  9 20:45:36 prodswarm1 kernel: [216481.008211]  [<ffffffff81840ed0>] ? bit_wait+0x60/0x60
Dec  9 20:45:36 prodswarm1 kernel: [216481.008214]  [<ffffffff818406d5>] schedule+0x35/0x80
Dec  9 20:45:36 prodswarm1 kernel: [216481.008216]  [<ffffffff81843825>] schedule_timeout+0x1b5/0x270
Dec  9 20:45:36 prodswarm1 kernel: [216481.008222]  [<ffffffff810b2d0c>] ? __enqueue_entity+0x6c/0x70
Dec  9 20:45:36 prodswarm1 kernel: [216481.008224]  [<ffffffff810b9b87>] ? put_prev_entity+0x97/0x7d0
Dec  9 20:45:36 prodswarm1 kernel: [216481.008228]  [<ffffffff810f635b>] ? ktime_get+0x3b/0xb0
Dec  9 20:45:36 prodswarm1 kernel: [216481.008229]  [<ffffffff81840ed0>] ? bit_wait+0x60/0x60
Dec  9 20:45:36 prodswarm1 kernel: [216481.008231]  [<ffffffff8183fc04>] io_schedule_timeout+0xa4/0x110
Dec  9 20:45:36 prodswarm1 kernel: [216481.008233]  [<ffffffff81840eeb>] bit_wait_io+0x1b/0x70
Dec  9 20:45:36 prodswarm1 kernel: [216481.008234]  [<ffffffff81840a7d>] __wait_on_bit+0x5d/0x90
Dec  9 20:45:36 prodswarm1 kernel: [216481.008239]  [<ffffffff8118efdb>] wait_on_page_bit+0xcb/0xf0
Dec  9 20:45:36 prodswarm1 kernel: [216481.008242]  [<ffffffff810c4460>] ? autoremove_wake_function+0x40/0x40
Dec  9 20:45:36 prodswarm1 kernel: [216481.008246]  [<ffffffff811a5faa>] shrink_page_list+0x78a/0x7a0
Dec  9 20:45:36 prodswarm1 kernel: [216481.008248]  [<ffffffff811a6659>] shrink_inactive_list+0x209/0x520
Dec  9 20:45:36 prodswarm1 kernel: [216481.008251]  [<ffffffff811a72e3>] shrink_lruvec+0x583/0x740
Dec  9 20:45:36 prodswarm1 kernel: [216481.008254]  [<ffffffff81098601>] ? insert_work+0x11/0xc0
Dec  9 20:45:36 prodswarm1 kernel: [216481.008256]  [<ffffffff811a758f>] shrink_zone+0xef/0x2e0
Dec  9 20:45:36 prodswarm1 kernel: [216481.008258]  [<ffffffff811a78db>] do_try_to_free_pages+0x15b/0x3b0
Dec  9 20:45:36 prodswarm1 kernel: [216481.008260]  [<ffffffff811a7d6a>] try_to_free_mem_cgroup_pages+0xba/0x1a0
Dec  9 20:45:36 prodswarm1 kernel: [216481.008264]  [<ffffffff811ffc64>] try_charge+0x194/0x660
Dec  9 20:45:36 prodswarm1 kernel: [216481.008267]  [<ffffffff81203f7b>] mem_cgroup_try_charge+0x6b/0x1e0
Dec  9 20:45:36 prodswarm1 kernel: [216481.008269]  [<ffffffff811fb71a>] do_huge_pmd_anonymous_page+0x16a/0x580
Dec  9 20:45:36 prodswarm1 kernel: [216481.008271]  [<ffffffff811c30b4>] handle_mm_fault+0x15d4/0x1820
Dec  9 20:45:36 prodswarm1 kernel: [216481.008275]  [<ffffffff81103ba7>] ? do_futex+0x307/0x540
Dec  9 20:45:36 prodswarm1 kernel: [216481.008279]  [<ffffffff8106b577>] __do_page_fault+0x197/0x400
Dec  9 20:45:36 prodswarm1 kernel: [216481.008281]  [<ffffffff8106b802>] do_page_fault+0x22/0x30
Dec  9 20:45:36 prodswarm1 kernel: [216481.008283]  [<ffffffff81846978>] page_fault+0x28/0x30
Dec  9 20:45:36 prodswarm1 kernel: [216481.008293] INFO: task java:14024 blocked for more than 120 seconds.
Dec  9 20:45:36 prodswarm1 kernel: [216481.008313]       Not tainted 4.4.0-101-generic #124-Ubuntu
Dec  9 20:45:36 prodswarm1 kernel: [216481.008331] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Dec  9 20:45:36 prodswarm1 kernel: [216481.008356] java            D ffff8800f2c5b158     0 14024   7863 0x00000100
Dec  9 20:45:36 prodswarm1 kernel: [216481.008358]  ffff8800f2c5b158 0000000000000000 ffff8810208aaa00 ffff8800f2ac0000
Dec  9 20:45:36 prodswarm1 kernel: [216481.008360]  ffff8800f2c5c000 ffff881027956e00 7fffffffffffffff ffffffff81840ed0
Dec  9 20:45:36 prodswarm1 kernel: [216481.008362]  ffff8800f2c5b2b0 ffff8800f2c5b170 ffffffff818406d5 0000000000000000
Dec  9 20:45:36 prodswarm1 kernel: [216481.008363] Call Trace:
Dec  9 20:45:36 prodswarm1 kernel: [216481.008365]  [<ffffffff81840ed0>] ? bit_wait+0x60/0x60
Dec  9 20:45:36 prodswarm1 kernel: [216481.008367]  [<ffffffff818406d5>] schedule+0x35/0x80
Dec  9 20:45:36 prodswarm1 kernel: [216481.008369]  [<ffffffff81843825>] schedule_timeout+0x1b5/0x270
Dec  9 20:45:36 prodswarm1 kernel: [216481.008374]  [<ffffffffc01f095b>] ? netvsc_start_xmit+0x3ab/0x7b0 [hv_netvsc]
Dec  9 20:45:36 prodswarm1 kernel: [216481.008376]  [<ffffffff810f635b>] ? ktime_get+0x3b/0xb0
Dec  9 20:45:36 prodswarm1 kernel: [216481.008377]  [<ffffffff81840ed0>] ? bit_wait+0x60/0x60
Dec  9 20:45:36 prodswarm1 kernel: [216481.008379]  [<ffffffff8183fc04>] io_schedule_timeout+0xa4/0x110
Dec  9 20:45:36 prodswarm1 kernel: [216481.008381]  [<ffffffff81840eeb>] bit_wait_io+0x1b/0x70
Dec  9 20:45:36 prodswarm1 kernel: [216481.008382]  [<ffffffff81840a7d>] __wait_on_bit+0x5d/0x90
Dec  9 20:45:36 prodswarm1 kernel: [216481.008385]  [<ffffffff8118efdb>] wait_on_page_bit+0xcb/0xf0
Dec  9 20:45:36 prodswarm1 kernel: [216481.008387]  [<ffffffff810c4460>] ? autoremove_wake_function+0x40/0x40
Dec  9 20:45:36 prodswarm1 kernel: [216481.008388]  [<ffffffff811a5faa>] shrink_page_list+0x78a/0x7a0
Dec  9 20:45:36 prodswarm1 kernel: [216481.008391]  [<ffffffff811a6659>] shrink_inactive_list+0x209/0x520
Dec  9 20:45:36 prodswarm1 kernel: [216481.008392]  [<ffffffff811a72e3>] shrink_lruvec+0x583/0x740
Dec  9 20:45:36 prodswarm1 kernel: [216481.008394]  [<ffffffff811a758f>] shrink_zone+0xef/0x2e0
Dec  9 20:45:36 prodswarm1 kernel: [216481.008396]  [<ffffffff811a78db>] do_try_to_free_pages+0x15b/0x3b0
Dec  9 20:45:36 prodswarm1 kernel: [216481.008398]  [<ffffffff811a7d6a>] try_to_free_mem_cgroup_pages+0xba/0x1a0
Dec  9 20:45:36 prodswarm1 kernel: [216481.008400]  [<ffffffff811ffc64>] try_charge+0x194/0x660
Dec  9 20:45:36 prodswarm1 kernel: [216481.008402]  [<ffffffff81203836>] __memcg_kmem_charge_memcg+0x96/0xc0
Dec  9 20:45:36 prodswarm1 kernel: [216481.008406]  [<ffffffff811ed611>] new_slab+0x461/0x490
Dec  9 20:45:36 prodswarm1 kernel: [216481.008408]  [<ffffffff811ee48b>] ___slab_alloc+0x22b/0x470
Dec  9 20:45:36 prodswarm1 kernel: [216481.008429]  [<ffffffffc045597b>] ? nfs_generic_pgio+0x6b/0x2a0 [nfs]
Dec  9 20:45:36 prodswarm1 kernel: [216481.008431]  [<ffffffff81203659>] ? __memcg_kmem_get_cache+0x39/0x130
Dec  9 20:45:36 prodswarm1 kernel: [216481.008437]  [<ffffffffc045597b>] ? nfs_generic_pgio+0x6b/0x2a0 [nfs]
Dec  9 20:45:36 prodswarm1 kernel: [216481.008439]  [<ffffffff811ee6f0>] __slab_alloc+0x20/0x40
Dec  9 20:45:36 prodswarm1 kernel: [216481.008441]  [<ffffffff811f0085>] __kmalloc+0x1d5/0x250
Dec  9 20:45:36 prodswarm1 kernel: [216481.008448]  [<ffffffffc045597b>] nfs_generic_pgio+0x6b/0x2a0 [nfs]
Dec  9 20:45:36 prodswarm1 kernel: [216481.008456]  [<ffffffffc0455c18>] nfs_generic_pg_pgios+0x68/0xd0 [nfs]
Dec  9 20:45:36 prodswarm1 kernel: [216481.008463]  [<ffffffffc0455857>] nfs_pageio_doio+0x27/0x60 [nfs]
Dec  9 20:45:36 prodswarm1 kernel: [216481.008470]  [<ffffffffc045631e>] __nfs_pageio_add_request+0xae/0x480 [nfs]
Dec  9 20:45:36 prodswarm1 kernel: [216481.008477]  [<ffffffffc0456da3>] nfs_pageio_add_request+0xa3/0x1b0 [nfs]
Dec  9 20:45:36 prodswarm1 kernel: [216481.008487]  [<ffffffffc045a76a>] nfs_do_writepage+0xca/0x1c0 [nfs]
Dec  9 20:45:36 prodswarm1 kernel: [216481.008496]  [<ffffffffc045a874>] nfs_writepages_callback+0x14/0x30 [nfs]
Dec  9 20:45:36 prodswarm1 kernel: [216481.008499]  [<ffffffff8119ba4e>] write_cache_pages+0x1ee/0x510
Dec  9 20:45:36 prodswarm1 kernel: [216481.008502]  [<ffffffff810c4175>] ? wake_up_bit+0x25/0x30
Dec  9 20:45:36 prodswarm1 kernel: [216481.008511]  [<ffffffffc045a860>] ? nfs_do_writepage+0x1c0/0x1c0 [nfs]
Dec  9 20:45:36 prodswarm1 kernel: [216481.008520]  [<ffffffffc045ac95>] nfs_writepages+0xa5/0x150 [nfs]
Dec  9 20:45:36 prodswarm1 kernel: [216481.008528]  [<ffffffffc044bdeb>] ? nfs_write_end+0x16b/0x4b0 [nfs]
Dec  9 20:45:36 prodswarm1 kernel: [216481.008530]  [<ffffffff8119e25e>] do_writepages+0x1e/0x30
Dec  9 20:45:36 prodswarm1 kernel: [216481.008532]  [<ffffffff81190f86>] __filemap_fdatawrite_range+0xc6/0x100
Dec  9 20:45:36 prodswarm1 kernel: [216481.008534]  [<ffffffff811910ca>] filemap_write_and_wait_range+0x2a/0x70
Dec  9 20:45:36 prodswarm1 kernel: [216481.008539]  [<ffffffffc044b1d3>] nfs_file_fsync+0x83/0x130 [nfs]
Dec  9 20:45:36 prodswarm1 kernel: [216481.008542]  [<ffffffff8124509b>] vfs_fsync_range+0x4b/0xb0
Dec  9 20:45:36 prodswarm1 kernel: [216481.008544]  [<ffffffff8124511c>] vfs_fsync+0x1c/0x20
Dec  9 20:45:36 prodswarm1 kernel: [216481.008549]  [<ffffffffc044b6a6>] nfs_file_flush+0x46/0x60 [nfs]
Dec  9 20:45:36 prodswarm1 kernel: [216481.008552]  [<ffffffff8120d7df>] filp_close+0x2f/0x70
Dec  9 20:45:36 prodswarm1 kernel: [216481.008554]  [<ffffffff8122f193>] __close_fd+0xa3/0xd0
Dec  9 20:45:36 prodswarm1 kernel: [216481.008556]  [<ffffffff8120d843>] SyS_close+0x23/0x50
Dec  9 20:45:36 prodswarm1 kernel: [216481.008558]  [<ffffffff818447f2>] entry_SYSCALL_64_fastpath+0x16/0x71
linux-kernel
asked on Server Fault Dec 10, 2017 by sbrattla

1 Answer

1

To me it looks like it is some kind of deadlock during swapping. You need to monitor the memory usage of the application and swap usage, and see how it behaves when this condition occurs. This way you can see if the issue is too little memory allocated for the container.

answered on Server Fault Dec 10, 2017 by Tero Kilkanen

User contributions licensed under CC BY-SA 3.0