The issue we are seeing is that some of our NFS mounted directories are disappearing. Literally just disappearing. They continue to exist at the server end.
At the client (CentOS 7.4, using el-repo's kernel-ml at version 4.16.6, nfs-utils-1:1.3.0-0.48.el7) end the drive is non-responsive. Calls to ls /mountpoint
just hang. The OS still thinks the drive is mounted - rebooting a server requires a hard reset because the shutdown process hangs on unmounting the drive.
In the logs we are seeing very little - at the server end we see nothing. Some, but not every, time we see something like this in /var/log/messages on the client:
May 29 16:55:22 papr-res-compute06 kernel: INFO: task STAR:1370 blocked for more than 120 seconds.
May 29 16:55:22 papr-res-compute06 kernel: Not tainted 4.16.6-1.el7.elrepo.x86_64 #1
May 29 16:55:22 papr-res-compute06 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
May 29 16:55:22 papr-res-compute06 kernel: STAR D 0 1370 1351 0x00000080
May 29 16:55:22 papr-res-compute06 kernel: Call Trace:
May 29 16:55:22 papr-res-compute06 kernel: __schedule+0x290/0x890
May 29 16:55:22 papr-res-compute06 kernel: ? out_of_line_wait_on_atomic_t+0x110/0x110
May 29 16:55:22 papr-res-compute06 kernel: schedule+0x36/0x80
May 29 16:55:22 papr-res-compute06 kernel: io_schedule+0x16/0x40
May 29 16:55:22 papr-res-compute06 kernel: bit_wait_io+0x11/0x50
May 29 16:55:22 papr-res-compute06 kernel: __wait_on_bit+0x66/0x90
May 29 16:55:22 papr-res-compute06 kernel: out_of_line_wait_on_bit+0x91/0xb0
May 29 16:55:22 papr-res-compute06 kernel: ? bit_waitqueue+0x40/0x40
May 29 16:55:22 papr-res-compute06 kernel: nfs_wait_on_request+0x4b/0x60 [nfs]
May 29 16:55:22 papr-res-compute06 kernel: nfs_lock_and_join_requests+0x12a/0x540 [nfs]
May 29 16:55:22 papr-res-compute06 kernel: ? radix_tree_lookup_slot+0x22/0x50
May 29 16:55:22 papr-res-compute06 kernel: nfs_updatepage+0x120/0x9f0 [nfs]
May 29 16:55:22 papr-res-compute06 kernel: ? nfs_flush_incompatible+0xc5/0x1c0 [nfs]
May 29 16:55:22 papr-res-compute06 kernel: nfs_write_end+0xe2/0x3c0 [nfs]
May 29 16:55:22 papr-res-compute06 kernel: generic_perform_write+0x10b/0x1c0
May 29 16:55:22 papr-res-compute06 kernel: ? _cond_resched+0x19/0x30
May 29 16:55:22 papr-res-compute06 kernel: ? _cond_resched+0x19/0x30
May 29 16:55:22 papr-res-compute06 kernel: nfs_file_write+0xd4/0x250 [nfs]
May 29 16:55:22 papr-res-compute06 kernel: do_iter_readv_writev+0x109/0x170
May 29 16:55:22 papr-res-compute06 kernel: do_iter_write+0x7f/0x190
May 29 16:55:22 papr-res-compute06 kernel: vfs_writev+0x84/0xf0
May 29 16:55:22 papr-res-compute06 kernel: ? handle_mm_fault+0x102/0x220
May 29 16:55:22 papr-res-compute06 kernel: ? _cond_resched+0x19/0x30
May 29 16:55:22 papr-res-compute06 kernel: do_writev+0x61/0xf0
May 29 16:55:22 papr-res-compute06 kernel: SyS_writev+0x10/0x20
May 29 16:55:22 papr-res-compute06 kernel: do_syscall_64+0x79/0x1b0
May 29 16:55:22 papr-res-compute06 kernel: entry_SYSCALL_64_after_hwframe+0x3d/0xa2
May 29 16:55:22 papr-res-compute06 kernel: RIP: 0033:0x7f0cc1563230
May 29 16:55:22 papr-res-compute06 kernel: RSP: 002b:00007fff6f75f2c0 EFLAGS: 00000293 ORIG_RAX: 0000000000000014
May 29 16:55:22 papr-res-compute06 kernel: RAX: ffffffffffffffda RBX: 0000000000002025 RCX: 00007f0cc1563230
May 29 16:55:22 papr-res-compute06 kernel: RDX: 0000000000000002 RSI: 00007fff6f75f300 RDI: 000000000000000a
May 29 16:55:22 papr-res-compute06 kernel: RBP: 000000001a7b0860 R08: 0000000000000000 R09: 00007f0cc15b8a00
May 29 16:55:22 papr-res-compute06 kernel: R10: cccccccccccccccd R11: 0000000000000293 R12: 000000000000000a
May 29 16:55:22 papr-res-compute06 kernel: R13: 00007fff6f75f300 R14: 0000000000000028 R15: 0000000000001ffd
We cannot reproduce this error reliably. There is very little in the way of commonalty between instances of the drives dropping - this is on a HPC with around 40 servers and 100 users. Usually only affects a single server, but has happened across hardware (Cisco UCSB-B200-M4 w 368GB and UCSME-142-M4 w 32GB). The one thing that is common is that the drives in question hold biological data which can be very large files (sometimes over half a TB).
So, we would like to monitor that the NFS drives are up because SLURM keeps assigning jobs to servers that have this problem, and those jobs will never complete.
My colleague has whipped up a couple of shell scripts that ping
this and ls
that and write to file and email when necessary. It's clever and would have been fun to write (awk!, cut!), but is absolutely the (in my mind) wrong way to do it.
A quick google shows that nfsiostat
might be useful but there is little in the way of good instruction, what there is doesn't seem to map to our OS (in particular "count" is necessary despite what the man page says) and most importantly, we don't want stats on usage. We want existence information. Literally "are you there nfsdrive, it's me root?" I admit I need to read more on nfsiostat.
Another solution I've seen is to move from nfs mounts listed in fstab to using autofs. The CentOS docs finish two versions ago but do state:
This is not a problem with one or two mounts, but when the system is maintaining mounts to many systems at one time, overall system performance can be affected.
Given that we have around 10-12 mount points - does this count as many? My colleague suggested that many was more in the realm of 100s. Regardless, this information is now old. Looking into the Redhat docs, we see that it is a copy/paste.
I've seen these two errata/bugs
but we are unable to upgrade immediately because the servers are production and in a clinical setting. Plus, there is no indication that this is explicitly our problem, and given that we can't reproduce the problem at will, we can't test on our dev machines - although I have been trying to reproduce on our dev machines. There are also more recent El Repo kernel-ml versions, but the same problem exists - can't just update overnight - these servers are often at 100% 24/7.
So, I guess I have two questions:
User contributions licensed under CC BY-SA 3.0