How to interrupt a syscall

1

I have a Go service doing heavy reads from an NFS / GPFS volume. I do occasionally run into issues at scale during which the underlying mount would not answer to a specific syscall, resulting in the entire service being taken down by the kernel:

[98549.941930]       Tainted: G           O    4.14.13-1.el7.elrepo.x86_64 #1
[98549.942454] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[98549.943422] ls              D    0 14884      1 0x00000084
[98549.943968] Call Trace:
[98549.944498]  __schedule+0x28d/0x880
[98549.945033]  schedule+0x36/0x80
[98549.945552]  schedule_preempt_disabled+0xe/0x10
[98549.946095]  __mutex_lock.isra.5+0x269/0x500
[98549.946611]  __mutex_lock_slowpath+0x13/0x20
[98549.947153]  mutex_lock+0x2f/0x40
[98549.947695]  fuse_lock_inode+0x2a/0x30 [fuse]
[98549.948248]  fuse_readdir+0x113/0x7e0 [fuse]
[98549.948795]  iterate_dir+0x16e/0x190
[98549.949323]  ? __audit_syscall_entry+0xaf/0x100
[98549.949847]  SyS_getdents+0x98/0x120
[98549.950358]  ? iterate_dir+0x190/0x190
[98549.950898]  do_syscall_64+0x67/0x1b0
[98549.951410]  entry_SYSCALL64_slow_path+0x25/0x25
[98549.951948] RIP: 0033:0x7ffff749dcb5
[98549.952454] RSP: 002b:00007fffffffd160 EFLAGS: 00000246 ORIG_RAX: 000000000000004e
[98549.953423] RAX: ffffffffffffffda RBX: 00000000006260a0 RCX: 00007ffff749dcb5
[98549.953985] RDX: 0000000000008000 RSI: 00000000006260a0 RDI: 0000000000000005
[98549.954518] RBP: 00000000006260a0 R08: 0000000000000080 R09: 0000000000008030
[98549.955131] R10: 00007fffffffced0 R11: 0000000000000246 R12: fffffffffffffe90
[98549.955655] R13: 0000000000000000 R14: 0000000000626030 R15: 0000000000626000

I an looking for a way to add a timeout so that no individual failed syscall can take the whole service down, but cannot find a good way to do it in Go.

One usual design I found is to run syscalls from an OS thread and kill this thread on timeout, but that does not seem like a possibility with Golang due to the lack of control on the underlying system threads. The service is typically executing large amounts of syscalls in parallel (possibly hundreds).

go
asked on Stack Overflow Jul 2, 2020 by Pierre • edited Jul 2, 2020 by Pierre

1 Answer

0

you can get the pid of the process using the sycall package

pid, _, _ := syscall.Syscall(syscall.SYS_GETPID, 0, 0, 0)

and latter kill the process by pid

select {
    case end = <-endSignal:
        fmt.Println("The end!")
    case <-time.After(5 * time.Second):
        proc, _ := os.FindProcess(pid)
        // Kill the process
         proc.Kill()
    }
answered on Stack Overflow Jul 2, 2020 by ganapathydselva • edited Jul 2, 2020 by ganapathydselva

User contributions licensed under CC BY-SA 3.0