"0x0000009E" Stop error when cluster node crashes in Windows Server 2012 R2

Question

"0x0000009E" Stop error when cluster node crashes in Windows Server 2012 R2

Full Memory Dump: https://pastebin.com/spkLeVYL

Crash message is:

USER_MODE_HEALTH_MONITOR (9e)

One or more critical user mode components failed to satisfy a health check.
Hardware mechanisms such as watchdog timers can detect that basic kernel
services are not executing. However, resource starvation issues, including
memory leaks, lock contention, and scheduling priority misconfiguration,
may block critical user mode components without blocking DPCs or
draining the nonpaged pool.

Kernel components can extend watchdog timer functionality to user mode
by periodically monitoring critical applications. This bugcheck indicates
that a user mode health check failed in a manner such that graceful
shutdown is unlikely to succeed. It restores critical services by
rebooting and/or allowing application failover to other servers.

Arguments:

Arg1: ffffe00026e00780, Process that failed to satisfy a health check within the configured timeout

Arg2: 000000000000003c, Health monitoring timeout (seconds)

Arg3: 000000000000000a, WatchdogSourceClussvcIsAlive
    Cluster service sends heartbeat to netft every 500 millseconds.
    By default netft expects at least 1 heartbeat per second.
    If this watchdog was triggered that means clussvc is o not getting
    CPU to send heartbers.
Arg4: 0000000000000000

Something in User Mode caused the Failover Clustering Service to become unresponsive, so User Mode processes and general hang debugging is the problem.Clustering has health detection between the user mode service and the kernel mode NetFT driver. If user mode goes unresponsive, then clustering bugchecks the box in an effort to force a failover. A STOP 0x9e is expected cluster behavior. A stop 0x9e for netft.sys, which is an intentional bugcheck caused by the cluster service due to a deadlock condition identified.

I found this in an article, I was wondering if I should change the recovery action HangRecoveryAction?

This property controls the action to take if the user-mode processes have stopped responding. For the HangRecoveryAction, we actually have 4 different settings with 3 being the default.

0 = Disables the heartbeat and monitoring mechanism.
1 = Logs an event in the system log of the Event Viewer.
2 = Terminates the Cluster Service.
3 = Causes a Stop error (Bugcheck) on the cluster node.  <<– default for 2008

Server is 2012 R2.

windows-server-2012-r2

hyper-v

failovercluster

asked on Server Fault Jan 9, 2020 by

Kali • edited Jan 10, 2020 by

LTPCGO

0 Answers

Nobody has answered this question yet.

User contributions licensed under CC BY-SA 3.0