On 13th May 2020, our Windows server downloaded and began to install two updates: KB4551853 and KB4556441. It restarted to finish the installation at 1:05am on Thursday morning. However, when people arrived to work on Thursday morning (whether physically or virtually) the server was not working; it was unresponsive.
First of all I forced it to reboot by removing the power to it and then starting it again. Once it started up again, however, it exhibited symptoms of having a memory leak. It ran well for an hour or so, then became increasingly unresponsive, until it had to be restarted again.
I tried uninstalling the updates – which was possible for KB4551853. However, this had no effect on the problem. I looked in to going back to a restore point, but the “System Protection” tab is missing. Perhaps we did not install the relevant “Rôle”.
Having noticed an error message about svchost.exe having a buffer overflow, I ran Windows’ memory test, but this saw no issues. Neither did the BIOS RAM test. Exact message:
svchost.exe – System Error.
The system detected an overrun of a stack-based buffer in this application. This overrun could potentially allow a malicious user to gain control of the application.
The Task Manager program indicates that the RAM becomes full (slowly), but that no process is taking up much RAM at all. It has 16Gb of RAM, but the memory reports 99% full while the process taking up the most RAM is only using 17.1Mb! Windows Start Menu and Settings seem the first things affected by the memory leak, the first programs that I would want to use that fail to work.
I did take the server apart and clean it and reseat the power connectors and RAM. It made no difference. Having researched saving the Active Directory settings, I converted the OS filesystem from FAT32 to NTFS using the Windows Server install disk, and then tried to do a “System state” backup. However, the system became unresponsive before it completed.
I tried to reinstall Windows Server 2019 over itself, while still keeping its settings. However, the reinstall failed with error 0x800706BA – 0x20003, in the SAFE_OS phase with an error during INSTALL_UPDATES operation. Suggested fixes included making sure the Windows update system is functional, so I ran the Windows Update Troubleshooter, and that found no errors. They also include uninstalling non-Microsoft antivirus software, so I uninstalled our copy of ESET from the server.
I have also run
DISM /Online /Cleanup-image /Restorehealth followed by
sfc /scannow, and
chkdsk c:, but they have all found no errors and made no difference.
First, I want to thank everyone who posted because all your posts were helpful.
Second, the method I used to solve the problem is one suggested to me by Andrew Mills of datamills. It was to use the Windows diagnostic startup feature of Windows Server. You can access it through Control Panel > Administrative Tools > System Configuration > General tab. It allows you to load only basic devices and services during the boot, then you can enable other services one by one to see which one causes the problem. This is how I implemented the action listed at the bottom of my original question.
Using this method, I discovered to my surprise that the culprit was not Dell’s Support Assist, or anything to do with ESET, or the Windows Updates, but instead the install of Tenable Nessus Essentials. I had installed this for our upcoming Cyber Essentials Plus audit, and, being careful about such things, when I noticed that it came bundled with the unsupported Winpcap library, I updated it to the supported Npcap library, in compatibility mode. However, I suspect that Nessus doesn’t play nice with this library, and that that was what caused the Nessus service to consume all the RAM. Whether I’m right or wrong about that, having removed Nessus the server now functions properly. I installed Nessus for the audit on a different computer, and left it with the unsupported Winpcap library, and it didn’t crash that computer.
User contributions licensed under CC BY-SA 3.0