We are currently utilizing Nagios to monitor our production servers at my workplace. Our Nagios instance is configured on a Linux server where we monitor both Linux and Windows machines.
I have been experiencing the following Nagios issue regarding NTP Time for quite some time now on several of our Windows Servers:
Please see the command below that is giving me troubles:
check_windows_time!us.pool.ntp.org!3000!6000
It appears that $ARG1$ is 'us.pool.ntp.org'. What does 'Lookup Failure for host $ARG1$' correspond to? Are these servers having troubles resolving the NTP host (us.pool.ntp.org)? If so, I am just curious as to why some servers are having troubles resolving this host while others are not? I am using the same command on many other servers without issue.
Just to note, all other monitoring statements are working fine on the servers experiencing this issue (Disk Space, CPU Usage, RAM usage, etc). It just seems to be the NTP command that is giving me trouble.
I have NTP configured the same way on many other servers but I am not experiencing this issue so I am at a loss regarding what could be causing this issue.
Has anyone experienced a similar error before?
Please let me know if you require any additional information and I will be happy to clarify.
Thank you!
EDIT 1: If it helps at all, I can nslookup 'us.pool.ntp.org' from the affected servers. So the servers that are having issues are able to resolve that DNS name.
EDIT 2: NSC.ini 'check_windows_time' configuration:
check_windows_time=check_windows_time.bat $ARG1$ $ARG2$ $ARG3$
check_windows_time.bat:
@echo off
SETLOCAL
rem ***************************************************
rem Check_Windows_Time.bat
rem
rem Author: Michael van den Berg
rem Copyright 2012 - PCS-IT Services B.V. (www.pcs-it.nl)
rem
rem This Nagios plugin will check the time offset
rem against a specified time server.
rem ***************************************************
if [%1]==[] (goto usage) else (set time_server=%1)
if [%1]==[/?] (goto usage) else (set time_server=%1)
if [%2]==[] (set warn_offset=nul) else (set warn_offset=%2)
if [%2]==[$ARG2$] set warn_offset=nul
if [%3]==[] (set crit_offset=nul) else (set crit_offset=%3)
if [%3]==[$ARG3$] set crit_offset=nul
for /f "tokens=*" %%t in ('w32tm /stripchart /computer:%time_server% /samples:1 /dataonly') do set output=%%t
if not "x%output:0x80072af9=%"=="x%output%" goto host_error
if not "x%output:0x800705B4=%"=="x%output%" goto comm_error
if not "x%output:error=%"=="x%output%" goto unknown_error
if not "x%output:)=%"=="x%output%" goto unknown_error
set time_org=%output:*, =%
set time=%time_org:~1,-9%
if %warn_offset% == nul (set warn_perf=0) else (set warn_perf=%warn_offset%)
if %crit_offset% == nul (set crit_perf=0) else (set crit_perf=%crit_offset%)
set perf_data='Offset'=%time%s;%warn_perf%;%crit_perf%;0
if %time% geq %crit_offset% goto threshold_crit
if %time% geq %warn_offset% goto threshold_warn
if %time% lss %warn_offset% goto okay
goto unknown_error
:usage
echo %0 - Nagios plugin that checks time offset against a specified ntp server.
echo.
echo Usage: %0 ^<timeserver^> ^<warning threshold in seconds^> ^<critical threshold in seconds^>
echo Examples: %0 us.pool.ntp.org 120 300
echo %0 my-domain-controller.local 120 300
exit /b 3
:host_error
echo UNKNOWN: Lookup failure for host %time_server%
exit /b 3
:comm_error
echo UNKNOWN: Unable to query NTP service at %time_server% (Port 123 blocked/closed)
exit /b 3
:threshold_crit
echo CRITICAL: Time is %time_org% from %time_server%^|%perf_data%
exit /b 2
:threshold_warn
echo WARNING: Time is %time_org% from %time_server%^|%perf_data%
exit /b 1
:okay
echo OK: Time is %time_org% from %time_server%^|%perf_data%
exit /b 0
:unknown_error
echo UNKNOWN: Unable to check time (command error)
exit /b 3
EDIT 3: The error message that I am receiving looks to be a result of the following condition being met:
if not "x%output:0x80072af9=%"=="x%output%" goto host_error
Does anyone have any ideas what this means or how I can resolve this?
I was finally able to to get these NTP errors to disappear.
First, since we have Windows Firewall enabled, I unblocked a port that is needed to check for NTP time in the Outbound Connection settings (123). I noticed this was the issue because I tried to run my 'check_windows_time.bat' file from command line and received an error.
Shout goes out to user 'Sorcha' from the comments above for suggesting I perform this testing.
I then compared the troubled NSC.ini instance to a version that I knew was working properly. There were a few differences between the working .ini file and the servers that were experiencing issues. I modified the troubled .ini files to match the working file and restarted the NSClient++ service.
I also restarted Nagios. After some time my errors cleared!
Thank you for your help.
User contributions licensed under CC BY-SA 3.0