How do I get Nagios to check whether the time is set correctly on a remote machine? (Where "correctly" is defined as "within a few seconds of the time on the Nagios server".)
The problem I keep running into is that for whatever reason, the time becomes wrong on one of my servers (because it rebooted while the ntp server was unreachable, or who knows what). Generally I only discover this days later when something has gone annoyingly wrong, like tickets are still on sale after doors.
I don't see an easy way to write such a Nagios plugin, though. Does such a thing already exist?
(Just checking whether an ntpd process is running is not enough. I want to do something like: string-compare the textual output of a pair of "date" commands.)
Update: Ok, I guess 'check_ntp -H ntp-server' is pretty close to what I want... though it won't detect the timezone going wonky.
The check_ntp plugin appears to come with Nagios, or at least it comes with the Nagios packages in Debian, and does exactly what you want.
Edited to add: Ah, I see, you're supposed to download the nagios-plugins distribution from nagiosplugins.org to get the standard plugins, and that's what Debian is packaging.
Uh, no, that checks whether an ntp server is speaking ntp. That's not even close to what I'm trying to do.
No, that's not all it does. The -w and -c flags will let you set how far from the local time you'll permit.
We use it for exactly this purpose.
Huh. I didn't know it did that. I'll have to put that on our KDC as well!
check_ntp polls an ntp server to see if it's working, not if the clock is accurate.
Also, check_ntp is deprecated. Use check_ntp_peer or check_ntp_time instead.
check_ntp_time works, because it checks offset, but it only checks offset if the host has successfully synchronized with the server. Probably not what jwz wants here.
Nagios plugins need one of three exit codes, 2, 1, or 0 which correspond to "Critical", "Warning" and "Ok". You exit the check script with one of those three exit codes and let nagios do the rest.
Plugins are super easy to write, and I've made tons of them at work like this.
For the plugin that you want to write I have two options for you.
1) Use the Unix datetime service to compare dates (probably disabled on most systems)
2) Use SSH (with a shared key) to go get the time on the external server.
One solution is roughly:
#!/bin/perl
use Date::Parse;
my $mytime = `ssh targethost /bin/date`;
# permitted clock skew
my $SLEW = 2000;
$parsedtime = str2time($mytime);
my $now=time();
my $delta = $now - $parsedtime;
if ($delta > $SLEW) {
print "CRITICAL: Time not within slew";
exit 2;
} else {
print "OK: Time within valid slew";
exit 0;
}
Oh, and if it's not already obvious, add this check script to checkcommands.cfg and call it as a service in your services.cfg file.
I thought it should be obvious that I meant "without requiring passwordless ssh from cron to work on the target machine." That's the whole point of nrpe, after all.
Use the solution below using daytime, or if you don't want to turn on the daytime port, move my script to the NRPE port. Enable NRPE Argument processing, and call the script with:
check_nrpe!name_of_my_script!$DATETIME$
Also change the script so that you're doing:
my $mytime=$ARGV[0];
What about NSCA for a scripting route? Or is that too heavyweight for something simple?
NagiosExchange has lots of useful plugins, including the one that you want:
http://www.nagiosexchange.org/cgi-bin/page.cgi?g=Detailed%2F1476.html;d=1
You'll either have to run a local time server so the nagios server can check the time (lame) or run the plugin locally via NRPE comparing to a single time server.
Does it have to be nagios? How about enabling the daytime service (13/tcp or 13/udp) via inetd?
$ telnet unclejesse 13
Sat May 24 23:55:41 2008
Sounds like exactly what you want.
If the server is running SSL, enabling the daytime protocol can weaken the strength of SSL by revealing the system clock.
DNA's got a decent firewall, so this shouldn't be an issue if jwz wants to do this.
Oh whatever... the web server already sends back a "Date:" header as part of every HTTP/1.1 response.
$ wget -O /dev/null -S http://www.livejournal.com/ 2>&1 | grep '^ Date: '
Date: Sun, 25 May 2008 04:14:24 GMT
Someone industrious could even use that to make a Nagios plugin for checking for correct time on remote webservers.
I guess I was thinking of the old 1996 Netscape SSL exploit.
Entropy in generated keys is much greater now.
Yes there's at least 32k possible hashes these days!!11 (couldn't resist, I'm still burning from that.)
Can't you run ntpd on your servers, do an ntpdate -q, and compare the results with your Nagios system? Your plugin should be simple enough after that.
Resident Windows IT jeenius has been unable to set the time correctly on the Windows domain controller for months now, resulting in, among other things, half the people showing up several minutes "late" for some meetings, and half of the people showing up several minutes "early" for some meetings, depending on the platform-of-choice of the meeting organizer. I actually had to write a script to dump the time on 20 different machines to prove to this guy that there even was a problem; he's perhaps the most evidence-proof person working with computers I've ever known. Anyway, after finally acknowledging that there might be an issue, Mr. Wizard said he couldn't fix it, citing http://support.microsoft.com/kb/875424 (you've got to be kidding me), http://support.microsoft.com/kb/940742, his busy busy schedule, and wah wah NTP is complicated wah wah as justification.
Solution to NTP problem: Got someone clueful with admin privs to fix the damn NTP config.
Solution to Windows problem: 50% of our developers use Macs. Percentage is growing.
Solution to Sysadmin problem: call Angus...
Holy crap is that sad. NTP is like the easiest thing ever!
If none of the above suit, what we used to do for this sort of outside-of-normal nagios thing was have the snmpd running on the target machine run command line processes (eg, df, or in your case, date) and extend nagios' standard snmp plugins to query that.
I wrote a simple perl script that checked the iso.3.6.1.2.1.25.1.2.0 snmp oid against the local clock and made sure it was within a particular limit. I coded it such that it could detect timezone mismatches too, which is useful for knowing if a server isn't following DST. It's a bit too long (and badly written) to post here, but shout and I'll upload it somewhere.
Use nagios to keep tabs on connectivity, restart (or SIGHUP) the ntpd every hour, and replace the clock backup battery on the motherboard.
This is not what you asked for, but because my experience has been that computer clocks drift significantly even over relatively short time spans, I recommend doing what ended up doing, which is running openntpd on the clients. That it's a dæmon is mainly so it can hang around to correct the clock periodically; primarily it's an NTP client, so needs minimal configuration.
So does NTP start and stay running if it can't connect to the server? If you have NTP running correctly for a while (couple of days) syncing to a server, and then that servers drops off NTP will still keep *extremely* accurate time based on the known drift of your PC.
cat /var/lib/ntp.drift
If it's not starting or crashing that's a different story.
If the clock's backup battery dies, drift will change a lot.
Typically what happens is that A) some upstream shithead has renamed or renumbered their NTP server and I can't reach it any more (this has happened to me like five times) or B) the server isn't reachable at boot time, so ntpdate can't run, so ntpd refuses to correct the eight-hours-off time that the machine booted with; or C) oh look they moved when DST begins and now I have to go through a fucking fire drill updating tzdata in a dozen places, and I missed some; or D) some god damned other thing I can't remember.
Yes, there are solutions to all of these problems. The problem I'm trying to solve is noticing that there is a problem.
I know this isn't the ultimate solution to your problem, looks like allbery has it above, but you might want to disable that initial ntpdate check. I know newer versions of Fedora don't do ntpdate first for the exact reason you're explaining.
NTPd should start and just sit their polling the servers every X seconds even if they're down. And with NTPd running it'll at least attempt to keep your clock as accurate as whatever your drift is. If/When those servers come back online (your network connection comes back, whatever) it'll just pick right up.
You probably know this but if the drift between you and the ntp server is too great ntpd won't set the system clock at all (why it does this is beyond me). You have to run ntpdate first and then run ntpd.
The redhat startup scripts are supposed to deal with this case, but they never seem to.
...some upstream shithead has renamed or renumbered their NTP server and I can't reach it any more (this has happened to me like five times)...
Try using servers like
0.us.pool.ntp.org
1.us.pool.ntp.org
2.us.pool.ntp.org
You will never have to worry about that problem again.
ntpd refuses to correct the eight-hours-off time that the machine booted with
Start ntpd with --panicgate AKA -g
...oh look they moved when DST begins...
Another freaking Peak Oil thing. :)
noticing that there is a problem.
The core of a plugin to do the checking would be:
ntpdate -d client | tail -1 | cut -d' ' -f10 | sed -e 's/\..*$//'
which spits out the number of seconds apart the clocks on the nagios server and the client being tested. Any takers on writing one?
You realize that pool.ntp.org is just random people across the internet, right? You have basically no guarantee that any of those times are correct.
It makes a lot more sense to set up a local NTP server that has a variety of upstream peers and then have all your local machines sync against it. Even better, have two local ones.
Unfortunately, on most unix-like systems (Linux included) the timezone info is cached after the first gettimeofday() or equiv. You'd have to have some process which gets respawned or restarts periodically - although I suppose having a non-builtin inetd service that spews `date` to stdout would suffice for that.
This has already been beaten to death, but there is also "check_time". If for some reason you don't/can't run ntp on your machines, just enable the time service and use check_time. We just went through this on our virtual machines, as you can't run ntp on them.