When my NRPE host goes down, I get email notifications for dozens of services. But all of my services have dependencies, so I should be getting one email. What am I doing wrong?
For example: the "ntp" service on host "cerebellum" depends on the "ping" service on host "cerebellum":
define service{
register 0
name generic-service
...
notification_options w,c,f,r ; notify on warn,crit,flap,recover
check_interval 5 ; check every N minutes
retry_interval 0.5 ; check every N min when not "OK"
max_check_attempts 10 ; notify after 5 min
; (later than host checks)
}
define service{
register 0
name parent-service
max_check_attempts 8 ; notify earlier (faster than
; generic, later than host)
use generic-service
}
...
define service{
host_name A4:cerebellum
service_description ping
check_command check_nrpe_membrane!ping_cerebellum
use parent-service
}
define service{
host_name A4:cerebellum
service_description ntp
check_command check_nrpe_membrane!check_ntp
use generic-service
}
define servicedependency {
dependent_host_name A4:cerebellum
dependent_service_description ntp
host_name A4:cerebellum
service_description ping
execution_failure_criteria n
notification_failure_criteria w,u,c
inherits_parent 1
}
("ping" is a service rather than a host check because the Nagios host has to tunnel through another gateway to get there, rather than pinging directly.)
I got these email notifications:
- 8:18:25 ntp critical, service check timed out
- 8:22:27 ping critical, service check timed out
- 8:27:45 ping ok
- 8:29:40 ntp ok
...and repeat for dozens of other services. Then they went critical again, but this time I didn't get email about it. This is as per the "View Availability Report" link instead of the "View Notifications" link:
- 10:20:31 ping critical, service check timed out
- 10:21:10 ntp critical, service check timed out
PS: Thanks, PG&E, for the massive power outage today taking down most of SOMA. You are so good at your job.