Nagios

Dear Lazyweb,

When my NRPE host goes down, I get email notifications for dozens of services. But all of my services have dependencies, so I should be getting one email. What am I doing wrong?

For example: the "ntp" service on host "cerebellum" depends on the "ping" service on host "cerebellum":

 define service{
register 0
name generic-service
...
notification_options w,c,f,r ; notify on warn,crit,flap,recover
check_interval 5 ; check every N minutes
retry_interval 0.5 ; check every N min when not "OK"
max_check_attempts 10 ; notify after 5 min
; (later than host checks)
}

define service{
register 0
name parent-service
max_check_attempts 8 ; notify earlier (faster than
; generic, later than host)
use generic-service
}

...

define service{
host_name A4:cerebellum
service_description ping
check_command check_nrpe_membrane!ping_cerebellum
use parent-service
}

define service{
host_name A4:cerebellum
service_description ntp
check_command check_nrpe_membrane!check_ntp
use generic-service
}

define servicedependency {
dependent_host_name A4:cerebellum
dependent_service_description ntp
host_name A4:cerebellum
service_description ping
execution_failure_criteria n
notification_failure_criteria w,u,c
inherits_parent 1
}

("ping" is a service rather than a host check because the Nagios host has to tunnel through another gateway to get there, rather than pinging directly.)

I got these email notifications:

  • 8:18:25 ntp critical, service check timed out
  • 8:22:27 ping critical, service check timed out
  • 8:27:45 ping ok
  • 8:29:40 ntp ok

...and repeat for dozens of other services. Then they went critical again, but this time I didn't get email about it. This is as per the "View Availability Report" link instead of the "View Notifications" link:

  • 10:20:31 ping critical, service check timed out
  • 10:21:10 ntp critical, service check timed out

PS: Thanks, PG&E, for the massive power outage today taking down most of SOMA. You are so good at your job.

Tags: , ,

21 Responses:

  1. koala says:

    I know you don't like accurate answers... but I'm nearly sure a host check can do everything a service check does- so the ping check can be the host check even though you have to do special stuff.

    And the service check/host check parent relationship works OOB, IIRC, so then your problem would be solved.

    Otherwise, your servicedependency looks like the ones I have [which I think they work correctly, but you're making me doubt things], except for execution_failure_criteria which I don't set explicitly...

    • jwz says:

      I'll look into the host check thing, but even if that can be re-arranged, I think that what I am seeing here is still, "service dependencies don't work."

    • jwz says:

      So, maaaaaybe changing my ping services to ping host-checks will help, because it sounds like, when a service goes bad, Nagios then does an out-of-sequence host check on it. So this might mean that hosts will signal, then the services will be silent.

      • jwz says:

        I think this just made the problem be differently bad. Instead of getting a slew of service notifications, based on when they happened to enter the failed state rather than on the service dependency hierarchy; now I get a slew of host notifications based on when they happened to enter the failed state rather than on the host dependency hierarchy.

        • koala says:

          Damn, sorry. Something's wonky there, because host dependencies for sure work for me (by setting parent host relationships, not hostdependencies).

        • Phil! Gold says:

          In my experience, the only time Nagios uses dependency information to do out-of-schedule checks is when running a host check on the containing host when a service starts to fail.  So if, say, Nagios is configured to know that host B is behind host A and host B fails its check, Nagios won't set B to unreachable status until its previously-scheduled check on host A executes and fails.

          I deal with this by setting my notification thresholds such that there's always enough time to check every single host in my environment between a given host's first failure and its subsequent notification-triggering failure.  (Also, your host definitions need to not have "u" in their "notification_options".  The default setting includes "u" so they'll send notifications on unreachable status anyway.)

          It looks like you're accounting for this with your service definitions, so you might just need to tune the host check intervals a bit in a similar way.

          • jwz says:

            So how do I tell what the ratio between parent and child host timings needs to be? Is it based on the total number of hosts or what?

            • Phil! Gold says:

              It should just be that the time in a soft failure state for a child should be greater than the time between regular checks of the parent.

              The soft failure time for a host is given by multiplying its retry_interval setting by one less than its max_check_attempts setting.  The time between checks for a host is its check_interval setting.  Neither of those depends on the total number of hosts you're monitoring; Nagios tries to hit its scheduled intervals independently from the number of simultaneous checks it needs to do to get there.

              Oh!  So maybe your current configuration would work better if you set max_check_attempts to 12 for your services.  That should give an interval of 5.5 minutes between the first service failure and the point at which Nagios would send a notification.  This is theoretical, though; my comment below about using "execution_failure_criteria  w,u,c" is something that actually works for me in production.

              • jwz says:

                I think that what you said can be simplified to:

                • parent-service and child-service have the same check_interval and retry_interval;
                • child-service max_check_attempts = parent-service max_check_attempts + 1

                Does that sound right?

  2. jwm says:

    Where nagios is concerned, always assume that the checks are not in lock step with one another, and look to the worst case eg:

    00:00:01 Ping checks ok
    00:00:02 Host outage occurs
    00:00:03 ntp check soft critical 1
    00:00:33 ntp check soft critical 2
    ...
    00:04:33 ntp checks hard critical, notifies
    00:05:01 Ping checks soft critical 1

    So for the ping check, I'd shift it's check_interval to 4 minutes minimum, but probably 1 minute if it's feasible.

    Footnote 1:

    Your mileage my vary, but we have historically set special settings on our ntp check because the damn ntpd daemon takes ages stabilize  after start up. Eventually gave up on it and are swtiching to chrony across the fleet.

    Footnote 2:

    There are people who might ask why you're still using NRPE and haven't upgraded to eg icinga 2. Speaking as someone who has failed to do that for the last three years, pissing off my colleagues, because I don't have time to boil that particular ocean, you can tell them to piss off from me, too.

    • jwz says:

      Well, I tried to do that with the different max_check_attempts on "generic-service" versus "parent-service". So are you saying that service dependencies are basically meaningless as all that matters is which one happens to trigger first?

      • jwm says:

        Dependencies are meaningful, but AFAIK, only as far as: “if parent check is hard critical, don't send notifications for child checks”. So I'd take greater pains to make sure that the ping test will always reach hard critical before the ntp test, first, then gaze into the abyss of the dependency docs.

        The hard critical part is the key. I believe that in even the best case where the ping check goes soft critical 1 second after the first ntp check goes soft critical, the ping check will reach hard critical one minute before the ntp check. That leaves 4 minutes worth of cases where ntp reaches hard critical before ping does.

        (I believe there is a tacit off-by-one in max-check-attempts, as soft critical 1 happens a zero seconds, and soft critical 10 is really hard critical 1.)

    • NB says:

      I did upgrade to Icinga2 and still sometimes get trailing "IPv6 ping is broken! Oh no wait it works again" notifications, post-host-recovery that was triggered by IPv4 ping.

  3. Chris says:

    One other thing you can play with, is to configure your NRPE with "-u = Make socket timeouts return an UNKNOWN state instead of CRITICAL"
    Then you can respond to UNKNOWN state differently, such as to not send email/page for them.
    https://chriscarey.com/blog/2012/06/09/how-to-prevent-multple-check_nrpe-socket-timeout-after-10-seconds-alerts/

    • Chris says:

      With this approach, in your case, you already have your generic-service set to not alert on UNKNOWN. So that's one step down.

      You would need to:

      - Copy your command check_nrpe_membrane to a new command check_nrpe_membrane_unknown (or some other name to signify UNKNOWN).
      - Modify your check_nrpe_membrane_unknown command adding the -u option to the command line
      - Change the check_command on all *dependent* services to use the new command

      Done.

      Now you will not get extra alarms on the dependent services but you will on the services that are depended on.

  4. MattyJ says:
    3

    At the risk of getting flamed (I'm old and don't give AF any more), I'll suppress my musings on my hatred of Nagios, but if you're forced (or forcing yourself) to use Nagios in 2022, consider using check_mk. Nagios core but with much-needed enhancements. Open source version is highly capable.

    In a recent past life I ran into a similar thing and there were at least two ways to solve this with check_mk that I remember, the easiest being 'bulk notifications' where you can hold off on notifications for, say, a minute then get one alert bundled together with all the 'problems' on a host in one go.

    It's also easy to set up alerts with 'when not already CRIT' provisions so only your first alert will come, and others will be suppressed if the host is already in a CRIT state. There are alert provisions for 'when state changes' or specific state changes you can choose to ignore or alert on (CRIT->OK, OK->WARN, etc.)

    What I'm saying is that check_mk has many more ways to alert than just '-w' and '-c', or using service dependencies that apparently don't work anyway.

    It's worth being chastised by jwz if I can save just one admin from murdering their colleagues because they have to admin a vanilla Nagios installation.

  5. 1

    I have nothing to contribute to this one, other than "it is nice to see a blog comments section filled with thoughtful comments by people who know what they are talking about and are trying to be responsive to the post."

  6. Phil! Gold says:

    I use service dependencies extensively and they work for me as you expect yours to work.  The main difference I see between my configuration and yours is that I have

    define servicedependency {
        ...
        execution_failure_criteria     w,u,c
    }

    Where you're using (the documentation-recommended) "execution_failure_criteria  n".

    According to the documentation, your configuration should work, but perhaps the documentation is wrong.  Mine definitely works for me, so hopefully it'll help you, too.

    I'll echo the recommendation to use host checks instead of the ping service, though.  Host checks are intended to cover the "entire host in unavailable" case.  I'll reply to your other comment with my thoughts on your issues with them.

Leave a Reply

Your email address will not be published. But if you provide a fake email address, I will likely assume that you are a troll, and not publish your comment.

Starting a new top-level thread.