Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Comment: Update reasons for high Fetcher usage.

...

  1. fetcher helper usage is permanently above 96% and fetcher count is already high (i.e., >50 or 100 or more) and
  2. the service "Check_MK" runs into constant "CRIT with fetcher timeouts   
    1. You can also use this command as site user to narrow down and to find slow running active checks

      Code Block
      languagebash
      themeRDark
      lq "GET services\nColumns: execution_time host_name display_name" | awk -F';' '{ printf("%.2f %s %s\n", $1, $2, $3)}' | sort -rn | head


This can have mainly twofold several reasons:

  1. Firewalls are dropping traffic from Checkmk to the monitored systems. If the packets are dropped rather than blocked, Checkmk needs to wait for a timeout instead of instantly terminating the fetching process.
  2. You might have too many DOWN hosts, which are still being checked. Checkmk still tries to query those hosts and the fetchers need to wait for a timeout every time. This can bind a lot of fetcher helpers, which are blocked for that time. Remove hosts, which are in a DOWN state from your monitoring. Either permanently, or by setting their Criticality to "Do not monitor this host".
  3. For classical operating system (Linux/Windows/etc.) this is a strong indicator that you might have plugins/local checks (primarily in Windows) that have quite a long runtime. Increasing the number of fetchers further here is not constructive. Instead, you have to identify the long-running plugins/local checks and set them to asynchronous execution and/or define (generous) cache settings or even timeouts especially specially for them.
  4. You might have too many DOWN hosts, which are still being checked. Checkmk still tries to query those hosts and the fetchers need to wait for a timeout every time. This can bind a lot of fetcher helpers, which are blocked for that time. Remove hosts, which are in a DOWN state for some time (due to scrapping or similar) from your monitoringFor SNMP devices, you might have poorly performing SNMP devices. To troubleshoot those, have a look at this blog post.

Filter by label (Content by label)
showLabelsfalse
max5
spacesKB
showSpacefalse
sortmodified
reversetrue
typepage
cqllabel in ("checker","fetcher","cmc") and type = "page" and space = "KB"
labelscmc fetcher checker

...