...
If you face the following problems:
- fetcher Fetcher helper usage is permanently above 96%, and fetcher count is already high (i.e., >50 or 100 or more) and
- the service "Check_MK" runs into constant "CRIT with fetcher timeouts
You can also use this command as site user to narrow down and find slow-running active checks.
Code Block language bash theme RDark lq "GET services\nColumns: execution_time host_name display_name" | awk -F';' '{ printf("%.2f %s %s\n", $1, $2, $3)}' | sort -rn | head
This can have several reasons:
- Firewalls are dropping traffic from Checkmk to the monitored systems. If the packets are dropped rather than blocked, Checkmk must wait for a timeout instead of instantly terminating the fetching process.
- You might have too many DOWN hosts, which are still being checked. Checkmk still tries to query those hosts, and the fetchers need to wait for a timeout every time. This can bind a lot of fetcher helpers, which are blocked for that time. Remove hosts which are in a DOWN state from your monitoring. Either permanently or by setting their Criticality to "Do not monitor this host".
- For classical operating systems (Linux/Windows/etc.), this indicates that you might have plugins/local checks with quite a long runtime. Increasing the number of fetchers further here is not constructive. Instead, you must identify the long-running plugins/local checks and set them to asynchronous execution and/or define (generous) cache settings or even timeouts, especially for them.
- For SNMP devices, you might have poorly performing SNMP devices. To troubleshoot those, have a look at this blog post.
...