How-to adjust Checkmk performance
The following article explains how to monitor and adjust the performance in Checkmk.
LAST TESTED ON CHECKMK 2.3.0P1
Overview
By default, Checkmk polls all monitored hosts at a one minute interval. If a service enters a non OK state, it is rechecked using the retry interval, which is also one minute by default.
When a host is unreachable, Checkmk attempts to contact it and waits until the configured agent timeout is reached. The default timeout is ten seconds. If the host does not respond within this period, the current connection attempt is aborted. However, monitoring does not stop. The host will be checked again at the next regular check interval.
During each connection attempt, a fetcher process remains occupied until the timeout expires or a response is received. If many hosts are unreachable at the same time, multiple fetcher processes may be tied up waiting for timeouts. This can reduce overall monitoring throughput, even though checks continue to run according to their configured intervals.
It is important to understand that monitoring performance issues are not always reflected by high CPU or memory usage. A system can appear lightly loaded while still experiencing degraded monitoring performance due to:
Blocked fetcher processes
High numbers of active checks
Large volumes of SNMP based hosts
Inefficient custom checks or local extensions
Environments with many SNMP based hosts typically require more CPU resources compared to agent based monitoring due to protocol overhead and bulk data processing.
Configuration of Fetcher and Checker settings
Check Core Performance Metrics
First, verify whether the monitoring core is under pressure.
Review these metrics:
CPU utilization
Memory utilization
Disk IO
OMD site performance
Core statistics sidebar snap-in
Review Fetcher and Checker Load
Use this section to determine whether performance degradation is caused by blocked fetcher processes or by an overly aggressive check configuration.
What to review and where to find it:
Number of unreachable hosts
Where to check
In the GUI, filter your host list for problem states (for example: DOWN, UNREACH) and note how many endpoints are currently unreachable.Why it matters
A high number of unreachable hosts increases the number of agent fetch attempts running in parallel.
Agent timeouts
Where to check
In the GUI, review the host or service settings for the agent data source and locate the configured agent timeout value.Why it matters
Each fetch attempt ties up a fetcher process until the timeout is reached. If many endpoints are unreachable, multiple fetchers are blocked at the same time.
Retry interval behavior
Where to check
In the GUI, review the host and service check settings for the configured retry interval and max check attempts.Why it matters
Retry checks increase the number of checks executed when many services are not OK, which can overload checkers even if fetchers are fine.
Fetcher and checker process configuration
Where to check
In the GUI, review the site’s monitoring core or performance-related settings where the counts for fetcher processes and checker processes are configured.Why it matters
If these values are too low for your environment, the site queues work and check execution slows down.
A large number of DOWN or unreachable hosts can block fetcher processes for up to the configured timeout per attempt, which reduces monitoring throughput even when CPU and memory usage look normal.
Validate Services and Local Customization
Monitoring performance issues are often not caused by “too much monitoring” in general, but by inefficient checks or custom extensions. The goal is not to reduce visibility, but to ensure that checks are implemented efficiently and behave as expected.
Review the following areas carefully:
Active checks with short intervals
Very short check intervals or a high number of active checks can increase scheduling pressure. Validate that intervals are appropriate for the use case.SNMP based monitoring
SNMP requires more CPU overhead than the Checkmk agent due to protocol handling and data parsing. Large numbers of SNMP hosts or complex SNMP walks can significantly increase load. Ensure bulk walking and rule configurations are optimized.Local checks and custom agent plugins
Inefficient scripts, slow external commands, or poorly written plugins can delay agent responses and indirectly impact fetcher availability. Review execution time and external dependencies of custom scripts.
To inspect local customizations, run as the site user:
OMD[mysite]:~$ find -L ~/local > local.txt
Let's give you an example:
With Core Statistics sidebar snap-in, you can check the load of the fetcher and helper.
At 70%, we recommend increasing these values in the Global Settings. The CPU load and memory consumption will grow while you increase these values.
That's why we also recommend checking these graphs:
You will find more information about the fetcher and checker architecture here:
Adjust the helper settings
In a distributed monitoring setup, having different values for the remote sites may be helpful. You will find the guidance on how to do that here: Site-specific global settings
If you decide to adjust the helper settings, please be aware of these settings:
Setup → General → Global Settings → Monitoring Core →
Maximum concurrent active checks
The usage should stay under 80% on average.
Maximum concurrent Checkmk fetchers
With increasing the number of fetchers, your RAM usage will rise, so make sure to adjust this setting carefully and keep an eye on the memory consumption of your server.
The usage should stay under 80% on average.
Maximum concurrent Checkmk checkers
The number of checkers should not be higher than your CPU core count!
If you have more than two cores, the general rule is:Maximum checkers = number of cores - 1.The usage should stay under 80% on average.
Maximum concurrent Livestatus connections
Check the Livestatus Performance
If you face issues like this:
Please see this manual to check the Livestatus Performance
Required log files
Please see this manual to enable debug log of the helpers. The required settings are:
Core
Debugging of Checkmk helpers
High Fetcher Usage Although the fetcher helper count is already high
Also, please check out our article on Troubleshooting high CPU usage of the Checkmk micro core (cmc)
If you face the following problems:
Fetcher helper usage is permanently above 96%, and fetcher count is already high (i.e., >50 or 100 or more) and
the service "Check_MK" runs into constant "CRIT with fetcher timeouts
You can also use this command as site user to narrow down and find slow-running active checks.
lq "GET services\nColumns: execution_time host_name display_name" | awk -F';' '{ printf("%.2f %s %s\n", $1, $2, $3)}' | sort -rn | head
This can have several reasons:
Firewalls are dropping traffic from Checkmk to the monitored systems. If the packets are dropped rather than blocked, Checkmk must wait for a timeout instead of instantly terminating the fetching process.
You might have too many DOWN hosts, which are still being checked. Checkmk still tries to query those hosts, and the fetchers need to wait for a timeout every time. This can bind a lot of fetcher helpers, which are blocked for that time. Remove hosts which are in a DOWN state from your monitoring. Either permanently or by setting their Criticality to "Do not monitor this host".
For classical operating systems (Linux/Windows/etc.), this indicates that you might have plugins/local checks with quite a long runtime. Increasing the number of fetchers further here is not constructive. Instead, you must identify the long-running plugins/local checks and set them to asynchronous execution and/or define (generous) cache settings or even timeouts, especially for them.
For SNMP devices, you might have poorly performing SNMP devices. To troubleshoot those, take a look at this blog post.