How-to adjust Checkmk performance

The following article explains how to monitor and adjust the performance in Checkmk.

LAST TESTED ON CHECKMK 2.3.0P1

Table of Contents

Overview

Checkmk polls all monitored hosts within the configured normal check interval (by default once a minute).
If the service then enters a not-OK-state, Checkmk uses the retry interval to re-check (again, once a minute by default).
If the endpoint is DOWN, Checkmk regularly tries to poll the agent, needs to wait for the timeout (10 seconds by default) before it aborts this try.
This binds a fetcher process for the mentioned amount of time and hence decreases your monitoring performance.
This problem multiplies with the number of endpoints monitored.

Monitoring performance can be independent of hardware utilization. So even if your Checkmk server is not using 100% of e.g., its CPU or memory, monitoring performance can still be poor.
Also, the needed resources are based on the number of services, active checks, and types of hosts. If you have e.g., a lot of SNMP hosts, you'll need more CPU performance for the protocol overhead, compared to our agent.
Let's dive into the possible reasons and how to understand potential issues.

Configuration of Fetcher and Checker settings

Hands-On

Required services to monitor

To configure the right resources, we recommend checking the following graphs:

PDF report with graphs of
- CPU
- Memory
- OMD <SITENAME> Performance
- activate the "Core statistics" snap in
- Check_MK
- Disk I/O Summary
The local structure
- find -L ~/local > local.txt (as site user)

Let's give you an example:

Screenshot of core statistics with Fetcher and Checker helper usages highlighted.

With Core Statistics snap-in, you can check the load of the fetcher and helper. At 70%, we recommend increasing these values in the Global Settings. The CPU load and memory consumption will grow while you increase these values.

That's why we also recommend checking these graphs:

Screenshot of a service search that includes cpu, load and memory.

You will find more information about the fetcher and checker architecture here:

Important information about the Checkers: The checkers should not exceed your CPU core count!

Adjust the helper settings

If you decide to adjust the helper settings, please be aware of these settings:

Setup → General → Global Settings → Monitoring Core →

Maximum concurrent active checks
- The usage should stay under 80% on average.

Maximum concurrent Checkmk fetchers
- With increasing the number of fetchers, your RAM usage will rise, so make sure to adjust this setting carefully and keep an eye on the memory consumption of your server.
- The usage should stay under 80% on average.
Maximum concurrent Checkmk checkers
- The number of checkers should not be higher than your CPU core count! If you have more than two cores, the general rule is: Maximum checkers = number of cores - 1 .
- The usage should stay under 80% on average.
Maximum concurrent Livestatus connections
- In a distributed monitoring setup, having different values for the remote sites may be helpful. You will find the guidance on how to do that here!

Check the Livestatus Performance

If you face issues like this:

Screenshot of a livestatus error. Unhandled exception 400. Timeout while waiting for free Livestatus channel.

Please see this manual to check the Livestatus Performance

Required log files

Please see this manual to enable debug log of the helpers. The required settings are:

Core
Debugging of Checkmk helpers

High Fetcher Usage Although the fetcher helper count is already high

Also, please check out our article on Troubleshooting high CPU usage of the Checkmk micro core (cmc)

If you face the following problems:

Fetcher helper usage is permanently above 96%, and fetcher count is already high (i.e., >50 or 100 or more) and
the service "Check_MK" runs into constant "CRIT with fetcher timeouts
- You can also use this command as site user to narrow down and find slow-running active checks.
```
lq "GET services\nColumns: execution_time host_name display_name" | awk -F';' '{ printf("%.2f %s %s\n", $1, $2, $3)}' | sort -rn | head
```

This can have several reasons:

Firewalls are dropping traffic from Checkmk to the monitored systems. If the packets are dropped rather than blocked, Checkmk must wait for a timeout instead of instantly terminating the fetching process.
You might have too many DOWN hosts, which are still being checked. Checkmk still tries to query those hosts, and the fetchers need to wait for a timeout every time. This can bind a lot of fetcher helpers, which are blocked for that time. Remove hosts which are in a DOWN state from your monitoring. Either permanently or by setting their Criticality to "Do not monitor this host".
For classical operating systems (Linux/Windows/etc.), this indicates that you might have plugins/local checks with quite a long runtime. Increasing the number of fetchers further here is not constructive. Instead, you must identify the long-running plugins/local checks and set them to asynchronous execution and/or define (generous) cache settings or even timeouts, especially for them.
For SNMP devices, you might have poorly performing SNMP devices. To troubleshoot those, take a look at this blog post.

Checkmk Knowledge Base