How-to adjust Checkmk performance

How-to adjust Checkmk performance

The following article explains how to monitor and adjust the performance in Checkmk.

LAST TESTED ON CHECKMK 2.3.0P1

Table of Contents

Overview

By default, Checkmk polls all monitored hosts at a one minute interval. If a service enters a non OK state, it is rechecked using the retry interval, which is also one minute by default.

When a host is unreachable, Checkmk attempts to contact it and waits until the configured agent timeout is reached. The default timeout is ten seconds. If the host does not respond within this period, the current connection attempt is aborted. However, monitoring does not stop. The host will be checked again at the next regular check interval.

During each connection attempt, a fetcher process remains occupied until the timeout expires or a response is received. If many hosts are unreachable at the same time, multiple fetcher processes may be tied up waiting for timeouts. This can reduce overall monitoring throughput, even though checks continue to run according to their configured intervals.

It is important to understand that monitoring performance issues are not always reflected by high CPU or memory usage. A system can appear lightly loaded while still experiencing degraded monitoring performance due to:

  • Blocked fetcher processes

  • High numbers of active checks

  • Large volumes of SNMP based hosts

  • Inefficient custom checks or local extensions

Environments with many SNMP based hosts typically require more CPU resources compared to agent based monitoring due to protocol overhead and bulk data processing.

 

Configuration of Fetcher and Checker settings

Check Core Performance Metrics

First, verify whether the monitoring core is under pressure.

Review these metrics:

  • CPU utilization

  • Memory utilization

  • Disk IO

  • OMD site performance

  • Core statistics sidebar snap-in


Review Fetcher and Checker Load

Use this section to determine whether performance degradation is caused by blocked fetcher processes or by an overly aggressive check configuration.


What to review and where to find it:

Number of unreachable hosts

  • Where to check
    In the GUI, filter your host list for problem states (for example: DOWN, UNREACH) and note how many endpoints are currently unreachable.

  • Why it matters
    A high number of unreachable hosts increases the number of agent fetch attempts running in parallel.

 

Agent timeouts

  • Where to check
    In the GUI, review the host or service settings for the agent data source and locate the configured agent timeout value.

  • Why it matters
    Each fetch attempt ties up a fetcher process until the timeout is reached. If many endpoints are unreachable, multiple fetchers are blocked at the same time.

Retry interval behavior

  • Where to check
    In the GUI, review the host and service check settings for the configured retry interval and max check attempts.

  • Why it matters
    Retry checks increase the number of checks executed when many services are not OK, which can overload checkers even if fetchers are fine.

Fetcher and checker process configuration

  • Where to check
    In the GUI, review the site’s monitoring core or performance-related settings where the counts for fetcher processes and checker processes are configured.

  • Why it matters
    If these values are too low for your environment, the site queues work and check execution slows down.

A large number of DOWN or unreachable hosts can block fetcher processes for up to the configured timeout per attempt, which reduces monitoring throughput even when CPU and memory usage look normal. 

  

Validate Services and Local Customization

Monitoring performance issues are often not caused by “too much monitoring” in general, but by inefficient checks or custom extensions. The goal is not to reduce visibility, but to ensure that checks are implemented efficiently and behave as expected.

Review the following areas carefully:

  • Active checks with short intervals
    Very short check intervals or a high number of active checks can increase scheduling pressure. Validate that intervals are appropriate for the use case.

  • SNMP based monitoring
    SNMP requires more CPU overhead than the Checkmk agent due to protocol handling and data parsing. Large numbers of SNMP hosts or complex SNMP walks can significantly increase load. Ensure bulk walking and rule configurations are optimized.

  • Local checks and custom agent plugins
    Inefficient scripts, slow external commands, or poorly written plugins can delay agent responses and indirectly impact fetcher availability. Review execution time and external dependencies of custom scripts.

 

To inspect local customizations, run as the site user:

OMD[mysite]:~$ find -L ~/local > local.txt

 

Let's give you an example:

Screenshot of core statistics with Fetcher and Checker helper usages highlighted.

With Core Statistics sidebar snap-in, you can check the load of the fetcher and helper.

At 70%, we recommend increasing these values in the Global Settings. The CPU load and memory consumption will grow while you increase these values.

That's why we also recommend checking these graphs:

Screenshot of a service search that includes cpu, load and memory.

You will find more information about the fetcher and checker architecture here:

 

Adjust the helper settings

In a distributed monitoring setup, having different values for the remote sites may be helpful. You will find the guidance on how to do that here: Site-specific global settings

If you decide to adjust the helper settings, please be aware of these settings:

Setup → General → Global Settings → Monitoring Core →

Maximum concurrent active checks

  • The usage should stay under 80% on average.

 

Maximum concurrent Checkmk fetchers

  • With increasing the number of fetchers, your RAM usage will rise, so make sure to adjust this setting carefully and keep an eye on the memory consumption of your server.

  • The usage should stay under 80% on average.

 

Maximum concurrent Checkmk checkers

  • The number of checkers should not be higher than your CPU core count!
    If you have more than two cores, the general rule is: Maximum checkers = number of cores - 1 .

  • The usage should stay under 80% on average.

 

Maximum concurrent Livestatus connections

 

Check the Livestatus Performance

If you face issues like this:

Screenshot of a livestatus error. Unhandled exception 400. Timeout while waiting for free Livestatus channel.

Please see this manual to check the Livestatus Performance

 

Required log files

Please see this manual to enable debug log of the helpers. The required settings are:

  • Core

  • Debugging of Checkmk helpers

 

High Fetcher Usage Although the fetcher helper count is already high

If you face the following problems: 

  • Fetcher helper usage is permanently above 96%, and fetcher count is already high (i.e., >50 or 100 or more) and

  • the service "Check_MK" runs into constant "CRIT with fetcher timeouts   

    • You can also use this command as site user to narrow down and find slow-running active checks.

      lq "GET services\nColumns: execution_time host_name display_name" | awk -F';' '{ printf("%.2f %s %s\n", $1, $2, $3)}' | sort -rn | head


This can have several reasons:

  • Firewalls are dropping traffic from Checkmk to the monitored systems. If the packets are dropped rather than blocked, Checkmk must wait for a timeout instead of instantly terminating the fetching process.

  • You might have too many DOWN hosts, which are still being checked. Checkmk still tries to query those hosts, and the fetchers need to wait for a timeout every time. This can bind a lot of fetcher helpers, which are blocked for that time. Remove hosts which are in a DOWN state from your monitoring. Either permanently or by setting their Criticality to "Do not monitor this host".

  • For classical operating systems (Linux/Windows/etc.), this indicates that you might have plugins/local checks with quite a long runtime. Increasing the number of fetchers further here is not constructive. Instead, you must identify the long-running plugins/local checks and set them to asynchronous execution and/or define (generous) cache settings or even timeouts, especially for them.

  • For SNMP devices, you might have poorly performing SNMP devices. To troubleshoot those, take a look at this blog post.

 

Related articles