Activate Changes: Advanced Troubleshooting and Debugging

Activate Changes: Advanced Troubleshooting and Debugging

The article explains how to debug long-running or hanging Activate Changes operations.

LAST TESTED ON CHECKMK 2.4.0P1

Table of Contents

Problem

Activating changes (in particular a CMC reload) takes an unusually long time, does not complete, or appears to hang.

This article describes advanced troubleshooting and debugging techniques for Activate Changes and is intended for cases where standard checks do not identify the cause.


Solution

Debug long-running activate changes on a single site

Long activation times are often caused by a large number of files that must be processed during activation.

Check the number of files in the relevant directories:

OMD[mysite]:~$ find ~/var/check_mk/web -follow -type f | wc -l OMD[mysite]:~$ find ~/local -follow -type f | wc -l

A very large number of files in ~/var/check_mk/web can significantly increase activation time.

A large ~/local directory can also affect performance, as Checkmk creates a .tar archive from this directory during each activation.

 

Debug long-running activate changes in a distributed setup

In distributed environments, discrepancies between the central site and remote sites can cause long or incomplete activations.

Run the following command on the central site and all remote sites and compare the results:

OMD[mysite]:~$ find ~/local -follow -type f | wc -l

 

Additionally, check for non-ASCII filenames or file contents, which may lead to unexpected behavior:

OMD[mysite]:~/local$ find . -type f | xargs ls -ltr |grep --color='auto' -P -n "[^\x00-\x7F]"^C OMD[mysite]:~/local$ find . -type f | grep --color='auto' -P -n "[^\x00-\x7F]"

 

 

Known Issue: Cortex XDR interference

Background

In some environments, Activate Changes may hang or fail after an upgrade due to interference from endpoint security software running on the host.

 

Problem

During an upgrade, an issue was observed where Activate Changes would not complete reliably.

image-20260422-075727.png

 

The upgrade itself appeared to work as expected. All sites were stopped, the update process completed with only minor warnings about outdated rules, and the sites came back online without errors. After reconnecting distributed sites, Checkmk correctly showed pending changes indicating that the upgrade needed to be activated.

 

When attempting to activate changes, the system would enter a blocked state. The background job would show as STOPPED, and a process would remain running but effectively hung. Looking at it with straceshowed no meaningful activity in the child processes, while the parent process would eventually report a timeout.

image-20260422-075753.png

In some cases, killing the stuck process allowed the activation to complete on the next attempt. However, the issue would return again later, making the behavior inconsistent but repeatable. The same situation could also be reproduced in a test environment.


Root Cause

After further testing, the issue was traced back to Cortex XDR running on the system.

Checkmk uses standard system tools during activation, including creating .tar archives from directories such as ~/local. Cortex XDR incorrectly identified this behavior as suspicious and interfered with the process.

As a result, the activation workflow would stall. From the outside, this looked like a hung process, stopped background job, and eventual timeout, but the actual cause was the security software blocking or interrupting normal execution.

 

Solution

To confirm the cause, Cortex XDR was temporarily disabled and removed, followed by a reboot. After this, Activate Changes worked normally without any delays or failures.

Once Cortex XDR was reinstalled, the issue immediately returned, which clearly confirmed the source of the problem.

 

Permanent Fix

The long-term fix is to adjust Cortex XDR so it does not interfere with Checkmk.

At a minimum, you should exclude the Apache process used by Checkmk. It is also recommended to ensure that the following are not inspected or blocked:

  • The tar command during execution

  • Processes running under the Checkmk site user

  • Checkmk site directories such as ~/local and ~/var

These exclusions prevent Cortex from interrupting normal activation behavior.

image-20260422-075818.png

 

Advanced debugging

If the checks above do not identify the root cause, advanced debugging techniques are required.

This section covers profiling and strace-based debugging, which are useful when Activate Changes:

  • Takes a very long time

  • Appears to hang

  • Is blocked by long-running background processes (for example, CMC or Apache)

 

Profiling Activate Changes

Before continuing, disable parallel core configuration generation:

  1. Go to Setup → General → Global settings

  2. Edit the global settings

  3. Remove the checkbox for “Generate core config parallelized”

This step is required to produce a usable and consistent profile.

image-20251216-125348.png

 

Create a profiling output

Run the following command on the central site and all remote sites:

OMD[mysite]:~$ cmk -O --profile --debug -vv &>activation_debug.log

The cmk -O command performs a configuration reload, which is part of the Activate Changes process.

The following files are generated:

  • show_profile.py

  • profile.out

  • activation_debug.log

To analyze the profile data manually, follow the GUI profiling documentation

Otherwise, open a support case and include all three files.

Any Checkmk command can be profiled. The full syntax is described in the Profiling via CLI documentation.

 

Low-level debugging with strace

If profiling does not reveal the cause, strace can be used to trace filesystem access during Activate Changes.

 

This approach is particularly useful for scenarios such as:

  • Another activation process is currently in progress or locked

  • CMC reloads that do not complete

  • Apache-related delays during activation

 

To improve trace readability, the affected binary is started directly under strace. This method:

  • Follows child processes automatically

  • Creates separate trace files per process

  • Produces clearer output than attaching to a running process

All commands must be executed as the site user.

 

Tracing the CMC process

omd stop cmc strace \ --output=cmc-strace.log \ --string-limit=9999 \ --absolute-timestamps=precision:us \ --follow-forks \ --trace=file \ ~/bin/cmc

 

Tracing the Apache process

Tracing Apache can help identify web server–related delays during Activate Changes.

omd stop apache strace \ --output=apache-strace.log \ --string-limit=9999 \ --absolute-timestamps=precision:us \ -ff \ --trace=file \ /usr/sbin/apache2 -f ~/etc/apache/apache.conf -DFOREGROUND

 

The resulting trace files can help identify:

  • Files or directories repeatedly accessed

  • Lock files or sockets blocking progress

  • Unexpected filesystem locations involved in activation

 

Related articles