Activate Changes: Advanced Troubleshooting and Debugging
The article explains how to debug long-running or hanging Activate Changes operations.
LAST TESTED ON CHECKMK 2.4.0P1
Problem
Activating changes (in particular a CMC reload) takes an unusually long time, does not complete, or appears to hang.
This article describes advanced troubleshooting and debugging techniques for Activate Changes and is intended for cases where standard checks do not identify the cause.
Solution
Debug long-running activate changes on a single site
Long activation times are often caused by a large number of files that must be processed during activation.
Check the number of files in the relevant directories:
OMD[mysite]:~$ find ~/var/check_mk/web -follow -type f | wc -l
OMD[mysite]:~$ find ~/local -follow -type f | wc -lA very large number of files in ~/var/check_mk/web can significantly increase activation time.
A large ~/local directory can also affect performance, as Checkmk creates a .tar archive from this directory during each activation.
Debug long-running activate changes in a distributed setup
In distributed environments, discrepancies between the central site and remote sites can cause long or incomplete activations.
Run the following command on the central site and all remote sites and compare the results:
OMD[mysite]:~$ find ~/local -follow -type f | wc -l
Additionally, check for non-ASCII filenames or file contents, which may lead to unexpected behavior:
OMD[mysite]:~/local$ find . -type f | xargs ls -ltr |grep --color='auto' -P -n "[^\x00-\x7F]"^C
OMD[mysite]:~/local$ find . -type f | grep --color='auto' -P -n "[^\x00-\x7F]"
Known Issue: Cortex XDR interference
Background
In some environments, Activate Changes may hang or fail after an upgrade due to interference from endpoint security software running on the host.
Problem
During an upgrade, an issue was observed where Activate Changes would not complete reliably.
The upgrade itself appeared to work as expected. All sites were stopped, the update process completed with only minor warnings about outdated rules, and the sites came back online without errors. After reconnecting distributed sites, Checkmk correctly showed pending changes indicating that the upgrade needed to be activated.
When attempting to activate changes, the system would enter a blocked state. The background job would show as STOPPED, and a process would remain running but effectively hung. Looking at it with straceshowed no meaningful activity in the child processes, while the parent process would eventually report a timeout.
In some cases, killing the stuck process allowed the activation to complete on the next attempt. However, the issue would return again later, making the behavior inconsistent but repeatable. The same situation could also be reproduced in a test environment.
Root Cause
After further testing, the issue was traced back to Cortex XDR running on the system.
Checkmk uses standard system tools during activation, including creating .tar archives from directories such as ~/local. Cortex XDR incorrectly identified this behavior as suspicious and interfered with the process.
As a result, the activation workflow would stall. From the outside, this looked like a hung process, stopped background job, and eventual timeout, but the actual cause was the security software blocking or interrupting normal execution.
Solution
To confirm the cause, Cortex XDR was temporarily disabled and removed, followed by a reboot. After this, Activate Changes worked normally without any delays or failures.
Once Cortex XDR was reinstalled, the issue immediately returned, which clearly confirmed the source of the problem.
Permanent Fix
The long-term fix is to adjust Cortex XDR so it does not interfere with Checkmk.
At a minimum, you should exclude the Apache process used by Checkmk. It is also recommended to ensure that the following are not inspected or blocked:
The
tarcommand during executionProcesses running under the Checkmk site user
Checkmk site directories such as
~/localand~/var
These exclusions prevent Cortex from interrupting normal activation behavior.
Advanced debugging
If the checks above do not identify the root cause, advanced debugging techniques are required.
This section covers profiling and strace-based debugging, which are useful when Activate Changes:
Takes a very long time
Appears to hang
Is blocked by long-running background processes (for example, CMC or Apache)
Profiling Activate Changes
Before continuing, disable parallel core configuration generation:
Go to Setup → General → Global settings
Edit the global settings
Remove the checkbox for “Generate core config parallelized”
This step is required to produce a usable and consistent profile.
Create a profiling output
Run the following command on the central site and all remote sites:
OMD[mysite]:~$ cmk -O --profile --debug -vv &>activation_debug.logThe cmk -O command performs a configuration reload, which is part of the Activate Changes process.
The following files are generated:
show_profile.pyprofile.outactivation_debug.log
To analyze the profile data manually, follow the GUI profiling documentation.
Otherwise, open a support case and include all three files.
Any Checkmk command can be profiled. The full syntax is described in the Profiling via CLI documentation.
Low-level debugging with strace
If profiling does not reveal the cause, strace can be used to trace filesystem access during Activate Changes.
This approach is particularly useful for scenarios such as:
Another activation process is currently in progress or locked
CMC reloads that do not complete
Apache-related delays during activation
To improve trace readability, the affected binary is started directly under strace. This method:
Follows child processes automatically
Creates separate trace files per process
Produces clearer output than attaching to a running process
All commands must be executed as the site user.
Tracing the CMC process
omd stop cmc
strace \
--output=cmc-strace.log \
--string-limit=9999 \
--absolute-timestamps=precision:us \
--follow-forks \
--trace=file \
~/bin/cmc
Tracing the Apache process
Tracing Apache can help identify web server–related delays during Activate Changes.
omd stop apache
strace \
--output=apache-strace.log \
--string-limit=9999 \
--absolute-timestamps=precision:us \
-ff \
--trace=file \
/usr/sbin/apache2 -f ~/etc/apache/apache.conf -DFOREGROUND
The resulting trace files can help identify:
Files or directories repeatedly accessed
Lock files or sockets blocking progress
Unexpected filesystem locations involved in activation
Related articles