Troubleshooting high CPU usage of the Checkmk micro core (cmc)

In this step-by-step guide, we want to advise you on dealing with high CPU usage of the CMC.

LAST TESTED ON CHECKMK 2.2.0P1

Table of Contents

Context

A process monitor like the htop command shows 100% CPU usage for one core by the CMC process. The command line should look something similar to the one below.

/omd/sites/mysite/bin/cmc /omd/sites/my_site/var/check_mk/core/config.pb

Step-by-step guide

  1. Verify that the CMC is consuming 100% of one or more CPU cores

    1. Install a process monitor like htop
    2. Run the process monitor as a site user
    3. Filter for CMC (e.g., for htop with F4 key) and write the string cmc into the filter

Screenshot of Htop with all four CPUs at 100 percent utilization.


Debugging

  1. Go to 'Master Control' within your sidebar.
    .
  2. Disable both Host and Service checks and restart CMC
    Screenshot of the right-hand side Master control with services and host checks set to off.

    OMD[mysite]~# omd restart cmc

    .

  3. Re-enable Host Checks and wait for at least 5 minutes
    Screenshot of the right-hand side Master control with services checks set to off.

    If the behavior reoccurs, disable Host Checks and restart CMC.

  4. Re-enable Service Checks and wait for at least 5 minutes
    Screenshot of the right-hand side Master control with host checks set to off.

    If the behavior reoccurs, disable Service Checks and restart CMC.

  5. Re-enable both Host Checks and Service Checks

Now, we need to understand which hosts might be causing this behavior.

  1. Start with the top-level folder of the affected site in Setup Hosts and set the "Criticality" of the folder to "Do not monitor this host."
    Screenshot of a host folder properties. Criticality is enabled and set to Do not monitor host.


    The subfolders will inherit this property.
    Screenshot of a host folder properties. Criticality is enabled and set to Do not monitor this  host.


  2. Activate changes and run omd restart on that site as the site user.

    OMD[mysite]~# omd restart

    .

  3. Now enable one of the subfolders and activate changes.
    Screenshot of a host folder properties. Criticality is enabled and set to Productive System.


  4. Run omd restart again and wait at least 5 minutes before checking htop

    OMD[mysite]~# omd restart


  5. If the CPU usage does not go back to 100%, repeat steps #3 & #4 until it does. Make sure to wait at least 5 minutes between each omd restart. Once the CPU usage is back at 100%, we found our culprit.

  6. Now, we can move forward to see what is causing the issue. What kind of host is it? Agent, SNMP, or Special Agent?

    • If it is an agent-based host:
    • Any local plugins?
    • Any special configuration?

  7. Run strace as root. You can use strace to track the cmc process when you face any issue. 

    root@mylinuxhost~# strace -o cmc-strace.log -p $(cat ~<mysite>/tmp/run/cmc.pid)
    Further information can be found here: Debugging the Checkmk Micro Core (CMC)#strace

    .

  8. With gdb, you can analyze the coredump if checkmk will create one. Note: Checkmk will only create one if you enable it in the global settings.

    gdb /omd/sites/mysite/bin/cmc --core=/home/mylinuxuser/Downloads/core.python3.989.4b7ee3adffd14e31a0188aac0c215161.804036.1640164046000000
    GNU gdb (Ubuntu 9.2-0ubuntu1~20.04) 9.2GNU gdb (Ubuntu 9.2-0ubuntu1~20.04) 9.2Copyright (C) 2020 Free Software Foundation, Inc.License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>This is free software: you are free to change and redistribute it.There is NO WARRANTY, to the extent permitted by law.Type "show copying" and "show warranty" for details.This GDB was configured as "x86_64-linux-gnu".Type "show configuration" for configuration details.For bug reporting instructions, please see:<http://www.gnu.org/software/gdb/bugs/>.Find the GDB manual and other documentation resources online at:    <http://www.gnu.org/software/gdb/documentation/>.
    For help, type "help".Type "apropos word" to search for commands related to "word"...Reading symbols from /omd/sites/mysite/bin/cmc...
    warning: core file may not match specified executable file.[New LWP 804036]Core was generated by `python3 /omd/sites/mysite/bin/cmk --discover-marked-hosts'.Program terminated with signal SIGSEGV, Segmentation fault.#0  0x00007f2b661be1fd in ?? ()
    (gdb) where
    #0  0x00007f2b661be1fd in ?? ()
    #1  0x00007ffed8a75060 in ?? ()
    #2  0x0000000000000000 in ?? ()
     
     
    # Run it (if it's still crashing, you'll see it crash)
    r
    # View the backtrace (call stack)
    bt 
    # Quit when done
    q
    # Memory mappings
    i proc m
     
    # Listing all threads. This is really useful!
    thread apply all bt

    .

    Further information can be found here: Debugging the Checkmk Micro Core (CMC)#gdb

    .

  9. If your investigation is not successful, please open a ticket and provide us with the following data:

    Please send us the following data to help us reproduce the issue. 

     * Login as a site user with {{su - $MYSITE}} and
     * create an archive with the following command {{tar czf ~/corefiles.tgz ~/var/check_mk/core/ ~/var/log/}}.