Debug Livestatus performance

This article details how to debug performance issues involving Livestatus.

APPLICABLE TO ALL CHECKMK VERSIONS

Table of Contents

Problem

You want to evaluate the performance of Livestatus and understand if there are problems with the performance.

Solution

Debugging via Livestatus

If you receive for the OMD $OMD_SITE Performance Service "CRIT - Site currently not running" or Livestatus is always dead, please go through this manual to find the bottleneck:

Please execute this command as site user to see how long Livestatus needs to respond. This command should immediately give a response. If not, Livestatus is busy.

OMD[mysite]~# time lq "GET status\nColumns: program_start"

The "OMD $OMD_SITE Performance Service" uses this command:

OMD[mysite]~# time echo -e "GET status" | waitmax 5 "/omd/sites/$OMD_SITE/bin/unixcat" "/omd/sites/$OMD_SITE/tmp/run/live"


Now we need to find out why Livestatus is busy. This command will show you some statistics:

OMD[mysite]~# grep "processed request in" "$OMD_ROOT/var/log/cmc.log" | cut -d" " -f9 | sed 's,$, 100 / 100 * p,' | dc | sort -n | uniq -c

With this command, you will receive the processed time for every request.

OMD[mysite]~# cat $OMD_ROOT/var/log/cmc.log |grep -Po "client.*processed request in [\d]* ms"  |sort -n |uniq -c

With this command, you can see an overview of all livestatus query types.

OMD[mysite]~# grep -Eo "request: GET \w+" $OMD_ROOT/var/log/cmc.log | uniq |sort -u
request: GET eventconsoleevents
request: GET hosts
request: GET log
request: GET services
request: GET status
request: GET timeperiods


Now you can search for the client with the long-running command. In my case, it is "client 22":

OMD[mysite]~# cat $OMD_ROOT/var/log/cmc.log |grep "client 22" |grep "request"

Copy this GET query and execute it with time. What kind of GET command does this is?

time lq "GET ........... "

Usually, we see primarily commands like "GET log Columns...".  

In this case, you should check the following:

  • Do you have a script accessing data via Livestatus?
  • Your views and dashboards
    • Maybe there is a view causing this long-running command
  • Is the core history corrupted?
    • A symptom of this would be very high CPU utilization of the CMC process itself.
    • You can troubleshoot this by stopping the core, moving "$OMD_ROOT/var/check_mk/core/archive/" away and start the core again. Make sure to have tested backups of your site!
    • If the issues are gone with the moved history archive, you need to find the history file, which is somehow corrupted. This can be done by moving history files back one by one and restarting the core every time.

Debugging long-running GET log commands

The "GET log" query fetches all kinds of log data for the Checkmk Views, e.g., Events of the last four hours.

Depending on how many such views you have and how big your history is, this can take longer and break Livestatus:

Screenshot of Livestatus error. Unhandled exception 400. Timeout while waiting for free Livestatus channel.

The restriction in log parsing is, of course, the storage of the Checkmk server!


How big is the history and archive of Checkmk? Do you really need all data?

In these files, we save the state changes of the host and services.
OMD[mysite]:~/var/check_mk/core$ du -sh history archive/
688K	history
113M	archive/

One quick and dirty solution could be to remove old history files to speed up things.

Settings to improve the log parsing

In Setup → General → Global settings we have several settings to improve the log parsing.

Screenshot of global settings with max concurrent livestatus connections set to 20. Max number of cached log messages set to 500000. History log rotation size limit set to 50 MiB.  Max number of parsed lines per log file set to 1000000.




Maximum concurrent Livestatus connections Typically, the default value should be acceptable. If you have a larger number of users, views, or distributed monitoring, you can increase this value step-by-step (50 - 100 - 150)

Maximum number of cached log messages

To speed up queries for historical data, the core keeps an in-memory cache of log file messages. This number can be configured here. A larger number requires more RAM. Note: even if you set this to 0, there might be some cases where messages need to be cached anyway.

You can set this to one million if you have enough memory

History log rotation: Rotate by size (Limit of the size)A log file rotation will be forced whenever its size exceeds that limit. In a large environment, you can increase this value to, e.g., 200mb. Checkmk will now need to parse the same amount of data through fewer files.
Maximum number of parsed lines per log file

To avoid large timeouts in the case of oversized history log files, the core limits the number of lines read from history log files. The limit is on a per-file basis and can be configured here. Exceeding lines are ignored, and an error is logged in the CMC daemon log file.


For example:

2021-12-03 09:30:30 [3] [client 1] more than 500000 lines in "/omd/sites/$OMD_SITE/var/check_mk/core/history", ignoring the rest! 
2021-12-03 09:30:37 [3] [client 1] more than 500000 lines in "/omd/sites/$OMD_SITE/var/check_mk/core/history", ignoring the rest! 
2021-12-03 09:30:42 [3] [client 2] more than 500000 lines in "/omd/sites/$OMD_SITE/var/check_mk/core/history", ignoring the rest! 
2021-12-03 09:30:42 [3] [client 1] more than 500000 lines in "/omd/sites/$OMD_SITE/var/check_mk/core/history", ignoring the rest! 

Settings in a distributed setup

In a distributed setting, you're using the Livestatus Proxy Daemon. This can be tuned for the central and all remote sites here:

Setup → General → Global settings → Livestatus Proxy → Livestatus Proxy default connection parameters

Specific settings for each remote site can be done here: Setup → Distributed monitoring →  → Livestatus Proxy → Livestatus Proxy default connection parameters


Please change the "Number of channels to keep open" from 5 to the required number from the central site log: ~/var/log/liveproxyd.state. Here you will see the state of all remote sites and all pending or waiting connections as well.