Debug Livestatus performance
This article details how to debug performance issues involving Livestatus.
APPLICABLE TO ALL CHECKMK VERSIONS
Problem
You want to evaluate the performance of Livestatus and understand if there are problems with the performance.
Solution
Debugging via Livestatus
If you receive for the OMD $OMD_SITE Performance Service "CRIT - Site currently not running" or Livestatus is always dead, please go through this manual to find the bottleneck:
Please execute this command as site user to see how long Livestatus needs to respond. This command should immediately give a response. If not, Livestatus is busy.
OMD[mysite]~# time lq "GET status\nColumns: program_start"
The "OMD $OMD_SITE Performance Service" uses this command:
OMD[mysite]~# time echo -e "GET status" | waitmax 5 "/omd/sites/$OMD_SITE/bin/unixcat" "/omd/sites/$OMD_SITE/tmp/run/live"
Now we need to find out why Livestatus is busy. This command will show you some statistics:
OMD[mysite]~# grep "processed request in" "$OMD_ROOT/var/log/cmc.log" | cut -d" " -f9 | sed 's,$, 100 / 100 * p,' | dc | sort -n | uniq -c
With this command, you will receive the processed time for every request.
OMD[mysite]~# cat $OMD_ROOT/var/log/cmc.log |grep -Po "client.*processed request in [\d]* ms" |sort -n |uniq -c
With this command, you can see an overview of all livestatus query types.
OMD[mysite]~# grep -Eo "request: GET \w+" $OMD_ROOT/var/log/cmc.log | uniq |sort -u request: GET eventconsoleevents request: GET hosts request: GET log request: GET services request: GET status request: GET timeperiods
Now you can search for the client with the long-running command. In my case, it is "client 22":
OMD[mysite]~# cat $OMD_ROOT/var/log/cmc.log |grep "client 22" |grep "request"
Copy this GET query and execute it with time. What kind of GET command does this is?
time lq "GET ........... "
Usually, we see primarily commands like "GET log Columns...".
In this case, you should check the following:
- Do you have a script accessing data via Livestatus?
- Your views and dashboards
- Maybe there is a view causing this long-running command
- Is the core history corrupted?
- A symptom of this would be very high CPU utilization of the CMC process itself.
- You can troubleshoot this by stopping the core, moving "$OMD_ROOT/var/check_mk/core/archive/" away and start the core again. Make sure to have tested backups of your site!
- If the issues are gone with the moved history archive, you need to find the history file, which is somehow corrupted. This can be done by moving history files back one by one and restarting the core every time.
Debugging long-running GET log commands
The "GET log" query fetches all kinds of log data for the Checkmk Views, e.g., Events of the last four hours.
Depending on how many such views you have and how big your history is, this can take longer and break Livestatus:
The restriction in log parsing is, of course, the storage of the Checkmk server!
How big is the history and archive of Checkmk? Do you really need all data?
OMD[mysite]:~/var/check_mk/core$ du -sh history archive/ 688K history 113M archive/
One quick and dirty solution could be to remove old history files to speed up things.
Settings to improve the log parsing
In Setup → General → Global settings we have several settings to improve the log parsing.
Maximum concurrent Livestatus connections | Typically, the default value should be acceptable. If you have a larger number of users, views, or distributed monitoring, you can increase this value step-by-step (50 - 100 - 150) |
Maximum number of cached log messages | To speed up queries for historical data, the core keeps an in-memory cache of log file messages. This number can be configured here. A larger number requires more RAM. Note: even if you set this to 0, there might be some cases where messages need to be cached anyway. You can set this to one million if you have enough memory |
History log rotation: Rotate by size (Limit of the size) | A log file rotation will be forced whenever its size exceeds that limit. In a large environment, you can increase this value to, e.g., 200mb. Checkmk will now need to parse the same amount of data through fewer files. |
Maximum number of parsed lines per log file | To avoid large timeouts in the case of oversized history log files, the core limits the number of lines read from history log files. The limit is on a per-file basis and can be configured here. Exceeding lines are ignored, and an error is logged in the CMC daemon log file. For example: 2021-12-03 09:30:30 [3] [client 1] more than 500000 lines in "/omd/sites/$OMD_SITE/var/check_mk/core/history", ignoring the rest! 2021-12-03 09:30:37 [3] [client 1] more than 500000 lines in "/omd/sites/$OMD_SITE/var/check_mk/core/history", ignoring the rest! 2021-12-03 09:30:42 [3] [client 2] more than 500000 lines in "/omd/sites/$OMD_SITE/var/check_mk/core/history", ignoring the rest! 2021-12-03 09:30:42 [3] [client 1] more than 500000 lines in "/omd/sites/$OMD_SITE/var/check_mk/core/history", ignoring the rest! |
Settings in a distributed setup
In a distributed setting, you're using the Livestatus Proxy Daemon. This can be tuned for the central and all remote sites here:
Setup → General → Global settings → Livestatus Proxy → Livestatus Proxy default connection parameters
Specific settings for each remote site can be done here: Setup → Distributed monitoring → → Livestatus Proxy → Livestatus Proxy default connection parameters
Please change the "Number of channels to keep open" from 5 to the required number from the central site log: ~/var/log/liveproxyd.state. Here you will see the state of all remote sites and all pending or waiting connections as well.
Related articles