Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.


Info

This article details how to debug performance issues involving Livestatus.

...

If you receive for the OMD <SITENAME> $OMD_SITE Performance Service "CRIT - Site currently not running" or Livestatus is always dead, please go through this manual to find the bottleneck:

...

Code Block
languagebash
themeRDark
OMD[mysite]~# time lq "GET status\nColumns: program_start"

...


The "OMD <SITENAME> $OMD_SITE Performance Service" uses this command:

Code Block
languagebash
themeRDark
OMD[mysite]~# time echo -e "GET status" | waitmax 5 "/omd/sites/<SITENAME>$OMD_SITE/bin/unixcat" "/omd/sites/<SITENAME>$OMD_SITE/tmp/run/live"

...


Now we need to find out why Livestatus is busy. This command will show you some statistics (run it in /omd/sites/<SITENAME>/var/log):

Code Block
languagebash
themeRDark
OMD[mysite]~# grep "processed request in" "$OMD_ROOT/var/log/cmc.log" | cut -d" " -f9 | sed 's,$, 100 / 100 * p,' | dc | sort -n | uniq -c

...


With this command, you will receive the processed time for every request.

Code Block
languagebash
themeRDark
OMD[mysite]~# cat $OMD_ROOT/var/log/cmc.log |grep -Po "client.*processed request in [\d]* ms"  |sort -n |uniq -c

.


With this command, you can see an overview of all livestatus query types.

Code Block
languagebash
themeRDark
OMD[mysite]~# grep -Eo "request: GET \w+" $OMD_ROOT/var/log/cmc.log | uniq |sort -u
request: GET eventconsoleevents
request: GET hosts
request: GET log
request: GET services
request: GET status
request: GET timeperiods

...

Now you can search for the client with the long-running command. In my case, it is "client 22":

Code Block
languagebash
themeRDark
OMD[mysite]~# cat $OMD_ROOT/var/log/cmc.log |grep "client 22" |grep "request"

...


Copy this GET query and execute it with time. What kind of GET command does this is?

Code Block
languagebash
themeRDark
time lq "GET ........... "

...


Usually, we see primarily commands like "GET log Columns...".  

...

  • Do you have a script accessing data via Livestatus?
  • Your views and dashboards
    • Maybe there is a view causing this long-running command
  • Is the core history corrupted?
    • A symptom of this would be very high CPU utilization of the CMC process itself.
    • You can troubleshoot this by stopping the core, moving "$OMD_ROOT/var/check_mk/core/archive/" away and start the core again. Make sure to have tested backups of your site!
    • If the issues are gone with the moved history archive, you need to find the history file, which is somehow corrupted. This can be done by moving history files back one by one and restarting the core every time.

Debugging long-running GET log commands

Note

The "GET log" query fetches all kinds of log data for the Checkmk Views, e.g., Events of the last four hours.

Depending on how many such views you have and how big your history is, this can take longer and break Livestatus:

Screenshot of Livestatus error. Unhandled exception 400. Timeout while waiting for free Livestatus channel.

The restriction in log parsing is, of course, the storage of the Checkmk server!

.


How big is the history and archive of Checkmk? Do you really need all data?

...

In Setup → General → Global settings we have several settings to improve the log parsing.

Screenshot of global settings with max concurrent livestatus connections set to 20. Max number of cached log messages set to 500000. History log rotation size limit set to 50 MiB.  Max number of parsed lines per log file set to 1000000.




Maximum concurrent Livestatus connections UsuallyTypically, the default value should be acceptable. If you have a larger number of users, views, or distributed monitoring, you can increase this value step-by-step (50 - 100 - 150)

Maximum number of cached log messages

To speed up queries for historical data, the core keeps an in-memory cache of log file messages. This number can be configured here. A larger number needs requires more RAM. Note: even if you set this to 0, there might be some cases where messages need to be cached anyway.

You can set this to one million if you have enough memory

History log rotation: Rotate by size (Limit of the size)A log file rotation will be forced whenever its size exceeds that limit. In a large environment, you can increase this value to, e.g., 200mb. Checkmk will now need to parse the same amount of data through fewer files.
Maximum number of parsed lines per log file

To avoid large timeouts in the case of oversized history log files, the core limits the number of lines read from history log files. The limit is on a per-file base basis and can be configured here. Exceeding lines are ignored, and an error is logged in the CMC daemon log file.


E.g.For example:

Code Block
languagebash
themeRDark
2021-12-03 09:30:30 [3] [client 1] more than 500000 lines in "/omd/sites/cmk$OMD_SITE/var/check_mk/core/history", ignoring the rest! 
2021-12-03 09:30:37 [3] [client 1] more than 500000 lines in "/omd/sites/cmk$OMD_SITE/var/check_mk/core/history", ignoring the rest! 
2021-12-03 09:30:42 [3] [client 2] more than 500000 lines in "/omd/sites/cmk$OMD_SITE/var/check_mk/core/history", ignoring the rest! 
2021-12-03 09:30:42 [3] [client 1] more than 500000 lines in "/omd/sites/cmk$OMD_SITE/var/check_mk/core/history", ignoring the rest! 


...