Info |
---|
This article details how to debug performance issues involving Livestatus. |
...
Code Block | ||||
---|---|---|---|---|
| ||||
OMD[mysite]~# time lq "GET status\nColumns: program_start" |
.
The "OMD <SITENAME> $OMD_SITE Performance Service" uses this command:
Code Block | ||||
---|---|---|---|---|
| ||||
OMD[mysite]~# time echo -e "GET status" | waitmax 5 "/omd/sites/$OMD_SITE/bin/unixcat" "/omd/sites/$OMD_SITE/tmp/run/live" |
...
Now we need to find out why Livestatus is busy: . This command will show you some statistics:
Code Block | ||||
---|---|---|---|---|
| ||||
OMD[mysite]~# grep "processed request in" "$OMD_ROOT/var/log/cmc.log" | cut -d" " -f9 | sed 's,$, 100 / 100 * p,' | dc | sort -n | uniq -c |
...
With this command, you will receive the processed time for every request.
Code Block | ||||
---|---|---|---|---|
| ||||
OMD[mysite]~# cat $OMD_ROOT/var/log/cmc.log |grep -Po "client.*processed request in [\d]* ms" |sort -n |uniq -c |
.
With this command, you can see an overview of all livestatus query types.
...
Now you can search for the client with the long-running command. In my case, it is "client 22":
Code Block | ||||
---|---|---|---|---|
| ||||
OMD[mysite]~# cat $OMD_ROOT/var/log/cmc.log |grep "client 22" |grep "request" |
...
Copy this GET query and execute it with time. What kind of GET command does this is?
Code Block | ||||
---|---|---|---|---|
| ||||
time lq "GET ........... " |
.
Usually, we see primarily commands like "GET log Columns...".
...
- Do you have a script accessing data via Livestatus?
- Your views and dashboards
- Maybe there is a view causing this long-running command
- Is the core history corrupted?
- A symptom of this would be very high CPU utilization of the CMC process itself.
- You can troubleshoot this by stopping the core, moving "$OMD_ROOT/var/check_mk/core/archive/" away and start the core again. Make sure to have tested backups of your site!
- If the issues are gone with the moved history archive, you need to find the history file, which is somehow corrupted. This can be done by moving history files back one by one and restarting the core every time.
Debugging long-running GET log commands
Note |
---|
The "GET log" query fetches all kinds of log data for the Checkmk Views, e.g., Events of the last four hours. Depending on how many such views you have and how big your history is, this can take longer and break Livestatus: The restriction in log parsing is, of course, the storage of the Checkmk server! |
...
How big is the history and archive of Checkmk? Do you really need all data?
...
In Setup → General → Global settings we have several settings to improve the log parsing.
Maximum concurrent Livestatus connections | Typically, the default value should be acceptable. If you have a larger number of users, views, or distributed monitoring, you can increase this value step-by-step (50 - 100 - 150) | |||||||
Maximum number of cached log messages | To speed up queries for historical data, the core keeps an in-memory cache of log file messages. This number can be configured here. A larger number requires more RAM. Note: even if you set this to 0, there might be some cases where messages need to be cached anyway. You can set this to one million if you have enough memory | |||||||
History log rotation: Rotate by size (Limit of the size) | A log file rotation will be forced whenever its size exceeds that limit. In a large environment, you can increase this value to, e.g., 200mb. Checkmk will now need to parse the same amount of data through fewer files. | |||||||
Maximum number of parsed lines per log file | To avoid large timeouts in the case of oversized history log files, the core limits the number of lines read from history log files. The limit is on a per-file basis and can be configured here. Exceeding lines are ignored, and an error is logged in the CMC daemon log file.E.g., For example:
|
...