How-to debug Nagvis
In this how-to, we will explain to you how to debug Nagvis performance issues
LAST TESTED ON CHECKMK 2.4.0P1
Getting Started
Background information regarding this subject is available in our Official documentation.
Basics
Check the OMD <SITENAME> performance graphs of the affected
Important are the following graphs
Livestatus Connects and Requests - localhost - OMD nagnis_central performance.
Livestatus Requests per Connection - localhost - OMD nagnis_central performance.
Livestatus usage - localhost - OMD nagnis_central performance.
Check_MK helper usage - localhost - OMD nagnis_central performance.
Do you see peaks in these graphs? If yes, please check the liveproxyd.log inside the site user context.
Please check the Livestatus Proxy settings.
"Maximum concurrent Livestatus connections": inside the global and site-specific global settings.
"Livestatus Proxy default connection parameters": inside the global and site-specific global settings.
Cleanup your map:
Do you have objects in your map that are no longer available in Checkmk?
Do you have a map with nested maps? Please check if you have objects that are no longer available in Checkmk.
How often is your Nagvis map refreshing? You can modify this value.
If the map takes a lot of time to open, you might need to debug further. In this case, we recommend checking the Livestatus queries while reloading the map.
Network analyze
To see how long the map really needs, we recommend using the network analyzer of your internet browser: Network Analyze with the internet browser.
Debugging with Livestatus
Enable the debug log
How to collect troubleshooting data for various issue types - Livestatus Proxy
Debug with the lq queries
The best way to debug with the lq queries is:
tail -f ~/var/log/liveproxyd.log >/path/to/file.txtreload the nagvis map
analyze the file
Detect long-running lq query.
Do you see any:
bigger lq query
a log query
a periodical message
You can try to execute this query via the network and see how long it takes:
Livestatus queries over network
One example
Infrastructure
OS: Ubuntu 20.4
Version: Checkmk 1.6.0p24
Sites: 1 Central and one Remote
The map
This is a dynamic map with my remote site as a backend. I created and accessed the map via the central site.
The Debugging
This approach is only if you're running a distributed setup. So, in this case, you can run that command on the central site.
If running a single site, please increase livestatus logging (How to collect troubleshooting data for various issue types) to debug and check the cmc.log.
OMD[mysite]:~$ tail -f var/log/liveproxyd.log >/tmp/lq_nagvis.txt
OMD[mysite]:~$ cat /tmp/lq_nagvis.txt |grep "GET downtimes" |more
2021-07-21 13:53:04,645 [10] [cmk.liveproxyd.(1108792).Site(cmes).Client(13)] Send request 'GET downtimes\nColumns: author comment start_time end_time\nFilter: host_name = random_095543
1380\nOutputFormat: json\nKeepAlive: on\nResponseHeader: fixed16\n\n'
2021-07-21 13:53:04,646 [10] [cmk.liveproxyd.(1108792).Site(cmes).Thread(Thread-2).Channel(7)] Send: 'GET downtimes\nColumns: author comment start_time end_time\nFilter: host_name = ran
dom_0955431380\nOutputFormat: json\nKeepAlive: on\nResponseHeader: fixed16\n\n'
2021-07-21 13:53:04,696 [10] [cmk.liveproxyd.(1108792).Site(cmes).Client(13)] Send request 'GET downtimes\nColumns: author comment start_time end_time\nFilter: host_name = random_095543
1380\nOutputFormat: json\nKeepAlive: on\nResponseHeader: fixed16\n\n'
2021-07-21 13:53:04,697 [10] [cmk.liveproxyd.(1108792).Site(cmes).Thread(Thread-2).Channel(7)] Send: 'GET downtimes\nColumns: author comment start_time end_time\nFilter: host_name = ran
dom_0955431380\nOutputFormat: json\nKeepAlive: on\nResponseHeader: fixed16\n\n'
2021-07-21 13:53:04,747 [10] [cmk.liveproxyd.(1108792).Site(cmes).Client(13)] Send request 'GET downtimes\nColumns: author comment start_time end_time\nFilter: host_name = random_095543
1380\nOutputFormat: json\nKeepAlive: on\nResponseHeader: fixed16\n\n'
2021-07-21 13:53:04,748 [10] [cmk.liveproxyd.(1108792).Site(cmes).Thread(Thread-2).Channel(7)] Send: 'GET downtimes\nColumns: author comment start_time end_time\nFilter: host_name = ran
dom_0955431380\nOutputFormat: json\nKeepAlive: on\nResponseHeader: fixed16\n\n'
2021-07-21 13:53:04,798 [10] [cmk.liveproxyd.(1108792).Site(cmes).Client(13)] Send request 'GET downtimes\nColumns: author comment start_time end_time\nFilter: host_name = random_095543
1380\nOutputFormat: json\nKeepAlive: on\nResponseHeader: fixed16\n\n'
2021-07-21 13:53:04,798 [10] [cmk.liveproxyd.(1108792).Site(cmes).Thread(Thread-2).Channel(7)] Send: 'GET downtimes\nColumns: author comment start_time end_time\nFilter: host_name = ran
dom_0955431380\nOutputFormat: json\nKeepAlive: on\nResponseHeader: fixed16\n\n'
The whole logfile: lq_nagvis.txt
What I noticed in the logfile
A significant amount of
lq "GET downtimes"commands during the map reloadIf I count the
"GET downtimes"lines, there are 4836OMD[mysite]:~$ cat /tmp/lq_nagvis.txt |grep "GET downtimes" |wc -l 4836All the other commands look small and reasonable.
Further debugging
I noticed a lot of "GET downtimes" from the log. Whenever I reload the map, my central site sends thousands of commands via livestatus.
When I check my Checkmk site, I set several host downtimes. This could explain why my central site collects all Downtimes before nagvis shows the map.
The Workaround
Remove all downtimes. The map will open faster.
Access the map directly via the remote site/local site
We fixed this behavior with Checkmk 2.0. The downtimes will not affect the reload time of the map