Troubleshooting overlapping clusters: "Empty output from agent at aa.bb.cc.dd:6556" error
This article helps troubleshoot errors with overlapping clusters.
LAST TESTED ON CHECKMK 2.3.0P1
Problem
When working with an overlapping cluster with one or more nodes and more than 1 clusters, it can happen that the Check_MK service shows the error
[agent] Empty output from agent at <ip>:6556.
This happens because, under certain circumstances, the agent is queried in parallel for each cluster due to a misconfiguration.
This leads to systemd, xinetd, or cmk-agent-ctl blocking the connection.
Furthermore, an additional check_tcp or host check commands to port 6556/tcp will consume one per source connection.
Reschedule active checks on all related Check_MK services, which is Spread over 0 minutes also can have a negative influence.
Always spread it over 1 or more minutes (something equal to the Normal check interval for service checks)
Solution
Set Global Setting > Maximum cache file age for clusters to, e.g., 1.5 times the Normal check interval for service checks of the
Check_MKservice of the cluster nodes.If you use the defaults of Normal check interval for service checks is equal to 1 min for the
Check_MKservice and leave Maximum cache file age for clusters of 90 seconds, you are fine.If you are using a service check interval of, e.g., 3 min or 5 min for the
Check_MKservice on the nodes, you probably run into this problem.Use a lower or equal Normal check interval for service checks for the
Check_MKservice on the clusternodes compared to clusterhosts.
In other words: check the clusternode (physical thing) more frequently than the clusterhost (virtual thing) or at least with the same interval.The clusterhosts then can use the cached agent output of the clusternodes because they are recent enough and does not have to initiate a TCP connection to the agent.
Other alternatives
Usually, there is no reason why you should increase the per_source connection limit for the reasons explained above.
But still, here is how to do it:
If you can't or do not want to set Normal check interval for service checks and Maximum cache file age for clusters as described above, you can configure a higher per source limiting for the agent.
Since there are at least 3 different methods to get the agent output from port
6556/tcpthere are also 3 different ways to do it.
xinetd
Edit /etc/xinetd.conf and add this line to the defaults section:
per_source = <the number of clusters that this node belongs to>Restart the xinetd daemon after that change.
systemd
We have to distinguish between 2.0 and 2.1 here.
With 2.1
cmk-agent-ctlis listening to6556/tcp, not asystemdsocket.
Edit /etc/systemd/system/check-mk-agent.socket and add this line to the Socket section:
MaxConnectionsPerSource=<the number of clusters that this node belongs to>.
Reload the systemd manager configuration by issuing
systemctl daemon-reload.
You verify your change by executing
systemctl show check-mk-agent.socket | grep MaxConnectionsPerSource
cmk-agent-ctl
The cmk-agent-ctl has got its own per-source limit protection, which is not done by systemd. Currently, it's not configurable by the bakery, but you can control this by an environment variable.
Edit the
systemdunit to set an environment variablesystemctl edit cmk-agent-ctl-daemon.service.
Set Environment variable
DEBUG_MAX_CONNECTIONS# /etc/systemd/system/cmk-agent-ctl-daemon.service.d/override.conf [Service] Environment="DEBUG_MAX_CONNECTIONS=16".
Make
systemdaware of this change
.Restart the
cmk-agent-ctl-daemonunit to use the Environment variableDEBUG_MAX_CONNECTIONSsystemctl daemon-reload systemctl restart cmk-agent-ctl-daemon.service