Troubleshooting overlapping clusters: "Empty output from agent at aa.bb.cc.dd:6556" error

This article helps troubleshoot errors with overlapping clusters.

LAST TESTED ON CHECKMK 2.0.0P1

Table of Contents

Problem

When working with an overlapping cluster with one or more nodes and more than 1 clusters, it can happen that the Check_MK service shows the error

[agent] Empty output from agent at <ip>:6556.


This happens because, under certain circumstances, the agent is queried in parallel for each cluster due to a misconfiguration.
This leads to systemd, xinetd, or cmk-agent-ctl blocking the connection.
Furthermore, an additional check_tcp or host check commands to port 6556/tcp will consume one per source connection.

Reschedule active checks on all related Check_MK services, which is Spread over 0  minutes also can have a negative influence.
Always spread it over 1 or more minutes (something equal to the  Normal check interval for service checks)

Solution

  • Set  Global Setting > Maximum cache file age for clusters to, e.g., 1.5 times the Normal check interval for service checks of the Check_MK service of the cluster nodes.

  • If you use the defaults of Normal check interval for service checks is equal to 1 min for the Check_MK service and leave Maximum cache file age for clusters of 90 seconds, you are fine.

  • If you are using a service check interval of, e.g., 3 min or 5 min for the Check_MK service on the nodes, you probably run into this problem.

  • Use a lower or equal Normal check interval for service checks for the Check_MK service on the clusternodes compared to clusterhosts.
    In other words: check the clusternode (physical thing) more frequently than the clusterhost (virtual thing) or at least with the same interval.

  • The clusterhosts then can use the cached agent output of the clusternodes because they are recent enough and does not have to initiate a TCP connection to the agent.

Other alternatives

Usually, there is no reason why you should increase the per_source connection limit for the reasons explained above.
But still, here is how to do it:

  • If you can't or do not want to set Normal check interval for service checks and Maximum cache file age for clusters as described above you can configure a higher per source limiting for the agent.
  • Since there are at least 3  different methods to get the agent output from port 6556/tcp there are also 3 different ways to do it.

xinetd

Edit /etc/xinetd.conf and add this line to the defaults section:

 per_source = <the number of clusters that this node belongs to>

Restart the xinetd daemon after that change.

systemd

  • We have to distinguish between 2.0 and 2.1 here.
  • With 2.1 cmk-agent-ctl is listening to 6556/tcp, not a systemd socket.

Edit /etc/systemd/system/check-mk-agent.socket and add this line to the Socket section:

MaxConnectionsPerSource=<the number of clusters that this node belongs to>

.
Reload the systemd manager configuration by issuing

systemctl daemon-reload

.
You verify your change by executing

systemctl show check-mk-agent.socket | grep MaxConnectionsPerSource

cmk-agent-ctl

The cmk-agent-ctl has got its own per-source limit protection which is not done by systemd. Currently, it's not configurable by the bakery, but you can control this by an environment variable.


  1. Edit the systemd unit to set an environment variable

     systemctl edit cmk-agent-ctl-daemon.service 

    .

  2. Set  Environment variable DEBUG_MAX_CONNECTIONS

    # /etc/systemd/system/cmk-agent-ctl-daemon.service.d/override.conf
    [Service]
    Environment="DEBUG_MAX_CONNECTIONS=16"

    .

  3. Make systemd aware of this change
    .
  4. Restart the cmk-agent-ctl-daemon unit to use the Environment variable DEBUG_MAX_CONNECTIONS

    systemctl daemon-reload 
    systemctl restart cmk-agent-ctl-daemon.service