Displayed interface bandwidth bigger than interface speed

As there is no direct possibility to poll bandwidth data from, e.g., a switch, the calculation of the bandwidth is basically like this:

(B(t_2) - B(t_1)) / (t_2 - t_1)

Where t_2 > t_1 are points in time, and B(t) is the total amount of bytes (octets) passed by an interface, starting from 0 when the device boots.

In the case of SNMP, such an OID can be .1.3.6.1.2.1.31.1.1.1.10.18 (ifHCOutOctets). In the case of a Linux system, you can see in an interface section of the output of ifconfig also some big RX/TX bytes number.

The counter B may have some delay in getting the current value, as there might be high traffic on the interface and counters not being updated instantaneously. This systematic error tends to compensate by difference calculation roughly, thus, B_error is taken to be minor for the moment.

The counter for the amount of bytes can be 32-bit or 64-bit sized. Having the smaller counters, you risk encountering a counter overflow. The bigger the polling interval, the higher this risk; thus, you should remain at 1 min whenever possible, also for other reasons. With the increasing development of Checkmk, measures have been implemented to detect and compensate counter overflow, which nevertheless will not be perfect.

The two measurements also take some time, thus introducing a t_error that may be positive or negative. Clearly, only a negative t_error leads to a more considerable bandwidth than expected. Instabilities of the Checkmk site can have a large impact on t_error. E.g., when there is fetcher saturation, leading to fetcher latency, this can increase the amount of t_error significantly. Other potential problem sources may be restarting a site, temporary inactivity of a site, or activating changes. Clearly, it would be good if Checkmk would work precisely regardless of such situations, and Dev is also working in this direction. Note that we do not deliver Checkmk as a scientific measurement tool. Of course, there may also be bugs in Checkmk or devices.

For debugging this further, two SNMP walks or agent outputs are necessary, together with the service graphs of the Checkmk site. Unfortunately, you will not have the measurement data available that caused the strange graph display, thus only rough testing is possible here. Usually, at the end of debugging, there will be plausible reasons for t_error.