Skip to end of metadata
Go to start of metadata

You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 4 Next »

Sometimes it is possible that a network interface graph shows a speed, higher than physically possible. This article explains how that can happen.


Table of Contents

Overview

Monitoring network interface throughput is typically done counter based. This is due to the fact, that SNMP implementations and classical operating systems start counters at zero on boot and increase these counters for every bit or packet sent or received. For a monitoring solution to be able to actually monitor throughput, it needs to collect two samples at different points in time and calculate the difference between them. By doing this, they get quite accurate readings, but this logic can fail for different reasons. Read on to find out about these reasons.

Reason

There are two main possible reasons: Timing issues or counter overflows.

Timing issues

As the calculation of interface speed needs data at two points in time, timing is of the essence. Consider an SNMP device being under heavy load and failing to respond in a timely manner. This can lead to both decreased in increased throughput, depending on the exact scenario.

Counter overflows

The OIDs that are being queried for interface throughput calculation can be 32-bit or 64-bit. While it is more likely for a 32-bit counter to overflow, 64-bit counters can also overflow. The result of such an overflow is with utmost certainty a off throughput calculation.

Original text:

As there is no direct possibility to poll bandwidth data from, e.g., a switch, the calculation of the bandwidth is basically like this:

(B(t_2) - B(t_1)) / (t_2 - t_1)

Where t_2 > t_1 are points in time, and B(t) is the total amount of bytes (octets) passed by an interface, starting from 0 when the device boots.


In the case of SNMP, such an OID can be .1.3.6.1.2.1.31.1.1.1.10.18 (ifHCOutOctets). In the case of a Linux system, you can see in an interface section of the output of ifconfig also some big RX/TX bytes number.

The counter B may have some delay in getting the current value, as there might be high traffic on the interface and counters not being updated instantaneously. This systematic error tends to compensate by difference calculation roughly, thus, B_error is taken to be minor for the moment.

The counter for the amount of bytes can be 32-bit or 64-bit sized. Having the smaller counters, you risk encountering a counter overflow. The bigger the polling interval, the higher this risk; thus, you should remain at 1 min whenever possible, also for other reasons. With the increasing development of Checkmk, measures have been implemented to detect and compensate counter overflow, which nevertheless will not be perfect.


The two measurements also take some time, thus introducing a t_error that may be positive or negative. Clearly, only a negative t_error leads to a more considerable bandwidth than expected. Instabilities of the Checkmk site can have a large impact on t_error. E.g., when there is fetcher saturation, leading to fetcher latency, this can increase the amount of t_error significantly. Other potential problem sources may be restarting a site, temporary inactivity of a site, or activating changes. Clearly, it would be good if Checkmk would work precisely regardless of such situations, and Dev is also working in this direction. Note that we do not deliver Checkmk as a scientific measurement tool. Of course, there may also be bugs in Checkmk or devices.

For debugging this further, two SNMP walks or agent outputs are necessary, together with the service graphs of the Checkmk site. Unfortunately, you will not have the measurement data available that caused the strange graph display, thus only rough testing is possible here. Usually, at the end of debugging, there will be plausible reasons for t_error.

  • No labels