How-to collect troubleshooting data for Kubernetes

This article explains what information to provide support to help troubleshoot Kubernetes issues.

LAST TESTED ON CHECKMK 2.1.0P1

Table of Contents


You will find in our GitHub Repo an overview of all supported k8s flavors.

Getting Started

Background information regarding this subject is available on our:

General data

For diagnostic purposes,  please send us a dump of the support diagnostics so we can take a closer look at what might be happening.

Detailed instructions on how to create such a dump are available in our official guide: https://docs.checkmk.com/latest/en/support_diagnostics.html

Please check the following boxes when creating the dump and attach the file to this ticket:

  • Local files
  • OMD Config
  • Checkmk Overview
  • Checkmk Configuration files
  • Performance Graphs of Checkmk Server
  • Global Settings
  • What kind of Kubernetes distro + version are you using?

    uname -a
    cat /etc/os-release
  • Some additional outputs, as described in Debug Kubernetes Cluster Components

    kubectl get pods -A
    kubectl get nodes -A
    kubectl version -o json
    
  • Please run the special agent on the command line

Debug Kubernetes Cluster Components

A couple of pods are deployed to the Kubernetes cluster for monitoring:

  • Cluster collector that runs on some worker node (depending on where Kubernetes decides to schedule it)
  • Two types of node collectors run on each worker node:
    • container metrics collector (collects CPU and memory metrics on containers running on the respective nodes)
    • machine sections collector (runs the Checkmk agent on the respective nodes)

The docker images that are run inside these pods as containers can be found on Dockerhub: https://hub.docker.com/r/checkmk/kubernetes-collector

The customer decides which namespace to deploy these containers in. The default is checkmk-monitoring.

Verify that the above components have been deployed correctly and are running successfully by running.

kubectl get pods -n NAMESPACE


The output may look something like this:

$ kubectl get pods -n checkmk-monitoring
NAME                                                  READY   STATUS    RESTARTS   AGE
safe-checkmk-cluster-collector-55d5f46bf6-8jzcx       1/1     Running   1          11d
safe-checkmk-node-collector-container-metrics-m778x   2/2     Running   2          11d
safe-checkmk-node-collector-container-metrics-rm69w   2/2     Running   2          11d
safe-checkmk-node-collector-container-metrics-zn2nj   2/2     Running   0          7d19h
safe-checkmk-node-collector-container-metrics-zqwn8   2/2     Running   2          11d
safe-checkmk-node-collector-machine-sections-4z2lr    1/1     Running   1          11d
safe-checkmk-node-collector-machine-sections-9mpx5    1/1     Running   0          7d19h
safe-checkmk-node-collector-machine-sections-rf4xb    1/1     Running   1          11d
safe-checkmk-node-collector-machine-sections-wbhqv    1/1     Running   1          11d


For each node, you should see one set of node collector pods and one occurrence of a cluster collector pod on some arbitrary node.

The status of these pods may flap between "Running" and "Error" or "CrashLoopBackOff". If this is the case, try to narrow down the reason for the error by running the below commands:


kubectl get events -n NAMESPACE
kubectl logs [--previous] POD -n NAMESPACE [-c CONTAINER]


The elements in brackets in the second command are optional and have the following effect:

  • Previous: shows the logs of the previously failed container. This is useful if the current container has been running successfully for the time being and does not submit any error logs.
  • Container: select the container if there is more than one container inside a pod. This is the case for the container metrics collector: it runs cadvisor (a 3rd party open source tool) and a container-metrics-collector. Usually, we are interested in the latter.

In addition to the above steps, you can ask the client to set the log level to debug. This is done in the yaml manifests or helm charts used to deploy the Kubernetes components. Once the modifications have been made, the components must be deployed to the cluster again.

Lastly, grab the log output, as explained above.

kubectl get pods
kubectl get nodes
kubectl version -o json