How-to collect troubleshooting data for Kubernetes

Table of Contents

You will find in our GitHub Repo an overview of all supported k8s flavors.

Getting Started

Background information regarding this subject is available on our:

General data

For diagnostic purposes, please send us a dump of the support diagnostics so we can take a closer look at what might be happening.

Detailed instructions on how to create such a dump are available in our official guide: https://docs.checkmk.com/latest/en/support_diagnostics.html

Please check the following boxes when creating the dump and attach the file to this ticket:

Local files
OMD Config
Checkmk Overview
Checkmk Configuration files
Performance Graphs of Checkmk Server
Global Settings
What kind of Kubernetes distro + version are you using?
```
uname -a
cat /etc/os-release
```
Some additional outputs, as described in Debug Kubernetes Cluster Components
```
kubectl get pods -A
kubectl get nodes -A
kubectl version -o json
```
Please run the special agent on the command line
- Command for Special Agents
- Debugging the Kubernetes - k8s special agent

Debug Kubernetes Cluster Components

A couple of pods are deployed to the Kubernetes cluster for monitoring:

Cluster collector that runs on some worker node (depending on where Kubernetes decides to schedule it)
Two types of node collectors run on each worker node:
- container metrics collector (collects CPU and memory metrics on containers running on the respective nodes)
- machine sections collector (runs the Checkmk agent on the respective nodes)

The docker images that are run inside these pods as containers can be found on Dockerhub: https://hub.docker.com/r/checkmk/kubernetes-collector

The customer decides which namespace to deploy these containers in. The default is checkmk-monitoring.

Verify that the above components have been deployed correctly and are running successfully by running.

kubectl get pods -n NAMESPACE

The output may look something like this:

$ kubectl get pods -n checkmk-monitoring
NAME                                                  READY   STATUS    RESTARTS   AGE
safe-checkmk-cluster-collector-55d5f46bf6-8jzcx       1/1     Running   1          11d
safe-checkmk-node-collector-container-metrics-m778x   2/2     Running   2          11d
safe-checkmk-node-collector-container-metrics-rm69w   2/2     Running   2          11d
safe-checkmk-node-collector-container-metrics-zn2nj   2/2     Running   0          7d19h
safe-checkmk-node-collector-container-metrics-zqwn8   2/2     Running   2          11d
safe-checkmk-node-collector-machine-sections-4z2lr    1/1     Running   1          11d
safe-checkmk-node-collector-machine-sections-9mpx5    1/1     Running   0          7d19h
safe-checkmk-node-collector-machine-sections-rf4xb    1/1     Running   1          11d
safe-checkmk-node-collector-machine-sections-wbhqv    1/1     Running   1          11d

For each node, you should see one set of node collector pods and one occurrence of a cluster collector pod on some arbitrary node.

The status of these pods may flap between "Running" and "Error" or "CrashLoopBackOff". If this is the case, try to narrow down the reason for the error by running the below commands:

kubectl get events -n NAMESPACE
kubectl logs [--previous] POD -n NAMESPACE [-c CONTAINER]

The elements in brackets in the second command are optional and have the following effect:

Previous: shows the logs of the previously failed container. This is useful if the current container has been running successfully for the time being and does not submit any error logs. This could also be used if the containers are failing and being created too fast.
Container: select the container if there is more than one container inside a pod. This is the case for the container metrics collector: it runs cadvisor (a 3rd party open source tool) and a container-metrics-collector. Usually, we are interested in the latter.

In addition to the above steps, you can ask the client to set the log level to debug. This is done in the yaml manifests or helm charts used to deploy the Kubernetes components. Once the modifications have been made, the components must be deployed to the cluster again.

Lastly, grab the log output, as explained above.

kubectl get pods
kubectl get nodes
kubectl version -o json

How-to collect troubleshooting data for Kubernetes

Getting Started

General data

Debug Kubernetes Cluster Components

Related articles