This article explains what information to provide support to help troubleshoot Kubernetes issues.
LAST TESTED ON CHECKMK 2.1.0P1
You will find in our GitHub Repo an overview of all supported k8s flavors.
General data
For diagnostic purposes, please send us a dump of the support diagnostics so we can have a closer look at what might be happening.
Detailed instructions on how to create such a dump are available in our official guide: https://docs.checkmk.com/latest/en/support_diagnostics.html
Please check the following boxes when creating the dump and attach the file to this ticket:
- Local files
- OMD Config
- Checkmk Overview
- Checkmk Configuration files
- Performance Graphs of Checkmk Server
- Global Settings
What kind of Kubernetes distro + version are you using?
uname -a cat /etc/os-release
Some additional outputs, as described in Debug Kubernetes Cluster Components
kubectl get pods kubectl get nodes kubectl version -o json
Please run the special agent on the command line.
Debug Kubernetes Cluster Components
A couple of pods are deployed to the Kubernetes cluster for monitoring:
- Cluster collector that runs on some worker node (depending on where Kubernetes decides to schedule it)
- Two types of node collectors run on each worker node:
- container metrics collector (collects CPU and memory metrics on containers running on the respective nodes)
- machine sections collector (runs the Checkmk agent on the respective nodes)
The docker images that are run inside these pods as containers can be found on Dockerhub: https://hub.docker.com/r/checkmk/kubernetes-collector
The customer decides which namespace to deploy these containers in. The default is checkmk-monitoring.
Verify that the above components have been deployed correctly and are running successfully by running.
kubectl get pods -n NAMESPACE
The output may look something like this:
$ kubectl get pods -n checkmk-monitoring NAME READY STATUS RESTARTS AGE safe-checkmk-cluster-collector-55d5f46bf6-8jzcx 1/1 Running 1 11d safe-checkmk-node-collector-container-metrics-m778x 2/2 Running 2 11d safe-checkmk-node-collector-container-metrics-rm69w 2/2 Running 2 11d safe-checkmk-node-collector-container-metrics-zn2nj 2/2 Running 0 7d19h safe-checkmk-node-collector-container-metrics-zqwn8 2/2 Running 2 11d safe-checkmk-node-collector-machine-sections-4z2lr 1/1 Running 1 11d safe-checkmk-node-collector-machine-sections-9mpx5 1/1 Running 0 7d19h safe-checkmk-node-collector-machine-sections-rf4xb 1/1 Running 1 11d safe-checkmk-node-collector-machine-sections-wbhqv 1/1 Running 1 11d
For each node, you should see one set of node collector pods and one occurrence of a cluster collector pod on some arbitrary node.
The status of these pods may flap between "Running" and "Error" or "CrashLoopBackOff". If this is the case, try to narrow down the reason for the error by running the below commands:
kubectl get events -n NAMESPACE kubectl logs [--previous] POD -n NAMESPACE [-c CONTAINER]
The elements in brackets in the second command are optional and have the following effect:
- Previous: shows the logs of the previously failed container. This is useful if the current container has been running successfully for the time being and does not submit any error logs.
- Container: select the container if there is more than one container inside a pod. This is the case for the container metrics collector: it runs cadvisor (a 3rd party open source tool) and a container-metrics-collector. Usually, we are interested in the latter.
In addition to the above steps, you can ask the client to set the log level to debug. This is done in the yaml manifests or helm charts used to deploy the Kubernetes components. Once the modifications have been made, the components must be deployed to the cluster again.
Lastly, grab the log output, as explained above.
kubectl get pods kubectl get nodes kubectl version -o json
Related articles