Debugging the Kubernetes - k8s special agent
This document is for debugging the new k8 special agent introduced with Checkmk 2.1 Werk #13810
LAST TESTED ON CHECKMK 2.0.0P1
Getting Started
Background information regarding this subject is available on our:
Installation of Checkmk Cluster Collectors (a.k.a Checkmk Kubernetes agent)
We strongly recommend using our helm charts for installing the Checkmk Cluster Collectors unless you are very experienced with Kubernetes and want to install the agent using the manifests in YAML format provided by us.
The Helm chart installs and configures all necessary components to run the agent and exposes several helpful configuration options that will help you automatically set up complex resources. The prerequisites have to be fulfilled before you proceed with the installation.
Below is an example of deploying the helm charts using a LoadBalancer (requires ability of cluster to create a LoadBalancer):
$ helm repo add checkmk-chart https://checkmk.github.io/checkmk_kube_agent $ helm repo update $ helm upgrade --install --create-namespace -n checkmk-monitoring checkmk checkmk-chart/checkmk --set clusterCollector.service.type="LoadBalancer" Release "checkmk" does not exist. Installing it now. NAME: checkmk LAST DEPLOYED: Tue May 17 22:01:07 2022 NAMESPACE: checkmk-monitoring STATUS: deployed REVISION: 1 TEST SUITE: None NOTES:You can access the checkmk `cluster-collector` via: LoadBalancer: ========== NOTE: It may take a few minutes for the LoadBalancer IP to be available. You can watch the status of by running 'kubectl get --namespace checkmk-monitoring svc -w checkmk-cluster-collector' export SERVICE_IP=$(kubectl get svc --namespace checkmk-monitoring checkmk-cluster-collector --template "{{ range (index .status.loadBalancer.ingress 0) }}{{.}}{{ end }}"); echo http://$SERVICE_IP:8080 # Cluster-internal DNS of `cluster-collector`: checkmk-cluster-collector.checkmk-monitoring ========================================================================================= With the token of the service account named `checkmk-checkmk` in the namespace `checkmk-monitoring` you can now issue queries against the `cluster-collector`. Run the following to fetch its token and the ca-certificate of the cluster: export TOKEN=$(kubectl get secret checkmk-checkmk -n checkmk-monitoring -o=jsonpath='{.data.token}' | base64 --decode); export CA_CRT="$(kubectl get secret checkmk-checkmk -n checkmk-monitoring -o=jsonpath='{.data.ca\.crt}' | base64 --decode)"; # Note: Quote the variable when echo'ing to preserve proper line breaks: `echo "$CA_CRT"` To test access you can run: curl -H "Authorization: Bearer $TOKEN" http://$SERVICE_IP:8080/metadata | jq
As an example, you can further set some configuration options on the command line to the above helm command (these are some examples, but depending on your requirement, you can specify multiple or separate values) :
Flags | Description |
---|---|
--set clusterCollector.service.type="LoadBalancer" | This sets the cluster collector service type to LoadBalancer. |
--set clusterCollector.service.type="NodePort" --set clusterCollector.service.nodePort=30035 | Here, you can specify a different service type and port again. |
--version 1.0.0-beta.2 | specify a version constraint for the chart version to use |
We recommend using these values.yaml to configure your Helm chart.
For this, you need to then run the command:
$ helm upgrade --install --create-namespace -n checkmk-monitoring myrelease checkmk-chart/checkmk -f values.yaml
After the chart has been successfully deployed, you will be presented with a set of commands to access the cluster-collector from the command line. In case you want to see those commands, you can do the following:
helm status checkmk -n checkmk-monitoring
At the same time, you can also verify if all the essential resources in the namespace have been deployed successfully. The below command in the screenshot lists some important resources:
$kubectl get all -n checkmk-monitoring NAME READY STATUS RESTARTS AGE pod/checkmk-cluster-collector-57c7f5f54b-xgqvx 1/1 Running 0 19m pod/checknk-node-collector-container-netrics-lflhs 2/2 Running 0 20m pod/checkmk-node-collector-container-netrics-s59lb 2/2 Running 0 20m pod/checkmk-node-collector-container-metrics-tnccf 2/2 Running 0 20m pod/checknk-node-collector-machine-sections-9k441 1/1 Running 0 20m pod/checkmk-node-collector-machine-sections-fc795 1/1 Running 0 19m pod/checknk-node-collector-machine-sections-lfv9l 1/1 Running 0 20m NAME TYPE CLUSTER-IP EXTERNAL-IP PORTS AGE service/checkmk-cluster-collector LoadBalancer 10.20.10.165 34.107.19.22 8080:31168/TCP 20m NAME DESIRED CURRENT READY UP-TO-DATE AVAILABLE NODE SELECTOR AGE daemonset. apps/checkmk-node-collector-container-metrics 3 3 3 3 3 <none> 20m daemonset.apps/checkmk-node-collector-machine-sections 3 3 3 3 3 <none> 20m NAME READY UP-TO-DATE AVAILABLE AGE deployment.apps/checkmk-cluster-collector 1/1 1 1 20m NAME DESIRED CURRENT READY AGE replicaset.apps/checkmk-cluster-collector-57c7f5f54b 1 1 1 20M
Exposing the Checkmk Cluster Collector
By default, the API of Checkmk Cluster Collector is not exposed to the outside (not to be mistaken with Kubernetes API itself). This is required to gather usage metrics and enrich your monitoring.
Checkmk pulls data from this API, which can be exposed via the service checkmk-cluster-collector. To do so, you must run it with one of the following flags or set them in a values.yaml.
Flags | Description |
---|---|
--set clusterCollector.service.type="LoadBalancer" | This sets the cluster collector service type to LoadBalancer. |
--set clusterCollector.service.type="NodePort" --set clusterCollector.service.nodePort=30035 | Here, you can specify a different service type and port again. |
Debugging K8s special agent
- The first step would be to find the complete command of the Kubernetes special agent.
The command can be found under "Type of agent >> Program." It will consist of multiple parameters depending on how the datasource program rule has been configured.
OMD[mysite]:~$ cmk -D k8s | more k8s Addresses: No IP Tags: [address_family:no-ip], [agent:special-agents], [criticality:prod], [networking:lan], [piggyback:auto-piggyback], [site:a21], [snmp_ds:no-snmp], [tcp:tcp] Labels: [cmk/kubernetes/cluster:at], [cmk/kubernetes/object:cluster], [cmk/site:k8s] Host groups: check_mk Contact groups: all Agent mode: No Checkmk agent, all configured special agents Type of agent: Program: /omd/sites/mysite/share/check_mk/agents/special/agent_kube '--cluster' 'k8s' '--token' 'xyz' '--monitored-objects' 'deployments' 'daemonsets' 'statefulsets' 'nodes' 'pods' '--api-server-endpoint' 'https://<YOUR-IP>:6443' '--api-server-proxy' 'FROM_ENVIRONMENT' '--cluster-collector-endpoint' 'https://<YOUR-ENDPOINT>:30035' '--cluster-collector-proxy' 'FROM_ENVIRONMENT' Process piggyback data from /omd/sites/mysite/tmp/check_mk/piggyback/k8s Services: ...
An easier way would be this command: /bin/sh -c "$(cmk -D k8s | grep -A1 "^Type of agent:" | grep "Program:" | cut -f2- -d':')"
Please note that if a line matching "^Type of agent:" followed by a line matching "^ Program:" exists more than once, the output might be messed up.
.
The special agent has the below options available for debugging purposes:
OMD[mysite]:~$ /omd/sites/mysite/share/check_mk/agents/special/agent_kube -h ... --debug Debug mode: raise Python exceptions -v / --verbose Verbose mode (for even more output use -vvv) --vcrtrace FILENAME Enables VCR tracing for the API calls ...
.
Now, you can modify the above command of the Kubernetes special agent like this:
OMD[mysite]:~$ /omd/sites/mysite/share/check_mk/agents/special/agent_kube \ '--cluster' 'at' \ '--token' 'xyz' \ '--monitored-objects' 'deployments' 'daemonsets' 'statefulsets' 'nodes' 'pods' \ '--api-server-endpoint' 'https://<YOUR-IP>:6443' \ '--api-server-proxy' 'FROM_ENVIRONMENT' \ '--cluster-collector-endpoint' 'https://<YOUR-ENDPOINT>:30035' \ '--cluster-collector-proxy' 'FROM_ENVIRONMENT' \ --debug -vvv --vcrtrace ~/tmp/vcrtrace.txt > ~/tmp/k8s_with_debug.txt 2>&1
Here, you can also reduce the number of '--monitored-objects' to a few resources to get less output.
.Run the special agent with no debug options to create an agent output, or you could download it from the cluster host via the Checkmk web interface.
/omd/sites/mysite/share/check_mk/agents/special/agent_kube '--cluster' 'at' '--token' 'xyz' '--monitored -objects' 'deployments' 'daemonsets' 'statefulsets' 'nodes' 'pods' '--api-server-endpoint' 'https://<YOUR-IP>:6443' '--api-server-proxy' 'FROM_ENVIRONMENT' '--cluster-collector-endpoint' 'https://<YOUR-ENDPOINT>:30035' '--cluster-collector-proxy' 'FROM_ENVIRONMENT' > ~/tmp/k8s_agent_output.txt 2>&1
.
Please upload the following files to the support ticket.
~/tmp/vcrtrace.txt | Tracefile |
~/tmp/k8s_with_debug.txt | Debug output |
~/tmp/k8s_agent_output.txt | Agent output |
Common errors
- Context: the Kubernetes special agent is slightly unconventional relative to other Special agents as it handles up to three different datasources (the API, the cluster collector container metrics, and the cluster collector node metrics)
- the connection to the Kubernetes API server is mandatory, while the connection to the others is optional (and decided through the configured Datasource rule)
- Failure to connect to the Kubernetes API server will be shown by the Checkmk service (as usual) → the agent crashes
- Failure to connect to the cluster collector will be highlighted in the Cluster Collector service → the error is not raised by the agent in production
- the error is only raised when executing the agent with the --debug flag
- the error is only raised when executing the agent with the --debug flag
- the connection to the Kubernetes API server is mandatory, while the connection to the others is optional (and decided through the configured Datasource rule)
- Version: We only support the latest three Kubernetes versions (https://kubernetes.io/releases/#:~:text=The%20Kubernetes%20project%20maintains%20release,9%20months%20of%20patch%20support.)
- If a customer has the latest release and the release itself is quite new (less than one month), ask one of the devs if we already have support.
- If a customer has the latest release and the release itself is quite new (less than one month), ask one of the devs if we already have support.
- Kubernetes API connection error: If the agent fails to make a connection to the Kubernetes API (e.g., 401 Unauthorized to query api/v1/core/pods), then the output based on the --debug flag should be sufficient
- common causes:
- service account was not configured correctly in the Kubernetes cluster
- wrong token configured
- Forgot to upload the ca.crt in the Global settings >> Trusted certificate authorities for SSL but --verify-cert-api is enabled.
- Wrong IP or Port
- Proxy is not configured in the datasource rule.
- Checkmk Cluster Collector connection error:
- Common causes:
- The cluster collector is not exposed via either NodePort or Ingress.
- The essential resources like pods, deployments, daemon-sets, replicas, etc., are not running or frequently restarting.
- A firewall or a security group blocks the cluster collector IP.
- Port/IP incorrect.
- Forgot to upload the ca.crt in the Global settings >> Trusted certificate authorities for SSL but --verify-cert-api is enabled.
- Proxy is not configured in the datasource rule.
- Common causes:
- API processing error: If the agent reports a bug similar to "value ... was not set, " the user should be asked for the vcrtrace file.
Related articles