Debugging the Kubernetes - k8s special agent

This document is for debugging the new k8 special agent introduced with Checkmk 2.1 Werk #13810

LAST TESTED ON CHECKMK 2.0.0P1

Table of Contents

Getting Started

Background information regarding this subject is available on our:

Installation of Checkmk Cluster Collectors (a.k.a Checkmk Kubernetes agent)

We strongly recommend using our helm charts for installing the Checkmk Cluster Collectors unless you are very experienced with Kubernetes and want to install the agent using the manifests in YAML format provided by us.

Please remember that we can not support you in installing the agent while using the manifests.

The Helm chart installs and configures all necessary components to run the agent and exposes several helpful configuration options that will help you automatically set up complex resources. The prerequisites have to be fulfilled before you proceed with the installation.

Below is an example of deploying the helm charts using a LoadBalancer (requires ability of cluster to create a LoadBalancer):

$ helm repo add checkmk-chart https://checkmk.github.io/checkmk_kube_agent

$ helm repo update

$ helm upgrade --install --create-namespace -n checkmk-monitoring checkmk checkmk-chart/checkmk --set clusterCollector.service.type="LoadBalancer"

Release "checkmk" does not exist. Installing it now.
NAME: checkmk
LAST DEPLOYED: Tue May 17 22:01:07 2022
NAMESPACE: checkmk-monitoring
STATUS: deployed
REVISION: 1
TEST SUITE: None

NOTES:You can access the checkmk `cluster-collector` via:

LoadBalancer:
==========  
NOTE: It may take a few minutes for the LoadBalancer IP to be available.
      You can watch the status of by running 'kubectl get --namespace checkmk-monitoring svc -w checkmk-cluster-collector'

  export SERVICE_IP=$(kubectl get svc --namespace checkmk-monitoring checkmk-cluster-collector --template "{{ range (index .status.loadBalancer.ingress 0) }}{{.}}{{ end }}");

  echo http://$SERVICE_IP:8080

# Cluster-internal DNS of `cluster-collector`: checkmk-cluster-collector.checkmk-monitoring
=========================================================================================
With the token of the service account named `checkmk-checkmk` in the namespace `checkmk-monitoring` you can now issue queries against the `cluster-collector`.

Run the following to fetch its token and the ca-certificate of the cluster:

  export TOKEN=$(kubectl get secret checkmk-checkmk -n checkmk-monitoring -o=jsonpath='{.data.token}' | base64 --decode);
  export CA_CRT="$(kubectl get secret checkmk-checkmk -n checkmk-monitoring -o=jsonpath='{.data.ca\.crt}' | base64 --decode)";
  
# Note: Quote the variable when echo'ing to preserve proper line breaks: `echo "$CA_CRT"`

To test access you can run:
  curl -H "Authorization: Bearer $TOKEN" http://$SERVICE_IP:8080/metadata | jq

As an example, you can further set some configuration options on the command line to the above helm command (these are some examples, but depending on your requirement, you can specify multiple or separate values) :

Flags	Description
--set clusterCollector.service.type="LoadBalancer"	This sets the cluster collector service type to LoadBalancer.
--set clusterCollector.service.type="NodePort" --set clusterCollector.service.nodePort=30035	Here, you can specify a different service type and port again.
--version 1.0.0-beta.2	specify a version constraint for the chart version to use

We recommend using these values.yaml to configure your Helm chart.

For this, you need to then run the command:

$ helm upgrade --install --create-namespace -n checkmk-monitoring myrelease checkmk-chart/checkmk -f values.yaml

After the chart has been successfully deployed, you will be presented with a set of commands to access the cluster-collector from the command line. In case you want to see those commands, you can do the following:

helm status checkmk -n checkmk-monitoring

At the same time, you can also verify if all the essential resources in the namespace have been deployed successfully. The below command in the screenshot lists some important resources:

$kubectl get all -n checkmk-monitoring

NAME 													READY	STATUS		RESTARTS	AGE
pod/checkmk-cluster-collector-57c7f5f54b-xgqvx			1/1 	Running		0 			19m
pod/checknk-node-collector-container-netrics-lflhs		2/2 	Running		0 			20m
pod/checkmk-node-collector-container-netrics-s59lb		2/2 	Running		0 			20m
pod/checkmk-node-collector-container-metrics-tnccf		2/2 	Running		0 			20m
pod/checknk-node-collector-machine-sections-9k441		1/1 	Running		0 			20m
pod/checkmk-node-collector-machine-sections-fc795		1/1 	Running		0 			19m
pod/checknk-node-collector-machine-sections-lfv9l		1/1 	Running		0 			20m


NAME 									TYPE			CLUSTER-IP		EXTERNAL-IP		PORTS				AGE
service/checkmk-cluster-collector 		LoadBalancer	10.20.10.165	34.107.19.22	8080:31168/TCP 		20m

NAME 														DESIRED		CURRENT 	READY	UP-TO-DATE	AVAILABLE	NODE SELECTOR 	AGE
daemonset. apps/checkmk-node-collector-container-metrics 	3 			3 			3 		3 			3 			<none> 			20m
daemonset.apps/checkmk-node-collector-machine-sections 		3 			3 			3 		3 			3 			<none> 			20m

NAME 										READY 	UP-TO-DATE	AVAILABLE	AGE
deployment.apps/checkmk-cluster-collector 	1/1 	1 			1 			20m

NAME 													DESIRED		CURRENT 	READY 	AGE
replicaset.apps/checkmk-cluster-collector-57c7f5f54b 	1 			1 			1   	20M

Exposing the Checkmk Cluster Collector

By default, the API of Checkmk Cluster Collector is not exposed to the outside (not to be mistaken with Kubernetes API itself). This is required to gather usage metrics and enrich your monitoring.
Checkmk pulls data from this API, which can be exposed via the service checkmk-cluster-collector. To do so, you must run it with one of the following flags or set them in a values.yaml.

Flags	Description
--set clusterCollector.service.type="LoadBalancer"	This sets the cluster collector service type to LoadBalancer.
--set clusterCollector.service.type="NodePort" --set clusterCollector.service.nodePort=30035	Here, you can specify a different service type and port again.

Debugging K8s special agent

The first step would be to find the complete command of the Kubernetes special agent.

The command can be found under "Type of agent >> Program." It will consist of multiple parameters depending on how the datasource program rule has been configured.

OMD[mysite]:~$ cmk -D k8s | more

k8s 
Addresses: No IP
Tags: [address_family:no-ip], [agent:special-agents], [criticality:prod], [networking:lan],
[piggyback:auto-piggyback], [site:a21], [snmp_ds:no-snmp], [tcp:tcp]
Labels: [cmk/kubernetes/cluster:at], [cmk/kubernetes/object:cluster], [cmk/site:k8s]
Host groups: check_mk
Contact groups: all
Agent mode: No Checkmk agent, all configured special agents
Type of agent: 
Program: /omd/sites/mysite/share/check_mk/agents/special/agent_kube '--cluster' 'k8s' '--token' 'xyz' '--monitored-objects' 'deployments' 'daemonsets' 'statefulsets' 'nodes' 'pods' '--api-server-endpoint' 'https://<YOUR-IP>:6443' '--api-server-proxy' 'FROM_ENVIRONMENT' '--cluster-collector-endpoint' 'https://<YOUR-ENDPOINT>:30035' '--cluster-collector-proxy' 'FROM_ENVIRONMENT'
Process piggyback data from /omd/sites/mysite/tmp/check_mk/piggyback/k8s
Services:
...

An easier way would be this command: /bin/sh -c "$(cmk -D k8s | grep -A1 "^Type of agent:" | grep "Program:" | cut -f2- -d':')"

Please note that if a line matching "^Type of agent:" followed by a line matching "^ Program:" exists more than once, the output might be messed up.

.

The special agent has the below options available for debugging purposes:

OMD[mysite]:~$ /omd/sites/mysite/share/check_mk/agents/special/agent_kube -h
...
--debug                     Debug mode: raise Python exceptions
-v / --verbose 				Verbose mode (for even more output use -vvv)
--vcrtrace FILENAME         Enables VCR tracing for the API calls
...

.

Now, you can modify the above command of the Kubernetes special agent like this:

OMD[mysite]:~$ /omd/sites/mysite/share/check_mk/agents/special/agent_kube  \
'--cluster' 'at' \
'--token' 'xyz' \
'--monitored-objects' 'deployments' 'daemonsets' 'statefulsets' 'nodes' 'pods' \
'--api-server-endpoint' 'https://<YOUR-IP>:6443' \
'--api-server-proxy' 'FROM_ENVIRONMENT' \
'--cluster-collector-endpoint' 'https://<YOUR-ENDPOINT>:30035' \
'--cluster-collector-proxy' 'FROM_ENVIRONMENT' \
--debug -vvv --vcrtrace ~/tmp/vcrtrace.txt > ~/tmp/k8s_with_debug.txt 2>&1

Here, you can also reduce the number of '--monitored-objects' to a few resources to get less output.
.

Run the special agent with no debug options to create an agent output, or you could download it from the cluster host via the Checkmk web interface.

/omd/sites/mysite/share/check_mk/agents/special/agent_kube '--cluster' 'at' '--token' 'xyz' '--monitored
-objects' 'deployments' 'daemonsets' 'statefulsets' 'nodes' 'pods' '--api-server-endpoint' 'https://<YOUR-IP>:6443' '--api-server-proxy' 'FROM_ENVIRONMENT' '--cluster-collector-endpoint' 'https://<YOUR-ENDPOINT>:30035' '--cluster-collector-proxy' 'FROM_ENVIRONMENT' > ~/tmp/k8s_agent_output.txt 2>&1

.

Please upload the following files to the support ticket.

~/tmp/vcrtrace.txt	Tracefile
~/tmp/k8s_with_debug.txt	Debug output
~/tmp/k8s_agent_output.txt	Agent output

Common errors

Context: the Kubernetes special agent is slightly unconventional relative to other Special agents as it handles up to three different datasources (the API, the cluster collector container metrics, and the cluster collector node metrics)
- the connection to the Kubernetes API server is mandatory, while the connection to the others is optional (and decided through the configured Datasource rule)
  - Failure to connect to the Kubernetes API server will be shown by the Checkmk service (as usual) → the agent crashes
  - Failure to connect to the cluster collector will be highlighted in the Cluster Collector service → the error is not raised by the agent in production
    - the error is only raised when executing the agent with the --debug flag
Version: We only support the latest three Kubernetes versions (https://kubernetes.io/releases/#:~:text=The%20Kubernetes%20project%20maintains%20release,9%20months%20of%20patch%20support.)
- If a customer has the latest release and the release itself is quite new (less than one month), ask one of the devs if we already have support.
Kubernetes API connection error: If the agent fails to make a connection to the Kubernetes API (e.g., 401 Unauthorized to query api/v1/core/pods), then the output based on the --debug flag should be sufficient

common causes:
- service account was not configured correctly in the Kubernetes cluster
- wrong token configured
- Forgot to upload the ca.crt in the Global settings >> Trusted certificate authorities for SSL but --verify-cert-api is enabled.
- Wrong IP or Port
- Proxy is not configured in the datasource rule.

Checkmk Cluster Collector connection error:
- Common causes:
  - The cluster collector is not exposed via either NodePort or Ingress.
  - The essential resources like pods, deployments, daemon-sets, replicas, etc., are not running or frequently restarting.
  - A firewall or a security group blocks the cluster collector IP.
  - Port/IP incorrect.
  - Forgot to upload the ca.crt in the Global settings >> Trusted certificate authorities for SSL but --verify-cert-api is enabled.
  - Proxy is not configured in the datasource rule.
API processing error: If the agent reports a bug similar to "value ... was not set, " the user should be asked for the vcrtrace file.