Debugging Special Agents (Combined)

This article helps debug issues with various Checkmk special agents.

LAST TESTED ON CHECKMK 2.0.0P1


Table of Contents


AWS special agent

If you want to execute the special agent from the command line, please run the following commands.


Checkmk versions 1.6, 2.0, and 2.1

For Checkmk 1.6 and below
echo '{"access_key_id": "xxxxxxxxxxxxxxxx", "secret_access_key": "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"}' | ~/share/check_mk/agents/special/agent_aws '--regions' 'eu-central-1' '--services' 'cloudwatch_alarms' 'dynamodb' 'ebs' 'ec2' 'elb' 'elbv2' 'glacier' 'rds' 's3' 'wafv2' '--ec2-limits' '--ebs-limits' '--s3-limits' '--glacier-limits' '--elb-limits' '--elbv2-limits' '--rds-limits' '--cloudwatch_alarms-limits' '--dynamodb-limits' '--wafv2-limits' '--cloudwatch-alarms' '--wafv2-cloudfront' '--hostname' 'aws' --debug --v


Checkmk Versions 2.2 and above

For Checkmk 2.0, 2.1 and 2.2
/omd/sites/mysite/share/check_mk/agents/special/agent_aws --access-key-id MYACCESSKEYID --secret-access-key MYSECRETEACCESSKEY --regions MYAWSREGION --global-services ce cloudfront route53 --services cloudwatch_alarms dynamodb ebs ec2 ecs elasticache elb elbv2 glacier lambda rds s3 sns wafv2 --ec2-limits --ebs-limits --s3-limits --glacier-limits --elb-limits --elbv2-limits --rds-limits --cloudwatch_alarms-limits --dynamodb-limits --wafv2-limits --lambda-limits --sns-limits --ecs-limits --elasticache-limits --s3-requests --cloudwatch-alarms --wafv2-cloudfront --cloudfront-host-assignment aws_host --hostname aws --piggyback-naming-convention ip_region_instance


Viewing AWS Options

To view more AWS options, use the cmk -D aws combined with grep

OMD[mysite]:~$ cmk -D aws |grep -A2  "Type of agent"

Type of agent:          
  Program: /omd/sites/mysite/share/check_mk/agents/special/agent_aws --access-key-id MYACCESSKEY --secret-access-key MYSECRETKEY --regions MYAWSREGION --global-services ce cloudfront route53 --services cloudwatch_alarms dynamodb ebs ec2 ecs elasticache elb elbv2 glacier lambda rds s3 sns wafv2 --ec2-limits --ebs-limits --s3-limits --glacier-limits --elb-limits --elbv2-limits --rds-limits --cloudwatch_alarms-limits --dynamodb-limits --wafv2-limits --lambda-limits --sns-limits --ecs-limits --elasticache-limits --s3-requests --cloudwatch-alarms --wafv2-cloudfront --cloudfront-host-assignment aws_host --hostname aws --piggyback-naming-convention ip_region_instance


Azure special agent


Azure endpoints

If you're experiencing connectivity issues or receiving unexpected data, please ensure that your endpoints are available from the Checkmk server.

Please review Microsoft's documentation for further information.

Checkmk versions 1.6 and below

If you want to execute the special agent from the command line, please run the following commands.

For Checkmk 1.6 and below
echo '{"secret": "xxxxxxxxx"}'|/omd/sites/mysite/share/check_mk/agents/special/agent_azure '--subscription' 'xxxxxxxxx' '--tenant' 'xxxxxxxxx' '--client' 'xxxxxxxxx' '--piggyback_vms' 'self' --debug -vvv  --vcrtrace /tmp/TRACEFILE


  • To enable debugging, you need the parameter "–debug"
  • To enable verbose output, you need the parameter "–vvv"

Checkmk Versions 2.0 and above

For Checkmk 2.0, 2.1 and 2.2
/omd/sites/mysite/share/check_mk/agents/special/agent_azure '--subscription' 'MYSUBSCRIPTIONKEY' '--tenant' 'MYTENANTKEY' '--client' 'MYCLIENTKEY' '--piggyback_vms' 'self' --debug -vvv  --vcrtrace /tmp/TRACEFILE


Display Azure Options

To view more Azure options, use the cmk -D azure combined with grep

OMD[mysite]:~$ cmk -D azure |grep -A2  "Type of agent"
Type of agent:          
  Program: /omd/sites/mysite/share/check_mk/agents/special/agent_azure --tenant MYTENANTKEY --client MYCLIENTKEY --secret MYSECRETKEY --subscription MYSUBSCRIPTIONKEY --piggyback_vms self --services users_count ad_connect app_registrations Microsoft.Compute/virtualMachines Microsoft.Network/virtualNetworkGateways Microsoft.Sql/servers/databases Microsoft.Storage/storageAccounts Microsoft.Web/sites Microsoft.DBforMySQL/servers Microsoft.DBforPostgreSQL/servers Microsoft.Network/trafficmanagerprofiles Microsoft.Network/loadBalancers Microsoft.RecoveryServices/vaults Microsoft.Network/applicationGateways
  Program: /omd/sites/mysite/share/check_mk/agents/special/agent_azure_status australiacentral australiacentral2 australiaeast australiasoutheast brazilsouth brazilsoutheast canadacentral germanynorth germanywestcentral koreacentral westeurope



Troubleshooting Microsoft Azure - "Graph client: Insufficient privileges to complete the operation" error


If you see the error message "Graph client: Insufficient privileges to complete the operation." when connecting to Azure, do the following:

  1. Open the Azure Portal

  2. Click Azure Active Directory 

    Screenshot of Azure services. Azure Active Directory highlighted.


  3. Click App registrations in the left bar

    Screenshot of the Azure Active Directory sidebar. App registrations is highlighted.


  4. Click the app you registered for Checkmk


  5. Click API permissions in the left bar

    Screenshot of the Monitoring sidebar. API permissions highlighted.


  6. Click Add Permissions and add a permissions for Microsoft Graph

Screenshot of the API permissions screen. Directory.Read.All and User.Read.All listed.



Full list of access rights needed:


These are the metrics we get via the Azure agents

Resource URIMetric name
Microsoft.Network/virtualNetworkGatewaysAverageBandwidth,P2SBandwidth
Microsoft.Sql/servers/databasesstorage_percent,deadlock,cpu_percent,dtu_consumption_percent,connection_successful,connection_failed
Microsoft.Storage/storageAccountsUsedCapacity,Ingress,Egress,Transactions,SuccessServerLatency,SuccessE2ELatency,Availability
Microsoft.Web/sites


SSL error - bad handshake


Problem

When trying to monitor my Microsoft Azure environment, you see the following error message in the Checkmk service Azure Agent Info:

Screenshot of the Azure Agent Info service. The state is currently at Critical with an SSLError of bad handshake.


Solution

You need to make sure that your Checkmk server can connect to the following two addresses of MS Azure: management.azure.com and login.microsoft.com

When a connection from your Checkmk server is impossible or times out, monitoring Azure will not be possible. You can quickly check this as the site user of your Checkmk site with either Telnet or Netcat:

OMD[mysite]:~$ nc -zv login.microsoftonline.com 443
OMD[mysite]:~$ nc -zv management.azure.com 443


The output of these commands should look like this:

OMD[mysite]:~$ nc -zv login.microsoft.com 443
Connection to login.microsoft.com 443 port [tcp/https] succeeded!

OMD[mysite]:~$ nc -zv management.azure.com 443
Connection to management.azure.com 443 port [tcp/https] succeeded!


If the output looks any different, you have to check the connection of your Checkmk server to Azure or contact your network people. More than once, there was a firewall blocking this connection.

BI special agent


With Werk #6679, we introduced the new BI special agent. The special agents receive the key directly via stdin.


If you want to execute the special agent from the command line, please run the following command.

Step-by-step guide

  1. If you use special agents installed from a Feature Pack, you can find the special agents in:

    OMD[mysite]:~$ ~/local/share/check_mk/agents/special/

    .

  2. The special agents which are already shipped with Checkmk can be found here:

    OMD[mysite]:~$ ~/share/check_mk/agents/special/

    .

  3. How to execute the special agent manually?

    OMD[mysite]:~$ cmk -D bi |head -n15
    
    bi                                                                             
    Addresses:              No IP
    Tags:                   [address_family:no-ip], [agent:cmk-agent], [checkmk-agent:checkmk-agent], [criticality:prod], [networking:lan], [piggyback:auto-piggyback], [site:bi], [snmp_ds:no-snmp], [tcp:tcp]
    Labels:                 [cmk/site:mysite]
    Host groups:            check_mk
    Contact groups:         all
    Agent mode:             Normal Checkmk agent, or special agent if configured
    Type of agent:          
      Program: /omd/sites/mysite/share/check_mk/agents/special/agent_bi 
      Program stdin:
    [{'site': 'local', 'credentials': 'automation', 'filter': {'groups': ['Hosts']}, 'assignments': {'querying_host': 'querying_host'}}]
      Process piggyback data from /omd/sites/bi/tmp/check_mk/piggyback/bi
    Services:
      checktype                 item                                 params                                             description                               groups

    With this command, you will get the special agent call. Now you can continue the debugging.

    .

  4. Execute the command

    echo "[{'site': 'local', 'credentials': 'automation', 'filter': {'groups': ['Hosts']}, 'assignments': {'querying_host': 'querying_host'}}]" | /omd/sites/mysite/share/check_mk/agents/special/agent_bi

Jenkins special agent and access rights

For the Jenkins Special Agent to work, the user that logs on to Jenkins has to have the following access rights:

  • General: Read

  • Agent: Connect

  • Element: Read & Workspace

  • Views: Read

Kubernetes - k8s special agent

Getting Started

Background information regarding this subject is available on our:


Installation of Checkmk Cluster Collectors (a.k.a Checkmk Kubernetes agent)

We strongly recommend using our helm charts for installing the Checkmk Cluster Collectors unless you are very experienced with Kubernetes and want to install the agent using the manifests in YAML format provided by us.

Please remember that we can not support you in installing the agent while using the manifests. 


The Helm chart installs and configures all necessary components to run the agent and exposes several helpful configuration options that will help you automatically set up complex resources. The prerequisites have to be fulfilled before you proceed with the installation.

Below is an example of deploying the helm charts using a LoadBalancer (requires ability of cluster to create a LoadBalancer): 

$ helm repo add checkmk-chart https://checkmk.github.io/checkmk_kube_agent

$ helm repo update

$ helm upgrade --install --create-namespace -n checkmk-monitoring checkmk checkmk-chart/checkmk --set clusterCollector.service.type="LoadBalancer"

Release "checkmk" does not exist. Installing it now.
NAME: checkmk
LAST DEPLOYED: Tue May 17 22:01:07 2022
NAMESPACE: checkmk-monitoring
STATUS: deployed
REVISION: 1
TEST SUITE: None

NOTES:You can access the checkmk `cluster-collector` via:

LoadBalancer:
==========  
NOTE: It may take a few minutes for the LoadBalancer IP to be available.
      You can watch the status of by running 'kubectl get --namespace checkmk-monitoring svc -w checkmk-cluster-collector'

  export SERVICE_IP=$(kubectl get svc --namespace checkmk-monitoring checkmk-cluster-collector --template "{{ range (index .status.loadBalancer.ingress 0) }}{{.}}{{ end }}");

  echo http://$SERVICE_IP:8080

# Cluster-internal DNS of `cluster-collector`: checkmk-cluster-collector.checkmk-monitoring
=========================================================================================
With the token of the service account named `checkmk-checkmk` in the namespace `checkmk-monitoring` you can now issue queries against the `cluster-collector`.

Run the following to fetch its token and the ca-certificate of the cluster:

  export TOKEN=$(kubectl get secret checkmk-checkmk -n checkmk-monitoring -o=jsonpath='{.data.token}' | base64 --decode);
  export CA_CRT="$(kubectl get secret checkmk-checkmk -n checkmk-monitoring -o=jsonpath='{.data.ca\.crt}' | base64 --decode)";
  
# Note: Quote the variable when echo'ing to preserve proper line breaks: `echo "$CA_CRT"`

To test access you can run:
  curl -H "Authorization: Bearer $TOKEN" http://$SERVICE_IP:8080/metadata | jq


As an example, you can further set some configuration options on the command line to the above helm command (these are some examples, but depending on your requirement, you can specify multiple or separate values) :

FlagsDescription

--set clusterCollector.service.type="LoadBalancer"

This sets the cluster collector service type to LoadBalancer.
--set clusterCollector.service.type="NodePort" --set clusterCollector.service.nodePort=30035Here, you can specify a different service type and port again.
--version 1.0.0-beta.2specify a version constraint for the chart version to use

We recommend using these values.yaml to configure your Helm chart.


For this, you need to then run the command:

$ helm upgrade --install --create-namespace -n checkmk-monitoring myrelease checkmk-chart/checkmk -f values.yaml


After the chart has been successfully deployed, you will be presented with a set of commands to access the cluster-collector from the command line.   In case you want to see those commands, you can do the following:

helm status checkmk -n checkmk-monitoring


At the same time, you can also verify if all the essential resources in the namespace have been deployed successfully. The below command in the screenshot lists some important resources:


$kubectl get all -n checkmk-monitoring

NAME 													READY	STATUS		RESTARTS	AGE
pod/checkmk-cluster-collector-57c7f5f54b-xgqvx			1/1 	Running		0 			19m
pod/checknk-node-collector-container-netrics-lflhs		2/2 	Running		0 			20m
pod/checkmk-node-collector-container-netrics-s59lb		2/2 	Running		0 			20m
pod/checkmk-node-collector-container-metrics-tnccf		2/2 	Running		0 			20m
pod/checknk-node-collector-machine-sections-9k441		1/1 	Running		0 			20m
pod/checkmk-node-collector-machine-sections-fc795		1/1 	Running		0 			19m
pod/checknk-node-collector-machine-sections-lfv9l		1/1 	Running		0 			20m


NAME 									TYPE			CLUSTER-IP		EXTERNAL-IP		PORTS				AGE
service/checkmk-cluster-collector 		LoadBalancer	10.20.10.165	34.107.19.22	8080:31168/TCP 		20m

NAME 														DESIRED		CURRENT 	READY	UP-TO-DATE	AVAILABLE	NODE SELECTOR 	AGE
daemonset. apps/checkmk-node-collector-container-metrics 	3 			3 			3 		3 			3 			<none> 			20m
daemonset.apps/checkmk-node-collector-machine-sections 		3 			3 			3 		3 			3 			<none> 			20m

NAME 										READY 	UP-TO-DATE	AVAILABLE	AGE
deployment.apps/checkmk-cluster-collector 	1/1 	1 			1 			20m

NAME 													DESIRED		CURRENT 	READY 	AGE
replicaset.apps/checkmk-cluster-collector-57c7f5f54b 	1 			1 			1   	20M

Exposing the Checkmk Cluster Collector

By default, the API of Checkmk Cluster Collector is not exposed to the outside (not to be mistaken with Kubernetes API itself). This is required to gather usage metrics and enrich your monitoring.
Checkmk pulls data from this API, which can be exposed via the service checkmk-cluster-collector. To do so, you must run it with one of the following flags or set them in a values.yaml.

FlagsDescription

--set clusterCollector.service.type="LoadBalancer"

This sets the cluster collector service type to LoadBalancer.
--set clusterCollector.service.type="NodePort" --set clusterCollector.service.nodePort=30035Here, you can specify a different service type and port again.

Debugging K8s special agent

  1. The first step would be to find the complete command of the Kubernetes special agent.
     
    1. The command can be found under "Type of agent >> Program." It will consist of multiple parameters depending on how the datasource program rule has been configured. 

      OMD[mysite]:~$ cmk -D k8s | more
      
      k8s 
      Addresses: No IP
      Tags: [address_family:no-ip], [agent:special-agents], [criticality:prod], [networking:lan],
      [piggyback:auto-piggyback], [site:a21], [snmp_ds:no-snmp], [tcp:tcp]
      Labels: [cmk/kubernetes/cluster:at], [cmk/kubernetes/object:cluster], [cmk/site:k8s]
      Host groups: check_mk
      Contact groups: all
      Agent mode: No Checkmk agent, all configured special agents
      Type of agent: 
      Program: /omd/sites/mysite/share/check_mk/agents/special/agent_kube '--cluster' 'k8s' '--token' 'xyz' '--monitored-objects' 'deployments' 'daemonsets' 'statefulsets' 'nodes' 'pods' '--api-server-endpoint' 'https://<YOUR-IP>:6443' '--api-server-proxy' 'FROM_ENVIRONMENT' '--cluster-collector-endpoint' 'https://<YOUR-ENDPOINT>:30035' '--cluster-collector-proxy' 'FROM_ENVIRONMENT'
      Process piggyback data from /omd/sites/mysite/tmp/check_mk/piggyback/k8s
      Services:
      ...

      An easier way would be this command: /bin/sh -c "$(cmk -D k8s | grep -A1 "^Type of agent:" | grep "Program:" | cut -f2- -d':')"

      Please note that if a line matching "^Type of agent:" followed by a line matching "^  Program:" exists more than once, the output might be messed up.

      .

    2. The special agent has the below options available for debugging purposes:

      OMD[mysite]:~$ /omd/sites/mysite/share/check_mk/agents/special/agent_kube -h
      ...
      --debug                     Debug mode: raise Python exceptions
      -v / --verbose 				Verbose mode (for even more output use -vvv)
      --vcrtrace FILENAME         Enables VCR tracing for the API calls
      ...

      .

    3. Now, you can modify the above  command of the Kubernetes special agent like this:

      OMD[mysite]:~$ /omd/sites/mysite/share/check_mk/agents/special/agent_kube  \
      '--cluster' 'at' \
      '--token' 'xyz' \
      '--monitored-objects' 'deployments' 'daemonsets' 'statefulsets' 'nodes' 'pods' \
      '--api-server-endpoint' 'https://<YOUR-IP>:6443' \
      '--api-server-proxy' 'FROM_ENVIRONMENT' \
      '--cluster-collector-endpoint' 'https://<YOUR-ENDPOINT>:30035' \
      '--cluster-collector-proxy' 'FROM_ENVIRONMENT' \
      --debug -vvv --vcrtrace ~/tmp/vcrtrace.txt > ~/tmp/k8s_with_debug.txt 2>&1

      Here, you can also reduce the number of '--monitored-objects' to a few resources to get less output. 
      .

    4. Run the special agent with no debug options to create an agent output, or you could download it from the cluster host via the Checkmk web interface. 

      /omd/sites/mysite/share/check_mk/agents/special/agent_kube '--cluster' 'at' '--token' 'xyz' '--monitored
      -objects' 'deployments' 'daemonsets' 'statefulsets' 'nodes' 'pods' '--api-server-endpoint' 'https://<YOUR-IP>:6443' '--api-server-proxy' 'FROM_ENVIRONMENT' '--cluster-collector-endpoint' 'https://<YOUR-ENDPOINT>:30035' '--cluster-collector-proxy' 'FROM_ENVIRONMENT' > ~/tmp/k8s_agent_output.txt 2>&1

      .

  2. Please upload the following files to the support ticket.

~/tmp/vcrtrace.txt Tracefile
~/tmp/k8s_with_debug.txt Debug output
~/tmp/k8s_agent_output.txt Agent output

Common errors

  • Context: the Kubernetes special agent is slightly unconventional relative to other Special agents as it handles up to three different datasources (the API, the cluster collector container metrics, and the cluster collector node metrics)
    • the connection to the Kubernetes API server is mandatory, while the connection to the others is optional (and decided through the configured Datasource rule)
      • Failure to connect to the Kubernetes API server will be shown by the Checkmk service (as usual) → the agent crashes
      • Failure to connect to the cluster collector will be highlighted in the Cluster Collector service → the error is not raised by the agent in production
        • the error is only raised when executing the agent with the --debug flag

  • Version: We only support the latest three Kubernetes versions (Kubernetes Release History)
    • If a customer has the latest release and the release itself is quite new (less than one month), ask one of the devs if we already have support.

  • Kubernetes API connection error: If the agent fails to make a connection to the Kubernetes API (e.g., 401 Unauthorized to query api/v1/core/pods), then the output based on the --debug flag should be sufficient
    • common causes:
      • service account was not configured correctly in the Kubernetes cluster 
      • wrong token configured
      • Forgot to upload the ca.crt in the Global settings >> Trusted certificate authorities for SSL  but --verify-cert-api is enabled.
      • Wrong IP or Port
      • Proxy is not configured in the datasource rule.

  • Checkmk Cluster  Collector connection error:
    • Common causes:
      • The cluster collector is not exposed via either NodePort or Ingress.
      • The essential resources like pods, deployments, daemon-sets, replicas, etc., are not running or frequently restarting.
      • A firewall or a security group blocks the cluster collector IP. 
      • Port/IP incorrect.
      • Forgot to upload the ca.crt in the Global settings >> Trusted certificate authorities for SSL  but --verify-cert-api is enabled.
      • Proxy is not configured in the datasource rule.

  • API processing error: If the agent reports a bug similar to "value ... was not set, " the user should be asked for the vcrtrace file.


Linux agent over SSH

Problem

When executing the Checkmk agent for Linux via SSH, you might encounter error messages when something is configured properly. Usually, the service Check_MK will notify you about any connection problems that might occur. Below, we will list a couple of these error messages and try to give some pointers as to what might solve your troubles.

screenshot of Services of host myserver. Check_mk service reports Permission denied, please try again later.


Solution

Error Message 01

Agent exited with code 255: Permission denied, please try again.

Possible Cause

The public key in the file authorized_keys on the host might contain an error. This can easily happen when - for example - a line break is somehow inserted in the key, or you omitted a single character, when copying the key to the host.

Possible Solution

Double and triple-check, that the public key on the host you are trying to monitor is 100 % the same as on your Checkmk server.


Error Message 02

Agent exited with code 255: Host key verification failed.CRIT, Got no information from the host, execution time 0.0 sec


Possible Cause

The error message here is clear. The "host key verification failed". But what does this mean? It might just mean that you never introduced your Checkmk server and the host to one another, and hence the key fingerprint is not available in the file _~/.ssh/known_hosts_ on your Checkmk server.

Possible Solution

This one can be resolved easily. Log in to your site and create an SSH connection to your host. SSH should now ask you if you actually want to connect to this machine. You should answer by typing 'yes'. This will add the host to the list of known hosts.


Netapp

  1. Login to the Checkmk server and become siteuser

    root@linux:# su mysite
    
    OMD[mysite]:~$ cmk -D <netapp_host> | head -n 15   

    This should display the whole special agent query, including all arguments (similar to vSphere debugging)
    .

  2. Copy that whole output 
    .
  3. Paste it and add the debug option to it like so:

    /omd/sites/yoursitename/share/check_mk/agents/special/agent_netapp 'hosntame' 'user' 'password' --vcrtrace /tmp/TRACEFILE '-no_counters' --debug --xml > /tmp/debug.txt 2>&1

    .

  4. Add the agent_netapp command line (password stripped) and the dump.txt to your support ticket

Special Agents with parameters via stdin

Step-by-step guide

A couple of our special agents get their parameters via stdin. For example, the Prometheus special agent or the AWS special agent. You can see this in the output of the command cmk -D myhost. If after the line for Program  you find a line beginning with Program stdin, you have to pipe these parameters into the special agent with echo.

Let's say you want to debug the special agent for Prometheus. You configured a rule and pinned it to the host myprometheushost. Log in as the site user and run the following command:

OMD[mysite]:~$ cmk -D myprometheushost


The output will look something like this:

myprometheushost                                                               
Addresses:              10.18.49.2
Tags:                   [address_family:ip-v4-only], [agent:cmk-agent], [checkmk-agent:checkmk-agent], [criticality:prod], [ip-v4:ip-v4], [networking:lan], [piggyback:auto-piggyback], [site:kube], [snmp_ds:no-snmp], [tcp:tcp]
Labels:                 [cmk/site:mysite]
Host groups:            check_mk
Contact groups:         all
Agent mode:             Normal Checkmk agent, or special agent if configured
Type of agent:          
  Program: /omd/sites/kube/local/share/check_mk/agents/special/agent_prometheus 
  Program stdin:
{'connection': ('ip_address', {'port': 31275}), 'verify-cert': False, 'protocol': 'http', 'exporter': [('kube_state', {'cluster_name': 'mypromcluster', 'prepend_namespaces': 'use_namespace', 'entities': ['cluster', 'nodes', 'services', 'pods', 'daemon_sets']})], 'promql_checks': [], 'host_address': '10.18.49.2', 'host_name': 'myprometheushost'}
  Process piggyback data from /omd/sites/mysite/tmp/check_mk/piggyback/myprometheushost


Now go ahead and copy the block after Program stdin wrap it in double quotes and prepend it with an echo. Next, put a pipe and the path to the special agent you find in the line, starting with Program. Together it looks like this:

OMD[mysite]:~$ echo "{'connection': ('ip_address', {'port': 31275}), 'verify-cert': False, 'protocol': 'http', 'exporter': [('kube_state', {'cluster_name': 'mypromcluster', 'prepend_namespaces': 'use_namespace', 'entities': ['cluster', 'nodes', 'services', 'pods', 'daemon_sets']})], 'promql_checks': [], 'host_address': '10.18.49.2', 'host_name': 'myprometheushost'}" | /omd/sites/mysite/local/share/check_mk/agents/special/agent_prometheus


In most cases, the special agents offer the possibility to activate verbose output or debug output from Python. Simply append -vvv  and/or --debug at the very end of the command above.

StoreOnce 4x special agent

Problem

The StoreOnce is agent is crashing with the following message

<<<storeonce4x_d2d_services:sep(0)>>>
Traceback (most recent call last):
  File "/omd/sites/mysite/lib/python3/requests_oauthlib/oauth2_session.py", line 477, in request
    url, headers, data = self._client.add_token(
  File "/omd/sites/mysite/lib/python3/oauthlib/oauth2/rfc6749/clients/base.py", line 198, in add_token
    raise TokenExpiredError()
oauthlib.oauth2.rfc6749.errors.TokenExpiredError: (token_expired)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "share/check_mk/agents/special/agent_storeonce4x", line 10, in <module>
    main()
  File "/omd/sites/mysite/lib/python3/cmk/special_agents/agent_storeonce4x.py", line 260, in main
    special_agent_main(parse_arguments, agent_storeonce4x_main)
  File "/omd/sites/mysite/lib/python3/cmk/special_agents/utils/agent_common.py", line 159, in special_agent_main
    _special_agent_main_core(parse_arguments, main_fn, sys.argv[1:])
  File "/omd/sites/mysite/lib/python3/cmk/special_agents/utils/agent_common.py", line 141, in _special_agent_main_core
    main_fn(args)
  File "/omd/sites/mysite/lib/python3/cmk/special_agents/agent_storeonce4x.py", line 251, in agent_storeonce4x_main
    writer.append_json(function(oauth_session))
  File "/omd/sites/mysite/lib/python3/cmk/special_agents/utils/agent_common.py", line 51, in append_json
    for l in data:
  File "/omd/sites/mysite/lib/python3/cmk/special_agents/agent_storeonce4x.py", line 154, in handler_simple
    yield from (requester.get(uri) for uri in uris)
  File "/omd/sites/mysite/lib/python3/cmk/special_agents/agent_storeonce4x.py", line 154, in <genexpr>
    yield from (requester.get(uri) for uri in uris)
  File "/omd/sites/mysite/lib/python3/cmk/special_agents/agent_storeonce4x.py", line 142, in get
    resp = self._oauth_session.request(
  File "/omd/sites/mysite/lib/python3/requests_oauthlib/oauth2_session.py", line 496, in request
    token = self.refresh_token(
  File "/omd/sites/mysite/lib/python3/requests_oauthlib/oauth2_session.py", line 446, in refresh_token
    self.token = self._client.parse_request_body_response(r.text, scope=self.scope)
  File "/omd/sites/mysite/lib/python3/oauthlib/oauth2/rfc6749/clients/base.py", line 421, in parse_request_body_response
    self.token = parse_token_response(body, scope=scope)
  File "/omd/sites/mysite/lib/python3/oauthlib/oauth2/rfc6749/parameters.py", line 431, in parse_token_response
    validate_token_parameters(params)
  File "/omd/sites/mysite/lib/python3/oauthlib/oauth2/rfc6749/parameters.py", line 441, in validate_token_parameters
    raise MissingTokenError(description="Missing access token parameter.")
oauthlib.oauth2.rfc6749.errors.MissingTokenError: (missing_token) Missing access token parameter.


Solution

  1. Example with Special Agent of storeonce4x

    1. Find out the detailed special agent command (Type of agent column)

      OMD[mysite]:~$ cmk -D hostname

      an easier way would be this command: /bin/sh -c "$(cmk -D k8s | grep -A1 "^Type of agent:" | grep "Program:" | cut -f2 -d':')"

      Please note that if a line matching "^Type of agent:" followed by a line matching "^  Program:" exists more than once, then the output might be messed up.

      .

    2. Check if there are some options for debugging

      OMD[mysite]:~$ ~/share/check_mk/agents/special/agent_storeonce4x -h


      There are three options for debugging the request:


      --debug, -d           Enable debug mode (keep some exceptions unhandled)
      --verbose, -v
      --vcrtrace TRACEFILE, --tracefile TRACEFILE
                                  If this flag is set to a TRACEFILE that does not exist yet, it will be created and
                                  all requests the program sends and their corresponding answers will be recorded in said file.
                                  If the file already exists, no requests are sent to the server, but the responses will be
                                  replayed from the tracefile. 
      

      .

    3. Modify the special agent command by adding these three options

      OMD[mysite]:~$ ~/share/check_mk/agents/special/agent_storeonce4x <OTHER ARGUMENTS> --debug -v --vcrtrace ~/tmp/vcrtrace.txt 2>1 ~/tmp/storeonce4x_with_debug.txt

      .

    4. Run the special agent with no debug options to create an agent output. With this file, we can reproduce your issue

      OMD[mysite]:~$ /omd/sites/mysite/share/check_mk/agents/special/agent_kube <OTHER ARGUMENTS> > ~/tmp/k8s_agent_output.txt

  2. Rename the token file

    The storeonce4x special agent is using username/password for authentication. After the successful login, we obtain the access token. The access token is used for future REST requests.

    If you want to read more, you can check this out: https://hewlettpackard.github.io/storeonce-rest/#Authentication

    1. We save the token file inside the site in

      ~/tmp/check_mk/special_agents/agent_storeonce4x/<hostname>_oAuthToken.json

      .

    2. Rename the file to _oAuthToken.json.back

      OMD[mysite]~# mv ~/tmp/check_mk/special_agents/agent_storeonce4x/<hostname>_oAuthToken.json ~/tmp/check_mk/special_agents/agent_storeonce4x/<hostname>_oAuthToken.json.back

      .

    3. Run the special agent again


VMware vSphere

Although Containers and their management with Kubernetes took the IT industry by storm, virtualization still has its "right to exist" in on-prem environments and everywhere where containerization would not fit.

This is an extension to Monitoring VMware ESXi

Getting Started

Background information regarding this subject is available in our Official documentation

Datastore provisioning in vSphere

When adding the vCenter into Checkmk, you automatically have full insight into datastore provisioning, and you can be alerted if too many VMs are provisioned as "Thin", thus reclaiming more logical space than the datastore can provide physically.

If you add the ESXi host solely, you'd probably see the "provisioning" value also in your filesystems, but then they are identical to the "Used filesystem" value. Only vCenter knows the real provisioned values.   


Piggyback-only with ESXi hosts

Although we consider it best practice to use a read-only user on the ESXi hosts themselves AND the vCenter, to allow continuous monitoring, in some cases, it might not be allowed to access the hosts themselves.

If Piggyback is configured correctly, then you only need a read-only user to access the vCenter inventory (tip: use a local vSphere user, i.e., monitoring@vsphere.local, not AD, as this might time out during query), and test access by logging in with this user at the vSphere console site.

When adding the vCenter with all available data and then adding the ESXi hosts in Checkmk, piggyback will automatically assign all resources to the ESXi hosts as you see them in the vCenter (i.e., CPU, memory, data stores ....).

Disadvantage: if the ESXi hosts are not directly monitored via Special Agent (or SNMP, we've seen that, too), the local partitions on the hosts are not visible. 


More piggyback! Snapshot monitoring

When adding the VMs as hosts into Checkmk, several more Checks are automatically added to them, without any further config needed, beginning with "ESX" and mainly displaying the VMs resource consumption.

One of them, "ESX Snapshots," allows you to monitor all given snapshots of the VM and alert you if they get too old. This is very useful to remind POs to delete their manually created snapshots in a timely fashion. 


Basic debugging

  1. Example with Special Agent of vSphere
    .
    1. Find out the detailed special agent command

      OMD[mysite]:~$ cmk -D <vcenter-host> | more
      
      vcenter 
      Addresses: x.x.x.x
      Tags: [add_ip_addresses:add_ip_addresses_1], [address_family:ip-v4-only], [agent:special-agents], [criticality:prod], 
      [ip-v4:ip-v4], [networking:lan], [piggyback:auto-piggyback], [site:nagnis_master], [snmp_ds:no-snmp], [tcp:tcp]
      Labels: [cmk/vsphere_object:vm]
      Host groups: check_mk
      Contact groups: all
      Agent mode: No Checkmk agent, all configured special agents
      Type of agent: 
      Program: /omd/sites/mysite/share/check_mk/agents/special/agent_vsphere -u 'user' -s 'password' -i hos
      tsystem,virtualmachine,datastore,counters,licenses -P --spaces cut --snapshot_display vCenter --no-cert-check 'x.x.x.x'
      Process piggyback data from /omd/sites/mysite/tmp/check_mk/piggyback/vcenter
      Services:
      checktype item params

      An easier way would be this command: /bin/sh -c "$(cmk -D vcenter | grep -A1 "^Type of agent:" | grep "^ Program:" | cut -f2 -d':')"

      Please note that if a line matching "^Type of agent:" followed by a line matching "^  Program:" exists more than once, the output might be messed up.

      .

    2. Check if there are options for debugging.

      OMD[mysite]:~$ /omd/sites/mysite/share/check_mk/agents/special/agent_vsphere -h

      There are two options for debugging the request.

      --debug                       Debug mode: let Python exceptions come through
      
      --tracefile FILENAME          Log all outgoing and incoming data into the given tracefile
      

      .

    3. Modify the special agent command by adding these two options

      OMD[mysite]:~$ /omd/sites/mysite/share/check_mk/agents/special/agent_vsphere  -u 'user' -s 'password' --debug --tracefile $OMD_ROOT/tmp/vcenter.out -i hostsystem,virtualmachine,datastore,counters,licenses -P --spaces cut --no-cert-check '$HOST_ADDRESS' > $OMD_ROOT/tmp/vcenter.debug

      In CMK 1.6.0, you might find the option "--snapshot_display vCenter" in your CMK -D output. If that's the case, you can include this parameter.

      .

    4. Run the special agent with no debug options to create an agent output. With this file, we can reproduce your issue.

      root@linux~# /omd/sites/mysite/share/check_mk/agents/special/agent_vsphere -u 'user' -s 'password' -i hostsystem,virtualmachine,datastore,counters,licenses -P --spaces cut --no-cert-check 'x.x.x.x' >/~tmp/agent.output

      .

  2. Please send us all three files. Now we're able to investigate further.

    1
    2
    3

    ~/tmp/vcenter.debug      # Debug Output
    ~/tmp/vcenter.out        # Tracefile
    /~tmp/agent.output       # Agent Output

Advanced Debugging Examples

Collect several agent outputs over a period of time:

export t=60; export s=0; while [ $s -le 600 ]; do echo $s; cmk -d $VSPHERE_HOST > /tmp/agent_vsphere_output.$s; let s=$s+$t; sleep $t; done


Collect several trace files over a period of time:

export t=60; export s=0; while [ $s -le 600 ]; do echo $s; ./agent_vsphere --trace /tmp/agent_vsphere_trace.$s $OTHER_COMMAND_PARAMS; let s=$s+$t; sleep $t; done