Troubleshooting long-running Windows agent

This article helps debug long-running Windows agents. 

LAST TESTED ON CHECKMK 2.3.0P1


Table of Contents

Problem

A Windows agent runs by default a few seconds. This is within our default 60 seconds timeout. Due to agent extensions (e.g., plugins, local checks) or misconfiguration an agent can run longer and this can lead to a timeout in the Checkmk UI.

Step-by-step Guide

Steps to figure out which section/plugin is affected

  1. Open PowerShell and run the agent locally to make sure how long it took

    Measure-Command { C:\ProgramData\checkmk\agent\bin\cmk-agent-ctl.exe -vv dump > agent_output.txt } > agent_output_time.txt

    .

  2. Go through the check_mk.log* files and use the following grep commands:

    Disclaimer

    For easier troubleshooting, I moved the files to a Linux server.



    This command will give you a list of all the sections and how long each of them took. We'll also sort by the section name.

    grep -roP "Section '\w+' took \[\d+\]" |sort -t '[' -k1,1n |uniq -c 



    This command will give you a list of all the sections and how long each of them took. We'll also sort by the time to list the long-running ones first.

    grep -roP "Section '\w+' took \[\d+\]" |sort -t '[' -k2,2n |uniq -c

Solution

  • Before tweaking or changing anything, we should first understand why that section/plugin is taking very long.

  • Two possible solutions:

    • If the plugin needs more than 60 seconds to provide data, feel free to follow this guide to run it asynchronous:
      Asynchronous execution of Windows plugins 

    • If it's a section inside the agent, you would have to change the agent interval.