Troubleshooting stuck or lingering downtime alerts

Troubleshooting stuck or lingering downtime alerts

Customers are reporting a lingering downtime that is no longer relevant. While this issue may be rare, a customer had a similar experience.

LAST TESTED ON CHECKMK 2.2.0P1

Table of Contents

Getting Started

Background information regarding this subject is available in our Official documentation

Problem

The customer is having difficulty dealing with the ongoing downtime of a specific host, which began without a set end time.

Screenshot of Host last check was 48 minutes ago

Troubleshooting

  1. Check the downtimes for both services and hosts:
    Screenshot of Display rules for recurring downtimes for services
    Screenshot of Display rules for recurring downtimes for hosts


  2. If there's not enough information, the next step is to run a Livestatus query to get all current downtimes. 

    OMD[mysite]~$ lq "GET downtimes\nColumns: downtime_author downtime_comment downtime_duration downtime_end_time downtime_entry_time downtime_fixed downtime_id downtime_is_service downtime_origin downtime_recurring downtime_start_time host_has_been_checked host_labels host_name host_scheduled_downtime_depth host_state service_description service_has_been_checked service_state"

    .

  3. Consider checking the history file found in ~/var/check_mk/core/ for summary details if nothing else is discovered.

    In this particular scenario, the summary to search for is 'DT2'. 

    OMD[mysite]~$ grep -rl <DOWNTIMESUMMARY> ~/var/check_mk/core/history 
    OMD[mysite]:-/var/check_ mk/core$ grep -r DT2 history
    history:[1684243614] EXTERNAL COMMAND: SCHEDULE HOST_ DOWNTIME;localhost2;1684243614:1684243734;1;0;0;cmkadmin;DT2
    history:[1684243614] HOST DOHNTIME ALERT: localhost2;STARTED;DT2
    OMD[mysite]:~ /var/check_mk/core$

    .

  4. If the history file is large, reviewing the files in ~/var/check_mk/core/archive/ can also be helpful. These history files contain Unix timestamps that can help with troubleshooting. 

    OMD[mysite]~$ grep -rl <DOWNTIMESUMMARY> ~/var/check_mk/core/archive/*


Solution

Warning

It is important to note that the upcoming steps are not officially supported, and it is advisable to create a backup of the Checkmk site before moving forward.


If the event is found in the history file but nowhere else:

  1. Stop the site 

    OMD[mysite]~$ omd stop

    .

  2. Open a command line text editor (such as vim)  and remove the entry from the history file. 

    OMD[mysite]~$ vi ~/var/check_mk/core/history

    .

  3. Start this site again. 

    OMD[mysite]~$ omd start

    These steps will effectively clear the active lingering downtime.