Troubleshooting outdated BI downtime alerts
Customers are reporting a lingering downtime that is no longer relevant. While this issue may be rare, a customer had a similar experience.
LAST TESTED ON CHECKMK 2.2.0P1
Getting Started
Background information regarding this subject is available in our Official documentation
Problem
The customer is experiencing difficulties resolving the constant downtime for a specific host. The issue involves a situation where the downtime was initiated without an associated end time.
Troubleshooting
- Check the downtimes for both services and hosts:
.
. If no relevant information is discovered, the next step would involve executing a Livestatus query to retrieve all existing downtimes.
lq "GET downtimes\nColumns: downtime_author downtime_comment downtime_duration downtime_end_time downtime_entry_time downtime_fixed downtime_id downtime_is_service downtime_origin downtime_recurring downtime_start_time host_has_been_checked host_labels host_name host_scheduled_downtime_depth host_state service_description service_has_been_checked service_state"
.
If nothing is still found, it is recommended to investigate the history file located in the ~/var/check_mk/core, explicitly searching for the summary information. In this particular scenario, the summary to search for is 'DT2'.
OMD[mysite]~$ grep -rl <DOWNTIMESUMMARY> ~/var/check_mk/core/history
OMD[mysite]:-/var/check_ mk/core$ grep -r DT2 history:[1684243614] EXTERNAL COMMAND: SCHEDULE HOST_ DOWNTIME;localhost2;1684243614:1684243734;1;0;0;cmkadmin;DT2 history:[1684243614] HOST DOHNTIME ALERT: localhost2;STARTED;DT2 OMD[mysite]:~ /var/check_mk/core$
.
If the history file is large, reviewing the files in ~/var/check_mk/core/archive can also be helpful. These history files contain Unix timestamps that can help with troubleshooting.
OMD[mysite]~$ grep -rl <DOWNTIMESUMMARY> ~/var/check_mk/core/archive/*
Solution
Warning
Please note that the following steps are unsupported, and a backup of the Checkmk site should be created before proceeding.
If the event is found in the history file but nowhere else:
Stop the site
OMD[mysite]~$ omd stop
.
Open a CLI text editor and remove the entry from the history file.
OMD[mysite]~$ vi ~/var/check_mk/core/history
.
Start this site again.
OMD[mysite]~$ omd start
These steps will effectively clear the active lingering downtime.
Related articles