Customers are reporting a lingering downtime that is no longer relevant. While this issue may be rare, a customer had a similar experience documented HERE.
LAST TESTED ON CHECKMK 2.2.0P1
Problem
The customer is experiencing difficulties resolving the constant downtime for a specific host. The issue involves a situation where the downtime was initiated without an associated end time.
Troubleshooting
- The first step would be to check the downtimes for both services and hosts:
If no relevant information is discovered, the next step would involve executing a Livestatus query to retrieve all existing downtimes.
lq "GET downtimes\nColumns: downtime_author downtime_comment downtime_duration downtime_end_time downtime_entry_time downtime_fixed downtime_id downtime_is_service downtime_origin downtime_recurring downtime_start_time host_has_been_checked host_labels host_name host_scheduled_downtime_depth host_state service_description service_has_been_checked service_state"
If nothing is still found, it is recommended to investigate the history file located in the ~/var/check_mk/core, explicitly searching for the summary information. In this particular scenario, the summary to search for is 'DT2'.
OMD[mysite]~$ grep -rl <DOWNTIMESUMMARY> ~/var/check_mk/core/history
If the history file is large, reviewing the files in ~/var/check_mk/core/archive can also be helpful. These history files contain Unix timestamps that can help with troubleshooting.
OMD[mysite]~$ grep -rl <DOWNTIMESUMMARY> ~/var/check_mk/core/archive/*
Solution
Warning
Please note that the following steps are unsupported, and a backup of the Checkmk site should be created before proceeding.
If the event is found in the history file but nowhere else:
Stop the site
OMD[mysite]~$ omd stop
Open a CLI text editor and remove the entry from the history file.
OMD[mysite]~$ vi ~/var/check_mk/core/history
Start this site again.
OMD[mysite]~$ omd start
These steps will effectively clear the active lingering downtime.
Related articles