Error handling with RRD files after conversion to the new format

Some customers may have errors after converting Round Robin Database (RRD) files to the new format. 

LAST TESTED ON CHECKMK 2.1.0P1

Table of Contents


This troubleshooting article is at your own risk! Make sure you have a complete backup before attempting any of these steps.

We observed this rare behavior only on installations that have been updated/migrated over the years (i.e. (1.5 →) 1.6 → 2.0). 

If you are unsure if this applies to you, or need help, do not hesitate to contact us.

Prerequisites

We expect that rules are set up in 'Setup → Services -> Service monitoring rules → Configuration of RRD databases of services' and 'Setup → Hosts → Host monitoring rules -> Configuration of RRD databases of hosts' like this:

Screenshot of creating a new service monitoring rule. The rule is Configuration of RRD databases of services.

Screenshot of creating a new service monitoring rule. The rule is Configuration of RRD databases of hosts.

Problem

After converting the RRD files to the new format (described here: https://docs.checkmk.com/latest/en/graphing.html#rrdformat), in some rare cases, it might happen that data is still written to ~/var/pnp4nagios/perfdata/<hostname>.

At the same time, you might see error messages in the cmc.log like

2021-09-20 09:18:14 [4] [rrdcached thread] [rrdcached at "/omd/sites/mysite/tmp/run/rrdcached.sock"] [log] -1 No such file: /omd/sites/mysite/var/pnp4nagios/perfdata/<hostname>/Memory_and_pagefile_pagefile_total.rrd
2021-09-20 09:18:14 [4] [rrdcached thread] [rrdcached at "/omd/sites/mysite/tmp/run/rrdcached.sock"] [log] -1 No such file: /omd/sites/mysite/var/pnp4nagios/perfdata/<hostname>/Memory_and_pagefile_pagefile_avg.rrd
2021-09-20 09:18:14 [4] [rrdcached thread] [rrdcached at "/omd/sites/mysite/tmp/run/rrdcached.sock"] [log] -1 No such file: /omd/sites/mysite/var/pnp4nagios/perfdata/<hostname>/Power_cpu0_Cores_w.rrd
2021-09-20 09:18:14 [4] [rrdcached thread] [rrdcached at "/omd/sites/mysite/tmp/run/rrdcached.sock"] [log] -1 No such file: /omd/sites/mysite/var/pnp4nagios/perfdata/<hostname>/Power_cpu0_DRAM_w.rrd
2021-09-20 09:18:14 [4] [rrdcached thread] [rrdcached at "/omd/sites/mysite/tmp/run/rrdcached.sock"] [log] -1 No such file: /omd/sites/mysite/var/pnp4nagios/perfdata/<hostname>/Power_cpu0_Graphics_w.rrd
2021-09-20 09:18:14 [4] [rrdcached thread] [rrdcached at "/omd/sites/mysite/tmp/run/rrdcached.sock"] [log] -1 No such file: /omd/sites/mysite/var/pnp4nagios/perfdata/<hostname>/Power_cpu0_Package_w.rrd

& at the same time; you can see that some RRD files in ~/var/pnp4nagios/perfdata/<hostname> are still updated. The chances are high that not all hosts are affected, but only a few, even less than 10%.

Solution


  1. Change to the site user and stop the site (if not already done)
    .

  2. Run the following command for one of the affected hosts:

    OMD[mysite]:~ cmk -vv --convert-rrds --delete-rrds <hostname>

    .

  3. If such messages appear like these:

    HOST_ PNP -> CMC WARNING: XML /opt/omd/sites/mysite/var/pnp4nagios/perfdata/<hostname>/_HOST_.xml refers to not existing RRD /opt/omd/sites/mysite/var/pnp4nagios/perfdata/<hostname>/_HOST__rta.rrd. Nothing to convert. Cleanup the XML file manually in case this is OK.

    .

  4. You can delete the XML file.
    .
  5. Restart the site, and the RRD files should be written correctly now.
    .
  6. If it works, you can run that for every host separately, as shown above, or for all hosts.

    Please note that the site must be stopped for this

    OMD[mysite]:~ cmk -vv --convert-rrds --delete-rrds