How-to debug failed backup jobs

How-to debug failed backup jobs

This manual is an extension of our general Checkmk backup article

LAST TESTED ON CHECKMK 2.4.0P1

Table of Contents

Problem

Basic information about mkbackup

After configuring the backup job, a cronjob is created. This job can be inspected on the command line, after logging in via SSH as site user:

OMD[mysite]:~$ cat etc/cron.d/mkbackup # Written by mkbackup configuration 0 0 * * * mkbackup backup mybackup >/dev/null OMD[mysite]:~$ mkbackup backup mybackup 2022-05-17 16:02:22 --- Starting backup (Check_MK-cma-mysite-mybackup to mytarget) --- 2022-05-17 16:02:24 Verifying backup consistency 2022-05-17 16:02:24 Cleaning up previously completed backup 2022-05-17 16:02:24 --- Backup completed (Duration: 0:00:01, Size: 42.00 MB, IO: 0.42 B/s) --- OMD[mysite]:~$

 

If you need more debugging, you can add --verbose and --debug to the mkbackup command:

OMD[cma]:~$ mkbackup --verbose --debug backup mybackup

 

Collection of error messages

Error 1

Failed to perform a backup: [Errno 104] Connection reset by peer

2021-03-17 11:10:20 --- Starting backup --- 2021-03-17 11:10:20 Performing system backup (system.tar) 2021-03-17 11:10:25 Performing system data backup (system-data.tar) 2021-03-17 11:10:48 Performing site backup: test Site backup failed: Failed to perform backup: [Errno 104] Connection reset by peer

 

Solution

Find the correct backup job

OMD[mysite]:~$ mkbackup jobs Job-ID Title ------------------------------------------------------------ myid mytitle OMD[mysite]:~$


Please run the backup directly on the command line and forward the output to a log file.

OMD[mysite]:~$ omd -v backup --no-compression mybackup - >~/path/to/my_backup.txt
Pausing RRD updates for /omd/sites/mysite/var/pnp4nagios/perfdata/myhost/my_disk_read.rrd rrdcached command: SUSPEND /omd/sites/mysite/var/pnp4nagios/perfdata/myhost/my_disk_read.rrd rrdcached response: '-1 /omd/sites/mysite/var/pnp4nagios/perfdata/myhost/my_disk_read.rrd - No such file or directory\n' Resuming RRD updates for /omd/sites/mysite/var/pnp4nagios/perfdata/myhost/my_disk_read.rrd rrdcached command: RESUME /omd/sites/mysite/var/pnp4nagios/perfdata/myhost/my_disk_read.rrd skipping rrdcached command (broken pipe) Pausing RRD updates for /omd/sites/mysite/var/pnp4nagios/perfdata/myhost/my_disk_write.rrd rrdcached command: SUSPEND /omd/sites/mysite/var/pnp4nagios/perfdata/myhost/my_disk_write.rrd rrdcached response: '-1 /omd/sites/mysite/var/pnp4nagios/perfdata/myhost/my_disk_write.rrd - No such file or directory\n' Resuming RRD updates for /omd/sites/mysite/var/pnp4nagios/perfdata/myhost/my_disk_write.rrd rrdcached command: RESUME /omd/sites/mysite/var/pnp4nagios/perfdata/myhost/my_disk_write.rrd Failed to perform backup: [Errno 104] Connection reset by peer

Here it looks like Checkmk is using pnp4nagios instead of Round Robin Database (RRD). We recommend converting the performance data to the RRD format. Please follow the steps described here: Customizing the RRD structure

Don't forget to stop the site before converting the files!

Now the backup should run without any errors.

 

Error 2

UnicodeEncodeError: 'utf-8' codec can't encode character '\udcc3' in position 122: surrogates not allowed

Job state: Site mysite Backup ############################################# Site backup State Failed Runtime Started at 2022-06-21 03:00:02, Finished at 2022-06-21 03:00:02 (Duration: 0:16:36) Output 2022-06-21 03:00:02 — Starting backup (Check_MK-mysite+cmk2-mysite-mysite+bak to Reload) — 2022-06-21 03:00:02 Found previous incomplete backup. Cleaning up those files. Site backup failed: Traceback (most recent call last): File "/omd/sites/mysite/bin/omd", line 60, in <module> omdlib.main.main() File "/omd/versions/2.0.0p23.cee/lib/python3/omdlib/main.py", line 4022, in main command.handler(version_info, site, global_opts, args, command_options) File "/omd/versions/2.0.0p23.cee/lib/python3/omdlib/main.py", line 2753, in main_backup omdlib.backup.backup_site_to_tarfile(site, fh, tar_mode, options, global_opts.verbose) File "/omd/versions/2.0.0p23.cee/lib/python3/omdlib/backup.py", line 54, in backup_site_to_tarfile _backup_site_files_to_tarfile(site, tar, options) File "/omd/versions/2.0.0p23.cee/lib/python3/omdlib/backup.py", line 112, in _backup_site_files_to_tarfile tar.add(site.dir, site.name, filter=filter_files) File "/omd/versions/2.0.0p23.cee/lib/python3/omdlib/backup.py", line 134, in add super(BackupTarFile, self).add(name, arcname, recursive, filter=filter) File "/omd/versions/2.0.0p23.cee/lib/python3.8/tarfile.py", line 1977, in add self.add(os.path.join(name, f), os.path.join(arcname, f), File "/omd/versions/2.0.0p23.cee/lib/python3/omdlib/backup.py", line 134, in add super(BackupTarFile, self).add(name, arcname, recursive, filter=filter) File "/omd/versions/2.0.0p23.cee/lib/python3.8/tarfile.py", line 1977, in add self.add(os.path.join(name, f), os.path.join(arcname, f), File "/omd/versions/2.0.0p23.cee/lib/python3/omdlib/backup.py", line 134, in add super(BackupTarFile, self).add(name, arcname, recursive, filter=filter) File "/omd/versions/2.0.0p23.cee/lib/python3.8/tarfile.py", line 1977, in add self.add(os.path.join(name, f), os.path.join(arcname, f), File "/omd/versions/2.0.0p23.cee/lib/python3/omdlib/backup.py", line 134, in add super(BackupTarFile, self).add(name, arcname, recursive, filter=filter) File "/omd/versions/2.0.0p23.cee/lib/python3.8/tarfile.py", line 1977, in add self.add(os.path.join(name, f), os.path.join(arcname, f), File "/omd/versions/2.0.0p23.cee/lib/python3/omdlib/backup.py", line 134, in add super(BackupTarFile, self).add(name, arcname, recursive, filter=filter) File "/omd/versions/2.0.0p23.cee/lib/python3.8/tarfile.py", line 1977, in add self.add(os.path.join(name, f), os.path.join(arcname, f), File "/omd/versions/2.0.0p23.cee/lib/python3/omdlib/backup.py", line 134, in add super(BackupTarFile, self).add(name, arcname, recursive, filter=filter) File "/omd/versions/2.0.0p23.cee/lib/python3.8/tarfile.py", line 1971, in add self.addfile(tarinfo, f) File "/omd/versions/2.0.0p23.cee/lib/python3/omdlib/backup.py", line 158, in addfile self._suspend_rrd_update(rrd_file_path) File "/omd/versions/2.0.0p23.cee/lib/python3/omdlib/backup.py", line 169, in _suspend_rrd_update self._send_rrdcached_command("SUSPEND %s" % path) File "/omd/versions/2.0.0p23.cee/lib/python3/omdlib/backup.py", line 199, in _send_rrdcached_command self._sock.sendall(("%s\n" % cmd).encode("utf-8")) UnicodeEncodeError: 'utf-8' codec can't encode character '\udcc3' in position 122: surrogates not allowed

 

Solution

Please run the backup directly on the command line and forward the output to a log file.

OMD[mysite]:~$ omd -v backup --no-compression mybackup - >~/path/to/my_backup.txt


Let's check the log now:

.. ... rrdcached command: SUSPEND /opt/omd/sites/mysite/var/pnp4nagios/perfdata/mysite/Check_MK_Jun_17_12_29_15_49152.18456_MSSQLSERVER_NT-AUTORIT�.rrd Traceback (most recent call last): File "/omd/sites/rrd2/bin/omd", line 60, in <module> omdlib.main.main() File "/omd/versions/2.0.0p26.cee/lib/python3/omdlib/main.py", line 4022, in main command.handler(version_info, site, global_opts, args, command_options) File "/omd/versions/2.0.0p26.cee/lib/python3/omdlib/main.py", line 2753, in main_backup omdlib.backup.backup_site_to_tarfile(site, fh, tar_mode, options, global_opts.verbose) File "/omd/versions/2.0.0p26.cee/lib/python3/omdlib/backup.py", line 54, in backup_site_to_tarfile _backup_site_files_to_tarfile(site, tar, options) File "/omd/versions/2.0.0p26.cee/lib/python3/omdlib/backup.py", line 112, in _backup_site_files_to_tarfile tar.add(site.dir, site.name, filter=filter_files) File "/omd/versions/2.0.0p26.cee/lib/python3/omdlib/backup.py", line 134, in add super(BackupTarFile, self).add(name, arcname, recursive, filter=filter) File "/omd/versions/2.0.0p26.cee/lib/python3.8/tarfile.py", line 1977, in add self.add(os.path.join(name, f), os.path.join(arcname, f), File "/omd/versions/2.0.0p26.cee/lib/python3/omdlib/backup.py", line 134, in add super(BackupTarFile, self).add(name, arcname, recursive, filter=filter) File "/omd/versions/2.0.0p26.cee/lib/python3.8/tarfile.py", line 1977, in add self.add(os.path.join(name, f), os.path.join(arcname, f), File "/omd/versions/2.0.0p26.cee/lib/python3/omdlib/backup.py", line 134, in add super(BackupTarFile, self).add(name, arcname, recursive, filter=filter) File "/omd/versions/2.0.0p26.cee/lib/python3.8/tarfile.py", line 1977, in add self.add(os.path.join(name, f), os.path.join(arcname, f), File "/omd/versions/2.0.0p26.cee/lib/python3/omdlib/backup.py", line 134, in add super(BackupTarFile, self).add(name, arcname, recursive, filter=filter) File "/omd/versions/2.0.0p26.cee/lib/python3.8/tarfile.py", line 1977, in add self.add(os.path.join(name, f), os.path.join(arcname, f), File "/omd/versions/2.0.0p26.cee/lib/python3/omdlib/backup.py", line 134, in add super(BackupTarFile, self).add(name, arcname, recursive, filter=filter) File "/omd/versions/2.0.0p26.cee/lib/python3.8/tarfile.py", line 1977, in add self.add(os.path.join(name, f), os.path.join(arcname, f), File "/omd/versions/2.0.0p26.cee/lib/python3/omdlib/backup.py", line 134, in add super(BackupTarFile, self).add(name, arcname, recursive, filter=filter) File "/omd/versions/2.0.0p26.cee/lib/python3.8/tarfile.py", line 1971, in add self.addfile(tarinfo, f) File "/omd/versions/2.0.0p26.cee/lib/python3/omdlib/backup.py", line 158, in addfile self._suspend_rrd_update(rrd_file_path) File "/omd/versions/2.0.0p26.cee/lib/python3/omdlib/backup.py", line 169, in _suspend_rrd_update self._send_rrdcached_command("SUSPEND %s" % path) File "/omd/versions/2.0.0p26.cee/lib/python3/omdlib/backup.py", line 199, in _send_rrdcached_command self._sock.sendall(("%s\n" % cmd).encode("utf-8")) UnicodeEncodeError: 'utf-8' codec can't encode character '\udcc3' in position 122: surrogates not allowed

 

This issue is that this file contains a non-ascii character at the end. "AUTORIT�.rrd"

To correct this, we must delete or rename this file. The safest solution would be to rename it.

OMD[mysite]:~$ mv oldfilename newfilename

 

Advanced Debugging

Tracing backup with strace

If a backup fails or behaves unexpectedly and the normal backup output does not provide enough detail, you can use strace to trace which files and directories are accessed during the backup. This is useful for identifying permission issues, unexpected paths, or filesystem-related problems.

Run the backup with strace enabled

Switch to the Checkmk site user and change to the system /tmp directory.
Do not use the site’s ~/tmp directory.

OMD[mysite]:~$ cd /tmp


Run the backup command wrapped with strace:

OMD[mysite]:/tmp$ strace -f -s 4096 -t -o /tmp/strace_backup.log omd -vvv backup --no-compression mybackupjobname - >& backup.log

Replace yourmybackupjobname with the name of the backup job you want to analyze.

This command will:

  • Trace all system calls made during the backup, including child processes

  • Write the full trace output to /tmp/strace_backup.log

  • Run the backup in verbose mode

  • Disable compression to simplify analysis

  • Store the standard backup output in backup.log

 

Using -s 4096 with strace

The -s option in strace defines the maximum string size that will be printed in the output.

By default, strace only shows the first 32 characters of a string. This often results in truncated file paths, SQL queries, or command arguments, which can make troubleshooting more difficult.

Using -s 4096 increases the maximum displayed string length to 4096 characters. This is especially helpful when debugging backup issues because you can see:

  • Full file paths being accessed

  • Complete SQL queries

  • Full command line arguments

  • Complete data being sent or received

 

Example with full string output (recommended for detailed debugging)

# Recommended - see full strings OMD[mysite]:/tmp$ strace -f -s 4096 -t -o /tmp/strace_backup.log omd -vvv backup --no-compression mybackupjobname - >& backup.log

This command provides much more detailed output, which is useful when investigating complex or unclear backup problems.

 

Example without -s 4096 (smaller output file)

# Will work but strings truncated OMD[mysite]:/tmp$ strace -f -t -o /tmp/strace_backup.log omd -vvv backup --no-compression mybackupjobname - >& backup.log

This will still work, but strings will be truncated to 32 characters.

 

Important note
Using -s 4096 can produce a very large output file, especially on busy or large sites. If disk space is limited or the backup runs for a long time, customers may prefer to:

  • Run without -s 4096 first

  • Only rerun with -s 4096 if more detail is required

In many cases, the default output is sufficient. Use -s 4096 when you specifically need full string visibility for deeper analysis.

 

List directories accessed during the backup

To extract a list of directories accessed during the backup, run:

OMD[mysite]:/tmp$ grep "O_DIRECTORY" /tmp/strace_backup.log | awk -F'"' '{print $2}' | grep "^/omd/sites/" | sort -u

This will produce a unique, sorted list of Checkmk site directories that were opened during the backup process.

 

When this is useful

This method can be helpful in situations such as:

  • Backups failing without clear error messages

  • Backups including unexpected data

  • Backups missing expected files or directories

  • Suspected filesystem or permission issues

 

Cleanup

After completing your analysis, remove the trace file to free disk space:

OMD[mysite]:/tmp$ rm /tmp/strace_backup.log

 

Related articles