Troubleshooting "no space left on device" or filesystem full errors
Filesystems on the Checkmk Appliance can fill up for expected reasons or due to misconfiguration. This guide helps identify the cause and suggests solutions.
APPLICABLE TO ALL CHECKMK APPLIANCES
Problem
There are various reasons why a filesystem on the Checkmk Appliance may fill up. The first step is to identify which filesystem is affected, as each filesystem serves a different purpose and requires a different approach.
You should usually start by checking the Checkmk site running on the appliance itself. Since the appliance monitors its own resources, it often provides a quick overview of the current situation.
On a typical Checkmk Appliance, you will see the following filesystems (filesystems ending in /tmp can be ignored):
//ro/rw/omd
The first three filesystems (/, /ro, /rw) should never fill up during normal operation. The /omd filesystem, however, can grow over time depending on the number of sites, hosts, services, and stored monitoring data.
Identifying disk usage
To get an overview of disk usage on the appliance, log in via SSH and run:
root@myappliance~$ df -hl -x tmpfs
Example output:
Filesystem Size Used Avail Use% Mounted on
udev 7,8G 0 7,8G 0% /dev
/dev/sdc2 219G 1,8G 217G 1% /ro
/dev/disk/by-label/mk-rw 3,9G 3,9G 0G 100% /rw
overlay 3,9G 3,9G 0G 100% /
efivarfs 304K 79K 221K 27% /sys/firmware/efi/efivars
/dev/md0p2 435G 22G 414G 5% /omdIn this example, both / and /rw are at 100% usage.
As / is mounted as an overlay filesystem, ignore it for this guide, the physical disc utilization is at /rw/ and can only be cleaned up there.
Finding what consumes disk space
Once you know which filesystem is full, you need to identify the files or directories consuming the most space.
We start from /rw/ to search for the big file. Use du -sh to inspect disk usage:
root@myappliance~$ du -sh /rw/*
root@myappliance~$ du -sh /rw/overlay-rw/*
root@myappliance~$ du -sh /rw/overlay-rw/mnt/*
root@myappliance~$ du -sh /rw/overlay-rw/mnt/static/*These commands list the size of all directories one level below the specified path. This example is one of likeliest, where a broken mount ends up filling /rw/.
Identify the largest directories and continue drilling down until you find the files responsible for the disk usage.
Examples
A full filesystem can cause scheduled maintenance tasks to fail. Typical error messages include:
/etc/cron.daily/dpkg
cp: error writing 'dpkg.status': No space left on device
gzip: .//dpkg.status.0.gz: No space left on device
mv: cannot stat './/dpkg.status.0.gz': No such file or directory
/etc/cron.daily/logrotate
/etc/cron.daily/logrotate:
error: Compressing program wrote following message to stderr when compressing log /var/log/apache2/ssl_access.log.1:
gzip: stdout: No space left on device
error: failed to compress log /var/log/apache2/ssl_access.log.1
run-parts: /etc/cron.daily/logrotate exited with return code 1
/etc/cron.daily/man-db
gdbm fatal: read error
run-parts: /etc/cron.daily/man-db exited with return code 1These errors indicate that routine background jobs cannot write temporary or compressed files due to insufficient disk space.
Possible solutions
Cleanup
For additional cleanup steps, refer to the documentation:
How to remove custom changes to files in the appliance
If you have identified large files or directories, verify the following:
Possible failed mounts, typically located under
/rw/overlay-rw/mnt/Verify that the source works and is accessible
Clean up the local files and re-mount the share
Verify successful mount
If a log file is growing extensively
Is debug logging enabled?
Is a script writing excessive log output?
Is an application repeatedly crashing and logging errors?
Any other place will most likely be an unsupported modification of the Appliance firmware. Refer to the link above on how to remove them.
Resize
It is normal for the /omd filesystem to grow over time. If cleanup does not free sufficient space, resizing may be required.
Currently, there is no online resizing option for the Checkmk Appliance. Expanding disk space requires setting up a new appliance and migrating existing sites.
Virtual appliance
Create a new appliance.
Before the first boot, resize the second disk to a sufficient size.
Boot and configure the appliance.
After the basic setup is complete, migrate your sites from the old appliance as described in the migration documentation.
Physical appliance
The process is technically similar. However, physical appliances are available only in fixed sizes.
If you are already using the larger rack5 model, no further resizing is possible.
Werk 9295
With Werk #9295, the default size of the /rw filesystem was increased from 800 MB to 4 GB.
To use the larger /rw filesystem, follow these steps:
Update to the latest 1.4 Firmware
Create a backup of the appliance as described in Configuring and using the appliance.
Reset the appliance to the factory settings.
Restore the backup.
After completing these steps, the /rw filesystem will have a size of 4 GB.