We use block device sync tool bdsync (https://github.com/TargetHolding/bdsync) to backup tools and misc shares from the passive server labstore secondary cluster(1004|5) to labstore2003|4 on codfw weekly. The backup jobs are setup via puppet cron.
Current issues:
- It uses MAILTO to send email notifications, but these emails are sent irrespective of job success or failure. This tends to make us treat these emails as noise. We should change this to only send emails when something goes wrong.
- Cron’s mailto doesn’t have ways to distinguish between stdout and stderr and only send emails when log lines get written to stderr. One way we could work around this for our case is by only producing meaningful stderr output when something goes wrong, and not produce any stdout when the cron-ed script completes successfully. This would mean cleaning up block_sync.py and the python script it shells out to for snapshot creation and deletion - snapshot-manager.py, have them default to log level ERROR if it’s run without the debug option, and making sure log lines that cause the scripts to exit have log levels ERROR or CRITICAL.
- Sometimes we miss checking the email notifications anyway and we have no check to ensure the integrity of the backups - coming up with some way to contextually check and alert through icinga that the backup status is good would be really helpful.
- One possible way of doing this is by writing a timestamp file to the volume before it’s backed up, and then having a monitoring script check that the timestamp file has a recent value (<1 week since we run backups every week)
- Since we do LVM level device backups, and not rsync style directory backups, writing or checking a file is a little more complicated.
- On the source NFS cluster side, we only have the volumes mounted on the active NFS server, and the backups happen from the passive server where they are not mounted. This is by design. We could make the block_sync script aware of the active NFS server and have it reach out and write a timestamp file to a specific location before kicking off the backup. Due to our DRBD replication set up, this would be immediately replicated to the passive server we are backing up from.
- On the destination backup server side, we’d need to set up a monitoring script that will have to mount the backed up volume, check that the timestamp file is recent and matches the source timestamp file and then unmount it. We can use snapshot-manager.py to do all these operations safely, and run this check once a day or so. The monitoring scripts are simple bash/python scripts with a valid OK/ERROR message and exit status, and then we use nrpe::monitor_service in puppet to set up icinga checks. See https://github.com/wikimedia/puppet/blob/production/modules/labstore/files/monitor/check_drbd_status and https://github.com/wikimedia/puppet/blob/production/modules/labstore/manifests/monitoring/secondary.pp#L29 for an example of this.
Outcomes:
- Cron MAILTO emails are only sent when backup jobs fail, with relevant information
- We have an icinga check set up on both back up servers that will alert if backups are potentially corrupted or not up to date.