Page MenuHomePhabricator

Better monitoring for labstore backup crons
Closed, ResolvedPublic

Description

We use block device sync tool bdsync (https://github.com/TargetHolding/bdsync) to backup tools and misc shares from the passive server labstore secondary cluster(1004|5) to labstore2003|4 on codfw weekly. The backup jobs are setup via puppet cron.

Current issues:

  • It uses MAILTO to send email notifications, but these emails are sent irrespective of job success or failure. This tends to make us treat these emails as noise. We should change this to only send emails when something goes wrong.
    • Cron’s mailto doesn’t have ways to distinguish between stdout and stderr and only send emails when log lines get written to stderr. One way we could work around this for our case is by only producing meaningful stderr output when something goes wrong, and not produce any stdout when the cron-ed script completes successfully. This would mean cleaning up block_sync.py and the python script it shells out to for snapshot creation and deletion - snapshot-manager.py, have them default to log level ERROR if it’s run without the debug option, and making sure log lines that cause the scripts to exit have log levels ERROR or CRITICAL.
  • Sometimes we miss checking the email notifications anyway and we have no check to ensure the integrity of the backups - coming up with some way to contextually check and alert through icinga that the backup status is good would be really helpful.
    • One possible way of doing this is by writing a timestamp file to the volume before it’s backed up, and then having a monitoring script check that the timestamp file has a recent value (<1 week since we run backups every week)
    • Since we do LVM level device backups, and not rsync style directory backups, writing or checking a file is a little more complicated.
    • On the source NFS cluster side, we only have the volumes mounted on the active NFS server, and the backups happen from the passive server where they are not mounted. This is by design. We could make the block_sync script aware of the active NFS server and have it reach out and write a timestamp file to a specific location before kicking off the backup. Due to our DRBD replication set up, this would be immediately replicated to the passive server we are backing up from.
    • On the destination backup server side, we’d need to set up a monitoring script that will have to mount the backed up volume, check that the timestamp file is recent and matches the source timestamp file and then unmount it. We can use snapshot-manager.py to do all these operations safely, and run this check once a day or so. The monitoring scripts are simple bash/python scripts with a valid OK/ERROR message and exit status, and then we use nrpe::monitor_service in puppet to set up icinga checks. See https://github.com/wikimedia/puppet/blob/production/modules/labstore/files/monitor/check_drbd_status and https://github.com/wikimedia/puppet/blob/production/modules/labstore/manifests/monitoring/secondary.pp#L29 for an example of this.

Outcomes:

  • Cron MAILTO emails are only sent when backup jobs fail, with relevant information
  • We have an icinga check set up on both back up servers that will alert if backups are potentially corrupted or not up to date.

Event Timeline

bd808 moved this task from Backlog to Shared Storage on the Data-Services board.Jul 23 2017, 11:41 PM
madhuvishy removed madhuvishy as the assignee of this task.Jan 5 2018, 6:55 PM
chasemp assigned this task to Bstorm.Jun 6 2018, 7:26 PM

Change 443643 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/puppet@production] labstore: block_sync backup job should email on error only

https://gerrit.wikimedia.org/r/443643

Change 443643 merged by Bstorm:
[operations/puppet@production] labstore: block_sync backup job should email on error only

https://gerrit.wikimedia.org/r/443643

Change 451181 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/puppet@production] labstore: set up an icinga plugin to check cron exit codes

https://gerrit.wikimedia.org/r/451181

Change 451657 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/puppet@production] labstore: Change backup cron to a systemd timer

https://gerrit.wikimedia.org/r/451657

Change 451657 merged by Bstorm:
[operations/puppet@production] labstore: Change backup cron to a systemd timer

https://gerrit.wikimedia.org/r/451657

Change 453137 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/puppet@production] labstore: Change backup cron to a systemd timer

https://gerrit.wikimedia.org/r/453137

Change 453137 merged by Bstorm:
[operations/puppet@production] labstore: Change backup cron to a systemd timer

https://gerrit.wikimedia.org/r/453137

Change 453156 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/puppet@production] labstore: trying to make dependency issues work

https://gerrit.wikimedia.org/r/453156

Change 453156 merged by Bstorm:
[operations/puppet@production] labstore: trying to make dependency issues work

https://gerrit.wikimedia.org/r/453156

Change 453173 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/puppet@production] labstore and systemd: change timer module to use simpler interface

https://gerrit.wikimedia.org/r/453173

Change 453173 abandoned by Bstorm:
labstore and systemd: change timer module to use simpler interface

Reason:
Won't work this way, and this potentially reduces flexibility. I don't like it.

https://gerrit.wikimedia.org/r/453173

Change 453180 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/puppet@production] labstore and systemd: Change timer dependency to unit instead of service

https://gerrit.wikimedia.org/r/453180

Change 453180 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/puppet@production] labstore and systemd: Change timer dependency to unit instead of service

https://gerrit.wikimedia.org/r/453180

Change 453180 merged by Bstorm:
[operations/puppet@production] labstore and systemd: Change timer dependency to unit instead of service

https://gerrit.wikimedia.org/r/453180

Converted the crons to systemd timers. Now they should actually page when broken. That said, I'm inclined to keep this open until next week so I can confirm that they ran.

Change 451181 abandoned by Bstorm:
labstore: set up an icinga plugin to check cron exit codes

Reason:
Went a different route and set up systemd.timers

https://gerrit.wikimedia.org/r/451181

Aug 21 20:00:01 labstore2003 systemd[1]: Starting DRBD Device Backup Job...
Aug 21 20:00:01 labstore2003 systemd[1]: Started DRBD Device Backup Job.
Aug 21 20:00:02 labstore2003 block_sync[25716]: 2018-08-21 20:00:02,385 INFO force is enabled
Aug 21 20:00:02 labstore2003 block_sync[25716]: 2018-08-21 20:00:02,410 INFO removing tools-project-backup
Aug 21 20:00:02 labstore2003 block_sync[25716]: 2018-08-21 20:00:02,461 INFO removing tools-project-backup
Aug 21 20:00:02 labstore2003 block_sync[25716]: 2018-08-21 20:00:02,968 INFO creating tools-project-backup at 2T
Aug 21 20:00:03 labstore2003 block_sync[25716]: 2018-08-21 20:00:03,935 INFO force is enabled
Aug 21 20:00:03 labstore2003 block_sync[25716]: 2018-08-21 20:00:03,978 INFO removing tools-snap
Aug 21 20:00:04 labstore2003 block_sync[25716]: 2018-08-21 20:00:04,042 INFO removing tools-snap
Aug 21 20:00:05 labstore2003 block_sync[25716]: 2018-08-21 20:00:05,907 INFO creating tools-snap at 1T

That looks like success to me! The systemd health checker should pick up if the timer enters a failed state.

Bstorm closed this task as Resolved.Aug 24 2018, 6:12 PM
Bstorm reopened this task as Open.EditedAug 31 2018, 2:26 PM

Re-opening this task because during kernel upgrades a backup failed (Wed is the misc backup). I can confirm that it did NOT set off an alarm because it didn't exit with anything that the systemd service would be bothered by. It threw some information into the log, though,

Aug 29 20:55:44 labstore2004 block_sync[10111]: Connection to 10.64.37.20 closed by remote host.
Aug 29 20:55:44 labstore2004 block_sync[10111]: Bad data

I suspect the problem is there is a python script calling a bash script, and so forth. So the exit codes are lost along the way.

Change 456740 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/puppet@production] block_sync: Small improvement to the drbd backup script

https://gerrit.wikimedia.org/r/456740

Change 456740 merged by Bstorm:
[operations/puppet@production] block_sync: Small improvement to the drbd backup script

https://gerrit.wikimedia.org/r/456740

Bstorm closed this task as Resolved.Sep 5 2018, 8:05 PM

This should actually be working now.