Page MenuHomePhabricator

Set up monitoring for secondary labstore HA cluster
Closed, ResolvedPublic

Description

The new drbd backed HA labstore cluster - with labstore1004/5 needs monitoring.

Things we should monitor explicitly

  • DRBD roles are assigned to nodes in cluster as expected
  • Status of DRBD node (connection and disk ok)
  • Status of DRBD service
  • Cluster IP is assigned to DRBD primary
  • NFS is served over the cluster IP
  • Backup jobs status

Event Timeline

Change 311723 had a related patch set uploaded (by Madhuvishy):
labstore: Add monitoring script for secondary HA cluster health

https://gerrit.wikimedia.org/r/311723

OK things we don't monitor yet:

  • DRBD service state (and add it to the role to start post all resources)
  • A check that validates that NFS is available over the cluster ip (even just showmount -e foo.eqiad.wmnet)
  • Backup jobs status (which will need new back jobs written)

Change 311723 merged by Madhuvishy:
labstore: Add monitoring for secondary HA cluster health

https://gerrit.wikimedia.org/r/311723

Change 315742 had a related patch set uploaded (by Madhuvishy):
nfs: Add sudo permissions for nagios user to run drbd commands

https://gerrit.wikimedia.org/r/315742

Change 315742 merged by Madhuvishy:
nfs: Add sudo permissions for nagios user to run drbd commands

https://gerrit.wikimedia.org/r/315742

Change 320935 had a related patch set uploaded (by Madhuvishy):
labstore: Add drbd service monitoring

https://gerrit.wikimedia.org/r/320935

Change 320935 merged by Madhuvishy:
labstore: Add drbd service monitoring

https://gerrit.wikimedia.org/r/320935

Change 320946 had a related patch set uploaded (by Madhuvishy):
labstore: Check that NFS is being served over Cluster IP for secondary cluster

https://gerrit.wikimedia.org/r/320946

Change 320946 merged by Madhuvishy:
labstore: Check that NFS is being served over Cluster IP for secondary cluster

https://gerrit.wikimedia.org/r/320946

Change 320962 had a related patch set uploaded (by Madhuvishy):
labstore: Set mailto address for secondary backups cron

https://gerrit.wikimedia.org/r/320962

Change 320962 merged by Madhuvishy:
labstore: Set mailto address for secondary backups cron

https://gerrit.wikimedia.org/r/320962

Change 321117 had a related patch set uploaded (by Madhuvishy):
labstore: Make secondary backup script fail if already running

https://gerrit.wikimedia.org/r/321117

Change 321117 merged by Madhuvishy:
labstore: Make secondary backup script fail if already running

https://gerrit.wikimedia.org/r/321117