Page MenuHomePhabricator

Set up monitoring for secondary labstore HA cluster
Closed, ResolvedPublic

Description

The new drbd backed HA labstore cluster - with labstore1004/5 needs monitoring.

Things we should monitor explicitly

  • DRBD roles are assigned to nodes in cluster as expected
  • Status of DRBD node (connection and disk ok)
  • Status of DRBD service
  • Cluster IP is assigned to DRBD primary
  • NFS is served over the cluster IP
  • Backup jobs status

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptSep 2 2016, 7:51 PM

Change 311723 had a related patch set uploaded (by Madhuvishy):
labstore: Add monitoring script for secondary HA cluster health

https://gerrit.wikimedia.org/r/311723

chasemp added a comment.EditedOct 7 2016, 8:06 PM

OK things we don't monitor yet:

  • DRBD service state (and add it to the role to start post all resources)
  • A check that validates that NFS is available over the cluster ip (even just showmount -e foo.eqiad.wmnet)
  • Backup jobs status (which will need new back jobs written)

Change 311723 merged by Madhuvishy:
labstore: Add monitoring for secondary HA cluster health

https://gerrit.wikimedia.org/r/311723

Change 315742 had a related patch set uploaded (by Madhuvishy):
nfs: Add sudo permissions for nagios user to run drbd commands

https://gerrit.wikimedia.org/r/315742

Change 315742 merged by Madhuvishy:
nfs: Add sudo permissions for nagios user to run drbd commands

https://gerrit.wikimedia.org/r/315742

Change 320935 had a related patch set uploaded (by Madhuvishy):
labstore: Add drbd service monitoring

https://gerrit.wikimedia.org/r/320935

Change 320935 merged by Madhuvishy:
labstore: Add drbd service monitoring

https://gerrit.wikimedia.org/r/320935

Change 320946 had a related patch set uploaded (by Madhuvishy):
labstore: Check that NFS is being served over Cluster IP for secondary cluster

https://gerrit.wikimedia.org/r/320946

Change 320946 merged by Madhuvishy:
labstore: Check that NFS is being served over Cluster IP for secondary cluster

https://gerrit.wikimedia.org/r/320946

Change 320962 had a related patch set uploaded (by Madhuvishy):
labstore: Set mailto address for secondary backups cron

https://gerrit.wikimedia.org/r/320962

Change 320962 merged by Madhuvishy:
labstore: Set mailto address for secondary backups cron

https://gerrit.wikimedia.org/r/320962

Change 321117 had a related patch set uploaded (by Madhuvishy):
labstore: Make secondary backup script fail if already running

https://gerrit.wikimedia.org/r/321117

Change 321117 merged by Madhuvishy:
labstore: Make secondary backup script fail if already running

https://gerrit.wikimedia.org/r/321117

@madhuvishy satisfied we can close?

madhuvishy closed this task as Resolved.Jan 25 2017, 10:13 PM

@chasemp Yup, closing.