Page MenuHomePhabricator

[ceph] Add monitoring for inter-osd/mon/cloudvirt connectivity
Closed, ResolvedPublic

Description

Add metrics and alerts for basic connectivity between ceph nodes and cloudvirts, specifically:

The alerts should not page, but just show up in alertmanager when:

  • More than 4 pings from a node failed in a short period
  • More than 4 pings to a node failed in a short period
  • More than 10% of the pings in total failed in a short period
  • More than 2 pings from a node failed continuously in a medium period
  • More than 2 pings to a node failed continuously in a medium period
  • More than 5% of the pings in total failed in a medium period

Details

Related Changes in Gerrit:

Event Timeline

dcaro changed the task status from Open to In Progress.
dcaro triaged this task as High priority.
dcaro moved this task from To refine to Doing on the User-dcaro board.

Change 824202 had a related patch set uploaded (by David Caro; author: David Caro):

[operations/puppet@production] node_pinger: use jumbo frames

https://gerrit.wikimedia.org/r/824202

Change 824202 merged by David Caro:

[operations/puppet@production] node_pinger: use jumbo frames

https://gerrit.wikimedia.org/r/824202

dcaro moved this task from Backlog to Done on the cloud-services-team (FY2022/2023-Q3) board.