Page MenuHomePhabricator

[ceph] Add monitoring for inter-osd/mon/cloudvirt connectivity
Closed, ResolvedPublic

Description

Add metrics and alerts for basic connectivity between ceph nodes and cloudvirts, specifically:

  • mons <-> osds (non jumbo frames, public interface only)
  • osds <-> osds (jumbo frames, public and cluster interfaces)
  • cloudvirts -> osds (non jumbo frames, public interface only) (will change to jumbos once T330075: [cloudvirt] Move to jumbo frames is there)

The alerts should not page, but just show up in alertmanager when:

  • More than 4 pings from a node failed in a short period
  • More than 4 pings to a node failed in a short period
  • More than 10% of the pings in total failed in a short period
  • More than 2 pings from a node failed continuously in a medium period
  • More than 2 pings to a node failed continuously in a medium period
  • More than 5% of the pings in total failed in a medium period

Event Timeline

dcaro changed the task status from Open to In Progress.Feb 15 2023, 10:06 AM
dcaro triaged this task as High priority.
dcaro created this task.
dcaro moved this task from To refine to Doing on the User-dcaro board.

Change 824202 had a related patch set uploaded (by David Caro; author: David Caro):

[operations/puppet@production] node_pinger: use jumbo frames

https://gerrit.wikimedia.org/r/824202

Change 824202 merged by David Caro:

[operations/puppet@production] node_pinger: use jumbo frames

https://gerrit.wikimedia.org/r/824202

dcaro moved this task from Backlog to Done on the cloud-services-team (FY2022/2023-Q3) board.