Page MenuHomePhabricator

[cloudceph] Improve downtime when a switch goes down
Open, In Progress, HighPublic

Description

We had several major outages in the ceph cluster (and thus all of cloudvps + toolforge + paws + quarry) due to switches going down or misbehaving:

Each ceph outage has a high impact (to be elaborated on), so this task is to come up with a high available setup that will minimize the outages given the current technical and financial constrains.

The live document (will be moved to the task eventually) is: https://docs.google.com/document/d/1UtMK8ZLLfn1CFbcgBccBvzTlIuab244XAs98_gTuBlg/edit?usp=sharing