We had several major outages in the ceph cluster (and thus all of cloudvps + toolforge + paws + quarry) due to switches going down or misbehaving:
- T373986: cloudsw1-c8-eqiad is unstable 2 outages, one when it misbehaved, and one when doing the reboot of the switch
- T329535: Cloud Ceph outage 2023-02-13 2 outages too
- T314870: Setup cloudcephosd10[25-34] into the ceph eqiad cluster another outage
Each ceph outage has a high impact (to be elaborated on), so this task is to come up with a high available setup that will minimize the outages given the current technical and financial constrains.
The live document (will be moved to the task eventually) is: https://docs.google.com/document/d/1UtMK8ZLLfn1CFbcgBccBvzTlIuab244XAs98_gTuBlg/edit?usp=sharing