[cloudceph] Improve downtime when a switch goes down
Open, In Progress, HighPublic
Actions

Assigned To

Authored By

	dcaro
	Sep 19 2024, 2:30 PM

Description

We had several major outages in the ceph cluster (and thus all of cloudvps + toolforge + paws + quarry) due to switches going down or misbehaving:

T373986: cloudsw1-c8-eqiad is unstable 2 outages, one when it misbehaved, and one when doing the reboot of the switch
T329535: Cloud Ceph outage 2023-02-13 2 outages too
T314870: Setup cloudcephosd10[25-34] into the ceph eqiad cluster another outage

Each ceph outage has a high impact (to be elaborated on), so this task is to come up with a high available setup that will minimize the outages given the current technical and financial constrains.

The live document (will be moved to the task eventually) is: https://docs.google.com/document/d/1UtMK8ZLLfn1CFbcgBccBvzTlIuab244XAs98_gTuBlg/edit?usp=sharing

Related Objects

Mentioned Here: T314870: Setup cloudcephosd10[25-34] into the ceph eqiad cluster
T329535: Cloud Ceph outage 2023-02-13
T373986: cloudsw1-c8-eqiad is unstable

Event Timeline

dcaro created this task.Sep 19 2024, 2:30 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptSep 19 2024, 2:30 PM

dcaro changed the task status from Open to In Progress.Sep 19 2024, 2:30 PM

dcaro triaged this task as High priority.

dcaro moved this task from Backlog to In progress on the cloud-services-team (FY2024/2025-Q1-Q2) board.

fnegri subscribed.Sep 19 2024, 3:31 PM

aborrero added a project: User-aborrero.Sep 20 2024, 8:25 AM

aborrero moved this task from Backlog to Radar/observer on the User-aborrero board.

aborrero subscribed.

taavi added a project: Cloud-VPS.Sep 28 2024, 12:22 PM

taavi moved this task from Unsorted to Storage on the Cloud-VPS board.Fri, Nov 1, 7:04 PM

[cloudceph] Improve downtime when a switch goes downOpen, In Progress, HighPublicActions

Description

Related Objects

Event Timeline

[cloudceph] Improve downtime when a switch goes down
Open, In Progress, HighPublic
Actions