Page MenuHomePhabricator

Order spare cloudvirt SSDs for eqiad
Closed, DeclinedPublic

Description

We've had a few scares with failing SSDs in cloudvirts -- if we were to lose the wrong two drives in a row we'd suffer loss of user data.

The prudent thing is probably to keep spares on hands so we can replace things as they fail and avoid having two overlapping failures. Unfortunately there's quite a variety of drive sizes in the cloudvirts so this will be several different drives we'll have to keep around.

This task is part of https://wikitech.wikimedia.org/wiki/Incident_documentation/20190213-cloudvps

Related Objects

StatusSubtypeAssignedTask
ResolvedAndrew
DeclinedNone

Event Timeline

@GTirloni suggests that we add a live spare to each cloudvirt to avoid data loss. Seems like a good idea, although in many cases we won't have spare drive bays for this.

https://www.dell.com/support/manuals/br/pt/brbsdt1/poweredge-rc-h730/perc9ugpublication/creating-global-hot-spares?guid=guid-138c7a5d-acd5-465b-ae14-a7cf236232f4&lang=en-us

aborrero added a project: Wikimedia-Incident.
aborrero moved this task from Inbox to Soon! on the cloud-services-team (Kanban) board.

I'm less sure that we need drives on hand now. We seem to be able to get replacements more-or-less overnight, and adding spare drives to the RAIDs will reduce the urgency of replacement.