Page MenuHomePhabricator

review eqiad database server quantities / warranties / service(s)
Closed, ResolvedPublic

Description

Recently, we've had a new ticket to order more SSDs for replacement in older model databases. https://rt.wikimedia.org/Ticket/Display.html?id=9455

We'll need to gather our current database servers and warranty expiration dates, we can also then break them up into service groups and have Jaime or Sean give us some insight into which are nearing the end of life for their service, and which are worth retaining and replacing bad SSDs and other parts.

Once we know how many of these systems we plan to continue using long term, we can better estimate how many SSDs to keep on hand for replacements.

Event Timeline

RobH claimed this task.
RobH raised the priority of this task from to High.
RobH updated the task description. (Show Details)
RobH added projects: acl*sre-team, DBA.
RobH added subscribers: RobH, Christopher, jcrespo, Springle.

db1002 (may be decommissioned) also just got a disk failure.

Noting it only because I think it is one more of the "old disks" that may not be worth replacing. Also to check model reliability (there should be no disk load there).

RobH lowered the priority of this task from High to Medium.Jul 2 2015, 4:02 PM

So the plan is, for critical databases with immediate needs, use replacements parts coming from decommissioning db1002 to db1007, maybe db1035 too.

The long term plan is to replace older >3 year servers with low memory, as not only disks or memories are failing: chassis, like db1035, too. Not only hardware is degrading, requirements, like s3 have grow. Consolidate with less, faster servers which increases rack space, and makes easier life bot at datacenter and DBA side.

@RobH, not sure what are the conditions to resolve this ticket. I would not buy new SAS disks for Core production servers except on very specific cases (none at this time) as we should upgrade the servers that need them instead. db1050 (which I think originated the RT ticket) could be fixed this way and always counting with onsite operator's Ok.

I think this task is resolved, as these are indeed being decommissioned per other discussions and refreshing the specification for eqiad.

Adidtionally, Jaime already has a decom plan for these in a related google document. I'm resolving this task.