Page MenuHomePhabricator

Replacement hardware for buster/stretch upgrade of contint1001 and contint2001
Closed, InvalidPublic

Description

contint1001.eqiad.wmnet and contint2001.eqiad.wmnet are physical hosts for Jenkins and Zuul. Both are running Jessie. We're running up on end-of-life for Jessie and need to upgrade. The upgrade would be easier if we could make a clean swap over to a new host, rather than running dist-upgrade on our existing system with no way to rollback. There is no need to upgrade the specs on either of these hosts: both are working fine for their intended purpose.

Specs:

32 cores
64G Ram
2 TB storage (4 x 1 TB SSD)

Related Objects

StatusSubtypeAssignedTask
StalledNone
ResolvedNone
Resolvedakosiaris
ResolvedJdforrester-WMF
ResolvedJdforrester-WMF
ResolvedJdforrester-WMF
InvalidJdforrester-WMF
ResolvedMoritzMuehlenhoff
ResolvedKrinkle
ResolvedKrinkle
Resolvedhashar
ResolvedJdforrester-WMF
ResolvedJdforrester-WMF
DeclinedJdforrester-WMF
DuplicateNone
ResolvedMilimetric
ResolvedMilimetric
ResolvedLadsgroup
Resolvedakosiaris
DeclinedNone
Resolved Mholloway
DuplicateNone
ResolvedNone
ResolvedNone
DeclinedNone
ResolvedMSantos
DuplicateNone
Resolvedjeena
ResolvedJdforrester-WMF
ResolvedJdrewniak
DuplicateNone
ResolvedJdforrester-WMF
ResolvedJdforrester-WMF
ResolvedJdforrester-WMF
ResolvedMoritzMuehlenhoff
Resolvedhashar
Resolvedhashar
Invalidthcipriani

Event Timeline

colewhite triaged this task as Medium priority.Dec 5 2019, 5:58 PM

I'd be the one to take these from "role::spare"-state to serving the puppet roles. We'll see what breaks on buster and whether we can go to buster or just stretch for now.

contint1001 has 4 1TB SSD:

# lshw -class disk -short
H/W path             Device     Class          Description
==========================================================
/0/86/0.0.0          /dev/sda   disk           1TB ST91000640NS
/0/87/0.0.0          /dev/sdb   disk           1TB ST91000640NS
/0/88/0.0.0          /dev/sdc   disk           1TB ST1000NX0313
/0/89/0.0.0          /dev/sdd   disk           1TB ST1000NX0313

sda and sdb have

1Graid for swap
50Graid for / (OS)
~ 950Graid with lvm for /srv (notably Jenkins build artifacts)

sdc and sdd have been added recently to hold Docker images

250Graid with lvm for /mnt/docker (Docker images)

750G free for future usages.

So this is a LOT of hardware churn that is non-desired by DC-Ops, at least from my perspective.

If we are upgrading both contint1001 and contint2001, why can't one become active while the other is reimaged/upgraded? Otherwise we have to take these perfectly good servers (R430s) and move them to the spares pool, where they likely to never be re-used due to their relative ages compared to the rest of the spares pool (which are mostly R440s).

The end result is this, in reality, will prematurely end of life these servers currently used for contint assignment. That is non-ideal.

Can this upgrade be handled by upgrading either the codfw or eqiad first to test and then fail to it to upgrade the remainder?

@thcipriani: Please comment with additional reasoning on why we need to swap the hardware, when compared to my comment above on how this basically puts the existing hosts into EOL, and assign back to me for followup (if needed.)

Thanks!

Thank you for the cross-linking of information, it is appreciated!

So this is a LOT of hardware churn that is non-desired by DC-Ops, at least from my perspective.

If we are upgrading both contint1001 and contint2001, why can't one become active while the other is reimaged/upgraded? Otherwise we have to take these perfectly good servers (R430s) and move them to the spares pool, where they likely to never be re-used due to their relative ages compared to the rest of the spares pool (which are mostly R440s).

The end result is this, in reality, will prematurely end of life these servers currently used for contint assignment. That is non-ideal.

Can this upgrade be handled by upgrading either the codfw or eqiad first to test and then fail to it to upgrade the remainder?

Thank you for the cross-linking of information, it is appreciated!

+1 to all of the above.

Thanks for the detailed response as always @RobH.