Page MenuHomePhabricator

Replacement hardware for buster/stretch upgrade of contint1001 and contint2001
Closed, InvalidPublic

Description

contint1001.eqiad.wmnet and contint2001.eqiad.wmnet are physical hosts for Jenkins and Zuul. Both are running Jessie. We're running up on end-of-life for Jessie and need to upgrade. The upgrade would be easier if we could make a clean swap over to a new host, rather than running dist-upgrade on our existing system with no way to rollback. There is no need to upgrade the specs on either of these hosts: both are working fine for their intended purpose.

Specs:

32 cores
64G Ram
2 TB storage (4 x 1 TB SSD)

Event Timeline

Paladox added a subscriber: Paladox.Dec 5 2019, 1:02 AM
colewhite triaged this task as Medium priority.Dec 5 2019, 5:58 PM
Dzahn added a comment.Dec 5 2019, 6:26 PM

I'd be the one to take these from "role::spare"-state to serving the puppet roles. We'll see what breaks on buster and whether we can go to buster or just stretch for now.

Dzahn awarded a token.Dec 5 2019, 6:27 PM
Dzahn added a subscriber: Muehlenhoff.

contint1001 has 4 1TB SSD:

# lshw -class disk -short
H/W path             Device     Class          Description
==========================================================
/0/86/0.0.0          /dev/sda   disk           1TB ST91000640NS
/0/87/0.0.0          /dev/sdb   disk           1TB ST91000640NS
/0/88/0.0.0          /dev/sdc   disk           1TB ST1000NX0313
/0/89/0.0.0          /dev/sdd   disk           1TB ST1000NX0313

sda and sdb have

1Graid for swap
50Graid for / (OS)
~ 950Graid with lvm for /srv (notably Jenkins build artifacts)

sdc and sdd have been added recently to hold Docker images

250Graid with lvm for /mnt/docker (Docker images)

750G free for future usages.

This ticket needs more @RobH. :-)

wiki_willy assigned this task to RobH.Dec 17 2019, 10:52 PM
RobH added a comment.Tue, Jan 14, 8:27 PM

So this is a LOT of hardware churn that is non-desired by DC-Ops, at least from my perspective.

If we are upgrading both contint1001 and contint2001, why can't one become active while the other is reimaged/upgraded? Otherwise we have to take these perfectly good servers (R430s) and move them to the spares pool, where they likely to never be re-used due to their relative ages compared to the rest of the spares pool (which are mostly R440s).

The end result is this, in reality, will prematurely end of life these servers currently used for contint assignment. That is non-ideal.

Can this upgrade be handled by upgrading either the codfw or eqiad first to test and then fail to it to upgrade the remainder?

RobH reassigned this task from RobH to thcipriani.Tue, Jan 14, 8:27 PM
RobH added a comment.Tue, Jan 14, 8:33 PM

@thcipriani: Please comment with additional reasoning on why we need to swap the hardware, when compared to my comment above on how this basically puts the existing hosts into EOL, and assign back to me for followup (if needed.)

Thanks!

RobH added a comment.Tue, Jan 14, 8:34 PM

Thank you for the cross-linking of information, it is appreciated!

thcipriani closed this task as Invalid.Thu, Jan 16, 7:52 PM

So this is a LOT of hardware churn that is non-desired by DC-Ops, at least from my perspective.
If we are upgrading both contint1001 and contint2001, why can't one become active while the other is reimaged/upgraded? Otherwise we have to take these perfectly good servers (R430s) and move them to the spares pool, where they likely to never be re-used due to their relative ages compared to the rest of the spares pool (which are mostly R440s).
The end result is this, in reality, will prematurely end of life these servers currently used for contint assignment. That is non-ideal.
Can this upgrade be handled by upgrading either the codfw or eqiad first to test and then fail to it to upgrade the remainder?

Thank you for the cross-linking of information, it is appreciated!

+1 to all of the above.

Thanks for the detailed response as always @RobH.