Replacement hardware for buster/stretch upgrade of contint1001 and contint2001
Closed, InvalidPublic
Actions

Assigned To

Authored By

	thcipriani
	Dec 5 2019, 12:52 AM

Description

contint1001.eqiad.wmnet and contint2001.eqiad.wmnet are physical hosts for Jenkins and Zuul. Both are running Jessie. We're running up on end-of-life for Jessie and need to upgrade. The upgrade would be easier if we could make a clean swap over to a new host, rather than running dist-upgrade on our existing system with no way to rollback. There is no need to upgrade the specs on either of these hosts: both are working fine for their intended purpose.

Specs:

32 cores
64G Ram
2 TB storage (4 x 1 TB SSD)

Related Objects
Search...

Status	Assigned	Task
Stalled	None	T302086 Set scap minimum python version to 3.7
Resolved	None	T247045 Migrate all of production metal and VMs to Buster or later
Resolved	akosiaris	T249724 Track and remove jessie based container images from production
Resolved	Jdforrester-WMF	T224908 Drop jessie testing support
Resolved	Jdforrester-WMF	T224906 Drop php56 testing support
Resolved	Jdforrester-WMF	T211784 Upgrade all CI jobs from node6/npm3 to node10/npm6 across all projects
Invalid	Jdforrester-WMF	T211785 Upgrade the mobileapps CI job from npm3 to npm6
Resolved	MoritzMuehlenhoff	T203239 Create Debian packages for Node.js 10 upgrade
Resolved	Krinkle	T213944 Jenkins jobs for npm-test fail on project with deps on node-gyp which requires python2.7
Resolved	Krinkle	T215562 npm 6 consistently fails with "Z_DATA_ERROR: invalid distance too far back" on some repos
Resolved	hashar	T217545 Update selenium-daily-beta-* jobs to node10/npm6
Resolved	Jdforrester-WMF	T222406 Switch quibble-based CI jobs from node6 to node10
Resolved	Jdforrester-WMF	T224983 mediawiki-phpunit-coverage-patch-docker fails to install fibers@3.1.1
Declined	Jdforrester-WMF	T224997 Update MobileFrontend-npm-run-lint-modules-docker to run node10
Duplicate	None	T224978 WikibaseMediaInfo selenium tests failing when run against beta commons
Resolved	Milimetric	T228451 Fix the analytics/mediawiki-storage repo to work on node10
Resolved	Milimetric	T228452 Fix the analytics/wikistats2 repo to work on node10
Resolved	Ladsgroup	T228453 Fix the data-values/value-view repo to work on node10
Resolved	akosiaris	T218733 Migrate mobileapps to k8s and node 10
Declined	None	T215539 Node.js 10 changes encoding for at least one Georgian character
Resolved	• Mholloway	T258186 Investigate why mobileapps in k8s "/{domain}/v1/data/css/mobile/site" endpoint takes way longer than on scb to complete
Duplicate	None	T225107 Migrate recommendation-api to node 10
Resolved	None	T225678 Migrate 3d2png to k8s
Resolved	None	T267327 Run latest Thumbor on Docker with Buster + Python 3
Declined	None	T269215 Blubber "copies" and "builder command" steps should run in the opposite order
Resolved	MSantos	T217114 Migrate Proton to k8s and nodejs 10
Duplicate	None	T228907 Migrate the wikimedia-portals-build timed CI job to node10
Resolved	jeena	T213806 Migrate wikimedia-portals-build to Docker container
Resolved	Jdforrester-WMF	T237479 Update the wikimedia-portals repo's CI/linting code for various security issues
Resolved	Jdrewniak	T247996 Fix issues with Gulp 4 migration
Duplicate	None	T229276 Fix the data-values/value-view repo to work on node10
Resolved	Jdforrester-WMF	T230841 Migrate documentation generation to Node 10.15.2 from node 6.11.0
Resolved	Jdforrester-WMF	T235570 Move the OOUI repo to a new custom docker image for node10 and php72
Resolved	Jdforrester-WMF	T247536 Migrate mediawiki-core-jsduck-docker-publish off node 6 so it works again
Resolved	MoritzMuehlenhoff	T224549 Track remaining jessie systems in production
Resolved	hashar	T249268 Reduce size of artifacts stored on the CI Jenkins master
Resolved	hashar	T224591 Migrate contint* hosts to Buster
Invalid	thcipriani	T239880 Replacement hardware for buster/stretch upgrade of contint1001 and contint2001

Event Timeline

thcipriani created this task.Dec 5 2019, 12:52 AM

Paladox subscribed.Dec 5 2019, 1:02 AM

colewhite triaged this task as Medium priority.Dec 5 2019, 5:58 PM

I'd be the one to take these from "role::spare"-state to serving the puppet roles. We'll see what breaks on buster and whether we can go to buster or just stretch for now.

Dzahn awarded a token.Dec 5 2019, 6:27 PM

Dzahn added a subscriber: Muehlenhoff.

Dzahn mentioned this in T239151: Gerrit VM to test data migration.Dec 5 2019, 6:29 PM

hashar added a project: Continuous-Integration-Infrastructure (phase-out-jessie).Dec 7 2019, 3:40 AM

hashar updated the task description. (Show Details)

hashar updated the task description. (Show Details)Dec 7 2019, 3:44 AM

hashar added a parent task: T224591: Migrate contint* hosts to Buster.

contint1001 has 4 1TB SSD:

# lshw -class disk -short
H/W path             Device     Class          Description
==========================================================
/0/86/0.0.0          /dev/sda   disk           1TB ST91000640NS
/0/87/0.0.0          /dev/sdb   disk           1TB ST91000640NS
/0/88/0.0.0          /dev/sdc   disk           1TB ST1000NX0313
/0/89/0.0.0          /dev/sdd   disk           1TB ST1000NX0313

sda and sdb have

1G	raid for swap
50G	raid for / (OS)
~ 950G	raid with lvm for `/srv` (notably Jenkins build artifacts)

sdc and sdd have been added recently to hold Docker images

250G	raid with lvm for `/mnt/docker` (Docker images)

750G free for future usages.

This ticket needs more @RobH. :-)

hashar mentioned this in T224591: Migrate contint* hosts to Buster.Dec 13 2019, 3:46 PM

wiki_willy assigned this task to RobH.Dec 17 2019, 10:52 PM

So this is a LOT of hardware churn that is non-desired by DC-Ops, at least from my perspective.

If we are upgrading both contint1001 and contint2001, why can't one become active while the other is reimaged/upgraded? Otherwise we have to take these perfectly good servers (R430s) and move them to the spares pool, where they likely to never be re-used due to their relative ages compared to the rest of the spares pool (which are mostly R440s).

The end result is this, in reality, will prematurely end of life these servers currently used for contint assignment. That is non-ideal.

Can this upgrade be handled by upgrading either the codfw or eqiad first to test and then fail to it to upgrade the remainder?

RobH reassigned this task from RobH to thcipriani.Jan 14 2020, 8:27 PM

@RobH: You're completely right, see https://phabricator.wikimedia.org/T224591#5737877

@thcipriani: Please comment with additional reasoning on why we need to swap the hardware, when compared to my comment above on how this basically puts the existing hosts into EOL, and assign back to me for followup (if needed.)

Thanks!

In T239880#5803332, @MoritzMuehlenhoff wrote:

@RobH: You're completely right, see https://phabricator.wikimedia.org/T224591#5737877

Thank you for the cross-linking of information, it is appreciated!

In T239880#5803313, @RobH wrote:

So this is a LOT of hardware churn that is non-desired by DC-Ops, at least from my perspective.

If we are upgrading both contint1001 and contint2001, why can't one become active while the other is reimaged/upgraded? Otherwise we have to take these perfectly good servers (R430s) and move them to the spares pool, where they likely to never be re-used due to their relative ages compared to the rest of the spares pool (which are mostly R440s).

The end result is this, in reality, will prematurely end of life these servers currently used for contint assignment. That is non-ideal.

Can this upgrade be handled by upgrading either the codfw or eqiad first to test and then fail to it to upgrade the remainder?

In T239880#5803335, @RobH wrote:

In T239880#5803332, @MoritzMuehlenhoff wrote:

@RobH: You're completely right, see https://phabricator.wikimedia.org/T224591#5737877

Thank you for the cross-linking of information, it is appreciated!

+1 to all of the above.

Thanks for the detailed response as always @RobH.

Replacement hardware for buster/stretch upgrade of contint1001 and contint2001Closed, InvalidPublicActions

Description

Related ObjectsSearch...

Event Timeline

Replacement hardware for buster/stretch upgrade of contint1001 and contint2001
Closed, InvalidPublic
Actions

Related Objects
Search...