Page MenuHomePhabricator

Q3:rack/setup/install ml-staging2003
Open, HighPublic

Description

This task will track the racking, setup, and OS installation of X

Hostname / Racking / Installation Details

Hostnames: ml-staging2003
Racking Proposal: Where should these systems be racked? Can they share with any existing systems or should they avoid any other systems sharing their rack or row? (Note EQIAD now has rows A-F.) Should not cohabitate with ml-staging hosts, otherwise free to go anywhere in codfw
Networking Setup: # of Connections:1/2 - Speed:1G/10G. - VLAN:Private/Public/Other(Specify) : AAAA records:Y/N, Additional IP records (Cassandra)? Yes/No
Partitioning/Raid: HW Raid: Y/N, Partman recipe and/or desired Raid Level: same partman recipe as ml-staging2001/2
OS Distro: Bullseye (default unless otherwise specified)
Sub-team Technical Contact: @klausman

Per host setup checklist

Each host should have its own setup checklist copied and pasted into the list below.

ml-staging2003:
  • Receive in system on procurement task T357414 & in Coupa
  • Rack system with proposed racking plan (see above) & update Netbox (include all system info plus location, state of planned)
  • Run the Provision a server's network attributes Netbox script - Note that you must run the DNS and Provision cookbook after completing this step
  • Immediately run the sre.dns.netbox cookbook
  • Immediately run the sre.hosts.provision cookbook
  • Run the sre.hardware.upgrade-firmware cookbook
  • Update the operations/puppet repo - this should include updates to preseed.yaml, and site.pp with roles defined by service group: https://wikitech.wikimedia.org/wiki/SRE/Dc-operations
  • Run the sre.hosts.reimage cookbook

Related Objects

StatusSubtypeAssignedTask
OpenNone

Event Timeline

RobH added a parent task: Unknown Object (Task).Feb 13 2024, 1:46 PM
RobH mentioned this in Unknown Object (Task).
RobH moved this task from Backlog to Racking Tasks on the ops-codfw board.

Tobias,

Please review this racking task for the GPU test host we're ordering for codfw and provide the needed details for network, confirm hostname, etc...

Please note that starting this quarter, the DC Ops team is asking SRE sub-teams to update the puppet repo (as not everyone in DC Ops has merge rights to puppet) for both the site.pp entry for this host (using the insetup role) and the partman info repo entries.

If you need assistance in these updates, please reach out to me and I can walk you (or whoever is designated within ML team) on how to update these files so the DC ops initial racking and installation can proceed.

As this is also the first of both the AMD cpu/chassis as well as the GPU, I expect we'll experience some technical challenges during this installation. When that happens we'll likely tag in other folks (like Moritz) but I didn't want to add them yet and spam them with non-relevant task updates.

Once this racking info is confirmed, you can remove the assignment from yourself and leave it assigned to no one. As it is in the codfw racking task column workboard, whichever of our onsites is present in codfw when it arrives can triage and rack the host. (This will most likely be Jenn but not sure depends on when it lands.)

RobH renamed this task from Q#:rack/setup/install ml-staging2003 to Q3:rack/setup/install ml-staging2003.Feb 13 2024, 1:52 PM
calbon raised the priority of this task from Medium to High.Feb 13 2024, 3:13 PM

I've updated the partman lines. I will update modules/profile/data/profile/installserver/preseed.yaml to include the new host in a moment, so standard imaging should pick the right recipe for the host.

Change 1006927 had a related patch set uploaded (by Klausman; author: Klausman):

[operations/puppet@production] partman/preseed: Add ml-staging2003 to standard LW worker recipe

https://gerrit.wikimedia.org/r/1006927

Change 1006927 merged by Klausman:

[operations/puppet@production] partman/preseed: Add ml-staging2003 to standard LW worker recipe

https://gerrit.wikimedia.org/r/1006927

Removed Tobias as assignee so the new node can be initialized.