Page MenuHomePhabricator

(Need By: TBD) rack/setup/install ml-serve200[1-4]
Closed, ResolvedPublic

Description

This task will track the racking, setup, and OS installation of ml-serve200[1-4].

As these hosts will eventually have a GPU, but do not now, they should likely be setup purely to ensure all currentl hardware is operational. They likely won't be able to have final roles applied, as the GPU will require reimage.

Hostname / Racking / Installation Details

Hostnames: ml-serve200[1-4]
Racking Proposal: 4 hosts will be in the same cluster, so differing racks at minimum.
Networking/Subnet/VLAN/IP: 1G, internal1 vlan (systems have 1G/10G nics, only need 1G at this time)
Partitioning/Raid: hw raid 1 of ssd for os, hw raid1 of sata for data/srv
OS Distro: buster

Per host setup checklist

Each host should have its own setup checklist copied and pasted into the list below.

ml-serve2001: Rack A5 U13/14 ge-5/0/12

  • - receive in system on procurement task T267651 & in coupa
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location, state of planned)
  • - bios/drac/serial setup/testing
  • - mgmt dns entries added for both asset tag and hostname
  • - network port setup (description, enable, vlan)
    • end on-site specific steps
  • - production dns entries added
  • - operations/puppet update (install_server at minimum, other files if possible)
  • - OS installation
  • - puppet accept/initial run (with role:spare)
  • - host state in netbox set to staged

ml-serve2002: Rack B5 U15/16 ge-5/0/14

  • - receive in system on procurement task T267651 & in coupa
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location, state of planned)
  • - bios/drac/serial setup/testing
  • - mgmt dns entries added for both asset tag and hostname
  • - network port setup (description, enable, vlan)
    • end on-site specific steps
  • - production dns entries added
  • - operations/puppet update (install_server at minimum, other files if possible)
  • - OS installation
  • - puppet accept/initial run (with role:spare)
  • - host state in netbox set to staged

ml-serve2003: Rack C5 U1/2 ge-5/0/0

  • - receive in system on procurement task T267651 & in coupa
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location, state of planned)
  • - bios/drac/serial setup/testing
  • - mgmt dns entries added for both asset tag and hostname
  • - network port setup (description, enable, vlan)
    • end on-site specific steps
  • - production dns entries added
  • - operations/puppet update (install_server at minimum, other files if possible)
  • - OS installation
  • - puppet accept/initial run (with role:spare)
  • - host state in netbox set to staged

ml-serve2004: rack D6 U5/6 ge-6/0/4

  • - receive in system on procurement task T267651 & in coupa
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location, state of planned)
  • - bios/drac/serial setup/testing
  • - mgmt dns entries added for both asset tag and hostname
  • - network port setup (description, enable, vlan)
    • end on-site specific steps
  • - production dns entries added
  • - operations/puppet update (install_server at minimum, other files if possible)
  • - OS installation
  • - puppet accept/initial run (with role:spare)
  • - host state in netbox set to staged

Once the system(s) above have had all checkbox steps completed, this task can be resolved.

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript
RobH added a parent task: Unknown Object (Task).
RobH moved this task from Backlog to Racking Tasks on the ops-codfw board.
RobH removed a subscriber: RobH.
RobH mentioned this in Unknown Object (Task).Nov 10 2020, 6:02 PM

@klausman what is the partman recipe to use for those servers and are we using 10G or 1G?

Thanks

Networking will be 1G. No hw RAID.

As for partitioning, there currently is no parman recipe available that does exactly what we want (2xSSD RAID-1 for OS, 2x (or more) spinning disk RAID1 for /srv).

FWIW, as long as we can get the DRACs configured and pointers to the Netbox entries, we can probably figure out the auto-install setup (including partitioning) ourselves.

If that isn't possible, I'll need some time to figure out a partman recipe using a test cloud vm first.

Change 648258 had a related patch set uploaded (by Papaul; owner: Papaul):
[operations/puppet@production] DHCP: Add MAC address for ml-serve200[1234]

https://gerrit.wikimedia.org/r/648258

Change 648258 merged by Papaul:
[operations/puppet@production] DHCP: Add MAC address for ml-serve200[1234]

https://gerrit.wikimedia.org/r/648258

Change 648262 had a related patch set uploaded (by Papaul; owner: Papaul):
[operations/puppet@production] Add ml-serve200[1234] to site.pp

https://gerrit.wikimedia.org/r/648262

Change 648262 merged by Papaul:
[operations/puppet@production] Add ml-serve200[1234] to site.pp

https://gerrit.wikimedia.org/r/648262

Papaul updated the task description. (Show Details)
Papaul added a subscriber: Papaul.

@klausman this is done from my end

Change 655024 had a related patch set uploaded (by Kormat; owner: Kormat):
[operations/puppet@production] install_server: Use a dummy partman config for d-i-test

https://gerrit.wikimedia.org/r/655024

Change 655024 merged by Kormat:
[operations/puppet@production] install_server: Use a dummy partman config for d-i-test

https://gerrit.wikimedia.org/r/655024

Change 655025 had a related patch set uploaded (by Kormat; owner: Kormat):
[operations/puppet@production] install_server: Drop virtual.cfg from d-i-test partman cfg

https://gerrit.wikimedia.org/r/655025

Change 655025 merged by Kormat:
[operations/puppet@production] install_server: Drop virtual.cfg from d-i-test partman cfg

https://gerrit.wikimedia.org/r/655025

Change 655026 had a related patch set uploaded (by Kormat; owner: Kormat):
[operations/puppet@production] install_server: Fix typo in d-i-test config

https://gerrit.wikimedia.org/r/655026

Change 655026 merged by Kormat:
[operations/puppet@production] install_server: Fix typo in d-i-test config

https://gerrit.wikimedia.org/r/655026

Change 655041 had a related patch set uploaded (by Kormat; owner: Kormat):
[operations/puppet@production] install_server: Fix ml-serve/raid1-2x2dev partman configs

https://gerrit.wikimedia.org/r/655041

Change 655041 merged by Kormat:
[operations/puppet@production] install_server: Fix ml-serve/raid1-2x2dev partman configs

https://gerrit.wikimedia.org/r/655041

Script wmf-auto-reimage was launched by klausman on cumin2001.codfw.wmnet for hosts:

rdb2004.codfw.wmnet

The log can be found in /var/log/wmf-auto-reimage/202101081218_klausman_11845_rdb2004_codfw_wmnet.log.

Completed auto-reimage of hosts:

['rdb2004.codfw.wmnet']

Of which those FAILED:

['rdb2004.codfw.wmnet']

Script wmf-auto-reimage was launched by klausman on cumin2001.codfw.wmnet for hosts:

ml-serve2001.codfw.wmnet

The log can be found in /var/log/wmf-auto-reimage/202101081231_klausman_13556_ml-serve2001_codfw_wmnet.log.

Completed auto-reimage of hosts:

['ml-serve2001.codfw.wmnet']

and were ALL successful.

Change 655050 had a related patch set uploaded (by Kormat; owner: Kormat):
[operations/puppet@production] install_server: Revert changes to d-i-test's config

https://gerrit.wikimedia.org/r/655050

Change 655050 merged by Kormat:
[operations/puppet@production] install_server: Revert changes to d-i-test's config

https://gerrit.wikimedia.org/r/655050

Script wmf-auto-reimage was launched by klausman on cumin2001.codfw.wmnet for hosts:

ml-serve2004.codfw.wmnet

The log can be found in /var/log/wmf-auto-reimage/202101081313_klausman_23128_ml-serve2004_codfw_wmnet.log.

Script wmf-auto-reimage was launched by klausman on cumin2001.codfw.wmnet for hosts:

ml-serve2003.codfw.wmnet

The log can be found in /var/log/wmf-auto-reimage/202101081313_klausman_23083_ml-serve2003_codfw_wmnet.log.

Script wmf-auto-reimage was launched by klausman on cumin2001.codfw.wmnet for hosts:

ml-serve2002.codfw.wmnet

The log can be found in /var/log/wmf-auto-reimage/202101081313_klausman_23043_ml-serve2002_codfw_wmnet.log.

Completed auto-reimage of hosts:

['ml-serve2003.codfw.wmnet']

and were ALL successful.

Completed auto-reimage of hosts:

['ml-serve2002.codfw.wmnet']

and were ALL successful.

Completed auto-reimage of hosts:

['ml-serve2004.codfw.wmnet']

and were ALL successful.

The machines now have a base install (i.e. there is nothing special for them in puppet).

The machines in eqiad should install correctly out of the box, though I have noticed that the boot order of the codfw machines was wrong, in that they always tried to PXE boot first.

@klausman any update on this? IF the install is done can you please resolve the task?
Thanks.

Hi @klausman - just following up here, to see if we can close out this task? Thanks, Willy

Yes, this is all done!