This task will track the racking, setup, and OS installation of ml-serve100[1-4].
Please note these are GPU capable chassis/servers ordered via T266482, with the GPU cards being ordered on T266516 & a future task. Only 1 GPU was ordered initially, as the new AMD Radeon 5700 XT has NOT been test fitted and power tested in the chassis. Once a single GPU has passed testing, we can order the remaining GPUs.
We went conservatively on this ordering cadence (one to test), as GPU cards all are non-returnable once opened, and even returned unopened will incur a restocking fee.
Hostname / Racking / Installation Details
Hostnames: ml-serve100[1-4]
Racking Proposal: 4 hosts will be in the same cluster, so differing racks at minimum. row diversity is preferred, but understood if a single row is at capacity then 2 hosts may need to share a row.
Networking/Subnet/VLAN/IP: 1G, internal1 vlan (systems have 1G/10G nics, only need 1G at this time)
Partitioning/Raid: match an-worker[1096-1101]
OS Distro: Buster
Per host setup checklist
Each host should have its own setup checklist copied and pasted into the list below.
ml-serve1001:
- - receive in system on procurement task T266482 & in coupa
- - receive in GPU card on procurement task T266516 & in coupa
- - rack system with proposed racking plan (see above) & update netbox (include all system info plus location, state of planned)
- - bios/drac/serial setup/testing
- - mgmt dns entries added for both asset tag and hostname
- - network port setup (description, enable, vlan)
- end on-site specific steps
- - production dns entries added
- - operations/puppet update (install_server at minimum, other files if possible)
- - OS installation
- - puppet accept/initial run (with role:spare)
- - host state in netbox set to staged
ml-serve1002:
- - receive in system on procurement task T266482 & in coupa
- - rack system with proposed racking plan (see above) & update netbox (include all system info plus location, state of planned)
- - bios/drac/serial setup/testing
- - mgmt dns entries added for both asset tag and hostname
- - network port setup (description, enable, vlan)
- end on-site specific steps
- - production dns entries added
- - operations/puppet update (install_server at minimum, other files if possible)
- - OS installation
- - puppet accept/initial run (with role:spare)
- - host state in netbox set to staged
ml-serve1003:
- - receive in system on procurement task T266482 & in coupa
- - rack system with proposed racking plan (see above) & update netbox (include all system info plus location, state of planned)
- - bios/drac/serial setup/testing
- - mgmt dns entries added for both asset tag and hostname
- - network port setup (description, enable, vlan)
- end on-site specific steps
- - production dns entries added
- - operations/puppet update (install_server at minimum, other files if possible)
- - OS installation
- - puppet accept/initial run (with role:spare)
- - host state in netbox set to staged
ml-serve1004:
- - receive in system on procurement task T266482 & in coupa
- - rack system with proposed racking plan (see above) & update netbox (include all system info plus location, state of planned)
- - bios/drac/serial setup/testing
- - mgmt dns entries added for both asset tag and hostname
- - network port setup (description, enable, vlan)
- end on-site specific steps
- - production dns entries added
- - operations/puppet update (install_server at minimum, other files if possible)
- - OS installation
- - puppet accept/initial run (with role:spare)
- - host state in netbox set to staged
Once the system(s) above have had all checkbox steps completed, this task can be resolved.