Page MenuHomePhabricator

(Need By: TBD) rack/setup/install ml-serve100[1-4]
Closed, ResolvedPublic

Description

This task will track the racking, setup, and OS installation of ml-serve100[1-4].

Please note these are GPU capable chassis/servers ordered via T266482, with the GPU cards being ordered on T266516 & a future task. Only 1 GPU was ordered initially, as the new AMD Radeon 5700 XT has NOT been test fitted and power tested in the chassis. Once a single GPU has passed testing, we can order the remaining GPUs.

We went conservatively on this ordering cadence (one to test), as GPU cards all are non-returnable once opened, and even returned unopened will incur a restocking fee.

Hostname / Racking / Installation Details

Hostnames: ml-serve100[1-4]
Racking Proposal: 4 hosts will be in the same cluster, so differing racks at minimum. row diversity is preferred, but understood if a single row is at capacity then 2 hosts may need to share a row.
Networking/Subnet/VLAN/IP: 1G, internal1 vlan (systems have 1G/10G nics, only need 1G at this time)
Partitioning/Raid: match an-worker[1096-1101]
OS Distro: Buster

Per host setup checklist

Each host should have its own setup checklist copied and pasted into the list below.

ml-serve1001:

  • - receive in system on procurement task T266482 & in coupa
  • - receive in GPU card on procurement task T266516 & in coupa
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location, state of planned)
  • - bios/drac/serial setup/testing
  • - mgmt dns entries added for both asset tag and hostname
  • - network port setup (description, enable, vlan)
    • end on-site specific steps
  • - production dns entries added
  • - operations/puppet update (install_server at minimum, other files if possible)
  • - OS installation
  • - puppet accept/initial run (with role:spare)
  • - host state in netbox set to staged

ml-serve1002:

  • - receive in system on procurement task T266482 & in coupa
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location, state of planned)
  • - bios/drac/serial setup/testing
  • - mgmt dns entries added for both asset tag and hostname
  • - network port setup (description, enable, vlan)
    • end on-site specific steps
  • - production dns entries added
  • - operations/puppet update (install_server at minimum, other files if possible)
  • - OS installation
  • - puppet accept/initial run (with role:spare)
  • - host state in netbox set to staged

ml-serve1003:

  • - receive in system on procurement task T266482 & in coupa
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location, state of planned)
  • - bios/drac/serial setup/testing
  • - mgmt dns entries added for both asset tag and hostname
  • - network port setup (description, enable, vlan)
    • end on-site specific steps
  • - production dns entries added
  • - operations/puppet update (install_server at minimum, other files if possible)
  • - OS installation
  • - puppet accept/initial run (with role:spare)
  • - host state in netbox set to staged

ml-serve1004:

  • - receive in system on procurement task T266482 & in coupa
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location, state of planned)
  • - bios/drac/serial setup/testing
  • - mgmt dns entries added for both asset tag and hostname
  • - network port setup (description, enable, vlan)
    • end on-site specific steps
  • - production dns entries added
  • - operations/puppet update (install_server at minimum, other files if possible)
  • - OS installation
  • - puppet accept/initial run (with role:spare)
  • - host state in netbox set to staged

Once the system(s) above have had all checkbox steps completed, this task can be resolved.

Related Objects

StatusSubtypeAssignedTask
ResolvedRobH

Event Timeline

RobH added parent tasks: Unknown Object (Task), Unknown Object (Task).
RobH mentioned this in Unknown Object (Task).Nov 2 2020, 6:37 PM
RobH added a parent task: Unknown Object (Task).Nov 6 2020, 12:07 AM
RobH renamed this task from (Need By: TBD) rack/setup/install ml-deploy100[1-4] to (Need By: TBD) rack/setup/install ml-serve100[1-4].Nov 10 2020, 5:53 PM
RobH updated the task description. (Show Details)
RobH updated the task description. (Show Details)
RobH mentioned this in Unknown Object (Task).Nov 30 2020, 7:48 PM
wiki_willy added subscribers: Jclark-ctr, wiki_willy.

Servers arrived Dec 23. @Jclark-ctr - can you install the GPU into one of these hosts?

Thanks,
Willy

@RobH @wiki_willy
Attached 3 photos Unfortunately large cooling fins have fitment issues in dell case. Any future gpu can not extend past backplane bracket

This comment was removed by Jclark-ctr.

@Jclark-ctr Thanks for the photos. For clarification, does this mean the GPU does not fit or that the GPU does fit but any future GPU's order should be smaller in dimensions?

@Jclark-ctr Got it. Thanks.

@RobH @wiki_willy Can we set up a call to discuss next steps?

@Jclark-ctr Got it. Thanks.

@RobH @wiki_willy Can we set up a call to discuss next steps?

Sure. Additionally, my understanding of the GPU market is that any given GPU has multiple manufacters for the GPU. This was a "XFX - AMD Radeon RX 5700 XT 8GB GDDR6 256Bit PCIE, 246601", so we may be able to find another manufacter of this GPU chipset, which may have better/smaller form factor for within the case.

This is exactly why we only wanted to order 1 to test! =]

Additionally I've created T271351 to determine if there is another version of this GPU chipset via Rahi that may have a differing fan/heatsink assembly that would suit our needs. Discussion for that can take place on task T271351.

Pleae note that after our sync up with @wiki_willy and @calbon, the followup steps are:

  • new GPU will be ordered for tested via T271351.
    • due to having to order another test GPU, the actuall rollout of GPUs through the ml-serve fleet will be delayed.
  • systems will be racked and imaged without GPUs, as they can perform a sub-set of their intended function without the GPU. (It will just be less powerful, and not do quite as much.)
  • cryptocurrency mining boom has put strain on worldwide gpu market, so sourcing a replacement gpu for a month or so may be problematic, so the system installs will proceed without GPUs for now
    • once the gpu arrives, whenever it is installed into a system, that system will need to be reimaged to ensure the GPU can be supported at time of system image/reimage in the future.

This had a mistake, introduced by me in the racking task, of 10G networking. These have 1G/10G nics, but ONLY need 1G. I had updated the codfw racking task, but failed to update this racking task.

So as a result, @Jclark-ctr had already racked 2 of them in 10G racks as of today, and will need to relocate them into 1G racks. Also updating that row diversity is prefered, but if one row in eqiad is overpopulated due to cloud, then two servers may have to share a row (but none should share a rack.)

Host racked ,working on cabling, netbox updated.
host port rack
ml-serve1001 12 a1
ml-serve1002 36 b5
ml-serve1003 5 c3
ml-serve1004 27 d8

Ran the dns cookbook for these hosts fine, but the homer script has issues:

robh@cumin1001:~$ sudo homer asw*eqiad* diff
INFO:homer.devices:Initialized 35 devices
INFO:homer:Generating diff for query asw*eqiad*
INFO:homer:Gathering global Netbox data
INFO:homer.devices:Matched 4 device(s) for query 'asw*eqiad*'
INFO:homer:Generating configuration for asw2-a-eqiad.mgmt.eqiad.wmnet
ERROR:homer.transports.junos:Failed to get diff for asw2-a-eqiad.mgmt.eqiad.wmnet: ConfigLoadError(severity: error, bad_element: 12, message: error: invalid interface name in 12
error: error recovery ignores input until this point
warning: mgd: statement must contain additional statements
error: invalid interface type: 12
error: error recovery ignores input until this point
warning: mgd: statement has no contents; ignored)
INFO:homer:Generating configuration for asw2-b-eqiad.mgmt.eqiad.wmnet
ERROR:homer.transports.junos:Failed to get diff for asw2-b-eqiad.mgmt.eqiad.wmnet: ConfigLoadError(severity: error, bad_element: 36, message: error: invalid interface name in 36
error: error recovery ignores input until this point
warning: mgd: statement must contain additional statements
error: invalid interface type: 36
error: error recovery ignores input until this point
warning: mgd: statement has no contents; ignored)
INFO:homer:Generating configuration for asw2-c-eqiad.mgmt.eqiad.wmnet
ERROR:homer.transports.junos:Failed to get diff for asw2-c-eqiad.mgmt.eqiad.wmnet: ConfigLoadError(severity: error, bad_element: 5, message: error: invalid interface name in 5
error: error recovery ignores input until this point
warning: mgd: statement must contain additional statements
error: invalid interface type: 5
error: error recovery ignores input until this point
warning: mgd: statement has no contents; ignored)
INFO:homer:Generating configuration for asw2-d-eqiad.mgmt.eqiad.wmnet
ERROR:homer.transports.junos:Failed to get diff for asw2-d-eqiad.mgmt.eqiad.wmnet: ConfigLoadError(severity: error, bad_element: 27, message: error: invalid interface name in 27
error: error recovery ignores input until this point
warning: mgd: statement must contain additional statements
error: invalid interface type: 27
error: error recovery ignores input until this point
warning: mgd: statement has no contents; ignored)
Changes for 4 devices: ['asw2-a-eqiad.mgmt.eqiad.wmnet', 'asw2-b-eqiad.mgmt.eqiad.wmnet', 'asw2-c-eqiad.mgmt.eqiad.wmnet', 'asw2-d-eqiad.mgmt.eqiad.wmnet']
# Failed
---------------
ERROR:homer:Homer run had issues on 4 devices: ['asw2-a-eqiad.mgmt.eqiad.wmnet', 'asw2-b-eqiad.mgmt.eqiad.wmnet', 'asw2-c-eqiad.mgmt.eqiad.wmnet', 'asw2-d-eqiad.mgmt.eqiad.wmnet']

host configured and racked

fixed the netbox issue, hsots will be imaged later today

fixed the netbox issue, hsots will be imaged later today

I hadn't circled back to this yet, but its on my radar to address/image today!

So I went to image these, and none of the mgmt interfaces are pingable. So either they aren't plugged in, or they were misconfigured. Since they aren't remotely accessible, I cannot troubleshoot this.

We'll need either @Jclark-ctr or @Cmjohnson to investigate and update the task description about the hosts being mgmt configured.

Change 655793 had a related patch set uploaded (by RobH; owner: RobH):
[operations/puppet@production] updating puppet repo for ml-serve100[1-4]

https://gerrit.wikimedia.org/r/655793

Change 655793 merged by RobH:
[operations/puppet@production] updating puppet repo for ml-serve100[1-4]

https://gerrit.wikimedia.org/r/655793

Script wmf-auto-reimage was launched by robh on cumin1001.eqiad.wmnet for hosts:

ml-serve1001.eqiad.wmnet

The log can be found in /var/log/wmf-auto-reimage/202101131809_robh_27635_ml-serve1001_eqiad_wmnet.log.

Completed auto-reimage of hosts:

['ml-serve1001.eqiad.wmnet']

Of which those FAILED:

['ml-serve1001.eqiad.wmnet']

Script wmf-auto-reimage was launched by robh on cumin1001.eqiad.wmnet for hosts:

['ml-serve1001.eqiad.wmnet', 'ml-serve1002.eqiad.wmnet', 'ml-serve1003.eqiad.wmnet', 'ml-serve1004.eqiad.wmnet']

The log can be found in /var/log/wmf-auto-reimage/202101131946_robh_10085.log.

Completed auto-reimage of hosts:

['ml-serve1002.eqiad.wmnet', 'ml-serve1001.eqiad.wmnet', 'ml-serve1003.eqiad.wmnet', 'ml-serve1004.eqiad.wmnet']

Of which those FAILED:

['ml-serve1002.eqiad.wmnet', 'ml-serve1001.eqiad.wmnet', 'ml-serve1003.eqiad.wmnet', 'ml-serve1004.eqiad.wmnet']

So 100[123] failed the final part of the reimage script with puppet run, no clue why, need to investigate.

1004 is failing to get a DHCP response for PXE boot, and isn't being seen by the installer host request log/syslog. The switch shows the port online, so I need to investigate that as well.

Overall I've been fighting these all day to make them do what I want, so I'm quite frustrated and need to step away from them for a short bit.

Change 655997 had a related patch set uploaded (by RobH; owner: RobH):
[operations/puppet@production] correcting ml-serve1004 mac

https://gerrit.wikimedia.org/r/655997

Change 655997 merged by RobH:
[operations/puppet@production] correcting ml-serve1004 mac

https://gerrit.wikimedia.org/r/655997

ml-serve1--4 will pxe boot now.

next step is use image script again and investigate its errors.

Script wmf-auto-reimage was launched by robh on cumin1001.eqiad.wmnet for hosts:

ml-serve1001.eqiad.wmnet

The log can be found in /var/log/wmf-auto-reimage/202101142000_robh_6361_ml-serve1001_eqiad_wmnet.log.

Completed auto-reimage of hosts:

['ml-serve1001.eqiad.wmnet']

and were ALL successful.

Script wmf-auto-reimage was launched by robh on cumin1001.eqiad.wmnet for hosts:

['ml-serve1002.eqiad.wmnet', 'ml-serve1003.eqiad.wmnet', 'ml-serve1004.eqiad.wmnet']

The log can be found in /var/log/wmf-auto-reimage/202101142059_robh_23189.log.

Completed auto-reimage of hosts:

['ml-serve1002.eqiad.wmnet', 'ml-serve1004.eqiad.wmnet', 'ml-serve1003.eqiad.wmnet']

and were ALL successful.

RobH updated the task description. (Show Details)

all four ml-serve hosts have been installed. They do not yet have GPUs, but our sync up showed they could be used without for now.

Ready for your team @calbon