Page MenuHomePhabricator

Q3:rack/setup/install cloudcephosd10(3[5-9]|40)
Open, Stalled, MediumPublic

Description

This task will track the racking, setup, and OS installation of X

Hostname / Racking / Installation Details

Hostnames: cloudcephosd103[5-9], cloudcephosd1040
Racking Proposal: Rack in WMCS, can be spread around or racked in any rack as needed
Networking Setup: 2 10G connections (same as cloudcephosd1034)
Partitioning/Raid: Same as cloudcephosd1034
OS Distro: Bullseye
Sub-team Technical Contact: @dcaro

Per host setup checklist

Each host should have its own setup checklist copied and pasted into the list below.

cloudcephosd1035:
  • - receive in system on procurement task T319446 & in coupa
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location, state of planned)
  • - add mgmt dns (asset tag and hostname) and production dns entries in netbox, run cookbook sre.dns.netbox.
  • - network port setup via netbox, run homer from an active cumin host to commit
  • - bios/drac/serial setup/testing, see Lifecycle Steps & Automatic BIOS setup details
  • - firmware update (idrac, bios, network, raid controller)
  • - operations/puppet update - this should include updates to netboot.pp, and site.pp role(insetup) or cp systems use role(insetup::nofirm).
  • - OS installation & initital puppet run via sre.hosts.reimage cookbook.
cloudcephosd1036:
  • - receive in system on procurement task T319446 & in coupa
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location, state of planned)
  • - add mgmt dns (asset tag and hostname) and production dns entries in netbox, run cookbook sre.dns.netbox.
  • - network port setup via netbox, run homer from an active cumin host to commit
  • - bios/drac/serial setup/testing, see Lifecycle Steps & Automatic BIOS setup details
  • - firmware update (idrac, bios, network, raid controller)
  • - operations/puppet update - this should include updates to netboot.pp, and site.pp role(insetup) or cp systems use role(insetup::nofirm).
  • - OS installation & initital puppet run via sre.hosts.reimage cookbook.
cloudcephosd1037:
  • - receive in system on procurement task T319446 & in coupa
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location, state of planned)
  • - add mgmt dns (asset tag and hostname) and production dns entries in netbox, run cookbook sre.dns.netbox.
  • - network port setup via netbox, run homer from an active cumin host to commit
  • - bios/drac/serial setup/testing, see Lifecycle Steps & Automatic BIOS setup details
  • - firmware update (idrac, bios, network, raid controller)
  • - operations/puppet update - this should include updates to netboot.pp, and site.pp role(insetup) or cp systems use role(insetup::nofirm).
  • - OS installation & initital puppet run via sre.hosts.reimage cookbook.
cloudcephosd1038:
  • - receive in system on procurement task T319446 & in coupa
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location, state of planned)
  • - add mgmt dns (asset tag and hostname) and production dns entries in netbox, run cookbook sre.dns.netbox.
  • - network port setup via netbox, run homer from an active cumin host to commit
  • - bios/drac/serial setup/testing, see Lifecycle Steps & Automatic BIOS setup details
  • - firmware update (idrac, bios, network, raid controller)
  • - operations/puppet update - this should include updates to netboot.pp, and site.pp role(insetup) or cp systems use role(insetup::nofirm).
  • - OS installation & initital puppet run via sre.hosts.reimage cookbook.
cloudcephosd1039:
  • - receive in system on procurement task T319446 & in coupa
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location, state of planned)
  • - add mgmt dns (asset tag and hostname) and production dns entries in netbox, run cookbook sre.dns.netbox.
  • - network port setup via netbox, run homer from an active cumin host to commit
  • - bios/drac/serial setup/testing, see Lifecycle Steps & Automatic BIOS setup details
  • - firmware update (idrac, bios, network, raid controller)
  • - operations/puppet update - this should include updates to netboot.pp, and site.pp role(insetup) or cp systems use role(insetup::nofirm).
  • - OS installation & initital puppet run via sre.hosts.reimage cookbook.
cloudcephosd1040:
  • - receive in system on procurement task T319446 & in coupa
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location, state of planned)
  • - add mgmt dns (asset tag and hostname) and production dns entries in netbox, run cookbook sre.dns.netbox.
  • - network port setup via netbox, run homer from an active cumin host to commit
  • - bios/drac/serial setup/testing, see Lifecycle Steps & Automatic BIOS setup details
  • - firmware update (idrac, bios, network, raid controller)
  • - operations/puppet update - this should include updates to netboot.pp, and site.pp role(insetup) or cp systems use role(insetup::nofirm).
  • - OS installation & initital puppet run via sre.hosts.reimage cookbook.

Event Timeline

RobH added a parent task: Unknown Object (Task).
RobH updated the task description. (Show Details)
RobH moved this task from Backlog to Racking Tasks on the ops-eqiad board.
RobH unsubscribed.

cloudcephosd1035 E4 U33 cableid. 20220009 port. 0 cableid. 20220007 port. 1
cloudcephosd1036 E4 U34 cableid. 20220008 port. 2 cableid. 20220016 port. 3
cloudcephosd1037 E4 U35 cableid. 20220011 port. 4 cableid. 20220014 port. 5
cloudcephosd1038 F4 U33 cableid. 20220005 port. 0 cableid. 20220012 port. 1
cloudcephosd1039 F4 U34 cableid. 20220015 port. 2 cableid. 20220013 port. 3
cloudcephosd1040 F4 U35 cableid. 20220019 port. 4 cableid. 20220006 port. 5

Change 894582 had a related patch set uploaded (by Cmjohnson; author: Cmjohnson):

[operations/puppet@production] Adding new cloudcephosd hosts to site.pp insetup::nofirm

https://gerrit.wikimedia.org/r/894582

Change 894582 merged by Cmjohnson:

[operations/puppet@production] Adding new cloudcephosd hosts to site.pp insetup::nofirm

https://gerrit.wikimedia.org/r/894582

Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host cloudcephosd1035.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host cloudcephosd1035.eqiad.wmnet with OS bullseye executed with errors:

  • cloudcephosd1035 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host cloudcephosd1040.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host cloudcephosd1040.eqiad.wmnet with OS bullseye executed with errors:

  • cloudcephosd1040 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • The reimage failed, see the cookbook logs for the details

Failed install but I didn't change the raid controller.

There doesn't seem to be a raid controller

Screen Shot 2023-03-30 at 2.11.58 PM.png (353×571 px, 70 KB)

wiki_willy subscribed.

@Jclark-ctr - can you take a peak at this one to see if it's pending on anything from our side? Thanks, Willy

@Jclark-ctr These should be setup with software RAID just like last time. See @Andrew comment: https://phabricator.wikimedia.org/T294972#8029219. @Andrew feel free to correct or jump in if needed.

Change 929800 had a related patch set uploaded (by Papaul; author: Papaul):

[operations/puppet@production] Update role in site.pp for cloudcephosd1035-1040

https://gerrit.wikimedia.org/r/929800

Change 929800 merged by Papaul:

[operations/puppet@production] Update role in site.pp for cloudcephosd1035-1040

https://gerrit.wikimedia.org/r/929800

Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host cloudcephosd1035.eqiad.wmnet with OS bullseye

RobH mentioned this in Unknown Object (Task).Jun 14 2023, 1:39 PM
RobH changed the task status from Open to Stalled.Jun 14 2023, 1:42 PM

IRC Update Summary:

  • Papaul attempted to install hosts, OS won't see the disks.
  • Disks show in bios, Rob pinged to double check.
  • Turns out we ordered these custom hosts over many quotation iterations and during that the disk controller was somehow swapped off entirely.
    • Corrected our sku review to disallow 'no controller' sku.
    • emailed our reps via T339131 to order 6 disk controllers
  • once the controller arrive, they'll be installed by on-site and unblock this installation.

Hello folks, sorry for arriving late to this ticket.

I'm not sure if this will be useful or not, but here's some context about what we'll be using these for: ceph OSDs typically have a pair of OS drives and then all the other drives are JBOD, managed directly by ceph. So we don't necessarily need hw raid support in these hosts, we could do sw raid for the OS drives.

I'm guessing that when you say they're missing a raid controller you mean that the drives simply don't work AT ALL, in which case we still have the same problem as before.

(edited to add:) apparently only the dell RAID controllers have nvme support, so that explains the raid confusion.

@Andrew usually we use the raid controller to configure os drives. I do not know if our Os install would recognize the correct drives @Papaul would you know if we can configure os drives in software raid prior to os install?

@Jclark-ctr @Andrew even with the SW raid you still need the controller to be able to see the drives.

Any update on status from Dell on getting this hardware operational? Are we still waiting on the correct controller cards?

So this system only supported UEFI mode, which we've not supported installing within WMF.

If I can recall correctly, a few years ago we had a discussion with Moritz about UEFI mode and why we didn't use it at the time, but I may be misrecalling.

@MoritzMuehlenhoff do you recall this discussion? (If not I've misrecalled, no worries.)

IRC update: Chatted with Moritz in IRC and we're no where near supporting UEFI mode anytime in near to mid term. We should likely return these.

Can someone provide an update on what's happening with these machines? Where they indeed sent back? Do we have replacement hardware?

Can someone provide an update on what's happening with these machines? Where they indeed sent back? Do we have replacement hardware?

So these machines aren't bootable and are being sent back, I'll sync with Willy later today to find out how to budget/process the replacement server order (since this was last year fiscal, but refund to this year fiscal, etc.)

Servers have been boxed up and shipped out

These hosts are still in Netbox and are marked as occupying switch ports etc - can those be cleaned up?

There are pending DNS changes in Netbox not committed to the auto-generated DNS repository related to those hosts since yesterday:

Fri 22:05:05   icinga-wm| PROBLEM - Uncommitted DNS changes in Netbox on netbox1002 is CRITICAL:...

Please always remember to run the sre.dns.netbox cookbook when modifying DNS records in Netbox. CC @RobH (from Netbox's changelog).

@wiki_willy @Jclark-ctr @RobH As I see that Rob is out this week, to unblock the rest of SREs with any DNS-related change in Netbox I'm running the sre.dns.netbox cookbook.

Done.
FYI cloudcephosd1040 had a wrong WMF asset tag wmf108805, I guess it was supposed to be wmf10805