Page MenuHomePhabricator

Q3:rack/setup/install phab2003
Closed, ResolvedPublic

Description

This task will track the racking, setup, and OS installation of phab2003

Hostname / Racking / Installation Details

Hostnames: phab2003.codfw.wmnet
Racking Proposal: codfw row C - C3
Networking Setup: # of Connections:1 - Speed:1G VLAN:Private
OS Distro: Trixie
Boot Method: Legacy BIOS
Sub-team Technical Contact: Dzahn

Per host setup checklist

Each host should have its own setup checklist copied and pasted into the list below.

phab2003:
  • Receive in system on procurement task T417685 & in Coupa
  • Rack system with proposed racking plan (see above) & update Netbox (include all system info plus location, state of planned)
  • Run the Provision a server's network attributes Netbox script - Note that you must run the DNS and Provision cookbook after completing this step
  • Immediately run the sre.dns.netbox cookbook
  • Immediately run the sre.hosts.provision cookbook
  • Run the sre.hardware.upgrade-firmware cookbook
  • Update the operations/puppet repo - this should include updates to preseed.yaml, and site.pp with roles defined by service group: https://wikitech.wikimedia.org/wiki/SRE/Dc-operations
  • Run the sre.hosts.reimage cookbook

Event Timeline

RobH moved this task from Backlog to Racking Tasks on the ops-codfw board.

Please update the site.pp file with the insetup role for your team (detailed on https://wikitech.wikimedia.org/wiki/SRE/Dc-operations) and add the new servers to preseed.yml for partition info.

If possible, please reference this task number in your patch set, so it is clear when complete. Once complete, just un-assign yourself (leaving no assignee) for this task and once the hardware arrives on-site engineerss will claim this task for racking and setup. Please don't re-subscribe me to this task unless there is a direct question for me.

Thank you!

RobH mentioned this in Unknown Object (Task).Mar 3 2026, 6:32 PM
RobH added a parent task: Unknown Object (Task).
RobH unsubscribed.

Change #1247665 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/puppet@production] site: add phab2003 with collab insetup role

https://gerrit.wikimedia.org/r/1247665

Change #1247665 merged by Dzahn:

[operations/puppet@production] site: add phab2003 with collab insetup role

https://gerrit.wikimedia.org/r/1247665

Dzahn removed Dzahn as the assignee of this task.Mar 3 2026, 7:33 PM
Dzahn subscribed.

Thank you. Done! Added to site.pp with insetup role. preseed.yml already covered by a wildcard.

okay two things with this server so far.
@Dzahn we won't be able to do legacy bios on these R470 servers. We'll need an efi boot

@elukey new little diff on these config B servers. as oppose to the config F's we got working. it can't seem to set the pxe boot to the nic

**added cause i hit the button too soon

==> Unable to auto-detect NIC with link. Pick the one to set PXE on:
['NIC.Slot.5-1-1', 'NIC.Slot.5-2-1']
> NIC.Slot.5-1-1
User input is: "NIC.Slot.5-1-1"
Failed to run cookbooks.sre.hosts.provision.DellProvisionRunner._config_host: 'NIC.Slot.5-1-1'

and it is throwing this error when it tries to reboot the server to apply changes

Running IPMI command: ipmitool -I lanplus -H phab2003.mgmt.codfw.wmnet -U root -E chassis power status
Error: Unable to establish IPMI v2 / RMCP+ session
Exception raised while executing cookbook sre.hosts.provision:
Traceback (most recent call last):
 File "/usr/lib/python3/dist-packages/spicerack/ipmi.py", line 86, in command
  output = run(command + command_parts, env=self.env.copy(), stdout=PIPE, check=True).stdout.decode()

Change #1270107 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/puppet@production] installserver: set UEFI-only recipes for newer phab* hosts

https://gerrit.wikimedia.org/r/1270107

Change #1270107 merged by Dzahn:

[operations/puppet@production] installserver: set UEFI-only recipes for newer phab* hosts

https://gerrit.wikimedia.org/r/1270107

@VRiley-WMF sorry for the lag, completely lost this task!

I tried today to run the provision script but I get stuck in the first check_connection:

2026-04-17 15:53:48,195 elukey 2412615 [ERROR] GET https://10.193.2.154/redfish returned HTTP 401
Response payload: {'error': {'code': 'Base.1.18.GeneralError', 'message': 'A general error has occurred. See ExtendedInfo for more information.', '@Message.ExtendedInfo': [{'@odata.type': '#Message.v1_1_0.Message', 'MessageId': 'Base.1.18.AccessDenied', 'Message': 'The authentication credentials included with this request are missing or invalid.', 'MessageArgs': [], 'MessageArgs@odata.count': 0, 'RelatedProperties': [], 'RelatedProperties@odata.count': 0, 'Severity': 'Critical', 'Resolution': 'Attempt to ensure that the URI is correct and that the service has the appropriate credentials.'}]}}
2026-04-17 15:53:48,196 elukey 2412615 [ERROR] Failed to run cookbooks.sre.hosts.provision.DellProvisionRunner.run.<locals>.check_connection: Unable to connect to the Redfish API of phab2003. Follow https://wikitech.wikimedia.org/wiki/SRE/Dc-operations/Platform-specific_documentation/Dell_Documentation#Troubleshooting_2

I'll try to investigate more on Monday, anything weird happening on the host?

Once the setup issues are resolved we will implement Phabricator and replace phab2002 with it over at T423727.

I was able to repro:

2026-04-20 17:39:37,345 elukey 3425810 [DEBUG wmflib.interactive:229 in confirm_on_failure] Traceback
Traceback (most recent call last):
  File "/srv/deployment/spicerack/cookbooks/sre/hosts/provision.py", line 1021, in run
    self._config_host()
  File "/srv/deployment/spicerack/cookbooks/sre/hosts/provision.py", line 1124, in _config_host
    self._disable_lldp(config)
  File "/srv/deployment/spicerack/cookbooks/sre/hosts/provision.py", line 1283, in _disable_lldp
    if config.components[nic].get(attribute, '') == 'Enabled':
       ~~~~~~~~~~~~~~~~~^^^^^
KeyError: 'NIC.Slot.5-1-1'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/lib/python3/dist-packages/wmflib/interactive.py", line 224, in confirm_on_failure
    ret = func(*args, **kwargs)
          ^^^^^^^^^^^^^^^^^^^^^
  File "/srv/deployment/spicerack/cookbooks/sre/hosts/provision.py", line 1124, in _config_host
    self._disable_lldp(config)
  File "/srv/deployment/spicerack/cookbooks/sre/hosts/provision.py", line 1283, in _disable_lldp
    if config.components[nic].get(attribute, '') == 'Enabled':
       ~~~~~~~~~~~~~~~~~^^^^^
KeyError: 'NIC.Slot.5-1-1'

And indeed:

>>> a.components.keys()
dict_keys(['BIOS.Setup.1-1', 'EventFilters.Audit.1', 'EventFilters.Configuration.1', 'EventFilters.Storage.1', 'EventFilters.SystemHealth.1', 'EventFilters.Updates.1', 'EventFilters.WorkNotes.1', 'LifecycleController.Embedded.1', 'RAID.SL.1-1', 'SupportAssist.Embedded.1', 'System.Embedded.1', 'iDRAC.Embedded.1'])

That should be T392851#10975444, so idrac10 hosts, so _disable_lldp in provision may not be compatible with idrac10 yet.

The other error seems to be:

Created attribute BIOS.Setup.1-1 -> UncoreFrequency (with Set On Import True) with value DynamicUFS

[...]

>>> a.components.keys()
dict_keys(['BIOS.Setup.1-1', 'EventFilters.Audit.1', 'EventFilters.Configuration.1', 'EventFilters.Storage.1', 'EventFilters.SystemHealth.1', 'EventFilters.Updates.1', 'EventFilters.WorkNotes.1', 'LifecycleController.Embedded.1', 'RAID.SL.1-1', 'SupportAssist.Embedded.1', 'System.Embedded.1', 'iDRAC.Embedded.1'])

That should be T392851#10975444, so idrac10 hosts, so _disable_lldp in provision may not be compatible with idrac10 yet.

The host broadcasts itself as being Broadcom BCM57414 2x25G OCP Ethernet NIC fw_version:AFW_236.1.126.0 via LLDP. Is it possible to look through the BIOS options if there is something to disable LLDP there ?
Once flipped, then I can look at what changed in scp_dump() and provide a patch.

See T250367#11843361, I was able to disable it manually but not via Redfish. Going to test and merge https://gerrit.wikimedia.org/r/c/operations/cookbooks/+/1275509 to unblock this host, an also solve T418899#11841356

Change #1275889 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/cookbooks@master] sre.hosts.provision: make UncoreFrequency dynamic for iDRAC 10

https://gerrit.wikimedia.org/r/1275889

I used test-cookbook with https://gerrit.wikimedia.org/r/1275889 and it worked, the host is now provisioned. I'll wait for Jesse's review and merge it as well.

@Jhancock.wm you should be unblocked to reimage!

Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host phab2003.codfw.wmnet with OS trixie

Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host phab2003.codfw.wmnet with OS trixie completed:

  • phab2003 (PASS)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced UEFI HTTP Boot for next reboot
    • Host rebooted via Redfish
    • Host up (Debian installer)
    • Host up (new fresh trixie OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202604220159_jhancock_1838622_phab2003.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
    • Updated Netbox status planned -> active
    • The sre.puppet.sync-netbox-hiera cookbook was run successfully
Jhancock.wm updated the task description. (Show Details)

@Dzahn finished up and ready to go

Thank you all involved in getting this installed. Handing over to @Arnoldokoth

Change #1275889 merged by Elukey:

[operations/cookbooks@master] sre.hosts.provision: make UncoreFrequency dynamic for iDRAC 10

https://gerrit.wikimedia.org/r/1275889