Page MenuHomePhabricator

Q3:rack/setup/install ms-fe102[14]
Closed, ResolvedPublic

Description

This task will track the racking, setup, and OS installation of ms-fe102[14]

Hostname / Racking / Installation Details

Hostnames: ms-fe102[1-4]
Racking Proposal: Where should these systems be racked? Avoid existing ms-fe nodes, please
Networking Setup: # of Connections:1- Speed:10G. - VLAN:Private
OS Distro: bullseye
Boot Method: UEFI
Sub-team Technical Contact: @MatthewVernon
Please add SRE-swift-storage to subsequent racking tasks

Per host setup checklist

Each host should have its own setup checklist copied and pasted into the list below.

ms-fe1021
  • Receive in system on procurement task T413266 & in Coupa
  • Rack system with proposed racking plan (see above) & update Netbox (include all system info plus location, state of planned)
  • Run the Provision a server's network attributes Netbox script - Note that you must run the DNS and Provision cookbook after completing this step
  • Immediately run the sre.dns.netbox cookbook
  • Immediately run the sre.hosts.provision cookbook
  • Run the sre.hardware.upgrade-firmware cookbook
  • Update the operations/puppet repo - this should include updates to preseed.yaml, and site.pp with roles defined by service group: https://wikitech.wikimedia.org/wiki/SRE/Dc-operations
  • Run the sre.hosts.reimage cookbook
ms-fe1022
  • Receive in system on procurement task T413266 & in Coupa
  • Rack system with proposed racking plan (see above) & update Netbox (include all system info plus location, state of planned)
  • Run the Provision a server's network attributes Netbox script - Note that you must run the DNS and Provision cookbook after completing this step
  • Immediately run the sre.dns.netbox cookbook
  • Immediately run the sre.hosts.provision cookbook
  • Run the sre.hardware.upgrade-firmware cookbook
  • Update the operations/puppet repo - this should include updates to preseed.yaml, and site.pp with roles defined by service group: https://wikitech.wikimedia.org/wiki/SRE/Dc-operations
  • Run the sre.hosts.reimage cookbook
ms-fe1023
  • Receive in system on procurement task T413266 & in Coupa
  • Rack system with proposed racking plan (see above) & update Netbox (include all system info plus location, state of planned)
  • Run the Provision a server's network attributes Netbox script - Note that you must run the DNS and Provision cookbook after completing this step
  • Immediately run the sre.dns.netbox cookbook
  • Immediately run the sre.hosts.provision cookbook
  • Run the sre.hardware.upgrade-firmware cookbook
  • Update the operations/puppet repo - this should include updates to preseed.yaml, and site.pp with roles defined by service group: https://wikitech.wikimedia.org/wiki/SRE/Dc-operations
  • Run the sre.hosts.reimage cookbook
ms-fe1024
  • Receive in system on procurement task T413266 & in Coupa
  • Rack system with proposed racking plan (see above) & update Netbox (include all system info plus location, state of planned)
  • Run the Provision a server's network attributes Netbox script - Note that you must run the DNS and Provision cookbook after completing this step
  • Immediately run the sre.dns.netbox cookbook
  • Immediately run the sre.hosts.provision cookbook
  • Run the sre.hardware.upgrade-firmware cookbook
  • Update the operations/puppet repo - this should include updates to preseed.yaml, and site.pp with roles defined by service group: https://wikitech.wikimedia.org/wiki/SRE/Dc-operations
  • Run the sre.hosts.reimage cookbook

Event Timeline

Jhancock.wm mentioned this in Unknown Object (Task).Feb 2 2026, 9:20 PM

@elukey These are failing as well, just like backup1015 in T414725. I haven’t made any changes to the server — it’s still using the default user and password.

@Jclark-ctr sadly from Redfish I don't see any LinkUp:

>>> pprint(r.request("GET", f"{r.system_manager}/EthernetInterfaces/NIC.Integrated.1-2-1").json()['LinkStatus'])
None
>>> pprint(r.request("GET", f"{r.system_manager}/EthernetInterfaces/NIC.Integrated.1-1-1").json()['LinkStatus'])
None

It seems iDRAC 9 + firmware version 7.20.80.50, not sure why it behaves like that. I think that the best action to unblock this is to manually select NIC.Integrated.1-1-1, I don't have other ideas :(

@MatthewVernon can you help with update preseed.yaml for efi booting?

Screenshot 2026-02-03 at 2.06.59 PM.png (878×1 px, 263 KB)

Jclark-ctr updated Other Assignee, added: Jclark-ctr.
Jclark-ctr subscribed.

Change #1236671 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/puppet@production] installserver: add EFI preseed config for ms-fe102[14]

https://gerrit.wikimedia.org/r/1236671

Change #1236671 merged by Elukey:

[operations/puppet@production] installserver: add EFI preseed config for ms-fe102[14]

https://gerrit.wikimedia.org/r/1236671

@Jclark-ctr Matthew is out this week, I just merged a change that should unblock you. Lemme know how it goes!

Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1003 for host ms-fe1021.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1003 for host ms-fe1023.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1003 for host ms-fe1024.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1003 for host ms-fe1023.eqiad.wmnet with OS bullseye executed with errors:

  • ms-fe1023 (FAIL)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced UEFI HTTP Boot for next reboot
    • Host rebooted via Redfish
    • Host up (Debian installer)
    • The reimage failed, see the cookbook logs for the details. You can also try typing "sudo install-console ms-fe1023.eqiad.wmnet" to get a root shell, but depending on the failure this may not work.

Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1003 for host ms-fe1023.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1003 for host ms-fe1021.eqiad.wmnet with OS bullseye completed:

  • ms-fe1021 (WARN)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced UEFI HTTP Boot for next reboot
    • Host rebooted via Redfish
    • Host up (Debian installer)
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202602041244_jclark_2527779_ms-fe1021.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is not optimal, downtime not removed
    • Updated Netbox data from PuppetDB
    • Updated Netbox status planned -> active
    • The sre.puppet.sync-netbox-hiera cookbook was run successfully

Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1003 for host ms-fe1024.eqiad.wmnet with OS bullseye completed:

  • ms-fe1024 (WARN)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced UEFI HTTP Boot for next reboot
    • Host rebooted via Redfish
    • Host up (Debian installer)
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202602041305_jclark_2528735_ms-fe1024.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is not optimal, downtime not removed
    • Updated Netbox data from PuppetDB
    • Updated Netbox status planned -> active
    • The sre.puppet.sync-netbox-hiera cookbook was run successfully

Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1003 for host ms-fe1022.eqiad.wmnet with OS bullseye

Change #1236750 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/puppet@production] installserver: fix preseed config for ms-fe102[1-4]

https://gerrit.wikimedia.org/r/1236750

Change #1236750 merged by Elukey:

[operations/puppet@production] installserver: fix preseed config for ms-fe102[1-4]

https://gerrit.wikimedia.org/r/1236750

Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1003 for host ms-fe1022.eqiad.wmnet with OS bullseye executed with errors:

  • ms-fe1022 (FAIL)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced UEFI HTTP Boot for next reboot
    • Host rebooted via Redfish
    • Host up (Debian installer)
    • The reimage failed, see the cookbook logs for the details. You can also try typing "sudo install-console ms-fe1022.eqiad.wmnet" to get a root shell, but depending on the failure this may not work.

Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1003 for host ms-fe1023.eqiad.wmnet with OS bullseye executed with errors:

  • ms-fe1023 (FAIL)
    • Host successfully migrated to the new VLAN
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced UEFI HTTP Boot for next reboot
    • Host rebooted via Redfish
    • Host up (Debian installer)
    • The reimage failed, see the cookbook logs for the details. You can also try typing "sudo install-console ms-fe1023.eqiad.wmnet" to get a root shell, but depending on the failure this may not work.

Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1003 for host ms-fe1022.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1003 for host ms-fe1023.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1003 for host ms-fe1023.eqiad.wmnet with OS bullseye completed:

  • ms-fe1023 (PASS)
    • Host successfully migrated to the new VLAN
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced UEFI HTTP Boot for next reboot
    • Host rebooted via Redfish
    • Host up (Debian installer)
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202602041434_jclark_2543945_ms-fe1023.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
    • Updated Netbox status planned -> active
    • The sre.puppet.sync-netbox-hiera cookbook was run successfully

Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1003 for host ms-fe1022.eqiad.wmnet with OS bullseye completed:

  • ms-fe1022 (PASS)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced UEFI HTTP Boot for next reboot
    • Host rebooted via Redfish
    • Host up (Debian installer)
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202602041436_jclark_2543935_ms-fe1022.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
    • Updated Netbox status planned -> active
    • The sre.puppet.sync-netbox-hiera cookbook was run successfully
Jclark-ctr updated the task description. (Show Details)

Change #1239643 had a related patch set uploaded (by MVernon; author: MVernon):

[operations/puppet@production] swift: add 4 new eqiad frontends ms-fe102[1-4]

https://gerrit.wikimedia.org/r/1239643

Change #1239645 had a related patch set uploaded (by MVernon; author: MVernon):

[operations/puppet@production] installserver: new ms-fe nodes are UEFI booted

https://gerrit.wikimedia.org/r/1239645

Change #1239643 merged by MVernon:

[operations/puppet@production] swift: add 4 new eqiad frontends ms-fe102[1-4]

https://gerrit.wikimedia.org/r/1239643

Change #1239645 merged by MVernon:

[operations/puppet@production] installserver: new ms-fe nodes are UEFI booted

https://gerrit.wikimedia.org/r/1239645