Page MenuHomePhabricator

Q4:rack/setup/install ms-fe101[56]
Closed, ResolvedPublic

Description

This task will track the racking, setup, and OS installation of ms-fe101[56]

Hostname / Racking / Installation Details

Hostnames: ms-fe101[56]
Racking Proposal: Avoid racks with current ms-fe10* nodes
Networking Setup: 10G production network
OS Distro: Bullseye
Sub-team Technical Contact: @MatthewVernon

Per host setup checklist

Each host should have its own setup checklist copied and pasted into the list below.

ms-fe1015
  • Receive in system on procurement task T385040 & in Coupa
  • Rack system with proposed racking plan (see above) & update Netbox (include all system info plus location, state of planned)
  • Run the Provision a server's network attributes Netbox script - Note that you must run the DNS and Provision cookbook after completing this step
  • Immediately run the sre.dns.netbox cookbook
  • Immediately run the sre.hosts.provision cookbook
  • Run the sre.hardware.upgrade-firmware cookbook
  • Update the operations/puppet repo - this should include updates to preseed.yaml, and site.pp with roles defined by service group: https://wikitech.wikimedia.org/wiki/SRE/Dc-operations
  • Run the sre.hosts.reimage cookbook
ms-fe1016
  • Receive in system on procurement task T385040 & in Coupa
  • Rack system with proposed racking plan (see above) & update Netbox (include all system info plus location, state of planned)
  • Run the Provision a server's network attributes Netbox script - Note that you must run the DNS and Provision cookbook after completing this step
  • Immediately run the sre.dns.netbox cookbook
  • Immediately run the sre.hosts.provision cookbook
  • Run the sre.hardware.upgrade-firmware cookbook
  • Update the operations/puppet repo - this should include updates to preseed.yaml, and site.pp with roles defined by service group: https://wikitech.wikimedia.org/wiki/SRE/Dc-operations
  • Run the sre.hosts.reimage cookbook

Details

Related Objects

Event Timeline

RobH moved this task from Backlog to Racking Tasks on the ops-eqiad board.
RobH mentioned this in Unknown Object (Task).
RobH added a parent task: Unknown Object (Task).
RobH unsubscribed.

Please note the workflow for racking tasks has changed this fiscal year, and we now require the puppet updates from the sub-team receiving the new servers. This is due to the majority of DC Ops not having root/merge puppet rights.

Please update the site.pp file with the insetup role for your team (detailed on https://wikitech.wikimedia.org/wiki/SRE/Dc-operations) and add the new servers to preseed.yml for partition info.

If possible, please reference this task number in your patch set, so it is clear when complete. Once complete, just un-assign yourself (leaving no assignee) for this task and once the hardware arrives on-sites will claim this task for racking and setup. Please don't re-subscribe me to this task unless there is a direct question for me.

Thank you!

No changes needed for these nodes (install and site.pp is ready for ms-fe*)

ms-fe1015
Rack E8
U 21
Port 17
CableID 240707900054

ms-fe1016
Rack F8
U 22
Port 17
CableID 240707900052

These have been added into netbox with their information

Cookbook cookbooks.sre.hosts.reimage was started by vriley@cumin1002 for host ms-fe1015.eqiad.wmnet with OS bullseye

Hey @MatthewVernon I have been trying to install these. However, would you be able to check the preseed?I could be wrong, but I wasn't able to find that this has preseed information. I see codfw has 2015 and 2016, but I don't see similar for eqiad for 1015 and 1016. Let me know, thank you!

@VRiley-WMF what preseed are you seeing for 2015/2016 that isn't for 1015/1016?

The changeset relating to the 2015/16 nodes https://gerrit.wikimedia.org/r/c/operations/puppet/+/1134210 is a post-install thing.

The existing preseed.yaml setup should specify exactly the same things for ms-fe*

Okay, thanks. These servers were having trouble imagine and I was trying to look into if they have been added into the preseed.

@VRiley-WMF I had a quick look at the console of ms-fe1015 and it looks like there's some problem with its network setup?

ms-fe1015-sadness.png (584×1 px, 120 KB)

Cookbook cookbooks.sre.hosts.reimage was started by mvernon@cumin2002 for host ms-fe1016.eqiad.wmnet with OS bullseye

I see the same failure mode on ms-fe1016:

Booting from BRCM MBA Slot 0400 v21.6.4

Broadcom UNDI PXE-2.1 v21.6.4
Copyright (C) 2000-2024 Broadcom Limited
Copyright (C) 1997-2000 Intel Corporation
All rights reserved.
PXE-E61: Media test failure, check cable
PXE-M0F: Exiting Broadcom PXE ROM.

Booting from Hard drive C:
No operating system is currently installed on this computer.

Looks like either the NIC isn't connected or it's trying to PXE off the wrong one or there's a firmware issue (the error message suggests a hardware trouble, but they have been known to mislead...).

Cookbook cookbooks.sre.hosts.reimage started by mvernon@cumin2002 for host ms-fe1016.eqiad.wmnet with OS bullseye executed with errors:

  • ms-fe1016 (FAIL)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • The reimage failed, see the cookbook logs for the details. You can also try typing "sudo install-console ms-fe1016.eqiad.wmnet" to get a root shell, but depending on the failure this may not work.

Cookbook cookbooks.sre.hosts.reimage was started by vriley@cumin1002 for host ms-fe1015.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage was started by vriley@cumin1002 for host ms-fe1016.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by vriley@cumin1002 for host ms-fe1015.eqiad.wmnet with OS bullseye completed:

  • ms-fe1015 (WARN)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata (7) to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202504301947_vriley_1216903_ms-fe1015.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is not optimal, downtime not removed
    • Updated Netbox data from PuppetDB
    • Updated Netbox status planned -> active
    • The sre.puppet.sync-netbox-hiera cookbook was run successfully

Cookbook cookbooks.sre.hosts.reimage started by vriley@cumin1002 for host ms-fe1016.eqiad.wmnet with OS bullseye completed:

  • ms-fe1016 (PASS)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata (7) to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202504302010_vriley_1240354_ms-fe1016.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
    • Updated Netbox status planned -> active
    • The sre.puppet.sync-netbox-hiera cookbook was run successfully

Change #1140752 had a related patch set uploaded (by MVernon; author: MVernon):

[operations/puppet@production] swift: add ms-fe101[5,6] as new proxy nodes

https://gerrit.wikimedia.org/r/1140752

Icinga downtime and Alertmanager silence (ID=60e63b14-6e88-4b5b-aa82-3177a7ab590b) set by mvernon@cumin1002 for 2 days, 18:00:00 on 1 host(s) and their services with reason: not yet in prod

ms-fe1015.eqiad.wmnet

Icinga downtime and Alertmanager silence (ID=aed9f089-6890-4797-9578-4420d20fd11c) set by mvernon@cumin1002 for 2 days, 18:00:00 on 1 host(s) and their services with reason: not yet in prod

ms-fe1016.eqiad.wmnet

Change #1140752 merged by MVernon:

[operations/puppet@production] swift: add ms-fe101[5,6] as new proxy nodes

https://gerrit.wikimedia.org/r/1140752

Host rebooted by mvernon@cumin1002 with reason: None

Host rebooted by mvernon@cumin1002 with reason: final reboot before bringing into service

Mentioned in SAL (#wikimedia-operations) [2025-05-07T15:04:01Z] <Emperor> pool ms-fe1015 ms-fe1016 new frontends T388886 T391354