Page MenuHomePhabricator

Q4:rack/setup/install apus-fe1003
Closed, ResolvedPublic

Description

This task will track the racking, setup, and OS installation of apus-fe1003

Hostname / Racking / Installation Details

Hostnames: apus-fe1003
Racking Proposal: If possible avoid racks containing moss-fe* nodes (D4, A2)
Networking Setup: 10G production network
OS Distro: Bookworm
Sub-team Technical Contact: @MatthewVernon

Per host setup checklist

Each host should have its own setup checklist copied and pasted into the list below.

apus-fe1003:
  • Receive in system on procurement task T388239 & in Coupa
  • Rack system with proposed racking plan (see above) & update Netbox (include all system info plus location, state of planned)
  • Run the Provision a server's network attributes Netbox script - Note that you must run the DNS and Provision cookbook after completing this step
  • Immediately run the sre.dns.netbox cookbook
  • Immediately run the sre.hosts.provision cookbook
  • Run the sre.hardware.upgrade-firmware cookbook
  • Update the operations/puppet repo - this should include updates to preseed.yaml, and site.pp with roles defined by service group: https://wikitech.wikimedia.org/wiki/SRE/Dc-operations
  • Run the sre.hosts.reimage cookbook

Event Timeline

RobH assigned this task to MatthewVernon.
RobH moved this task from Backlog to Racking Tasks on the ops-eqiad board.
RobH unsubscribed.

@MatthewVernon,

Please note the workflow for racking tasks has changed this fiscal year, and we now require the puppet updates from the sub-team receiving the new servers. This is due to the majority of DC Ops not having root/merge puppet rights.

Please update the site.pp file with the insetup role for your team (detailed on https://wikitech.wikimedia.org/wiki/SRE/Dc-operations) and add the new servers to preseed.yml for partition info.

If possible, please reference this task number in your patch set, so it is clear when complete. Once complete, just un-assign yourself (leaving no assignee) for this task and once the hardware arrives on-sites will claim this task for racking and setup. Please don't re-subscribe me to this task unless there is a direct question for me.

Thank you!

RobH mentioned this in Unknown Object (Task).Mar 21 2025, 3:41 PM
RobH added a parent task: Unknown Object (Task).

Change #1130151 had a related patch set uploaded (by MVernon; author: MVernon):

[operations/puppet@production] site/install: prep for new apus and thanosnodes

https://gerrit.wikimedia.org/r/1130151

Puppet work done, unassigning myself

Change #1130151 merged by MVernon:

[operations/puppet@production] site/install: prep for new apus and thanos nodes

https://gerrit.wikimedia.org/r/1130151

apus-fe1003
Racked and added into netbox

C2
U14

Cookbook cookbooks.sre.hosts.reimage was started by vriley@cumin1002 for host apus-fe1003.wikimedia.org with OS bookworm

Cookbook cookbooks.sre.hosts.reimage started by vriley@cumin1002 for host apus-fe1003.wikimedia.org with OS bookworm executed with errors:

  • apus-fe1003 (FAIL)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • The reimage failed, see the cookbook logs for the details. You can also try typing "sudo install-console apus-fe1003.wikimedia.org" to get a root shell, but depending on the failure this may not work.

Cookbook cookbooks.sre.hosts.reimage was started by vriley@cumin1002 for host apus-fe1003.wikimedia.org with OS bookworm

Cookbook cookbooks.sre.hosts.reimage started by vriley@cumin1002 for host apus-fe1003.wikimedia.org with OS bookworm executed with errors:

  • apus-fe1003 (FAIL)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • The reimage failed, see the cookbook logs for the details. You can also try typing "sudo install-console apus-fe1003.wikimedia.org" to get a root shell, but depending on the failure this may not work.

Cookbook cookbooks.sre.hosts.reimage was started by vriley@cumin1002 for host apus-fe1003.wikimedia.org with OS bookworm

Cookbook cookbooks.sre.hosts.reimage was started by vriley@cumin1002 for host apus-fe1003.wikimedia.org with OS bookworm

Cookbook cookbooks.sre.hosts.reimage started by vriley@cumin1002 for host apus-fe1003.wikimedia.org with OS bookworm executed with errors:

  • apus-fe1003 (FAIL)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • The reimage failed, see the cookbook logs for the details. You can also try typing "sudo install-console apus-fe1003.wikimedia.org" to get a root shell, but depending on the failure this may not work.

Cookbook cookbooks.sre.hosts.reimage was started by vriley@cumin1002 for host apus-fe1003.eqiad.wmnet with OS bookworm

@MatthewVernon Can you update the eqiad.yaml file for this one think some things where missed it will not image in eqiad for @VRiley-WMF

Cookbook cookbooks.sre.hosts.reimage started by vriley@cumin1002 for host apus-fe1003.eqiad.wmnet with OS bookworm executed with errors:

  • apus-fe1003 (FAIL)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata (7) to Debian installer
    • Checked BIOS boot parameters are back to normal
    • The reimage failed, see the cookbook logs for the details. You can also try typing "sudo install-console apus-fe1003.eqiad.wmnet" to get a root shell, but depending on the failure this may not work.

Cookbook cookbooks.sre.hosts.reimage was started by mvernon@cumin1002 for host apus-fe1003.eqiad.wmnet with OS bookworm

Cookbook cookbooks.sre.hosts.reimage started by mvernon@cumin1002 for host apus-fe1003.eqiad.wmnet with OS bookworm executed with errors:

  • apus-fe1003 (FAIL)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • The reimage failed, see the cookbook logs for the details. You can also try typing "sudo install-console apus-fe1003.eqiad.wmnet" to get a root shell, but depending on the failure this may not work.

Cookbook cookbooks.sre.hosts.reimage was started by mvernon@cumin1002 for host apus-fe1003.eqiad.wmnet with OS bookworm

@Jclark-ctr I've had a look at this, and the problem seems to be that it's failing to PXE boot at all - the reimage cookbook brings the host up fine (and I see on the console IPMI: Boot to PXE Boot Requested by iDRAC), then at the point I'd expect it to boot the screen blanks for ~30s and then it falls back to trying to boot from the hard drive, which fails (as expected on a new host).

So whatever the problem is, it's not preseed - we're not even getting that far. Is this system correctly set up to use the right port of the right NIC for PXE? It was odd that I wasn't seeing any message from the NIC PXE at the point of trying to boot (you usually get either an error message or an initialisation message before it PXE-boots), and the BIOS wasn't obviously letting me select a NIC to boot from...

FWIW, the preseed setup is done in preseed.yaml to use the same setup as worked OK with apus-fe2003 in codfw.

Cookbook cookbooks.sre.hosts.reimage started by mvernon@cumin1002 for host apus-fe1003.eqiad.wmnet with OS bookworm executed with errors:

  • apus-fe1003 (FAIL)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • The reimage failed, see the cookbook logs for the details. You can also try typing "sudo install-console apus-fe1003.eqiad.wmnet" to get a root shell, but depending on the failure this may not work.

Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host apus-fe1003.eqiad.wmnet with OS bookworm

Cookbook cookbooks.sre.hosts.reimage was started by vriley@cumin1002 for host apus-fe1003.eqiad.wmnet with OS bookworm

Cookbook cookbooks.sre.hosts.reimage started by vriley@cumin1002 for host apus-fe1003.eqiad.wmnet with OS bookworm completed:

  • apus-fe1003 (PASS)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata (7) to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bookworm OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202505081908_vriley_95758_apus-fe1003.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
    • Updated Netbox status planned -> active
    • The sre.puppet.sync-netbox-hiera cookbook was run successfully

This has been completed

Change #1143821 had a related patch set uploaded (by MVernon; author: MVernon):

[operations/puppet@production] apus: bring new frontend apus-fe1003 into service

https://gerrit.wikimedia.org/r/1143821

Change #1143821 merged by MVernon:

[operations/puppet@production] apus: bring new frontend apus-fe1003 into service

https://gerrit.wikimedia.org/r/1143821