Page MenuHomePhabricator

Q3:rack/setup/install backup1015
Closed, ResolvedPublic

Description

This task will track the racking, setup, and OS installation of backup1015

Hostname / Racking / Installation Details

Hostnames: backup1015
Racking Proposal: This can go almost anywhere. Use https://fault-tolerance.toolforge.org/map to optimize this placement.
Raid Setup: The 2 SSDs as 2 standalone disks (or 2 different virtual disks RAID 0, it is the same). Leave the HDs unconfigured, they will be setup post-install by the data persistence team. Ideally, the SSDs as the first virtual disks, for the partman recipe to work as intended.
Networking Setup: # of Connections:1 - Speed:10G. - VLAN:Private
OS Distro: Bookworm
Boot Method: UEFI.
Sub-team Technical Contact: Jaime Crespo

Per host setup checklist

Each host should have its own setup checklist copied and pasted into the list below.

backup1015
  • Receive in system on procurement task T414723 & in Coupa
  • Rack system with proposed racking plan (see above) & update Netbox (include all system info plus location, state of planned)
  • Run the Provision a server's network attributes Netbox script - Note that you must run the DNS and Provision cookbook after completing this step
  • Immediately run the sre.dns.netbox cookbook
  • Immediately run the sre.hosts.provision cookbook
  • Run the sre.hardware.upgrade-firmware cookbook
  • Update the operations/puppet repo - this should include updates to preseed.yaml, and site.pp with roles defined by service group: https://wikitech.wikimedia.org/wiki/SRE/Dc-operations
  • Run the sre.hosts.reimage cookbook

Event Timeline

RobH assigned this task to jcrespo.
RobH moved this task from Backlog to Racking Tasks on the ops-eqiad board.
RobH unsubscribed.

Jaime,

I had to split up the expansion and refresh budget lines for backup this quarter, so this racking task (and its parent order task) only covers the line item: "Unbudgeted Request for ms-be - eqiad (T410028)"

I've taken a guess at the next available backup hostname, but this needs your update in the task description on rack preference plus a double check of the info I've provided.

Please update the site.pp file with the insetup role for your team (detailed on https://wikitech.wikimedia.org/wiki/SRE/Dc-operations) and add the new servers to preseed.yml for partition info.

If possible, please reference this task number in your patch set, so it is clear when complete. Once complete, just un-assign yourself (leaving no assignee) for this task and once the hardware arrives on-site engineerss will claim this task for racking and setup. Please don't re-subscribe me to this task unless there is a direct question for me.

Thank you!

RobH mentioned this in Unknown Object (Task).Jan 15 2026, 7:53 PM
RobH added a parent task: Unknown Object (Task).
RobH renamed this task from Q#:rack/setup/install X to Q3:rack/setup/install backup1015.Jan 15 2026, 8:35 PM

Change #1227773 had a related patch set uploaded (by Jcrespo; author: Jcrespo):

[operations/puppet@production] install: Reimage with format backup1015-backup1020 & backup2015-backup2020

https://gerrit.wikimedia.org/r/1227773

Change #1227773 merged by Jcrespo:

[operations/puppet@production] install: Reimage with format backup1015-backup1020 & backup2015-backup2020

https://gerrit.wikimedia.org/r/1227773

This is almost done-ready to reimage (partman is ready), but I want to give a last review to the RAID controller setup and UEFI, to see if there are further changes.

Change #1227792 had a related patch set uploaded (by Jcrespo; author: Jcrespo):

[operations/puppet@production] backup: Set up backup1015-backup1020 & backup2015-backup2020

https://gerrit.wikimedia.org/r/1227792

Change #1227792 merged by Jcrespo:

[operations/puppet@production] backup: Set up backup1015-backup1020 & backup2015-backup2020

https://gerrit.wikimedia.org/r/1227792

@elukey I’m having issues with this server failing to provision. I've manually set the username, password, and idrac network, but it continues to fail to pick up the link for Integrated NIC 1:1, even though it is visible as connected in the iDRAC GUI.

Change #1236297 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/cookbooks@master] sre.hosts.provision: initialize dict when setting lldp

https://gerrit.wikimedia.org/r/1236297

The host is provisioned now! I didn't have any issue in picking up the NIC, the main problem was related to LLDP not being set correctly (an error in the cookbook that already have a fix for).

Change #1236297 merged by jenkins-bot:

[operations/cookbooks@master] sre.hosts.provision: set self.config_changes as defaultdict

https://gerrit.wikimedia.org/r/1236297

Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1003 for host backup1015.eqiad.wmnet with OS bookworm

@elukey Thanks for the help. I’m having issues now getting it to start PXE. I’ll have to look around and see if anything else changed in the BIOS.

Booting from PXE Device 1: Embedded NIC 1 Port 1 Partition 1
PXE: No media detected.
Boot Failed: PXE Device 1: Embedded NIC 1 Port 1 Partition 1

Booting from HTTP Device 1: Integrated NIC 1 Port 1 Partition 1

>>Start HTTP boot over IPv4
.
 Unable to obtain IPv4 address and other boot information from DHCP server.
 Please ensure the DHCP server is configured correctly and try again.

  Error: Could not retrieve NBP file size from HTTP server.

  Error: Server response timeout.
Boot Failed: HTTP Device 1: Integrated NIC 1 Port 1 Partition 1

No boot device available or Operating System detected.
Please ensure a compatible bootable media is available.

Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1003 for host ms-fe1021.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1003 for host ms-fe1023.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1003 for host backup1015.eqiad.wmnet with OS bookworm executed with errors:

  • backup1015 (FAIL)
    • Host successfully migrated to the new VLAN
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced UEFI HTTP Boot for next reboot
    • Host rebooted via Redfish
    • The reimage failed, see the cookbook logs for the details. You can also try typing "sudo install-console backup1015.eqiad.wmnet" to get a root shell, but depending on the failure this may not work.

Cookbook cookbooks.sre.hosts.reimage was started by cmooney@cumin1003 for host ms-fe1024.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1003 for host ms-fe1021.eqiad.wmnet with OS bullseye executed with errors:

  • ms-fe1021 (FAIL)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced UEFI HTTP Boot for next reboot
    • Host rebooted via Redfish
    • Host up (Debian installer)
    • The reimage failed, see the cookbook logs for the details. You can also try typing "sudo install-console ms-fe1021.eqiad.wmnet" to get a root shell, but depending on the failure this may not work.

Cookbook cookbooks.sre.hosts.reimage started by cmooney@cumin1003 for host ms-fe1024.eqiad.wmnet with OS bullseye executed with errors:

  • ms-fe1024 (FAIL)
    • Host successfully migrated to the new VLAN
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced UEFI HTTP Boot for next reboot
    • Host rebooted via Redfish
    • The reimage failed, see the cookbook logs for the details. You can also try typing "sudo install-console ms-fe1024.eqiad.wmnet" to get a root shell, but depending on the failure this may not work.

Cookbook cookbooks.sre.hosts.reimage was started by cmooney@cumin1003 for host ms-fe1024.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1003 for host backup1015.eqiad.wmnet with OS bookworm

@jcrespo with eLukey and Topranks help we where able to get it to start imaging but is failing because preseed.yaml is missing efi booting can you help fix that?

Screenshot 2026-02-03 at 3.06.11 PM.png (906×1 px, 265 KB)

Cookbook cookbooks.sre.hosts.reimage started by cmooney@cumin1003 for host ms-fe1024.eqiad.wmnet with OS bullseye executed with errors:

  • ms-fe1024 (FAIL)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced UEFI HTTP Boot for next reboot
    • Host rebooted via Redfish
    • Host up (Debian installer)
    • The reimage failed, see the cookbook logs for the details. You can also try typing "sudo install-console ms-fe1024.eqiad.wmnet" to get a root shell, but depending on the failure this may not work.

Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1003 for host backup1015.eqiad.wmnet with OS bookworm executed with errors:

  • backup1015 (FAIL)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced UEFI HTTP Boot for next reboot
    • Host rebooted via Redfish
    • Host up (Debian installer)
    • The reimage failed, see the cookbook logs for the details. You can also try typing "sudo install-console backup1015.eqiad.wmnet" to get a root shell, but depending on the failure this may not work.

@jcrespo with eLukey and Topranks help we where able to get it to start imaging but is failing because preseed.yaml is missing efi booting can you help fix that?

Screenshot 2026-02-03 at 3.06.11 PM.png (906×1 px, 265 KB)

The only documentation I see on wikitech is a link to a patch with no detailed diff, and see no further/clear instructions. Leaving this for I.F. to update the docs or create a patch, as I don't know how to follow up based on current docs. I CC @elukey but meaning anyone from his team CC @LSobanski.

Please note the urgency of this request, as the setup of this host (and similar ones) is preventing from generating new media backups.

Change #1236688 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/puppet@production] install_server: add UEFI partman recipe for backup1015

https://gerrit.wikimedia.org/r/1236688

@jcrespo Hi! I guess you refer to https://wikitech.wikimedia.org/wiki/UEFI_Boot, we can definitely add more docs together if you have time. I filed https://gerrit.wikimedia.org/r/c/operations/puppet/+/1236688 as follow up, the idea is the following:

  • Since it is a custom recipe, duplicate it into its EFI variant.
  • Add the EFI boot partition.
  • Modify the sw RAID config numbering to address the right partitions (since adding the EFI one changes the partition numbers etc..).

Lemme know if it makes sense!

Change #1236688 merged by Elukey:

[operations/puppet@production] install_server: add UEFI partman recipe for backup1015

https://gerrit.wikimedia.org/r/1236688

Cookbook cookbooks.sre.hosts.reimage was started by jynus@cumin1003 for host backup1015.eqiad.wmnet with OS trixie

Change #1236708 had a related patch set uploaded (by Jcrespo; author: Jcrespo):

[operations/puppet@production] install_server: Prevent reimage of backup1015 and setup all other new hosts

https://gerrit.wikimedia.org/r/1236708

Cookbook cookbooks.sre.hosts.reimage started by jynus@cumin1003 for host backup1015.eqiad.wmnet with OS trixie completed:

  • backup1015 (WARN)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced UEFI HTTP Boot for next reboot
    • Host rebooted via Redfish
    • Host up (Debian installer)
    • Host up (new fresh trixie OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202602041145_jynus_2518658_backup1015.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is not optimal, downtime not removed
    • Updated Netbox data from PuppetDB
    • Updated Netbox status planned -> active
    • The sre.puppet.sync-netbox-hiera cookbook was run successfully

Thanks @jcrespo updating preseed file and kicking off the imaging!

Change #1236708 merged by Jcrespo:

[operations/puppet@production] install_server: Prevent reimage of backup1015 and setup all other new hosts

https://gerrit.wikimedia.org/r/1236708