Page MenuHomePhabricator

Q3:rack/setup/install backup20[16-20]
Closed, ResolvedPublic

Description

This task will track the racking, setup, and OS installation of backup20[16-20]

Hostname / Racking / Installation Details

Hostnames: backup20[16-20] (backup2015 being used for expansion order)
Racking Proposal: As distributed among them as reasonable Use https://fault-tolerance.toolforge.org/map to optimize this placement.
Raid Setup: The 2 SSDs as 2 standalone disks (or 2 different virtual disks RAID 0, it is the same). Leave the HDs unconfigured, they will be setup post-install by the data persistence team. Ideally, the SSDs as the first virtual disks, for the partman recipe to work as intended.
Networking Setup: # of Connections:1 - Speed:10G. - VLAN:Private
OS Distro: trixie
Boot Method: UEFI.
Sub-team Technical Contact: Jaime Crespo

Per host setup checklist

Each host should have its own setup checklist copied and pasted into the list below.

backup2016
  • Receive in system on procurement task T412434 & in Coupa
  • Rack system with proposed racking plan (see above) & update Netbox (include all system info plus location, state of planned)
  • Run the Provision a server's network attributes Netbox script - Note that you must run the DNS and Provision cookbook after completing this step
  • Immediately run the sre.dns.netbox cookbook
  • Immediately run the sre.hosts.provision cookbook
  • Run the sre.hardware.upgrade-firmware cookbook
  • Update the operations/puppet repo - this should include updates to preseed.yaml, and site.pp with roles defined by service group: https://wikitech.wikimedia.org/wiki/SRE/Dc-operations
  • Run the sre.hosts.reimage cookbook
backup2017
  • Receive in system on procurement task T412434 & in Coupa
  • Rack system with proposed racking plan (see above) & update Netbox (include all system info plus location, state of planned)
  • Run the Provision a server's network attributes Netbox script - Note that you must run the DNS and Provision cookbook after completing this step
  • Immediately run the sre.dns.netbox cookbook
  • Immediately run the sre.hosts.provision cookbook
  • Run the sre.hardware.upgrade-firmware cookbook
  • Update the operations/puppet repo - this should include updates to preseed.yaml, and site.pp with roles defined by service group: https://wikitech.wikimedia.org/wiki/SRE/Dc-operations
  • Run the sre.hosts.reimage cookbook
backup2018
  • Receive in system on procurement task T412434 & in Coupa
  • Rack system with proposed racking plan (see above) & update Netbox (include all system info plus location, state of planned)
  • Run the Provision a server's network attributes Netbox script - Note that you must run the DNS and Provision cookbook after completing this step
  • Immediately run the sre.dns.netbox cookbook
  • Immediately run the sre.hosts.provision cookbook
  • Run the sre.hardware.upgrade-firmware cookbook
  • Update the operations/puppet repo - this should include updates to preseed.yaml, and site.pp with roles defined by service group: https://wikitech.wikimedia.org/wiki/SRE/Dc-operations
  • Run the sre.hosts.reimage cookbook
backup2019
  • Receive in system on procurement task T412434 & in Coupa
  • Rack system with proposed racking plan (see above) & update Netbox (include all system info plus location, state of planned)
  • Run the Provision a server's network attributes Netbox script - Note that you must run the DNS and Provision cookbook after completing this step
  • Immediately run the sre.dns.netbox cookbook
  • Immediately run the sre.hosts.provision cookbook
  • Run the sre.hardware.upgrade-firmware cookbook
  • Update the operations/puppet repo - this should include updates to preseed.yaml, and site.pp with roles defined by service group: https://wikitech.wikimedia.org/wiki/SRE/Dc-operations
  • Run the sre.hosts.reimage cookbook
backup2020
  • Receive in system on procurement task T412434 & in Coupa
  • Rack system with proposed racking plan (see above) & update Netbox (include all system info plus location, state of planned)
  • Run the Provision a server's network attributes Netbox script - Note that you must run the DNS and Provision cookbook after completing this step
  • Immediately run the sre.dns.netbox cookbook
  • Immediately run the sre.hosts.provision cookbook
  • Run the sre.hardware.upgrade-firmware cookbook
  • Update the operations/puppet repo - this should include updates to preseed.yaml, and site.pp with roles defined by service group: https://wikitech.wikimedia.org/wiki/SRE/Dc-operations
  • Run the sre.hosts.reimage cookbook

Related Objects

StatusSubtypeAssignedTask
ResolvedJhancock.wm

Event Timeline

RobH mentioned this in Unknown Object (Task).Jan 15 2026, 8:31 PM
RobH added a parent task: Unknown Object (Task).
RobH moved this task from Backlog to Racking Tasks on the ops-codfw board.
RobH unsubscribed.

Jaime,

I had to split up the expansion and refresh budget lines for backup this quarter, so this racking task (and its parent order task) only covers the line item for refreshing backup hosts in this site (not expansion, expansion racking task was already filed via T414724.

I've taken a guess at the next available backup hostname, but this needs your update in the task description on rack preference plus a double check of the info I've provided.

Please update the site.pp file with the insetup role for your team (detailed on https://wikitech.wikimedia.org/wiki/SRE/Dc-operations) and add the new servers to preseed.yml for partition info.

If possible, please reference this task number in your patch set, so it is clear when complete. Once complete, just un-assign yourself (leaving no assignee) for this task and once the hardware arrives on-site engineerss will claim this task for racking and setup. Please don't re-subscribe me to this task unless there is a direct question for me.

Thank you!

Change #1227773 had a related patch set uploaded (by Jcrespo; author: Jcrespo):

[operations/puppet@production] install: Reimage with format backup1015-backup1020 & backup2015-backup2020

https://gerrit.wikimedia.org/r/1227773

Change #1227773 merged by Jcrespo:

[operations/puppet@production] install: Reimage with format backup1015-backup1020 & backup2015-backup2020

https://gerrit.wikimedia.org/r/1227773

Change #1227792 had a related patch set uploaded (by Jcrespo; author: Jcrespo):

[operations/puppet@production] backup: Set up backup1015-backup1020 & backup2015-backup2020

https://gerrit.wikimedia.org/r/1227792

Change #1227792 merged by Jcrespo:

[operations/puppet@production] backup: Set up backup1015-backup1020 & backup2015-backup2020

https://gerrit.wikimedia.org/r/1227792

jcrespo subscribed.

Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host backup2016.codfw.wmnet with OS trixie

Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host backup2017.codfw.wmnet with OS trixie

Jhancock.wm subscribed.

@jcrespo i need an edit to the site.pp file. the backup20XX servers have eqiad in the name. they should be codfw. Thank you!

Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host backup2016.codfw.wmnet with OS trixie executed with errors:

  • backup2016 (FAIL)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced UEFI HTTP Boot for next reboot
    • Host rebooted via Redfish
    • Host up (Debian installer)
    • Host up (new fresh trixie OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • The reimage failed, see the cookbook logs for the details. You can also try typing "sudo install-console backup2016.codfw.wmnet" to get a root shell, but depending on the failure this may not work.

Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host backup2017.codfw.wmnet with OS trixie executed with errors:

  • backup2017 (FAIL)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced UEFI HTTP Boot for next reboot
    • Host rebooted via Redfish
    • Host up (Debian installer)
    • Host up (new fresh trixie OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • The reimage failed, see the cookbook logs for the details. You can also try typing "sudo install-console backup2017.codfw.wmnet" to get a root shell, but depending on the failure this may not work.

Change #1240203 had a related patch set uploaded (by Jcrespo; author: Jcrespo):

[operations/puppet@production] backups: Fix error on domain for backup hosts

https://gerrit.wikimedia.org/r/1240203

Change #1240203 merged by Jcrespo:

[operations/puppet@production] backups: Fix error on domain for backup hosts

https://gerrit.wikimedia.org/r/1240203

@jcrespo i need an edit to the site.pp file. the backup20XX servers have eqiad in the name. they should be codfw. Thank you!

Done.

Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host backup2016.codfw.wmnet with OS trixie

Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host backup2016.codfw.wmnet with OS trixie completed:

  • backup2016 (WARN)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced UEFI HTTP Boot for next reboot
    • Host rebooted via Redfish
    • Host up (Debian installer)
    • Host up (new fresh trixie OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202602181953_jhancock_2702651_backup2016.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is not optimal, downtime not removed
    • Updated Netbox data from PuppetDB
    • Updated Netbox status planned -> active
    • The sre.puppet.sync-netbox-hiera cookbook was run successfully

Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host backup2017.codfw.wmnet with OS trixie

Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host backup2018.codfw.wmnet with OS trixie

Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host backup2017.codfw.wmnet with OS trixie completed:

  • backup2017 (PASS)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced UEFI HTTP Boot for next reboot
    • Host rebooted via Redfish
    • Host up (Debian installer)
    • Host up (new fresh trixie OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202602182054_jhancock_2737703_backup2017.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
    • Updated Netbox status planned -> active
    • The sre.puppet.sync-netbox-hiera cookbook was run successfully

Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host backup2018.codfw.wmnet with OS trixie completed:

  • backup2018 (WARN)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced UEFI HTTP Boot for next reboot
    • Host rebooted via Redfish
    • Host up (Debian installer)
    • Host up (new fresh trixie OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202602182058_jhancock_2738025_backup2018.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is not optimal, downtime not removed
    • Updated Netbox data from PuppetDB
    • Updated Netbox status planned -> active
    • The sre.puppet.sync-netbox-hiera cookbook was run successfully

Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host backup2019.codfw.wmnet with OS trixie

Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host backup2020.codfw.wmnet with OS trixie

Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host backup2019.codfw.wmnet with OS trixie completed:

  • backup2019 (WARN)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced UEFI HTTP Boot for next reboot
    • Host rebooted via Redfish
    • Host up (Debian installer)
    • Host up (new fresh trixie OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202602191714_jhancock_3358214_backup2019.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is not optimal, downtime not removed
    • Updated Netbox data from PuppetDB
    • Updated Netbox status planned -> active
    • The sre.puppet.sync-netbox-hiera cookbook was run successfully

Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host backup2020.codfw.wmnet with OS trixie completed:

  • backup2020 (WARN)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced UEFI HTTP Boot for next reboot
    • Host rebooted via Redfish
    • Host up (Debian installer)
    • Host up (new fresh trixie OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202602191718_jhancock_3358939_backup2020.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is not optimal, downtime not removed
    • Updated Netbox data from PuppetDB
    • Updated Netbox status planned -> active
    • The sre.puppet.sync-netbox-hiera cookbook was run successfully
Jhancock.wm claimed this task.
Jhancock.wm updated the task description. (Show Details)