Page MenuHomePhabricator

Q4:rack/setup/install backup1010, backup1011
Closed, ResolvedPublic

Description

This task will track the racking, setup, and OS installation of backup1010, backup1011

Hostname / Racking / Installation Details

Hostnames: backup1010, backup1011 (unless other backup hosts are purchased in-between)
Racking Proposal: Anywhere, as redundant as possible with backup1002, backup1008 & backup1009, as well as among them 2 (different rows, or if not, different racks).
Networking Setup: 1 10G. Vlan (Specify) AAAA records:Y (both ipv4 and ipv6 enabled)
Partitioning/Raid:

  • First SSD - not part of a RAID
  • Second SSD - not part of the RAID.
  • All others (HDs): RAID6.
  • The os should end up seeing 3 disks, sda, sdb (of aprox 480 GB each) and sdc (of around 160 available TB).
  • The recipe after that is: partman/custom/backup-format.cfg

Per host setup checklist

Each host should have its own setup checklist copied and pasted into the list below.

backup1010:
  • - receive in system on procurement task T325225 & in coupa
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location, state of planned)
  • - add mgmt dns (asset tag and hostname) and production dns entries in netbox, run cookbook sre.dns.netbox.
  • - network port setup via netbox, run homer from an active cumin host to commit
  • - bios/drac/serial setup/testing, see Lifecycle Steps & Automatic BIOS setup details
  • - firmware update (idrac, bios, network, raid controller)
  • - operations/puppet update - this should include updates to netboot.pp, and site.pp role::insetup::data_persistence
  • - OS installation & initital puppet run via sre.hosts.reimage cookbook.
backup1011:
  • - receive in system on procurement task T325225 & in coupa
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location, state of planned)
  • - add mgmt dns (asset tag and hostname) and production dns entries in netbox, run cookbook sre.dns.netbox.
  • - network port setup via netbox, run homer from an active cumin host to commit
  • - bios/drac/serial setup/testing, see Lifecycle Steps & Automatic BIOS setup details
  • - firmware update (idrac, bios, network, raid controller)
  • - operations/puppet update - this should include updates to netboot.pp, and site.pp role::insetup::data_persistence
  • - OS installation & initital puppet run via sre.hosts.reimage cookbook.

Related Objects

StatusSubtypeAssignedTask
ResolvedJclark-ctr

Event Timeline

RobH added a parent task: Unknown Object (Task).
RobH moved this task from Backlog to Racking Tasks on the ops-eqiad board.
RobH mentioned this in Unknown Object (Task).
RobH unsubscribed.
RobH renamed this task from Q3:rack/setup/install backup1010, backup1011 to Q4:rack/setup/install backup1010, backup1011.Jan 10 2023, 10:05 PM

backup1010 E1. U5. PORT5. CABLIEID 20220246
backup1011 F1. U5 PORT5 CABLIEID 20220245

wiki_willy raised the priority of this task from Medium to High.Jun 8 2023, 6:10 PM
wiki_willy subscribed.

Change 928681 had a related patch set uploaded (by Jclark-ctr; author: jclark):

[operations/puppet@production] Add backup101[0-1] site.pp and netboot.cfg

https://gerrit.wikimedia.org/r/928681

Change 928681 merged by Jclark-ctr:

[operations/puppet@production] Add backup101[0-1] site.pp and netboot.cfg

https://gerrit.wikimedia.org/r/928681

Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host backup1010.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host backup1010.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host backup1010.eqiad.wmnet with OS bullseye executed with errors:

  • backup1010 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host backup1011.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host backup1010.eqiad.wmnet with OS bullseye executed with errors:

  • backup1010 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host backup1011.eqiad.wmnet with OS bullseye executed with errors:

  • backup1011 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host backup1010.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host backup1011.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host backup1010.eqiad.wmnet with OS bullseye executed with errors:

  • backup1010 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host backup1011.eqiad.wmnet with OS bullseye executed with errors:

  • backup1011 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • The reimage failed, see the cookbook logs for the details

@Papaul i did not have any luck last night imaging servers still failling

@Jclark-ctr you started the reimage last night just after we did the puppet merge but we didn't run puppet on the apt server that might me the reason why it did failed you can try again and let me know.

Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host backup1010.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host backup1011.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host backup1010.eqiad.wmnet with OS bullseye executed with errors:

  • backup1010 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host backup1011.eqiad.wmnet with OS bullseye executed with errors:

  • backup1011 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • The reimage failed, see the cookbook logs for the details

@Jclark-ctr when you lunch the re image cookbook make sure you have at lest a terminal console open to see what's going on on that server. backupk1010 is not able to pxe boot the reason being that the 10G nic firmware is at version 22 you need to downgrade the firmware to version 21

Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host backup1010.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host backup1011.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host backup1011.eqiad.wmnet with OS bullseye completed:

  • backup1011 (PASS)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202306121441_jclark_911270_backup1011.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
    • Updated Netbox status planned -> active
    • The sre.puppet.sync-netbox-hiera cookbook was run successfully
    • Cleared switch DHCP cache and MAC table for the host IP and MAC (row E/F)

Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host backup1010.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage was started by robh@cumin1001 for host backup1010.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host backup1010.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host backup1010.eqiad.wmnet with OS bullseye executed with errors:

  • backup1010 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host backup1010.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host backup1010.eqiad.wmnet with OS bullseye executed with errors:

  • backup1010 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host backup1010.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host backup1010.eqiad.wmnet with OS bullseye executed with errors:

  • backup1010 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host backup1010.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host backup1010.eqiad.wmnet with OS bullseye completed:

  • backup1010 (PASS)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202306121954_jclark_973367_backup1010.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
    • Updated Netbox status planned -> active
    • The sre.puppet.sync-netbox-hiera cookbook was run successfully
    • Cleared switch DHCP cache and MAC table for the host IP and MAC (row E/F)