Page MenuHomePhabricator

Q1:rack/setup/install ulsfo misc class hosts
Closed, ResolvedPublic

Description

This task will track the racking, setup, and OS installation of the 9 misc class hosts ordered for ulsfo refresh.

Hostname / Racking / Installation Details

Hostnames: lvs40(0[89]|10), ganeti400[5678], dns400[34]
Racking Proposal: odd numbered hostnames in rack 22, even numbered hostnames in rack 23
Networking Setup: # of Connections:1 for everything except LVS which need 2, Speed:10G. Vlan:Match existing hosts AAAA records:Y, Additional IP records (Cassandra)?
Partitioning/Raid: HW Raid: N, Partman recipe and/or desired Raid Level:
OS Distro: Bullseye
Sub-team Technical Contact:@BBlack

Per host setup checklist

Each host should have its own setup checklist copied and pasted into the list below.

lvs4008:
  • - receive in system on procurement task T311749 & in coupa
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location, state of planned)
  • - racked with power and mgmt, idrac not yet setup, allowing for automation to troubleshoot the auto provisioning script in ulsfo
  • - LVS hosts must be wired in to both their own ASW and the adjacent rack ASW
  • - add mgmt dns (asset tag and hostname) and production dns entries in netbox, run cookbook sre.dns.netbox.
  • - network port setup via netbox, run homer from an active cumin host to commit
  • - bios/drac/serial setup/testing, see Lifecycle Steps & Automatic BIOS setup details
  • - firmware update (idrac, bios, network, raid controller) - downgraded nic from 22. to 21.85.21.92
  • - operations/puppet update - this should include updates to netboot.pp, and site.pp role(insetup) or cp systems use role(insetup::nofirm).
  • - OS installation & initital puppet run via sre.hosts.reimage cookbook.
lvs4009:
  • - receive in system on procurement task T311749 & in coupa
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location, state of planned)
  • - add mgmt dns (asset tag and hostname) and production dns entries in netbox, run cookbook sre.dns.netbox.
  • - network port setup via netbox, run homer from an active cumin host to commit
  • - bios/drac/serial setup/testing, see Lifecycle Steps & Automatic BIOS setup details
  • - firmware update (idrac, bios, network, raid controller) NIC downgraded from 22.00.07.60 to 21.85.21.92
  • - operations/puppet update - this should include updates to netboot.pp, and site.pp role(insetup) or cp systems use role(insetup::nofirm).
  • - OS installation & initital puppet run via sre.hosts.reimage cookbook.
lvs4010:
  • - receive in system on procurement task T311749 & in coupa
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location, state of planned)
  • - add mgmt dns (asset tag and hostname) and production dns entries in netbox, run cookbook sre.dns.netbox.
  • - network port setup via netbox, run homer from an active cumin host to commit
  • - bios/drac/serial setup/testing, see Lifecycle Steps & Automatic BIOS setup details
  • - firmware update (idrac, bios, network, raid controller) NIC downgraded from 22.00.07.60 to 21.85.21.92
  • - operations/puppet update - this should include updates to netboot.pp, and site.pp role(insetup) or cp systems use role(insetup::nofirm).
  • - OS installation & initital puppet run via sre.hosts.reimage cookbook.
ganeti4005:
  • - receive in system on procurement task T311749 & in coupa
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location, state of planned)
  • - add mgmt dns (asset tag and hostname) and production dns entries in netbox, run cookbook sre.dns.netbox.
  • - network port setup via netbox, run homer from an active cumin host to commit
  • - bios/drac/serial setup/testing, see Lifecycle Steps

& Automatic BIOS setup details - fails the provision script for redfish connection, idrac manually setup for remote connectivity so @Volans can troubleshoot whats up

  • - firmware update (idrac, bios, network, raid controller)
  • - operations/puppet update - this should include updates to netboot.pp, and site.pp role(insetup) or cp systems use role(insetup::nofirm).
  • - OS installation & initital puppet run via sre.hosts.reimage cookbook.
ganeti4006:
  • - receive in system on procurement task T311749 & in coupa
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location, state of planned)
  • - add mgmt dns (asset tag and hostname) and production dns entries in netbox, run cookbook sre.dns.netbox.
  • - network port setup via netbox, run homer from an active cumin host to commit
  • - bios/drac/serial setup/testing, see Lifecycle Steps & Automatic BIOS setup details
  • - firmware update (idrac, bios, network, raid controller)
  • - operations/puppet update - this should include updates to netboot.pp, and site.pp role(insetup) or cp systems use role(insetup::nofirm).
  • - OS installation & initital puppet run via sre.hosts.reimage cookbook.
ganeti4007:
  • - receive in system on procurement task T311749 & in coupa
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location, state of planned)
  • - add mgmt dns (asset tag and hostname) and production dns entries in netbox, run cookbook sre.dns.netbox.
  • - network port setup via netbox, run homer from an active cumin host to commit
  • - bios/drac/serial setup/testing, see Lifecycle Steps & Automatic BIOS setup details
  • - firmware update (idrac, bios, network, raid controller) ganeti4007 (NETWORK): now at version: 21.85.21.92
  • - operations/puppet update - this should include updates to netboot.pp, and site.pp role(insetup) or cp systems use role(insetup::nofirm).
  • - OS installation & initital puppet run via sre.hosts.reimage cookbook.
ganeti4008:
  • - receive in system on procurement task T311749 & in coupa
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location, state of planned)
  • - add mgmt dns (asset tag and hostname) and production dns entries in netbox, run cookbook sre.dns.netbox.
  • - network port setup via netbox, run homer from an active cumin host to commit
  • - bios/drac/serial setup/testing, see Lifecycle Steps & Automatic BIOS setup details
  • - firmware update (idrac, bios, network, raid controller)
  • - operations/puppet update - this should include updates to netboot.pp, and site.pp role(insetup) or cp systems use role(insetup::nofirm).
  • - OS installation & initital puppet run via sre.hosts.reimage cookbook.
dns4003:
  • - receive in system on procurement task T311749 & in coupa
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location, state of planned)
  • - add mgmt dns (asset tag and hostname) and production dns entries in netbox, run cookbook sre.dns.netbox.
  • - network port setup via netbox, run homer from an active cumin host to commit
  • - bios/drac/serial setup/testing, see Lifecycle Steps & Automatic BIOS setup details
  • - firmware update (idrac, bios, network, raid controller) - bios newest, idrac capped, network rolled back to working version 21.85 from 22.x
  • - operations/puppet update - this should include updates to netboot.pp, and site.pp role(insetup) or cp systems use role(insetup::nofirm).
  • - OS installation & initital puppet run via sre.hosts.reimage cookbook.
dns4004:
  • - receive in system on procurement task T311749 & in coupa
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location, state of planned)
  • - add mgmt dns (asset tag and hostname) and production dns entries in netbox, run cookbook sre.dns.netbox.
  • - network port setup via netbox, run homer from an active cumin host to commit
  • - bios/drac/serial setup/testing, see Lifecycle Steps & Automatic BIOS setup details
  • - firmware update (idrac, bios, network, raid controller)
  • - operations/puppet update - this should include updates to netboot.pp, and site.pp role(insetup) or cp systems use role(insetup::nofirm).
  • - OS installation & initital puppet run via sre.hosts.reimage cookbook.

Details

SubjectRepoBranchLines +/-
operations/puppetproduction+1 -1
operations/puppetproduction+2 -5
operations/puppetproduction+4 -0
operations/puppetproduction+1 -11
operations/puppetproduction+4 -6
operations/homer/publicmaster+0 -1
operations/homer/publicmaster+1 -0
operations/puppetproduction+8 -3
operations/homer/publicmaster+0 -1
operations/puppetproduction+1 -6
operations/puppetproduction+1 -0
operations/homer/publicmaster+2 -1
operations/puppetproduction+9 -6
operations/homer/publicmaster+0 -1
operations/puppetproduction+3 -1
operations/homer/publicmaster+1 -0
operations/dnsmaster+2 -0
operations/puppetproduction+1 -2
operations/puppetproduction+1 -0
operations/puppetproduction+1 -1
operations/puppetproduction+1 -2
operations/puppetproduction+2 -2
operations/puppetproduction+2 -5
operations/puppetproduction+8 -0
operations/homer/publicmaster+1 -0
operations/puppetproduction+9 -5
operations/puppetproduction+2 -5
operations/puppetproduction+4 -0
operations/puppetproduction+4 -0
operations/dnsmaster+1 -1
operations/homer/publicmaster+1 -0
operations/puppetproduction+0 -0
operations/puppetproduction+5 -5
operations/puppetproduction+4 -1
Show related patches Customize query in gerrit

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

So the issue with ganeti4005 was that the bios boot setting was set to UEFI that is the reason RedFish was failing so after I changed it to BIOS i had no issues

END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ganeti4005.mgmt.ulsfo.wmnet with reboot policy FORCED
pt1979@cumin2002:~$

Change 845070 had a related patch set uploaded (by RobH; author: RobH):

[operations/puppet@production] ulsfo ganeti4005 lvs4008

https://gerrit.wikimedia.org/r/845070

Change 845070 merged by RobH:

[operations/puppet@production] ulsfo ganeti4005 lvs4008

https://gerrit.wikimedia.org/r/845070

Cookbook cookbooks.sre.hosts.reimage was started by robh@cumin2002 for host ganeti4005.ulsfo.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by robh@cumin2002 for host ganeti4005.ulsfo.wmnet with OS bullseye executed with errors:

  • ganeti4005 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by robh@cumin2002 for host ganeti4005.ulsfo.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage was started by robh@cumin2002 for host lvs4008.ulsfo.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by robh@cumin2002 for host ganeti4005.ulsfo.wmnet with OS bullseye completed:

  • ganeti4005 (WARN)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Unable to downtime the new host on Icinga/Alertmanager, the sre.hosts.downtime cookbook returned 99
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202210202338_robh_1954717_ganeti4005.out
    • Checked BIOS boot parameters are back to normal
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is not optimal, downtime not removed
    • Updated Netbox data from PuppetDB
    • Updated Netbox status planned -> staged

Cookbook cookbooks.sre.hosts.reimage started by robh@cumin2002 for host lvs4008.ulsfo.wmnet with OS bullseye completed:

  • lvs4008 (PASS)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202210210007_robh_1959939_lvs4008.out
    • Checked BIOS boot parameters are back to normal
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
    • Updated Netbox status planned -> staged

Change 849010 had a related patch set uploaded (by Muehlenhoff; author: Muehlenhoff):

[operations/puppet@production] Make ganeti4005 a Ganeti node

https://gerrit.wikimedia.org/r/849010

Change 849010 merged by Muehlenhoff:

[operations/puppet@production] Make ganeti4005 a Ganeti node

https://gerrit.wikimedia.org/r/849010

Change 849023 had a related patch set uploaded (by Muehlenhoff; author: Muehlenhoff):

[operations/puppet@production] Swap ganeti4003 with ganeti4005 for blackbox smoke tests

https://gerrit.wikimedia.org/r/849023

Change 849023 merged by Muehlenhoff:

[operations/puppet@production] Swap ganeti4002/ganeti4003 for blackbox smoke tests

https://gerrit.wikimedia.org/r/849023

Mentioned in SAL (#wikimedia-operations) [2022-10-25T09:36:43Z] <moritzm> drain ganeti4002 for eventual decom T317247

Change 849054 had a related patch set uploaded (by Muehlenhoff; author: Muehlenhoff):

[operations/puppet@production] Remove ganeti4002 from Puppet for decom

https://gerrit.wikimedia.org/r/849054

Change 849054 merged by Muehlenhoff:

[operations/puppet@production] Remove ganeti4002 from Puppet for decom

https://gerrit.wikimedia.org/r/849054

cookbooks.sre.hosts.decommission executed by jmm@cumin2002 for hosts: ganeti4002.ulsfo.wmnet

  • ganeti4002.ulsfo.wmnet (PASS)
    • Downtimed host on Icinga/Alertmanager
    • Found physical host
    • Downtimed management interface on Icinga/Alertmanager
    • Wiped all swraid, partition-table and filesystem signatures
    • Powered off
    • [Netbox] Set status to Decommissioning, deleted all non-mgmt IPs, updated switch interfaces (disabled, removed vlans, etc
    • Configured the linked switch interface(s)
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB

I have setup ganeti4005 as a node in the ulsfo Ganeti cluster and moved a VM to it to confirm it works as expected.

@RobH : I've also decomissioned ganeti4002, you can unrack it the next time you're going to ulsfo.

Change 849105 had a related patch set uploaded (by Ssingh; author: Ssingh):

[operations/dns@master] Depool ulsfo for cp hosts hardware refresh

https://gerrit.wikimedia.org/r/849105

Change 849105 merged by Ssingh:

[operations/dns@master] Depool ulsfo for cp hosts hardware refresh

https://gerrit.wikimedia.org/r/849105

Mentioned in SAL (#wikimedia-operations) [2022-10-27T09:17:39Z] <moritzm> failover ganeti master in ulsfo to ganeti4008, unblocking future decom of ganeti4003 T317247

Change 850260 had a related patch set uploaded (by RobH; author: RobH):

[operations/puppet@production] ganeti4006

https://gerrit.wikimedia.org/r/850260

Change 850260 merged by RobH:

[operations/puppet@production] ganeti4006

https://gerrit.wikimedia.org/r/850260

Cookbook cookbooks.sre.hosts.reimage was started by robh@cumin2002 for host ganeti4006.ulsfo.wmnet with OS buster

Cookbook cookbooks.sre.hosts.reimage started by robh@cumin2002 for host ganeti4006.ulsfo.wmnet with OS buster executed with errors:

  • ganeti4006 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh buster OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by robh@cumin2002 for host ganeti4006.ulsfo.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by robh@cumin2002 for host ganeti4006.ulsfo.wmnet with OS bullseye completed:

  • ganeti4006 (WARN)
    • Downtimed on Icinga/Alertmanager
    • Unable to disable Puppet, the host may have been unreachable
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202210272035_robh_3631110_ganeti4006.out
    • Checked BIOS boot parameters are back to normal
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
    • Updated Netbox status planned -> staged

Change 850413 had a related patch set uploaded (by Muehlenhoff; author: Muehlenhoff):

[operations/puppet@production] Make ganeti4006 a Ganeti node

https://gerrit.wikimedia.org/r/850413

Change 850413 merged by Muehlenhoff:

[operations/puppet@production] Make ganeti4006 a Ganeti node

https://gerrit.wikimedia.org/r/850413

Mentioned in SAL (#wikimedia-operations) [2022-10-28T09:53:36Z] <moritzm> drain ganeti4003 for eventual decom T317247

Change 850451 had a related patch set uploaded (by Muehlenhoff; author: Muehlenhoff):

[operations/puppet@production] Remove ganeti4003 from Puppet

https://gerrit.wikimedia.org/r/850451

Change 850451 merged by Muehlenhoff:

[operations/puppet@production] Remove ganeti4003 from Puppet

https://gerrit.wikimedia.org/r/850451

cookbooks.sre.hosts.decommission executed by jmm@cumin2002 for hosts: ganeti4003.ulsfo.wmnet

  • ganeti4003.ulsfo.wmnet (PASS)
    • Downtimed host on Icinga/Alertmanager
    • Found physical host
    • Downtimed management interface on Icinga/Alertmanager
    • Wiped all swraid, partition-table and filesystem signatures
    • Powered off
    • [Netbox] Set status to Decommissioning, deleted all non-mgmt IPs, updated switch interfaces (disabled, removed vlans, etc
    • Configured the linked switch interface(s)
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB

cookbooks.sre.hosts.decommission executed by volans@cumin1001 for hosts: ganeti4003.ulsfo.wmnet

  • ganeti4003.ulsfo.wmnet (FAIL)
    • Unable to find/resolve the mgmt DNS record, using the IP instead: 10.128.129.19
    • Host not found on Icinga, unable to downtime it
    • Found physical host
    • Management interface not found on Icinga, unable to downtime it
    • Unable to connect to the host, wipe of swraid, partition-table and filesystem signatures will not be performed: Cumin execution failed (exit_code=2)
    • Failed to power off, manual intervention required: Remote IPMI for 10.128.129.19 failed (exit=1): b''
    • [Netbox] Set status to Decommissioning, deleted all non-mgmt IPs, updated switch interfaces (disabled, removed vlans, etc
    • Configured the linked switch interface(s)
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB

ERROR: some step on some host failed, check the bolded items above

Change 855583 had a related patch set uploaded (by Ssingh; author: Ssingh):

[operations/homer/public@master] sites.yaml: add lvs4008, replacing lvs4005 (ulsfo hardware refresh)

https://gerrit.wikimedia.org/r/855583

Change 855607 had a related patch set uploaded (by Ssingh; author: Ssingh):

[operations/puppet@production] lvs4008: commission new LVS host (ulsfo hardware refresh)

https://gerrit.wikimedia.org/r/855607

Change 855583 merged by Ssingh:

[operations/homer/public@master] sites.yaml: add lvs4008 (ulsfo hardware refresh)

https://gerrit.wikimedia.org/r/855583

Change 855607 merged by Ssingh:

[operations/puppet@production] lvs4008: commission new LVS host (ulsfo hardware refresh)

https://gerrit.wikimedia.org/r/855607

cookbooks.sre.hosts.decommission executed by sukhe@cumin2002 for hosts: lvs4005.ulsfo.wmnet

  • lvs4005.ulsfo.wmnet (WARN)
    • Downtimed host on Icinga/Alertmanager
    • Found physical host
    • Management interface not found on Icinga, unable to downtime it
    • Wiped all swraid, partition-table and filesystem signatures
    • Powered off
    • [Netbox] Set status to Decommissioning, deleted all non-mgmt IPs, updated switch interfaces (disabled, removed vlans, etc)
    • Configured the linked switch interface(s)
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB

Change 856946 had a related patch set uploaded (by Ssingh; author: Ssingh):

[operations/homer/public@master] sites.yaml: remove decommissioned host lvs4005

https://gerrit.wikimedia.org/r/856946

Change 856946 merged by Ssingh:

[operations/homer/public@master] sites.yaml: remove decommissioned host lvs4005

https://gerrit.wikimedia.org/r/856946

Change 858336 had a related patch set uploaded (by Ssingh; author: Ssingh):

[operations/puppet@production] lvs4009: commission new LVS host (ulsfo hardware refresh)

https://gerrit.wikimedia.org/r/858336

Change 859065 had a related patch set uploaded (by Ssingh; author: Ssingh):

[operations/homer/public@master] sites.yaml: add lvs4009 (ulsfo hardware refresh)

https://gerrit.wikimedia.org/r/859065

Change 859086 had a related patch set uploaded (by Ssingh; author: Ssingh):

[operations/puppet@production] lvs4006: set profile::pybal::bgp to no

https://gerrit.wikimedia.org/r/859086

Change 858336 merged by Ssingh:

[operations/puppet@production] lvs4009: commission new LVS host (ulsfo hardware refresh)

https://gerrit.wikimedia.org/r/858336

Change 859065 merged by Ssingh:

[operations/homer/public@master] sites.yaml: add lvs4009 (ulsfo hardware refresh)

https://gerrit.wikimedia.org/r/859065

Mentioned in SAL (#wikimedia-operations) [2022-11-22T18:46:29Z] <sukhe> cr[34]-ulsfo: set routing-options static route 198.35.26.112/28 next-hop 10.128.0.9: T317247

Change 859086 merged by Ssingh:

[operations/puppet@production] lvs4006: set profile::pybal::bgp to no

https://gerrit.wikimedia.org/r/859086

Change 859598 had a related patch set uploaded (by Ssingh; author: Ssingh):

[operations/puppet@production] lvs4009: set as high-traffic2 primary LVS and remove lvs4006 (decomm)

https://gerrit.wikimedia.org/r/859598

Change 859600 had a related patch set uploaded (by Ssingh; author: Ssingh):

[operations/homer/public@master] sites.yaml: remove decommissioned host lvs4006

https://gerrit.wikimedia.org/r/859600

cookbooks.sre.hosts.decommission executed by sukhe@cumin2002 for hosts: lvs4006.ulsfo.wmnet

  • lvs4006.ulsfo.wmnet (WARN)
    • Downtimed host on Icinga/Alertmanager
    • Found physical host
    • Management interface not found on Icinga, unable to downtime it
    • Wiped all swraid, partition-table and filesystem signatures
    • Powered off
    • [Netbox] Set status to Decommissioning, deleted all non-mgmt IPs, updated switch interfaces (disabled, removed vlans, etc)
    • Configured the linked switch interface(s)
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB

Change 859598 merged by Ssingh:

[operations/puppet@production] lvs4009: set as high-traffic2 primary LVS and remove lvs4006 (decomm)

https://gerrit.wikimedia.org/r/859598

Change 859600 merged by jenkins-bot:

[operations/homer/public@master] sites.yaml: remove decommissioned host lvs4006

https://gerrit.wikimedia.org/r/859600

Change 860067 had a related patch set uploaded (by Ssingh; author: Ssingh):

[operations/puppet@production] lvs4010: commission new LVS host (ulsfo hardware refresh)

https://gerrit.wikimedia.org/r/860067

Change 860089 had a related patch set uploaded (by Ssingh; author: Ssingh):

[operations/homer/public@master] sites.yaml: add lvs4010 (ulsfo hardware refresh)

https://gerrit.wikimedia.org/r/860089

Change 860067 merged by Ssingh:

[operations/puppet@production] lvs4010: commission new LVS host (ulsfo hardware refresh)

https://gerrit.wikimedia.org/r/860067

Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin2002 for host lvs4010.ulsfo.wmnet with OS buster

Change 860094 had a related patch set uploaded (by Ssingh; author: Ssingh):

[operations/puppet@production] lvs4010: set as secondary LVS and remove lvs4007 (decom)

https://gerrit.wikimedia.org/r/860094

Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin2002 for host lvs4010.ulsfo.wmnet with OS buster completed:

  • lvs4010 (PASS)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh buster OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202211231845_sukhe_397048_lvs4010.out
    • Checked BIOS boot parameters are back to normal
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
    • Updated Netbox status planned -> active

Change 860089 merged by Ssingh:

[operations/homer/public@master] sites.yaml: add lvs4010 (ulsfo hardware refresh)

https://gerrit.wikimedia.org/r/860089

Change 860103 had a related patch set uploaded (by Ssingh; author: Ssingh):

[operations/homer/public@master] sites.yaml: remove decommissioned host lvs4007

https://gerrit.wikimedia.org/r/860103

cookbooks.sre.hosts.decommission executed by sukhe@cumin2002 for hosts: lvs4007.ulsfo.wmnet

  • lvs4007.ulsfo.wmnet (WARN)
    • Downtimed host on Icinga/Alertmanager
    • Found physical host
    • Management interface not found on Icinga, unable to downtime it
    • Wiped all swraid, partition-table and filesystem signatures
    • Powered off
    • [Netbox] Set status to Decommissioning, deleted all non-mgmt IPs, updated switch interfaces (disabled, removed vlans, etc)
    • Configured the linked switch interface(s)
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB

Change 860103 merged by Ssingh:

[operations/homer/public@master] sites.yaml: remove decommissioned host lvs4007

https://gerrit.wikimedia.org/r/860103

Change 860094 merged by Ssingh:

[operations/puppet@production] lvs4010: set as secondary LVS and remove lvs4007 (decom)

https://gerrit.wikimedia.org/r/860094

Change 860930 had a related patch set uploaded (by Ssingh; author: Ssingh):

[operations/puppet@production] hiera: unify ulsfo LVS configuration

https://gerrit.wikimedia.org/r/860930

Change 860930 merged by Ssingh:

[operations/puppet@production] hiera: unify ulsfo LVS configuration

https://gerrit.wikimedia.org/r/860930

Change 866441 had a related patch set uploaded (by Ssingh; author: Ssingh):

[operations/puppet@production] site.pp: update LVS hosts in ulsfo

https://gerrit.wikimedia.org/r/866441

Change 868733 had a related patch set uploaded (by RobH; author: RobH):

[operations/puppet@production] ganeti4007 insetup role

https://gerrit.wikimedia.org/r/868733

Change 868733 merged by RobH:

[operations/puppet@production] ganeti4007 insetup role

https://gerrit.wikimedia.org/r/868733

Cookbook cookbooks.sre.hosts.reimage was started by robh@cumin2002 for host ganeti4007.ulsfo.wmnet with OS bullseye

RobH updated the task description. (Show Details)
RobH updated the task description. (Show Details)

Cookbook cookbooks.sre.hosts.reimage started by robh@cumin2002 for host ganeti4007.ulsfo.wmnet with OS bullseye completed:

  • ganeti4007 (PASS)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202212161819_robh_1894479_ganeti4007.out
    • Checked BIOS boot parameters are back to normal
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
    • Updated Netbox status planned -> active
RobH claimed this task.

@MoritzMuehlenhoff,

ganeti4007 is all yours and this resolves all pending misc ulsfo installs =]

Change 869220 had a related patch set uploaded (by Muehlenhoff; author: Muehlenhoff):

[operations/puppet@production] Make ganeti4007 a Ganeti node

https://gerrit.wikimedia.org/r/869220

Change 869220 merged by Muehlenhoff:

[operations/puppet@production] Make ganeti4007 a Ganeti node

https://gerrit.wikimedia.org/r/869220

ganeti4007 has been added to the ulsfo Ganeti cluster.

Mentioned in SAL (#wikimedia-operations) [2022-12-20T10:16:40Z] <moritzm> rebalance ganeti cluster in ulsfo after adding new node and decom of the old hardware T317247

Change 866441 merged by Ssingh:

[operations/puppet@production] site.pp: update LVS hosts in ulsfo

https://gerrit.wikimedia.org/r/866441