Page MenuHomePhabricator

Q1:rack/setup/install ulsfo misc class hosts
Open, MediumPublic

Description

This task will track the racking, setup, and OS installation of the 9 misc class hosts ordered for ulsfo refresh.

Hostname / Racking / Installation Details

Hostnames: lvs40(0[89]|10), ganeti400[5678], dns400[34]
Racking Proposal: odd numbered hostnames in rack 22, even numbered hostnames in rack 23
Networking Setup: # of Connections:1 for everything except LVS which need 2, Speed:10G. Vlan:Match existing hosts AAAA records:Y, Additional IP records (Cassandra)?
Partitioning/Raid: HW Raid: N, Partman recipe and/or desired Raid Level:
OS Distro: Bullseye
Sub-team Technical Contact:@BBlack

Per host setup checklist

Each host should have its own setup checklist copied and pasted into the list below.

lvs4008:
  • - receive in system on procurement task T311749 & in coupa
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location, state of planned)
  • - racked with power and mgmt, idrac not yet setup, allowing for automation to troubleshoot the auto provisioning script in ulsfo
  • - LVS hosts must be wired in to both their own ASW and the adjacent rack ASW
  • - add mgmt dns (asset tag and hostname) and production dns entries in netbox, run cookbook sre.dns.netbox.
  • - network port setup via netbox, run homer from an active cumin host to commit
  • - bios/drac/serial setup/testing, see Lifecycle Steps & Automatic BIOS setup details
  • - firmware update (idrac, bios, network, raid controller) - downgraded nic from 22. to 21.85.21.92
  • - operations/puppet update - this should include updates to netboot.pp, and site.pp role(insetup) or cp systems use role(insetup::nofirm).
  • - OS installation & initital puppet run via sre.hosts.reimage cookbook.
lvs4009:
  • - receive in system on procurement task T311749 & in coupa
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location, state of planned)
  • - add mgmt dns (asset tag and hostname) and production dns entries in netbox, run cookbook sre.dns.netbox.
  • - network port setup via netbox, run homer from an active cumin host to commit
  • - bios/drac/serial setup/testing, see Lifecycle Steps & Automatic BIOS setup details
  • - firmware update (idrac, bios, network, raid controller) NIC downgraded from 22.00.07.60 to 21.85.21.92
  • - operations/puppet update - this should include updates to netboot.pp, and site.pp role(insetup) or cp systems use role(insetup::nofirm).
  • - OS installation & initital puppet run via sre.hosts.reimage cookbook.
lvs4010:
  • - receive in system on procurement task T311749 & in coupa
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location, state of planned)
  • - add mgmt dns (asset tag and hostname) and production dns entries in netbox, run cookbook sre.dns.netbox.
  • - network port setup via netbox, run homer from an active cumin host to commit
  • - bios/drac/serial setup/testing, see Lifecycle Steps & Automatic BIOS setup details
  • - firmware update (idrac, bios, network, raid controller) NIC downgraded from 22.00.07.60 to 21.85.21.92
  • - operations/puppet update - this should include updates to netboot.pp, and site.pp role(insetup) or cp systems use role(insetup::nofirm).
  • - OS installation & initital puppet run via sre.hosts.reimage cookbook.
ganeti4005:
  • - receive in system on procurement task T311749 & in coupa
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location, state of planned)
  • - add mgmt dns (asset tag and hostname) and production dns entries in netbox, run cookbook sre.dns.netbox.
  • - network port setup via netbox, run homer from an active cumin host to commit
  • - bios/drac/serial setup/testing, see Lifecycle Steps

& Automatic BIOS setup details - fails the provision script for redfish connection, idrac manually setup for remote connectivity so @Volans can troubleshoot whats up

  • - firmware update (idrac, bios, network, raid controller)
  • - operations/puppet update - this should include updates to netboot.pp, and site.pp role(insetup) or cp systems use role(insetup::nofirm).
  • - OS installation & initital puppet run via sre.hosts.reimage cookbook.
ganeti4006:
  • - receive in system on procurement task T311749 & in coupa
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location, state of planned)
  • - add mgmt dns (asset tag and hostname) and production dns entries in netbox, run cookbook sre.dns.netbox.
  • - network port setup via netbox, run homer from an active cumin host to commit
  • - bios/drac/serial setup/testing, see Lifecycle Steps & Automatic BIOS setup details
  • - firmware update (idrac, bios, network, raid controller)
  • - operations/puppet update - this should include updates to netboot.pp, and site.pp role(insetup) or cp systems use role(insetup::nofirm).
  • - OS installation & initital puppet run via sre.hosts.reimage cookbook.
ganeti4007:
  • - receive in system on procurement task T311749 & in coupa
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location, state of planned)
  • - add mgmt dns (asset tag and hostname) and production dns entries in netbox, run cookbook sre.dns.netbox.
  • - network port setup via netbox, run homer from an active cumin host to commit
  • - bios/drac/serial setup/testing, see Lifecycle Steps & Automatic BIOS setup details
  • - firmware update (idrac, bios, network, raid controller)
  • - operations/puppet update - this should include updates to netboot.pp, and site.pp role(insetup) or cp systems use role(insetup::nofirm).
  • - OS installation & initital puppet run via sre.hosts.reimage cookbook.
ganeti4008:
  • - receive in system on procurement task T311749 & in coupa
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location, state of planned)
  • - add mgmt dns (asset tag and hostname) and production dns entries in netbox, run cookbook sre.dns.netbox.
  • - network port setup via netbox, run homer from an active cumin host to commit
  • - bios/drac/serial setup/testing, see Lifecycle Steps & Automatic BIOS setup details
  • - firmware update (idrac, bios, network, raid controller)
  • - operations/puppet update - this should include updates to netboot.pp, and site.pp role(insetup) or cp systems use role(insetup::nofirm).
  • - OS installation & initital puppet run via sre.hosts.reimage cookbook.
dns4003:
  • - receive in system on procurement task T311749 & in coupa
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location, state of planned)
  • - add mgmt dns (asset tag and hostname) and production dns entries in netbox, run cookbook sre.dns.netbox.
  • - network port setup via netbox, run homer from an active cumin host to commit
  • - bios/drac/serial setup/testing, see Lifecycle Steps & Automatic BIOS setup details
  • - firmware update (idrac, bios, network, raid controller) - bios newest, idrac capped, network rolled back to working version 21.85 from 22.x
  • - operations/puppet update - this should include updates to netboot.pp, and site.pp role(insetup) or cp systems use role(insetup::nofirm).
  • - OS installation & initital puppet run via sre.hosts.reimage cookbook.
dns4004:
  • - receive in system on procurement task T311749 & in coupa
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location, state of planned)
  • - add mgmt dns (asset tag and hostname) and production dns entries in netbox, run cookbook sre.dns.netbox.
  • - network port setup via netbox, run homer from an active cumin host to commit
  • - bios/drac/serial setup/testing, see Lifecycle Steps & Automatic BIOS setup details
  • - firmware update (idrac, bios, network, raid controller)
  • - operations/puppet update - this should include updates to netboot.pp, and site.pp role(insetup) or cp systems use role(insetup::nofirm).
  • - OS installation & initital puppet run via sre.hosts.reimage cookbook.

Details

ProjectBranchLines +/-Subject
operations/puppetproduction+1 -11
operations/puppetproduction+4 -6
operations/homer/publicmaster+0 -1
operations/homer/publicmaster+1 -0
operations/puppetproduction+8 -3
operations/homer/publicmaster+0 -1
operations/puppetproduction+1 -6
operations/puppetproduction+1 -0
operations/homer/publicmaster+2 -1
operations/puppetproduction+9 -6
operations/homer/publicmaster+0 -1
operations/puppetproduction+3 -1
operations/homer/publicmaster+1 -0
operations/dnsmaster+2 -0
operations/puppetproduction+1 -2
operations/puppetproduction+1 -0
operations/puppetproduction+1 -1
operations/puppetproduction+1 -2
operations/puppetproduction+2 -2
operations/puppetproduction+2 -5
operations/puppetproduction+8 -0
operations/homer/publicmaster+1 -0
operations/puppetproduction+9 -5
operations/puppetproduction+2 -5
operations/puppetproduction+4 -0
operations/puppetproduction+4 -0
operations/dnsmaster+1 -1
operations/homer/publicmaster+1 -0
operations/puppetproduction+0 -0
operations/puppetproduction+5 -5
operations/puppetproduction+4 -1
Show related patches Customize query in gerrit

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

@ssingh: dns4004 installed fine, so its ready for role and reimage as needed by Traffic. I also kicked the dns4002 decom task over to you for puppet & configuration cleanup.

Change 841390 had a related patch set uploaded (by Muehlenhoff; author: Muehlenhoff):

[operations/puppet@production] Make ganeti4008 a Ganeti node

https://gerrit.wikimedia.org/r/841390

Change 841496 had a related patch set uploaded (by Ssingh; author: Ssingh):

[operations/puppet@production] dns4004: add Puppet role and DNS/NTP configs

https://gerrit.wikimedia.org/r/841496

Change 841390 merged by Muehlenhoff:

[operations/puppet@production] Make ganeti4008 a Ganeti node

https://gerrit.wikimedia.org/r/841390

Change 841533 had a related patch set uploaded (by Ssingh; author: Ssingh):

[operations/homer/public@master] sites.yaml: add dns4004 to anycast_neighbors

https://gerrit.wikimedia.org/r/841533

Change 841496 merged by Ssingh:

[operations/puppet@production] dns4004: add Puppet role and DNS/NTP configs

https://gerrit.wikimedia.org/r/841496

Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin2002 for host dns4004.wikimedia.org with OS buster

I have setup ganeti4008 as a node in the ulsfo Ganeti cluster and moved a VM to it to confirm it works as expected.

Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin2002 for host dns4004.wikimedia.org with OS buster completed:

  • dns4004 (FAIL)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh buster OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202210111504_sukhe_3935795_dns4004.out
    • Checked BIOS boot parameters are back to normal
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
    • Failed to get Netbox script results, try manually: https://netbox.discovery.wmnet/api/extras/job-results/3844849/

Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin2002 for host dns4004.wikimedia.org with OS buster executed with errors:

  • dns4004 (FAIL)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh buster OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202210111504_sukhe_3935795_dns4004.out
    • Checked BIOS boot parameters are back to normal
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
    • Failed to get Netbox script results, try manually: https://netbox.discovery.wmnet/api/extras/job-results/3844849/
    • The reimage failed, see the cookbook logs for the details

Change 841533 merged by Ssingh:

[operations/homer/public@master] sites.yaml: add dns4004 to anycast_neighbors

https://gerrit.wikimedia.org/r/841533

dns4004 has been commissioned.

RobH added a subscriber: Volans.

When attempting to run the provsion script for both ganeti4005 and lvs4008 I get the same error:

Testing Redfish API connection to lvs4008 (10.128.128.17)
Failed to run cookbooks.sre.hosts.provision.ProvisionRunner.run.<locals>.check_connection: Unable to connect to the Redfish API of lvs4008. Follow https://wikitech.wikimedia.org/wiki/SRE/Dc-operations/Platform-specific_documentation/Dell_Documentation#Troubleshooting_2
==> What do you want to do? "retry" the last command, manually fix the issue and "skip" the last command to continue the execution or completely "abort" the execution.

Following the directions in that link, I can then run eval $(sudo grep '/usr/bin/base64' /var/log/spicerack/sre/hosts/provision.log | grep lvs4008 | tail -n1 | grep -o "/bin/echo.*base64 -d") on the cumin host and it does indeed read the correct idrac serial number (compared to physical label on server and in netbox)

robh@install4001:~$ sudo tcpdump -vvv 'udp and (src port 67 or src port 68 or src port 69)' | grep 'Hostname Option'
tcpdump: listening on ens13, link-type EN10MB (Ethernet), capture size 262144 bytes

it sits and sees nothing when the script is running and failing with its above error message.

@Volans : Any idea whats up with this?

out put after running the cookbook on lvs4008

END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host lvs4008.mgmt.ulsfo.wmnet with reboot policy FORCED

out when running the cookbook on ganeti4005

Entering configuration mode
warning: statement not found
[edit interfaces xe-1/0/9]
-   description DISABLED;
+   description "ganeti4005 {#ganeti4005d}";
-   disable;
+   mtu 9192;
+   unit 0 {
+       family ethernet-switching {
+           interface-mode access;
+           vlan {
+               members private1-ulsfo;
+           }
+       }
+   }
load complete
Exiting configuration mode

i commit this

and get another output

Failed to run cookbooks.sre.hosts.provision.ProvisionRunner._config: 'BiosBootSeq' What do you want to do? "retry" the last command, manually fix the issue and "skip" the last command to continue the execution or completely "abort" the execution.

i skip the above and get

No configuration change needed on the switch for ##PRIMARY##
END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ganeti4005.mgmt.ulsfo.wmnet with reboot policy FORCED

it passed but Redfish was not able to set the BiosbootSeq so looking at this now

So the issue with ganeti4005 was that the bios boot setting was set to UEFI that is the reason RedFish was failing so after I changed it to BIOS i had no issues

END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ganeti4005.mgmt.ulsfo.wmnet with reboot policy FORCED
pt1979@cumin2002:~$

Change 845070 had a related patch set uploaded (by RobH; author: RobH):

[operations/puppet@production] ulsfo ganeti4005 lvs4008

https://gerrit.wikimedia.org/r/845070

Change 845070 merged by RobH:

[operations/puppet@production] ulsfo ganeti4005 lvs4008

https://gerrit.wikimedia.org/r/845070

Cookbook cookbooks.sre.hosts.reimage was started by robh@cumin2002 for host ganeti4005.ulsfo.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by robh@cumin2002 for host ganeti4005.ulsfo.wmnet with OS bullseye executed with errors:

  • ganeti4005 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by robh@cumin2002 for host ganeti4005.ulsfo.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage was started by robh@cumin2002 for host lvs4008.ulsfo.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by robh@cumin2002 for host ganeti4005.ulsfo.wmnet with OS bullseye completed:

  • ganeti4005 (WARN)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Unable to downtime the new host on Icinga/Alertmanager, the sre.hosts.downtime cookbook returned 99
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202210202338_robh_1954717_ganeti4005.out
    • Checked BIOS boot parameters are back to normal
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is not optimal, downtime not removed
    • Updated Netbox data from PuppetDB
    • Updated Netbox status planned -> staged

Cookbook cookbooks.sre.hosts.reimage started by robh@cumin2002 for host lvs4008.ulsfo.wmnet with OS bullseye completed:

  • lvs4008 (PASS)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202210210007_robh_1959939_lvs4008.out
    • Checked BIOS boot parameters are back to normal
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
    • Updated Netbox status planned -> staged

Change 849010 had a related patch set uploaded (by Muehlenhoff; author: Muehlenhoff):

[operations/puppet@production] Make ganeti4005 a Ganeti node

https://gerrit.wikimedia.org/r/849010

Change 849010 merged by Muehlenhoff:

[operations/puppet@production] Make ganeti4005 a Ganeti node

https://gerrit.wikimedia.org/r/849010

Change 849023 had a related patch set uploaded (by Muehlenhoff; author: Muehlenhoff):

[operations/puppet@production] Swap ganeti4003 with ganeti4005 for blackbox smoke tests

https://gerrit.wikimedia.org/r/849023

Change 849023 merged by Muehlenhoff:

[operations/puppet@production] Swap ganeti4002/ganeti4003 for blackbox smoke tests

https://gerrit.wikimedia.org/r/849023

Mentioned in SAL (#wikimedia-operations) [2022-10-25T09:36:43Z] <moritzm> drain ganeti4002 for eventual decom T317247

Change 849054 had a related patch set uploaded (by Muehlenhoff; author: Muehlenhoff):

[operations/puppet@production] Remove ganeti4002 from Puppet for decom

https://gerrit.wikimedia.org/r/849054

Change 849054 merged by Muehlenhoff:

[operations/puppet@production] Remove ganeti4002 from Puppet for decom

https://gerrit.wikimedia.org/r/849054

cookbooks.sre.hosts.decommission executed by jmm@cumin2002 for hosts: ganeti4002.ulsfo.wmnet

  • ganeti4002.ulsfo.wmnet (PASS)
    • Downtimed host on Icinga/Alertmanager
    • Found physical host
    • Downtimed management interface on Icinga/Alertmanager
    • Wiped all swraid, partition-table and filesystem signatures
    • Powered off
    • [Netbox] Set status to Decommissioning, deleted all non-mgmt IPs, updated switch interfaces (disabled, removed vlans, etc
    • Configured the linked switch interface(s)
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB

I have setup ganeti4005 as a node in the ulsfo Ganeti cluster and moved a VM to it to confirm it works as expected.

@RobH : I've also decomissioned ganeti4002, you can unrack it the next time you're going to ulsfo.

Change 849105 had a related patch set uploaded (by Ssingh; author: Ssingh):

[operations/dns@master] Depool ulsfo for cp hosts hardware refresh

https://gerrit.wikimedia.org/r/849105

Change 849105 merged by Ssingh:

[operations/dns@master] Depool ulsfo for cp hosts hardware refresh

https://gerrit.wikimedia.org/r/849105

Mentioned in SAL (#wikimedia-operations) [2022-10-27T09:17:39Z] <moritzm> failover ganeti master in ulsfo to ganeti4008, unblocking future decom of ganeti4003 T317247

Change 850260 had a related patch set uploaded (by RobH; author: RobH):

[operations/puppet@production] ganeti4006

https://gerrit.wikimedia.org/r/850260

Change 850260 merged by RobH:

[operations/puppet@production] ganeti4006

https://gerrit.wikimedia.org/r/850260

Cookbook cookbooks.sre.hosts.reimage was started by robh@cumin2002 for host ganeti4006.ulsfo.wmnet with OS buster

Cookbook cookbooks.sre.hosts.reimage started by robh@cumin2002 for host ganeti4006.ulsfo.wmnet with OS buster executed with errors:

  • ganeti4006 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh buster OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by robh@cumin2002 for host ganeti4006.ulsfo.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by robh@cumin2002 for host ganeti4006.ulsfo.wmnet with OS bullseye completed:

  • ganeti4006 (WARN)
    • Downtimed on Icinga/Alertmanager
    • Unable to disable Puppet, the host may have been unreachable
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202210272035_robh_3631110_ganeti4006.out
    • Checked BIOS boot parameters are back to normal
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
    • Updated Netbox status planned -> staged

Change 850413 had a related patch set uploaded (by Muehlenhoff; author: Muehlenhoff):

[operations/puppet@production] Make ganeti4006 a Ganeti node

https://gerrit.wikimedia.org/r/850413

Change 850413 merged by Muehlenhoff:

[operations/puppet@production] Make ganeti4006 a Ganeti node

https://gerrit.wikimedia.org/r/850413

Mentioned in SAL (#wikimedia-operations) [2022-10-28T09:53:36Z] <moritzm> drain ganeti4003 for eventual decom T317247

Change 850451 had a related patch set uploaded (by Muehlenhoff; author: Muehlenhoff):

[operations/puppet@production] Remove ganeti4003 from Puppet

https://gerrit.wikimedia.org/r/850451

Change 850451 merged by Muehlenhoff:

[operations/puppet@production] Remove ganeti4003 from Puppet

https://gerrit.wikimedia.org/r/850451

cookbooks.sre.hosts.decommission executed by jmm@cumin2002 for hosts: ganeti4003.ulsfo.wmnet

  • ganeti4003.ulsfo.wmnet (PASS)
    • Downtimed host on Icinga/Alertmanager
    • Found physical host
    • Downtimed management interface on Icinga/Alertmanager
    • Wiped all swraid, partition-table and filesystem signatures
    • Powered off
    • [Netbox] Set status to Decommissioning, deleted all non-mgmt IPs, updated switch interfaces (disabled, removed vlans, etc
    • Configured the linked switch interface(s)
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB

cookbooks.sre.hosts.decommission executed by volans@cumin1001 for hosts: ganeti4003.ulsfo.wmnet

  • ganeti4003.ulsfo.wmnet (FAIL)
    • Unable to find/resolve the mgmt DNS record, using the IP instead: 10.128.129.19
    • Host not found on Icinga, unable to downtime it
    • Found physical host
    • Management interface not found on Icinga, unable to downtime it
    • Unable to connect to the host, wipe of swraid, partition-table and filesystem signatures will not be performed: Cumin execution failed (exit_code=2)
    • Failed to power off, manual intervention required: Remote IPMI for 10.128.129.19 failed (exit=1): b''
    • [Netbox] Set status to Decommissioning, deleted all non-mgmt IPs, updated switch interfaces (disabled, removed vlans, etc
    • Configured the linked switch interface(s)
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB

ERROR: some step on some host failed, check the bolded items above

Change 855583 had a related patch set uploaded (by Ssingh; author: Ssingh):

[operations/homer/public@master] sites.yaml: add lvs4008, replacing lvs4005 (ulsfo hardware refresh)

https://gerrit.wikimedia.org/r/855583

Change 855607 had a related patch set uploaded (by Ssingh; author: Ssingh):

[operations/puppet@production] lvs4008: commission new LVS host (ulsfo hardware refresh)

https://gerrit.wikimedia.org/r/855607

Change 855583 merged by Ssingh:

[operations/homer/public@master] sites.yaml: add lvs4008 (ulsfo hardware refresh)

https://gerrit.wikimedia.org/r/855583

Change 855607 merged by Ssingh:

[operations/puppet@production] lvs4008: commission new LVS host (ulsfo hardware refresh)

https://gerrit.wikimedia.org/r/855607

cookbooks.sre.hosts.decommission executed by sukhe@cumin2002 for hosts: lvs4005.ulsfo.wmnet

  • lvs4005.ulsfo.wmnet (WARN)
    • Downtimed host on Icinga/Alertmanager
    • Found physical host
    • Management interface not found on Icinga, unable to downtime it
    • Wiped all swraid, partition-table and filesystem signatures
    • Powered off
    • [Netbox] Set status to Decommissioning, deleted all non-mgmt IPs, updated switch interfaces (disabled, removed vlans, etc)
    • Configured the linked switch interface(s)
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB

Change 856946 had a related patch set uploaded (by Ssingh; author: Ssingh):

[operations/homer/public@master] sites.yaml: remove decommissioned host lvs4005

https://gerrit.wikimedia.org/r/856946

Change 856946 merged by Ssingh:

[operations/homer/public@master] sites.yaml: remove decommissioned host lvs4005

https://gerrit.wikimedia.org/r/856946

Change 858336 had a related patch set uploaded (by Ssingh; author: Ssingh):

[operations/puppet@production] lvs4009: commission new LVS host (ulsfo hardware refresh)

https://gerrit.wikimedia.org/r/858336

Change 859065 had a related patch set uploaded (by Ssingh; author: Ssingh):

[operations/homer/public@master] sites.yaml: add lvs4009 (ulsfo hardware refresh)

https://gerrit.wikimedia.org/r/859065

Change 859086 had a related patch set uploaded (by Ssingh; author: Ssingh):

[operations/puppet@production] lvs4006: set profile::pybal::bgp to no

https://gerrit.wikimedia.org/r/859086

Change 858336 merged by Ssingh:

[operations/puppet@production] lvs4009: commission new LVS host (ulsfo hardware refresh)

https://gerrit.wikimedia.org/r/858336

Change 859065 merged by Ssingh:

[operations/homer/public@master] sites.yaml: add lvs4009 (ulsfo hardware refresh)

https://gerrit.wikimedia.org/r/859065

Mentioned in SAL (#wikimedia-operations) [2022-11-22T18:46:29Z] <sukhe> cr[34]-ulsfo: set routing-options static route 198.35.26.112/28 next-hop 10.128.0.9: T317247

Change 859086 merged by Ssingh:

[operations/puppet@production] lvs4006: set profile::pybal::bgp to no

https://gerrit.wikimedia.org/r/859086

Change 859598 had a related patch set uploaded (by Ssingh; author: Ssingh):

[operations/puppet@production] lvs4009: set as high-traffic2 primary LVS and remove lvs4006 (decomm)

https://gerrit.wikimedia.org/r/859598

Change 859600 had a related patch set uploaded (by Ssingh; author: Ssingh):

[operations/homer/public@master] sites.yaml: remove decommissioned host lvs4006

https://gerrit.wikimedia.org/r/859600

cookbooks.sre.hosts.decommission executed by sukhe@cumin2002 for hosts: lvs4006.ulsfo.wmnet

  • lvs4006.ulsfo.wmnet (WARN)
    • Downtimed host on Icinga/Alertmanager
    • Found physical host
    • Management interface not found on Icinga, unable to downtime it
    • Wiped all swraid, partition-table and filesystem signatures
    • Powered off
    • [Netbox] Set status to Decommissioning, deleted all non-mgmt IPs, updated switch interfaces (disabled, removed vlans, etc)
    • Configured the linked switch interface(s)
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB

Change 859598 merged by Ssingh:

[operations/puppet@production] lvs4009: set as high-traffic2 primary LVS and remove lvs4006 (decomm)

https://gerrit.wikimedia.org/r/859598

Change 859600 merged by jenkins-bot:

[operations/homer/public@master] sites.yaml: remove decommissioned host lvs4006

https://gerrit.wikimedia.org/r/859600

Change 860067 had a related patch set uploaded (by Ssingh; author: Ssingh):

[operations/puppet@production] lvs4010: commission new LVS host (ulsfo hardware refresh)

https://gerrit.wikimedia.org/r/860067

Change 860089 had a related patch set uploaded (by Ssingh; author: Ssingh):

[operations/homer/public@master] sites.yaml: add lvs4010 (ulsfo hardware refresh)

https://gerrit.wikimedia.org/r/860089

Change 860067 merged by Ssingh:

[operations/puppet@production] lvs4010: commission new LVS host (ulsfo hardware refresh)

https://gerrit.wikimedia.org/r/860067

Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin2002 for host lvs4010.ulsfo.wmnet with OS buster

Change 860094 had a related patch set uploaded (by Ssingh; author: Ssingh):

[operations/puppet@production] lvs4010: set as secondary LVS and remove lvs4007 (decom)

https://gerrit.wikimedia.org/r/860094

Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin2002 for host lvs4010.ulsfo.wmnet with OS buster completed:

  • lvs4010 (PASS)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh buster OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202211231845_sukhe_397048_lvs4010.out
    • Checked BIOS boot parameters are back to normal
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
    • Updated Netbox status planned -> active

Change 860089 merged by Ssingh:

[operations/homer/public@master] sites.yaml: add lvs4010 (ulsfo hardware refresh)

https://gerrit.wikimedia.org/r/860089

Change 860103 had a related patch set uploaded (by Ssingh; author: Ssingh):

[operations/homer/public@master] sites.yaml: remove decommissioned host lvs4007

https://gerrit.wikimedia.org/r/860103

cookbooks.sre.hosts.decommission executed by sukhe@cumin2002 for hosts: lvs4007.ulsfo.wmnet

  • lvs4007.ulsfo.wmnet (WARN)
    • Downtimed host on Icinga/Alertmanager
    • Found physical host
    • Management interface not found on Icinga, unable to downtime it
    • Wiped all swraid, partition-table and filesystem signatures
    • Powered off
    • [Netbox] Set status to Decommissioning, deleted all non-mgmt IPs, updated switch interfaces (disabled, removed vlans, etc)
    • Configured the linked switch interface(s)
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB

Change 860103 merged by Ssingh:

[operations/homer/public@master] sites.yaml: remove decommissioned host lvs4007

https://gerrit.wikimedia.org/r/860103

Change 860094 merged by Ssingh:

[operations/puppet@production] lvs4010: set as secondary LVS and remove lvs4007 (decom)

https://gerrit.wikimedia.org/r/860094

Change 860930 had a related patch set uploaded (by Ssingh; author: Ssingh):

[operations/puppet@production] hiera: unify ulsfo LVS configuration

https://gerrit.wikimedia.org/r/860930

Change 860930 merged by Ssingh:

[operations/puppet@production] hiera: unify ulsfo LVS configuration

https://gerrit.wikimedia.org/r/860930