Page MenuHomePhabricator

Q2:rack/setup/install eqsin refresh
Closed, ResolvedPublic

Description

This task will track the racking, setup, and OS installation of:
cp50[17-32]

cp5017-cp5024: text
cp5025-cp5032: upload

lvs500[456]
ganeti500[4567]
dns500[34]

This will need to be scheduled with Jin of DreamIIC to handle all on-site work.

Hostname / Racking / Installation Details

This task will track the installation of eqsin's refresh of hosts. The racking elevation diagram handed to Jin is here.

Automation Testing

Please note cp5032 has been added to netbox, and had the netbox network provisioning run, as well as the dns cookbook and network port cookbook run. The cookbook to automatically setup the bios/idrac has not been run, leaving that to automation testing.

Per host setup checklist

Each host should have its own setup checklist copied and pasted into the list below.

cp5017
  • - receive in system on procurement task T313266 & in coupa
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location, state of planned)
  • - add mgmt dns (asset tag and hostname) and production dns entries in netbox, run cookbook sre.dns.netbox.
  • - network port setup via netbox, run homer from an active cumin host to commit
  • - bios/drac/serial setup/testing, see Lifecycle Steps & Automatic BIOS setup details
  • - firmware update (idrac, bios, network, raid controller)
  • - operations/puppet update - this should include updates to netboot.pp, and site.pp role(insetup) or cp systems use role(insetup::nofirm).
  • - OS installation & initital puppet run via sre.hosts.reimage cookbook.
cp5018
  • - receive in system on procurement task T313266 & in coupa
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location, state of planned)
  • - add mgmt dns (asset tag and hostname) and production dns entries in netbox, run cookbook sre.dns.netbox.
  • - network port setup via netbox, run homer from an active cumin host to commit
  • - bios/drac/serial setup/testing, see Lifecycle Steps & Automatic BIOS setup details
  • - firmware update (idrac, bios, network, raid controller)
  • - operations/puppet update - this should include updates to netboot.pp, and site.pp role(insetup) or cp systems use role(insetup::nofirm).
  • - OS installation & initital puppet run via sre.hosts.reimage cookbook.
cp5019
  • - receive in system on procurement task T313266 & in coupa
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location, state of planned)
  • - add mgmt dns (asset tag and hostname) and production dns entries in netbox, run cookbook sre.dns.netbox.
  • - network port setup via netbox, run homer from an active cumin host to commit
  • - bios/drac/serial setup/testing, see Lifecycle Steps & Automatic BIOS setup details
  • - firmware update (idrac, bios, network, raid controller)
  • - operations/puppet update - this should include updates to netboot.pp, and site.pp role(insetup) or cp systems use role(insetup::nofirm).
  • - OS installation & initital puppet run via sre.hosts.reimage cookbook.
cp5020
  • - receive in system on procurement task T313266 & in coupa
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location, state of planned)
  • - add mgmt dns (asset tag and hostname) and production dns entries in netbox, run cookbook sre.dns.netbox.
  • - network port setup via netbox, run homer from an active cumin host to commit
  • - bios/drac/serial setup/testing, see Lifecycle Steps & Automatic BIOS setup details
  • - firmware update (idrac, bios, network, raid controller)
  • - operations/puppet update - this should include updates to netboot.pp, and site.pp role(insetup) or cp systems use role(insetup::nofirm).
  • - OS installation & initital puppet run via sre.hosts.reimage cookbook.
cp5021
  • - receive in system on procurement task T313266 & in coupa
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location, state of planned)
  • - add mgmt dns (asset tag and hostname) and production dns entries in netbox, run cookbook sre.dns.netbox.
  • - network port setup via netbox, run homer from an active cumin host to commit
  • - bios/drac/serial setup/testing, see Lifecycle Steps & Automatic BIOS setup details
  • - firmware update (idrac, bios, network, raid controller)
  • - operations/puppet update - this should include updates to netboot.pp, and site.pp role(insetup) or cp systems use role(insetup::nofirm).
  • - OS installation & initital puppet run via sre.hosts.reimage cookbook.
cp5022
  • - receive in system on procurement task T313266 & in coupa
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location, state of planned)
  • - add mgmt dns (asset tag and hostname) and production dns entries in netbox, run cookbook sre.dns.netbox.
  • - network port setup via netbox, run homer from an active cumin host to commit
  • - bios/drac/serial setup/testing, see Lifecycle Steps & Automatic BIOS setup details
  • - firmware update (idrac, bios, network, raid controller)
  • - operations/puppet update - this should include updates to netboot.pp, and site.pp role(insetup) or cp systems use role(insetup::nofirm).
  • - OS installation & initital puppet run via sre.hosts.reimage cookbook.
cp5023
  • - receive in system on procurement task T313266 & in coupa
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location, state of planned)
  • - add mgmt dns (asset tag and hostname) and production dns entries in netbox, run cookbook sre.dns.netbox.
  • - network port setup via netbox, run homer from an active cumin host to commit
  • - bios/drac/serial setup/testing, see Lifecycle Steps & Automatic BIOS setup details
  • - firmware update (idrac, bios, network, raid controller)
  • - operations/puppet update - this should include updates to netboot.pp, and site.pp role(insetup) or cp systems use role(insetup::nofirm).
  • - OS installation & initital puppet run via sre.hosts.reimage cookbook.
cp5024
  • - receive in system on procurement task T313266 & in coupa
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location, state of planned)
  • - add mgmt dns (asset tag and hostname) and production dns entries in netbox, run cookbook sre.dns.netbox.
  • - network port setup via netbox, run homer from an active cumin host to commit
  • - bios/drac/serial setup/testing, see Lifecycle Steps & Automatic BIOS setup details
  • - firmware update (idrac, bios, network, raid controller)
  • - operations/puppet update - this should include updates to netboot.pp, and site.pp role(insetup) or cp systems use role(insetup::nofirm).
  • - OS installation & initital puppet run via sre.hosts.reimage cookbook.
cp5025
  • - receive in system on procurement task T313266 & in coupa
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location, state of planned)
  • - add mgmt dns (asset tag and hostname) and production dns entries in netbox, run cookbook sre.dns.netbox.
  • - network port setup via netbox, run homer from an active cumin host to commit
  • - bios/drac/serial setup/testing, see Lifecycle Steps & Automatic BIOS setup details
  • - firmware update (idrac, bios, network, raid controller)
  • - operations/puppet update - this should include updates to netboot.pp, and site.pp role(insetup) or cp systems use role(insetup::nofirm).
  • - OS installation & initital puppet run via sre.hosts.reimage cookbook.
cp5026
  • - receive in system on procurement task T313266 & in coupa
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location, state of planned)
  • - add mgmt dns (asset tag and hostname) and production dns entries in netbox, run cookbook sre.dns.netbox.
  • - network port setup via netbox, run homer from an active cumin host to commit
  • - bios/drac/serial setup/testing, see Lifecycle Steps & Automatic BIOS setup details
  • - firmware update (idrac, bios, network, raid controller)
  • - operations/puppet update - this should include updates to netboot.pp, and site.pp role(insetup) or cp systems use role(insetup::nofirm).
  • - OS installation & initital puppet run via sre.hosts.reimage cookbook.
cp5027
  • - receive in system on procurement task T313266 & in coupa
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location, state of planned)
  • - add mgmt dns (asset tag and hostname) and production dns entries in netbox, run cookbook sre.dns.netbox.
  • - network port setup via netbox, run homer from an active cumin host to commit
  • - bios/drac/serial setup/testing, see Lifecycle Steps & Automatic BIOS setup details
  • - firmware update (idrac, bios, network, raid controller)
  • - operations/puppet update - this should include updates to netboot.pp, and site.pp role(insetup) or cp systems use role(insetup::nofirm).
  • - OS installation & initital puppet run via sre.hosts.reimage cookbook.
cp5028
  • - receive in system on procurement task T313266 & in coupa
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location, state of planned)
  • - add mgmt dns (asset tag and hostname) and production dns entries in netbox, run cookbook sre.dns.netbox.
  • - network port setup via netbox, run homer from an active cumin host to commit
  • - bios/drac/serial setup/testing, see Lifecycle Steps & Automatic BIOS setup details
  • - firmware update (idrac, bios, network, raid controller)
  • - operations/puppet update - this should include updates to netboot.pp, and site.pp role(insetup) or cp systems use role(insetup::nofirm).
  • - OS installation & initital puppet run via sre.hosts.reimage cookbook.
cp5029
  • - receive in system on procurement task T313266 & in coupa
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location, state of planned)
  • - add mgmt dns (asset tag and hostname) and production dns entries in netbox, run cookbook sre.dns.netbox.
  • - network port setup via netbox, run homer from an active cumin host to commit
  • - bios/drac/serial setup/testing, see Lifecycle Steps & Automatic BIOS setup details
  • - firmware update (idrac, bios, network, raid controller)
  • - operations/puppet update - this should include updates to netboot.pp, and site.pp role(insetup) or cp systems use role(insetup::nofirm).
  • - OS installation & initital puppet run via sre.hosts.reimage cookbook.
cp5030
  • - receive in system on procurement task T313266 & in coupa
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location, state of planned)
  • - add mgmt dns (asset tag and hostname) and production dns entries in netbox, run cookbook sre.dns.netbox.
  • - network port setup via netbox, run homer from an active cumin host to commit
  • - bios/drac/serial setup/testing, see Lifecycle Steps & Automatic BIOS setup details
  • - firmware update (idrac, bios, network, raid controller)
  • - operations/puppet update - this should include updates to netboot.pp, and site.pp role(insetup) or cp systems use role(insetup::nofirm).
  • - OS installation & initital puppet run via sre.hosts.reimage cookbook.
cp5031
  • - receive in system on procurement task T313266 & in coupa
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location, state of planned)
  • - add mgmt dns (asset tag and hostname) and production dns entries in netbox, run cookbook sre.dns.netbox.
  • - network port setup via netbox, run homer from an active cumin host to commit
  • - bios/drac/serial setup/testing, see Lifecycle Steps & Automatic BIOS setup details
  • - firmware update (idrac, bios, network, raid controller)
  • - operations/puppet update - this should include updates to netboot.pp, and site.pp role(insetup) or cp systems use role(insetup::nofirm).
  • - OS installation & initital puppet run via sre.hosts.reimage cookbook.
cp5032
  • - receive in system on procurement task T313266 & in coupa
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location, state of planned) https://netbox.wikimedia.org/dcim/devices/4485/
  • - add mgmt dns (asset tag and hostname) and production dns entries in netbox, run cookbook sre.dns.netbox.
  • - network port setup via netbox, run homer from an active cumin host to commit
  • - bios/drac/serial setup/testing, see Lifecycle Steps & Automatic BIOS setup details
  • - firmware update (idrac, bios, network, raid controller)
  • - operations/puppet update - this should include updates to netboot.pp, and site.pp role(insetup) or cp systems use role(insetup::nofirm).
  • - OS installation & initital puppet run via sre.hosts.reimage cookbook.
lvs5004
  • - receive in system on procurement task T313266 & in coupa
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location, state of planned)
  • - add mgmt dns (asset tag and hostname) and production dns entries in netbox, run cookbook sre.dns.netbox.
  • - network port setup via netbox, run homer from an active cumin host to commit
  • - bios/drac/serial setup/testing, see Lifecycle Steps & Automatic BIOS setup details
  • - firmware update (idrac, bios, network, raid controller) - bios and idrac latest allowed revisions, downgraded nic to 21.85 from 22.x
  • - operations/puppet update - this should include updates to netboot.pp, and site.pp role(insetup) or cp systems use role(insetup::nofirm).
  • - OS installation & initital puppet run via sre.hosts.reimage cookbook.
lvs5005
  • - receive in system on procurement task T313266 & in coupa
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location, state of planned)
  • - add mgmt dns (asset tag and hostname) and production dns entries in netbox, run cookbook sre.dns.netbox.
  • - network port setup via netbox, run homer from an active cumin host to commit
  • - bios/drac/serial setup/testing, see Lifecycle Steps & Automatic BIOS setup details
  • - firmware update (idrac, bios, network, raid controller)
  • - operations/puppet update - this should include updates to netboot.pp, and site.pp role(insetup) or cp systems use role(insetup::nofirm).
  • - OS installation & initital puppet run via sre.hosts.reimage cookbook.
lvs5006
  • - receive in system on procurement task T313266 & in coupa
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location, state of planned)
  • - add mgmt dns (asset tag and hostname) and production dns entries in netbox, run cookbook sre.dns.netbox.
  • - network port setup via netbox, run homer from an active cumin host to commit
  • - bios/drac/serial setup/testing, see Lifecycle Steps & Automatic BIOS setup details
  • - firmware update (idrac, bios, network, raid controller)
  • - operations/puppet update - this should include updates to netboot.pp, and site.pp role(insetup) or cp systems use role(insetup::nofirm).
  • - OS installation & initital puppet run via sre.hosts.reimage cookbook.
ganeti5004
  • - receive in system on procurement task T313266 & in coupa
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location, state of planned)
  • - add mgmt dns (asset tag and hostname) and production dns entries in netbox, run cookbook sre.dns.netbox.
  • - network port setup via netbox, run homer from an active cumin host to commit
  • - bios/drac/serial setup/testing, see Lifecycle Steps & Automatic BIOS setup details
  • - firmware update (idrac, bios, network, raid controller)
  • - operations/puppet update - this should include updates to netboot.pp, and site.pp role(insetup) or cp systems use role(insetup::nofirm).
  • - OS installation & initital puppet run via sre.hosts.reimage cookbook.
ganeti5005
  • - receive in system on procurement task T313266 & in coupa
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location, state of planned)
  • - add mgmt dns (asset tag and hostname) and production dns entries in netbox, run cookbook sre.dns.netbox.
  • - network port setup via netbox, run homer from an active cumin host to commit
  • - bios/drac/serial setup/testing, see Lifecycle Steps & Automatic BIOS setup details
  • - firmware update (idrac, bios, network, raid controller)
  • - operations/puppet update - this should include updates to netboot.pp, and site.pp role(insetup) or cp systems use role(insetup::nofirm).
  • - OS installation & initital puppet run via sre.hosts.reimage cookbook.
ganeti5006
  • - receive in system on procurement task T313266 & in coupa
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location, state of planned)
  • - add mgmt dns (asset tag and hostname) and production dns entries in netbox, run cookbook sre.dns.netbox.
  • - network port setup via netbox, run homer from an active cumin host to commit
  • - bios/drac/serial setup/testing, see Lifecycle Steps & Automatic BIOS setup details
  • - firmware update (idrac, bios, network, raid controller)
  • - operations/puppet update - this should include updates to netboot.pp, and site.pp role(insetup) or cp systems use role(insetup::nofirm).
  • - OS installation & initital puppet run via sre.hosts.reimage cookbook.
ganeti5007
  • - receive in system on procurement task T313266 & in coupa
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location, state of planned)
  • - add mgmt dns (asset tag and hostname) and production dns entries in netbox, run cookbook sre.dns.netbox.
  • - network port setup via netbox, run homer from an active cumin host to commit
  • - bios/drac/serial setup/testing, see Lifecycle Steps & Automatic BIOS setup details
  • - firmware update (idrac, bios, network, raid controller)
  • - operations/puppet update - this should include updates to netboot.pp, and site.pp role(insetup) or cp systems use role(insetup::nofirm).
  • - OS installation & initital puppet run via sre.hosts.reimage cookbook.
dns5003
  • - receive in system on procurement task T313266 & in coupa
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location, state of planned)
  • - add mgmt dns (asset tag and hostname) and production dns entries in netbox, run cookbook sre.dns.netbox.
  • - network port setup via netbox, run homer from an active cumin host to commit
  • - bios/drac/serial setup/testing, see Lifecycle Steps & Automatic BIOS setup details
  • - firmware update (idrac, bios, network, raid controller)
  • - operations/puppet update - this should include updates to netboot.pp, and site.pp role(insetup) or cp systems use role(insetup::nofirm).
  • - OS installation & initital puppet run via sre.hosts.reimage cookbook.
dns5004
  • - receive in system on procurement task T313266 & in coupa
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location, state of planned)
  • - add mgmt dns (asset tag and hostname) and production dns entries in netbox, run cookbook sre.dns.netbox.
  • - network port setup via netbox, run homer from an active cumin host to commit
  • - bios/drac/serial setup/testing, see Lifecycle Steps & Automatic BIOS setup details
  • - firmware update (idrac, bios, network, raid controller)
  • - operations/puppet update - this should include updates to netboot.pp, and site.pp role(insetup) or cp systems use role(insetup::nofirm).
  • - OS installation & initital puppet run via sre.hosts.reimage cookbook.

Details

Related Changes in Gerrit:
SubjectRepoBranchLines +/-
operations/puppetproduction+1 -2
operations/puppetproduction+2 -11
operations/puppetproduction+1 -1
operations/homer/publicmaster+1 -0
operations/puppetproduction+10 -3
operations/homer/publicmaster+1 -0
operations/homer/publicmaster+1 -0
operations/puppetproduction+1 -1
operations/puppetproduction+3 -118
operations/puppetproduction+9 -2
operations/puppetproduction+8 -0
operations/puppetproduction+4 -0
operations/homer/publicmaster+1 -0
operations/homer/publicmaster+1 -0
operations/puppetproduction+5 -6
operations/puppetproduction+1 -1
operations/puppetproduction+1 -2
operations/puppetproduction+1 -1
operations/puppetproduction+0 -2
operations/puppetproduction+9 -2
operations/puppetproduction+4 -0
operations/puppetproduction+14 -1
operations/puppetproduction+14 -1
operations/puppetproduction+14 -1
operations/puppetproduction+12 -1
operations/puppetproduction+12 -1
operations/puppetproduction+12 -1
operations/puppetproduction+12 -1
operations/puppetproduction+14 -1
operations/puppetproduction+14 -1
operations/puppetproduction+14 -1
operations/puppetproduction+14 -1
operations/puppetproduction+14 -1
operations/puppetproduction+12 -1
operations/puppetproduction+12 -1
operations/puppetproduction+12 -1
operations/puppetproduction+12 -1
operations/puppetproduction+2 -0
operations/puppetproduction+12 -1
operations/puppetproduction+1 -1
Show related patches Customize query in gerrit

Related Objects

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Change 862998 had a related patch set uploaded (by Ssingh; author: Ssingh):

[operations/homer/public@master] sites.yaml: add dns5004 (eqsin hardware refresh)

https://gerrit.wikimedia.org/r/862998

Change 862996 merged by Ssingh:

[operations/puppet@production] dns5004: add Puppet role and DNS/NTP configs

https://gerrit.wikimedia.org/r/862996

Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin2002 for host dns5004.wikimedia.org with OS buster

Change 862943 merged by Ssingh:

[operations/puppet@production] lvs5004: commission new LVS host (eqsin hardware refresh)

https://gerrit.wikimedia.org/r/862943

Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin2002 for host lvs5004.eqsin.wmnet with OS buster

Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin2002 for host lvs5004.eqsin.wmnet with OS buster executed with errors:

  • lvs5004 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh buster OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin2002 for host dns5004.wikimedia.org with OS buster executed with errors:

  • dns5004 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh buster OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202212011714_sukhe_2310196_dns5004.out
    • Checked BIOS boot parameters are back to normal
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin2002 for host dns5004.wikimedia.org with OS buster

Change 863046 had a related patch set uploaded (by Ssingh; author: Ssingh):

[operations/puppet@production] hiera: temporarily remove references to dns5004

https://gerrit.wikimedia.org/r/863046

Change 863046 merged by Ssingh:

[operations/puppet@production] hiera: temporarily remove references to dns5004

https://gerrit.wikimedia.org/r/863046

Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin2002 for host dns5004.wikimedia.org with OS buster completed:

  • dns5004 (PASS)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh buster OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202212011940_sukhe_2345615_dns5004.out
    • Checked BIOS boot parameters are back to normal
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
    • Updated Netbox status planned -> active

Mentioned in SAL (#wikimedia-operations) [2022-12-02T07:41:34Z] <moritzm> draining ganeti5001 for eventual decom T322048

Change 863236 had a related patch set uploaded (by Muehlenhoff; author: Muehlenhoff):

[operations/puppet@production] Switch to ganeti5004 in blackbox smoke tests

https://gerrit.wikimedia.org/r/863236

Change 863237 had a related patch set uploaded (by Muehlenhoff; author: Muehlenhoff):

[operations/puppet@production] Remove ganeti5001 from Puppet

https://gerrit.wikimedia.org/r/863237

Change 863236 merged by Muehlenhoff:

[operations/puppet@production] Switch to ganeti5004 in blackbox smoke tests

https://gerrit.wikimedia.org/r/863236

Change 863237 merged by Muehlenhoff:

[operations/puppet@production] Remove ganeti5001 from Puppet

https://gerrit.wikimedia.org/r/863237

cookbooks.sre.hosts.decommission executed by jmm@cumin2002 for hosts: ganeti5001.eqsin.wmnet

  • ganeti5001.eqsin.wmnet (WARN)
    • Downtimed host on Icinga/Alertmanager
    • Found physical host
    • Management interface not found on Icinga, unable to downtime it
    • Wiped all swraid, partition-table and filesystem signatures
    • Powered off
    • [Netbox] Set status to Decommissioning, deleted all non-mgmt IPs, updated switch interfaces (disabled, removed vlans, etc)
    • Configured the linked switch interface(s)
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB

ganeti5001 has been decommissioned and can be unracked.

Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin2002 for host dns5004.wikimedia.org with OS buster

Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin1001 for host lvs5004.eqsin.wmnet with OS buster

Change 863367 had a related patch set uploaded (by Ssingh; author: Ssingh):

[operations/puppet@production] lvs5004: update interface names in profile::lvs::interface_tweaks

https://gerrit.wikimedia.org/r/863367

Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin2002 for host dns5004.wikimedia.org with OS buster completed:

  • dns5004 (PASS)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh buster OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202212021438_sukhe_2539134_dns5004.out
    • Checked BIOS boot parameters are back to normal
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Change 863367 merged by Ssingh:

[operations/puppet@production] lvs5004: update interface names in profile::lvs::interface_tweaks

https://gerrit.wikimedia.org/r/863367

Change 862998 merged by Ssingh:

[operations/homer/public@master] sites.yaml: add dns5004 (eqsin hardware refresh)

https://gerrit.wikimedia.org/r/862998

Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin1001 for host lvs5004.eqsin.wmnet with OS buster completed:

  • lvs5004 (WARN)
    • Downtimed on Icinga/Alertmanager
    • Unable to disable Puppet, the host may have been unreachable
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh buster OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202212021448_sukhe_2619696_lvs5004.out
    • Checked BIOS boot parameters are back to normal
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is not optimal, downtime not removed
    • Updated Netbox data from PuppetDB
    • Updated Netbox status planned -> active

Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin1001 for host lvs5004.eqsin.wmnet with OS buster executed with errors:

  • lvs5004 (FAIL)
    • Downtimed on Icinga/Alertmanager
    • Unable to disable Puppet, the host may have been unreachable
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh buster OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202212021448_sukhe_2619696_lvs5004.out
    • Checked BIOS boot parameters are back to normal
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is not optimal, downtime not removed
    • Updated Netbox data from PuppetDB
    • Updated Netbox status planned -> active
    • The reimage failed, see the cookbook logs for the details

Change 862944 merged by Ssingh:

[operations/homer/public@master] sites.yaml: add lvs5004 (eqsin hardware refresh)

https://gerrit.wikimedia.org/r/862944

Change 863379 had a related patch set uploaded (by BBlack; author: BBlack):

[operations/puppet@production] profile::pybal: expand the lvs hostname regexen.

https://gerrit.wikimedia.org/r/863379

Change 863379 merged by Ssingh:

[operations/puppet@production] profile::pybal: expand the lvs hostname regexen.

https://gerrit.wikimedia.org/r/863379

Change 865120 had a related patch set uploaded (by BBlack; author: BBlack):

[operations/puppet@production] eqsin cp: unify per-node hieradata

https://gerrit.wikimedia.org/r/865120

Change 865124 had a related patch set uploaded (by RobH; author: RobH):

[operations/puppet@production] adding eqsin ganeti

https://gerrit.wikimedia.org/r/865124

Change 865124 merged by RobH:

[operations/puppet@production] adding eqsin ganeti

https://gerrit.wikimedia.org/r/865124

RobH changed the task status from Open to In Progress.Dec 6 2022, 6:47 PM
RobH updated the task description. (Show Details)

Cookbook cookbooks.sre.hosts.reimage was started by robh@cumin2002 for host ganeti5005.eqsin.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage was started by robh@cumin2002 for host ganeti5006.eqsin.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage was started by robh@cumin2002 for host ganeti5007.eqsin.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by robh@cumin2002 for host ganeti5006.eqsin.wmnet with OS bullseye completed:

  • ganeti5006 (PASS)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202212061903_robh_3611796_ganeti5006.out
    • Checked BIOS boot parameters are back to normal
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
    • Updated Netbox status planned -> active

Cookbook cookbooks.sre.hosts.reimage started by robh@cumin2002 for host ganeti5007.eqsin.wmnet with OS bullseye completed:

  • ganeti5007 (PASS)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202212061903_robh_3611800_ganeti5007.out
    • Checked BIOS boot parameters are back to normal
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
    • Updated Netbox status planned -> active

Cookbook cookbooks.sre.hosts.reimage started by robh@cumin2002 for host ganeti5005.eqsin.wmnet with OS bullseye completed:

  • ganeti5005 (PASS)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202212061903_robh_3611787_ganeti5005.out
    • Checked BIOS boot parameters are back to normal
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
    • Updated Netbox status planned -> active

FYI, there are outstanding Homer diffs for asw1-eqsin:

[edit interfaces]
-   ge-0/0/16 {
-       description DISABLED;
-       disable;
-   }
[edit interfaces xe-0/0/19]
-   description cp5029;
+   description "cp5029 {#2022110007}";

FYI, there are outstanding Homer diffs for asw1-eqsin:

[edit interfaces]
-   ge-0/0/16 {
-       description DISABLED;
-       disable;
-   }
[edit interfaces xe-0/0/19]
-   description cp5029;
+   description "cp5029 {#2022110007}";

Thanks! Not sure why and how these specific ones were not committed. I am guessing these should be safe to merge?

Change 865613 had a related patch set uploaded (by Ssingh; author: Ssingh):

[operations/puppet@production] lvs5005: commission new LVS host (eqsin hardware refresh)

https://gerrit.wikimedia.org/r/865613

Change 865615 had a related patch set uploaded (by Ssingh; author: Ssingh):

[operations/homer/public@master] sites.yaml: add lvs5005 (eqsin hardware refresh)

https://gerrit.wikimedia.org/r/865615

FYI, there are outstanding Homer diffs for asw1-eqsin:

[edit interfaces]
-   ge-0/0/16 {
-       description DISABLED;
-       disable;
-   }
[edit interfaces xe-0/0/19]
-   description cp5029;
+   description "cp5029 {#2022110007}";

Thanks! Not sure why and how these specific ones were not committed. I am guessing these should be safe to merge?

I'd guess they're an oversight from doing changes directly in Netbox and forgetting to run Homer.

Yep, it looks safe to merge.

Change 865657 had a related patch set uploaded (by Ssingh; author: Ssingh):

[operations/puppet@production] dns5003: add Puppet role and DNS/NTP configs

https://gerrit.wikimedia.org/r/865657

Change 865657 merged by Ssingh:

[operations/puppet@production] dns5003: add Puppet role and DNS/NTP configs

https://gerrit.wikimedia.org/r/865657

Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin2002 for host dns5003.wikimedia.org with OS buster

Change 865660 had a related patch set uploaded (by Ssingh; author: Ssingh):

[operations/homer/public@master] sites.yaml: add dns5003 (eqsin hardware refresh)

https://gerrit.wikimedia.org/r/865660

Change 865613 merged by Ssingh:

[operations/puppet@production] lvs5005: commission new LVS host (eqsin hardware refresh)

https://gerrit.wikimedia.org/r/865613

Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin2002 for host lvs5005.eqsin.wmnet with OS buster

Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin2002 for host dns5003.wikimedia.org with OS buster completed:

  • dns5003 (PASS)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh buster OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202212071432_sukhe_3824833_dns5003.out
    • Checked BIOS boot parameters are back to normal
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
    • Updated Netbox status planned -> active

Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin2002 for host lvs5005.eqsin.wmnet with OS buster completed:

  • lvs5005 (PASS)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh buster OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202212071444_sukhe_3825989_lvs5005.out
    • Checked BIOS boot parameters are back to normal
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
    • Updated Netbox status planned -> active

Change 865660 merged by Ssingh:

[operations/homer/public@master] sites.yaml: add dns5003 (eqsin hardware refresh)

https://gerrit.wikimedia.org/r/865660

Change 865120 merged by BBlack:

[operations/puppet@production] eqsin cp: unify per-node hieradata

https://gerrit.wikimedia.org/r/865120

Change 865615 merged by Ssingh:

[operations/homer/public@master] sites.yaml: add lvs5005 (eqsin hardware refresh)

https://gerrit.wikimedia.org/r/865615

Mentioned in SAL (#wikimedia-operations) [2022-12-07T16:38:07Z] <sukhe> cr[23]-eqsin*: set routing-options static route 103.102.166.240/28 next-hop 10.132.0.6: T322048

Change 865706 had a related patch set uploaded (by RobH; author: RobH):

[operations/puppet@production] updating role

https://gerrit.wikimedia.org/r/865706

Change 865706 merged by RobH:

[operations/puppet@production] updating role

https://gerrit.wikimedia.org/r/865706

Change 865722 had a related patch set uploaded (by Ssingh; author: Ssingh):

[operations/puppet@production] lvs5006: commission new LVS host (eqsin hardware refresh)

https://gerrit.wikimedia.org/r/865722

Change 865732 had a related patch set uploaded (by Ssingh; author: Ssingh):

[operations/homer/public@master] sites.yaml: add lvs5006 (eqsin hardware refresh)

https://gerrit.wikimedia.org/r/865732

RobH changed the task status from In Progress to Open.Dec 7 2022, 7:55 PM
RobH removed RobH as the assignee of this task.
RobH assigned this task to ssingh.

@ssingh,

Once the final OS installations are completed please resolve this task. Thanks!

Change 865722 merged by Ssingh:

[operations/puppet@production] lvs5006: commission new LVS host (eqsin hardware refresh)

https://gerrit.wikimedia.org/r/865722

Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin2002 for host lvs5006.eqsin.wmnet with OS buster

Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin2002 for host lvs5006.eqsin.wmnet with OS buster completed:

  • lvs5006 (PASS)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh buster OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202212072042_sukhe_3895070_lvs5006.out
    • Checked BIOS boot parameters are back to normal
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
    • Updated Netbox status planned -> active

Change 865732 merged by Ssingh:

[operations/homer/public@master] sites.yaml: add lvs5006 (eqsin hardware refresh)

https://gerrit.wikimedia.org/r/865732

ssingh updated the task description. (Show Details)
ssingh added subscribers: cmooney, Volans, BBlack.

Thanks to @RobH, @Papaul, @BBlack, @cmooney, @MoritzMuehlenhoff, @Volans for all their help in the eqsin refresh.

Change 865799 had a related patch set uploaded (by Ssingh; author: Ssingh):

[operations/puppet@production] hiera: unify eqsin LVS configuration

https://gerrit.wikimedia.org/r/865799

cookbooks.sre.hosts.decommission executed by jmm@cumin2002 for hosts: ganeti5002.eqsin.wmnet

  • ganeti5002.eqsin.wmnet (WARN)
    • Downtimed host on Icinga/Alertmanager
    • Found physical host
    • Management interface not found on Icinga, unable to downtime it
    • Wiped all swraid, partition-table and filesystem signatures
    • Powered off
    • [Netbox] Set status to Decommissioning, deleted all non-mgmt IPs, updated switch interfaces (disabled, removed vlans, etc)
    • Configured the linked switch interface(s)
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB

Mentioned in SAL (#wikimedia-operations) [2022-12-12T11:43:42Z] <moritzm> failover Ganeti master in eqsin to ganeti5004 (5003 will be decommissioned) T322048

Mentioned in SAL (#wikimedia-operations) [2022-12-12T11:49:20Z] <moritzm> drain ganeti5003 for eventual decom T322048

Change 867154 had a related patch set uploaded (by Muehlenhoff; author: Muehlenhoff):

[operations/puppet@production] blackbox smoke tests: Switch to ganeti5007 for rack 603

https://gerrit.wikimedia.org/r/867154

Change 867154 merged by Muehlenhoff:

[operations/puppet@production] blackbox smoke tests: Switch to ganeti5007 for rack 603

https://gerrit.wikimedia.org/r/867154

Change 865799 merged by Ssingh:

[operations/puppet@production] hiera: unify eqsin LVS configuration

https://gerrit.wikimedia.org/r/865799

Mentioned in SAL (#wikimedia-operations) [2022-12-16T08:35:04Z] <moritzm> power down ganeti5003 manually (mgmt/IPMI broken) for pending decom T322048

Change 868617 had a related patch set uploaded (by Muehlenhoff; author: Muehlenhoff):

[operations/puppet@production] Remove ganeti5003 from Puppet

https://gerrit.wikimedia.org/r/868617

Change 868617 merged by Muehlenhoff:

[operations/puppet@production] Remove ganeti5003 from Puppet

https://gerrit.wikimedia.org/r/868617

cookbooks.sre.hosts.decommission executed by jmm@cumin2002 for hosts: ganeti5003.eqsin.wmnet

  • ganeti5003.eqsin.wmnet (FAIL)
    • Downtimed host on Icinga/Alertmanager
    • Found physical host
    • Management interface not found on Icinga, unable to downtime it
    • Unable to connect to the host, wipe of swraid, partition-table and filesystem signatures will not be performed: Cumin execution failed (exit_code=2)
    • Failed to power off, manual intervention required: Remote IPMI for ganeti5003.mgmt.eqsin.wmnet failed (exit=1): b''
    • [Netbox] Set status to Decommissioning, deleted all non-mgmt IPs, updated switch interfaces (disabled, removed vlans, etc)
    • Configured the linked switch interface(s)
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB

ERROR: some step on some host failed, check the bolded items above