Page MenuHomePhabricator

Replacement of esams VMs in knams Ganeti clusters
Closed, ResolvedPublic

Description

Tracking task for decom and recreation of ganeti VMs in esams

Old VM nameNew VM namenew cluster (esams01 or esams02)old VM decommed
bast3006.wikimedia.orgbast3007.wikimedia.orgesams01Done
doh3001.wikimedia.orgdoh3003esams01Done
doh3002.wikimedia.orgdoh3004esams02Done
durum3001.esams.wmnetdurum3003.esams.wmnetesams01Done
durum3002.esams.wmnetdurum3004.esams.wmnetesams02Done
install3002.wikimedia.orginstall3003.wikimedia.orgesams02Done
ncredir3001.esams.wmnetncredir3003.esams.wmnetesams01Done
ncredir3002.esams.wmnetncredir3004.esams.wmnetesams02Done
netflow3002.esams.wmnetnetflow3002.esams.wmnetesams02Done
ping3003.esams.wmnetNot needed for nown/aDone
prometheus3002.esams.wmnetprometheus3003.esams.wmnetesams02Done

Event Timeline

Change 949537 had a related patch set uploaded (by Muehlenhoff; author: Muehlenhoff):

[operations/puppet@production] Add netflow3003 to Ferm rules for Kafka jumbo

https://gerrit.wikimedia.org/r/949537

Change 949537 abandoned by Muehlenhoff:

[operations/puppet@production] Add netflow3003 to Ferm rules for Kafka jumbo

Reason:

Abandoned in favour of 949534

https://gerrit.wikimedia.org/r/949537

cookbooks.sre.hosts.decommission executed by jmm@cumin2002 for hosts: bast3006.wikimedia.org

  • bast3006.wikimedia.org (PASS)
    • Downtimed host on Icinga/Alertmanager
    • Found Ganeti VM
    • VM shutdown
    • Started forced sync of VMs in Ganeti cluster esams to Netbox
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB
    • VM removed
    • Started forced sync of VMs in Ganeti cluster esams to Netbox

Change 949541 had a related patch set uploaded (by Muehlenhoff; author: Muehlenhoff):

[operations/puppet@production] Remove bast3006/ping3003 from site.pp

https://gerrit.wikimedia.org/r/949541

Change 949541 merged by Muehlenhoff:

[operations/puppet@production] Remove bast3006/ping3003 from site.pp

https://gerrit.wikimedia.org/r/949541

cookbooks.sre.hosts.decommission executed by jmm@cumin2002 for hosts: ping3003.esams.wmnet

  • ping3003.esams.wmnet (PASS)
    • Downtimed host on Icinga/Alertmanager
    • Found Ganeti VM
    • VM shutdown
    • Started forced sync of VMs in Ganeti cluster esams to Netbox
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB
    • VM removed
    • Started forced sync of VMs in Ganeti cluster esams to Netbox

Change 949543 had a related patch set uploaded (by Muehlenhoff; author: Muehlenhoff):

[operations/puppet@production] New install server for new esams

https://gerrit.wikimedia.org/r/949543

Change 949543 merged by Muehlenhoff:

[operations/puppet@production] New install server for new esams

https://gerrit.wikimedia.org/r/949543

cookbooks.sre.hosts.decommission executed by jmm@cumin2002 for hosts: netflow3002.esams.wmnet

  • netflow3002.esams.wmnet (PASS)
    • Downtimed host on Icinga/Alertmanager
    • Found Ganeti VM
    • VM shutdown
    • Started forced sync of VMs in Ganeti cluster esams to Netbox
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB
    • VM removed
    • Started forced sync of VMs in Ganeti cluster esams to Netbox

Change 949552 had a related patch set uploaded (by Muehlenhoff; author: Muehlenhoff):

[operations/puppet@production] Make install3003 the new install server for esams

https://gerrit.wikimedia.org/r/949552

cookbooks.sre.hosts.decommission executed by sukhe@cumin2002 for hosts: doh[3001-3002].wikimedia.org

  • doh3001.wikimedia.org (PASS)
    • Downtimed host on Icinga/Alertmanager
    • Found Ganeti VM
    • VM shutdown
    • Started forced sync of VMs in Ganeti cluster esams to Netbox
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB
    • VM removed
    • Started forced sync of VMs in Ganeti cluster esams to Netbox
  • doh3002.wikimedia.org (PASS)
    • Downtimed host on Icinga/Alertmanager
    • Found Ganeti VM
    • VM shutdown
    • Started forced sync of VMs in Ganeti cluster esams to Netbox
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB
    • VM removed
    • Started forced sync of VMs in Ganeti cluster esams to Netbox

cookbooks.sre.hosts.decommission executed by sukhe@cumin2002 for hosts: durum[3001-3002].esams.wmnet

  • durum3001.esams.wmnet (PASS)
    • Downtimed host on Icinga/Alertmanager
    • Found Ganeti VM
    • VM shutdown
    • Started forced sync of VMs in Ganeti cluster esams to Netbox
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB
    • VM removed
    • Started forced sync of VMs in Ganeti cluster esams to Netbox
  • durum3002.esams.wmnet (PASS)
    • Downtimed host on Icinga/Alertmanager
    • Found Ganeti VM
    • VM shutdown
    • Started forced sync of VMs in Ganeti cluster esams to Netbox
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB
    • VM removed
    • Started forced sync of VMs in Ganeti cluster esams to Netbox

Change 949558 had a related patch set uploaded (by Ssingh; author: Ssingh):

[operations/puppet@production] ncredir300x: decommission hosts in esams

https://gerrit.wikimedia.org/r/949558

cookbooks.sre.hosts.decommission executed by sukhe@cumin2002 for hosts: lvs[3005-3007].esams.wmnet

  • lvs3005.esams.wmnet (PASS)
    • Downtimed host on Icinga/Alertmanager
    • Found physical host
    • Downtimed management interface on Alertmanager
    • Wiped all swraid, partition-table and filesystem signatures
    • Powered off
    • [Netbox] Set status to Decommissioning, deleted all non-mgmt IPs, updated switch interfaces (disabled, removed vlans, etc)
    • Configured the linked switch interface(s)
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB
  • lvs3006.esams.wmnet (PASS)
    • Downtimed host on Icinga/Alertmanager
    • Found physical host
    • Downtimed management interface on Alertmanager
    • Wiped all swraid, partition-table and filesystem signatures
    • Powered off
    • [Netbox] Set status to Decommissioning, deleted all non-mgmt IPs, updated switch interfaces (disabled, removed vlans, etc)
    • Configured the linked switch interface(s)
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB
  • lvs3007.esams.wmnet (PASS)
    • Downtimed host on Icinga/Alertmanager
    • Found physical host
    • Downtimed management interface on Alertmanager
    • Wiped all swraid, partition-table and filesystem signatures
    • Powered off
    • [Netbox] Set status to Decommissioning, deleted all non-mgmt IPs, updated switch interfaces (disabled, removed vlans, etc)
    • Configured the linked switch interface(s)
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB

Change 949552 merged by Muehlenhoff:

[operations/puppet@production] Make install3003 the new install server for esams

https://gerrit.wikimedia.org/r/949552

Change 949628 had a related patch set uploaded (by Muehlenhoff; author: Muehlenhoff):

[operations/dns@master] Point the esams webproxy to install3003

https://gerrit.wikimedia.org/r/949628

Change 949628 merged by Muehlenhoff:

[operations/dns@master] Point the esams webproxy to install3003

https://gerrit.wikimedia.org/r/949628

Icinga downtime and Alertmanager silence (ID=756bda9d-0fe5-407f-8e34-35d788d9ab8c) set by jmm@cumin2002 for 1 day, 0:00:00 on 1 host(s) and their services with reason: decom in progress

install3002.wikimedia.org

cookbooks.sre.hosts.decommission executed by jmm@cumin2002 for hosts: install3002.wikimedia.org

  • install3002.wikimedia.org (FAIL)
    • Downtimed host on Icinga/Alertmanager
    • Found Ganeti VM
    • Failed to shutdown VM, manually run gnt-instance remove on the Ganeti master for the esams cluster: Cumin execution failed (exit_code=2)
    • Started forced sync of VMs in Ganeti cluster esams to Netbox
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB
    • Failed to remove VM, manually run gnt-instance remove on the Ganeti master for the esams cluster: Cumin execution failed (exit_code=2)
    • Started forced sync of VMs in Ganeti cluster esams to Netbox

ERROR: some step on some host failed, check the bolded items above

Change 949558 merged by Muehlenhoff:

[operations/puppet@production] ncredir300x: decommission hosts in esams

https://gerrit.wikimedia.org/r/949558

cookbooks.sre.hosts.decommission executed by jmm@cumin2002 for hosts: ncredir3002.esams.wmnet

  • ncredir3002.esams.wmnet (FAIL)
    • Downtimed host on Icinga/Alertmanager
    • Found Ganeti VM
    • VM shutdown
    • Started forced sync of VMs in Ganeti cluster esams to Netbox
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB
    • Failed to remove VM, manually run gnt-instance remove on the Ganeti master for the esams cluster: Cumin execution failed (exit_code=2)
    • Started forced sync of VMs in Ganeti cluster esams to Netbox

ERROR: some step on some host failed, check the bolded items above

cookbooks.sre.hosts.decommission executed by jmm@cumin2002 for hosts: ncredir3001.esams.wmnet

  • ncredir3001.esams.wmnet (PASS)
    • Downtimed host on Icinga/Alertmanager
    • Found Ganeti VM
    • VM shutdown
    • Started forced sync of VMs in Ganeti cluster esams to Netbox
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB
    • VM removed
    • Started forced sync of VMs in Ganeti cluster esams to Netbox

cookbooks.sre.hosts.decommission executed by filippo@cumin1001 for hosts: prometheus3002.esams.wmnet

  • prometheus3002.esams.wmnet (FAIL)
    • Downtimed host on Icinga/Alertmanager
    • Found Ganeti VM
    • Failed to shutdown VM, manually run gnt-instance remove on the Ganeti master for the esams cluster: Cumin execution failed (exit_code=2)
    • Started forced sync of VMs in Ganeti cluster esams to Netbox
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB
    • Failed to remove VM, manually run gnt-instance remove on the Ganeti master for the esams cluster: Cumin execution failed (exit_code=2)
    • Started forced sync of VMs in Ganeti cluster esams to Netbox

ERROR: some step on some host failed, check the bolded items above

Change 949837 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] Out with prometheus3002, in with prometheus3003

https://gerrit.wikimedia.org/r/949837

Cookbook cookbooks.sre.hosts.reimage was started by filippo@cumin1001 for host prometheus3003.esams.wmnet with OS bullseye

Change 949838 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/dns@master] wmnet: use prometheus3003 in esams

https://gerrit.wikimedia.org/r/949838

Change 949837 merged by Filippo Giunchedi:

[operations/puppet@production] Out with prometheus3002, in with prometheus3003

https://gerrit.wikimedia.org/r/949837

Change 949838 merged by Filippo Giunchedi:

[operations/dns@master] wmnet: use prometheus3003 in esams

https://gerrit.wikimedia.org/r/949838

Cookbook cookbooks.sre.hosts.reimage started by filippo@cumin1001 for host prometheus3003.esams.wmnet with OS bullseye completed:

  • prometheus3003 (WARN)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via gnt-instance
    • Host up (Debian installer)
    • Set boot media to disk
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run failed and logged in /var/log/spicerack/sre/hosts/reimage/202308170922_filippo_3583585_prometheus3003.out, asking the operator what to do
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202308170927_filippo_3583585_prometheus3003.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Change 949941 had a related patch set uploaded (by Muehlenhoff; author: Muehlenhoff):

[operations/puppet@production] Add ncredir300[34]

https://gerrit.wikimedia.org/r/949941

Change 949941 merged by Muehlenhoff:

[operations/puppet@production] Add ncredir300[34]

https://gerrit.wikimedia.org/r/949941

Change 949977 had a related patch set uploaded (by Muehlenhoff; author: Muehlenhoff):

[operations/puppet@production] Add durum300[34] to site.pp

https://gerrit.wikimedia.org/r/949977

Change 949977 merged by Muehlenhoff:

[operations/puppet@production] Add durum300[34] to site.pp

https://gerrit.wikimedia.org/r/949977

Change 949987 had a related patch set uploaded (by Muehlenhoff; author: Muehlenhoff):

[operations/puppet@production] Add doh300[34]

https://gerrit.wikimedia.org/r/949987

Change 949987 merged by Muehlenhoff:

[operations/puppet@production] Add doh300[34]

https://gerrit.wikimedia.org/r/949987

Change 950000 had a related patch set uploaded (by Ssingh; author: Ssingh):

[operations/puppet@production] site: reimage ncredir300[34] to proper role

https://gerrit.wikimedia.org/r/950000

Change 950000 merged by Ssingh:

[operations/puppet@production] site: reimage ncredir300[34] to proper role

https://gerrit.wikimedia.org/r/950000

Change 950026 had a related patch set uploaded (by Ssingh; author: Ssingh):

[operations/puppet@production] conf-tool/esams: add ncredir300[34]

https://gerrit.wikimedia.org/r/950026

Change 950026 merged by Ssingh:

[operations/puppet@production] conf-tool/esams: add ncredir300[34]

https://gerrit.wikimedia.org/r/950026

Change 950159 had a related patch set uploaded (by Muehlenhoff; author: Muehlenhoff):

[operations/puppet@production] Make bast3007 a bastion

https://gerrit.wikimedia.org/r/950159

Change 950159 merged by Muehlenhoff:

[operations/puppet@production] Make bast3007 a bastion

https://gerrit.wikimedia.org/r/950159

Change 951528 had a related patch set uploaded (by Ssingh; author: Ssingh):

[operations/puppet@production] site: add wikidough VMs for esams

https://gerrit.wikimedia.org/r/951528

Change 951528 merged by Ssingh:

[operations/puppet@production] site: add wikidough VMs for esams

https://gerrit.wikimedia.org/r/951528

Change 951531 had a related patch set uploaded (by Ssingh; author: Ssingh):

[operations/puppet@production] site: remove older references to doh300[34]

https://gerrit.wikimedia.org/r/951531

Change 951531 merged by Ssingh:

[operations/puppet@production] site: remove older references to doh300[34]

https://gerrit.wikimedia.org/r/951531

Change 951532 had a related patch set uploaded (by Ssingh; author: Ssingh):

[operations/puppet@production] hiera: update authorized_hosts for acme_chief for WDNS

https://gerrit.wikimedia.org/r/951532

Mentioned in SAL (#wikimedia-operations) [2023-08-22T15:58:29Z] <sukhe> sudo cookbook sre.ganeti.makevm --vcpus 2 --memory 8 --disk 15 --network public --os bullseye --cluster esams01 --group BY27 -t T344355 doh3003

Change 951532 merged by Ssingh:

[operations/puppet@production] hiera: update authorized_hosts for acme_chief for WDNS

https://gerrit.wikimedia.org/r/951532

cookbooks.sre.hosts.decommission executed by sukhe@cumin2002 for hosts: doh3003.wikimedia.org

  • doh3003.wikimedia.org (WARN)
    • Host not found on Icinga, unable to downtime it
    • Found Ganeti VM
    • VM shutdown
    • Started forced sync of VMs in Ganeti cluster esams01 to Netbox
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB
    • VM removed
    • Started forced sync of VMs in Ganeti cluster esams01 to Netbox

Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin2002 for host doh3003.wikimedia.org with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin2002 for host doh3003.wikimedia.org with OS bullseye completed:

  • doh3003 (PASS)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via gnt-instance
    • Host up (Debian installer)
    • Set boot media to disk
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202308221742_sukhe_584678_doh3003.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin2002 for host doh3004.wikimedia.org with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin2002 for host doh3004.wikimedia.org with OS bullseye completed:

  • doh3004 (PASS)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via gnt-instance
    • Host up (Debian installer)
    • Set boot media to disk
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202308221821_sukhe_632198_doh3004.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

cookbooks.sre.hosts.decommission executed by sukhe@cumin2002 for hosts: doh3004.wikimedia.org

  • doh3004.wikimedia.org (PASS)
    • Downtimed host on Icinga/Alertmanager
    • Found Ganeti VM
    • VM shutdown
    • Started forced sync of VMs in Ganeti cluster esams01 to Netbox
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB
    • VM removed
    • Started forced sync of VMs in Ganeti cluster esams01 to Netbox

Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin2002 for host doh3004.wikimedia.org with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin2002 for host doh3004.wikimedia.org with OS bullseye completed:

  • doh3004 (PASS)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via gnt-instance
    • Host up (Debian installer)
    • Set boot media to disk
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202308221916_sukhe_688765_doh3004.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Change 951581 had a related patch set uploaded (by Ssingh; author: Ssingh):

[operations/homer/public@master] devices: add doh300[34] to asw1-b*27-esams

https://gerrit.wikimedia.org/r/951581

Change 951581 merged by Ssingh:

[operations/homer/public@master] devices: add doh300[34] to asw1-b*27-esams

https://gerrit.wikimedia.org/r/951581