Page MenuHomePhabricator

Revisit CDN<-->Swift communication
Closed, ResolvedPublic

Description

We've been spotting some issues on the upload cluster that aren't happening on text. One of the big differences between text and upload is that almost all origin servers in text uses envoy as its TLS termination and swift uses nginx. The nginx puppetization is the one that we used to leverage in the traffic team to perform TLS termination for untrusted clients.

One of this issues is a FetchError logged by varnish-frontend stating "Timed out reusing backend connection", according to logstash during the last month all the ocurrences of this issue are limited to the upload cluster.

Progress on migration to envoy:

  • ms-fe1009
  • ms-fe1010
  • ms-fe1011
  • ms-fe1012
  • ms-fe1013
  • ms-fe1014
  • moss-fe1001
  • ms-fe2009
  • ms-fe2010
  • ms-fe2011
  • ms-fe2012
  • ms-fe2013
  • ms-fe2014
  • moss-fe2001

Outstanding is changing the value of profile::swift::proxy::use_envoy: for the ms-* clusters (or maybe globally, but that's likely to upset beta)

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Mentioned in SAL (#wikimedia-operations) [2023-11-21T13:32:22Z] <Emperor> repool ms-fe2014 with new envoy TLS setup T317616

Change 976229 had a related patch set uploaded (by MVernon; author: MVernon):

[operations/puppet@production] hiera: move two more swift frontends to envoy

https://gerrit.wikimedia.org/r/976229

Change 976229 merged by MVernon:

[operations/puppet@production] hiera: move two more swift frontends to envoy

https://gerrit.wikimedia.org/r/976229

Mentioned in SAL (#wikimedia-operations) [2023-11-22T08:59:03Z] <Emperor> depool ms-fe2013 to reimage with new envoy TLS setup T317616

Mentioned in SAL (#wikimedia-operations) [2023-11-22T08:59:12Z] <Emperor> depool ms-fe1013 to reimage with new envoy TLS setup T317616

Cookbook cookbooks.sre.hosts.reimage was started by mvernon@cumin1001 for host ms-fe1013.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage was started by mvernon@cumin2002 for host ms-fe2013.codfw.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by mvernon@cumin2002 for host ms-fe2013.codfw.wmnet with OS bullseye completed:

  • ms-fe2013 (PASS)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202311220917_mvernon_2407767_ms-fe2013.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Cookbook cookbooks.sre.hosts.reimage started by mvernon@cumin1001 for host ms-fe1013.eqiad.wmnet with OS bullseye completed:

  • ms-fe1013 (PASS)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202311220936_mvernon_1292627_ms-fe1013.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Mentioned in SAL (#wikimedia-operations) [2023-11-22T10:26:52Z] <Emperor> repool ms-fe1013 with new envoy TLS setup T317616

Mentioned in SAL (#wikimedia-operations) [2023-11-22T10:27:35Z] <Emperor> repool ms-fe2013 with new envoy TLS setup T317616

Change 976672 had a related patch set uploaded (by MVernon; author: MVernon):

[operations/puppet@production] hiera: move two more swift frontends to envoy

https://gerrit.wikimedia.org/r/976672

Change 976673 had a related patch set uploaded (by MVernon; author: MVernon):

[operations/puppet@production] hiera: move two more swift frontends to envoy

https://gerrit.wikimedia.org/r/976673

Change 976674 had a related patch set uploaded (by MVernon; author: MVernon):

[operations/puppet@production] hiera: move two more swift frontends to envoy

https://gerrit.wikimedia.org/r/976674

Change 976675 had a related patch set uploaded (by MVernon; author: MVernon):

[operations/puppet@production] hiera: move two more swift frontends to envoy

https://gerrit.wikimedia.org/r/976675

Change 976676 had a related patch set uploaded (by MVernon; author: MVernon):

[operations/puppet@production] hiera: move two more swift frontends to envoy

https://gerrit.wikimedia.org/r/976676

Change 976677 had a related patch set uploaded (by MVernon; author: MVernon):

[operations/puppet@production] hiera: move final swift frontend to envoy

https://gerrit.wikimedia.org/r/976677

Mentioned in SAL (#wikimedia-operations) [2023-11-22T11:33:42Z] <Emperor> depool ms-fe1012 to reimage with new envoy TLS setup T317616

Mentioned in SAL (#wikimedia-operations) [2023-11-22T11:34:05Z] <Emperor> depool ms-fe2012 to reimage with new envoy TLS setup T317616

Change 976672 merged by MVernon:

[operations/puppet@production] hiera: move two more swift frontends to envoy

https://gerrit.wikimedia.org/r/976672

Cookbook cookbooks.sre.hosts.reimage was started by mvernon@cumin1001 for host ms-fe1012.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage was started by mvernon@cumin2002 for host ms-fe2012.codfw.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by mvernon@cumin1001 for host ms-fe1012.eqiad.wmnet with OS bullseye completed:

  • ms-fe1012 (PASS)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202311221150_mvernon_1369438_ms-fe1012.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
    • Cleared switch DHCP cache and MAC table for the host IP and MAC (EVPN Switch)

Cookbook cookbooks.sre.hosts.reimage started by mvernon@cumin2002 for host ms-fe2012.codfw.wmnet with OS bullseye completed:

  • ms-fe2012 (PASS)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202311221153_mvernon_2482647_ms-fe2012.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Mentioned in SAL (#wikimedia-operations) [2023-11-22T12:15:08Z] <Emperor> repool ms-fe2012 with new envoy TLS setup T317616

Mentioned in SAL (#wikimedia-operations) [2023-11-22T12:15:52Z] <Emperor> repool ms-fe1012 with new envoy TLS setup T317616

Mentioned in SAL (#wikimedia-operations) [2023-11-22T12:45:32Z] <Emperor> depool ms-fe2011 to reimage with new envoy TLS setup T317616

Mentioned in SAL (#wikimedia-operations) [2023-11-22T12:45:41Z] <Emperor> depool ms-fe1011 to reimage with new envoy TLS setup T317616

Change 976673 merged by MVernon:

[operations/puppet@production] hiera: move two more swift frontends to envoy

https://gerrit.wikimedia.org/r/976673

Cookbook cookbooks.sre.hosts.reimage was started by mvernon@cumin1001 for host ms-fe1011.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage was started by mvernon@cumin2002 for host ms-fe2011.codfw.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by mvernon@cumin1001 for host ms-fe1011.eqiad.wmnet with OS bullseye completed:

  • ms-fe1011 (PASS)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202311221302_mvernon_1402383_ms-fe1011.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Cookbook cookbooks.sre.hosts.reimage started by mvernon@cumin2002 for host ms-fe2011.codfw.wmnet with OS bullseye completed:

  • ms-fe2011 (PASS)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202311221305_mvernon_2516527_ms-fe2011.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Mentioned in SAL (#wikimedia-operations) [2023-11-22T13:22:51Z] <Emperor> repool ms-fe1011 with new envoy TLS setup T317616

Mentioned in SAL (#wikimedia-operations) [2023-11-22T13:23:44Z] <Emperor> repool ms-fe2011 with new envoy TLS setup T317616

Change 976674 merged by MVernon:

[operations/puppet@production] hiera: move two more swift frontends to envoy

https://gerrit.wikimedia.org/r/976674

Mentioned in SAL (#wikimedia-operations) [2023-11-22T13:27:43Z] <Emperor> depool ms-fe1010 to reimage with new envoy TLS setup T317616

Mentioned in SAL (#wikimedia-operations) [2023-11-22T13:27:53Z] <Emperor> depool ms-fe2010 to reimage with new envoy TLS setup T317616

Cookbook cookbooks.sre.hosts.reimage was started by mvernon@cumin1001 for host ms-fe1010.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage was started by mvernon@cumin2002 for host ms-fe2010.codfw.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by mvernon@cumin1001 for host ms-fe1010.eqiad.wmnet with OS bullseye completed:

  • ms-fe1010 (PASS)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202311221344_mvernon_1428230_ms-fe1010.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Cookbook cookbooks.sre.hosts.reimage started by mvernon@cumin2002 for host ms-fe2010.codfw.wmnet with OS bullseye completed:

  • ms-fe2010 (PASS)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202311221347_mvernon_2539112_ms-fe2010.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Mentioned in SAL (#wikimedia-operations) [2023-11-22T14:14:16Z] <Emperor> repool ms-fe1010 with new envoy TLS setup T317616

Mentioned in SAL (#wikimedia-operations) [2023-11-22T14:14:52Z] <Emperor> repool ms-fe2010 with new envoy TLS setup T317616

Change 976675 merged by MVernon:

[operations/puppet@production] hiera: move two more swift frontends to envoy

https://gerrit.wikimedia.org/r/976675

Mentioned in SAL (#wikimedia-operations) [2023-11-22T14:19:03Z] <Emperor> depool ms-fe1009 to reimage with new envoy TLS setup T317616

Mentioned in SAL (#wikimedia-operations) [2023-11-22T14:19:12Z] <Emperor> depool ms-fe2009 to reimage with new envoy TLS setup T317616

Cookbook cookbooks.sre.hosts.reimage was started by mvernon@cumin1001 for host ms-fe1009.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage was started by mvernon@cumin2002 for host ms-fe2009.codfw.wmnet with OS bullseye

MatthewVernon changed the task status from Stalled to In Progress.Nov 22 2023, 2:35 PM
MatthewVernon updated the task description. (Show Details)

Cookbook cookbooks.sre.hosts.reimage started by mvernon@cumin1001 for host ms-fe1009.eqiad.wmnet with OS bullseye completed:

  • ms-fe1009 (PASS)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202311221435_mvernon_1450084_ms-fe1009.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Cookbook cookbooks.sre.hosts.reimage started by mvernon@cumin2002 for host ms-fe2009.codfw.wmnet with OS bullseye completed:

  • ms-fe2009 (WARN)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Unable to downtime the new host on Icinga/Alertmanager, the sre.hosts.downtime cookbook returned 99
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202311221435_mvernon_2560896_ms-fe2009.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Mentioned in SAL (#wikimedia-operations) [2023-11-22T14:59:49Z] <Emperor> repool ms-fe2009 with new envoy TLS setup T317616

Mentioned in SAL (#wikimedia-operations) [2023-11-22T15:00:02Z] <Emperor> repool ms-fe1009 with new envoy TLS setup T317616

Mentioned in SAL (#wikimedia-operations) [2023-11-22T15:02:46Z] <Emperor> depool moss-fe2001 to reimage with new envoy TLS setup T317616

Mentioned in SAL (#wikimedia-operations) [2023-11-22T15:03:11Z] <Emperor> depool moss-fe1001 to reimage with new envoy TLS setup T317616

Change 976676 merged by MVernon:

[operations/puppet@production] hiera: move two more swift frontends to envoy

https://gerrit.wikimedia.org/r/976676

Cookbook cookbooks.sre.hosts.reimage was started by mvernon@cumin1001 for host moss-fe1001.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage was started by mvernon@cumin2002 for host moss-fe2001.codfw.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by mvernon@cumin2002 for host moss-fe2001.codfw.wmnet with OS bullseye executed with errors:

  • moss-fe2001 (FAIL)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run failed and the operator aborted
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by mvernon@cumin2002 for host moss-fe2001.codfw.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by mvernon@cumin1001 for host moss-fe1001.eqiad.wmnet with OS bullseye executed with errors:

  • moss-fe1001 (FAIL)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202311221520_mvernon_1473703_moss-fe1001.out
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by mvernon@cumin1001 for host moss-fe1001.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by mvernon@cumin2002 for host moss-fe2001.codfw.wmnet with OS bullseye executed with errors:

  • moss-fe2001 (FAIL)
    • Downtimed on Icinga/Alertmanager
    • Unable to disable Puppet, the host may have been unreachable
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202311221546_mvernon_2597483_moss-fe2001.out
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by mvernon@cumin2002 for host moss-fe2001.codfw.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by mvernon@cumin1001 for host moss-fe1001.eqiad.wmnet with OS bullseye completed:

  • moss-fe1001 (WARN)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202311221543_mvernon_1488465_moss-fe1001.out
    • Unable to run puppet on config-master2001.codfw.wmnet,config-master1001.eqiad.wmnet to update configmaster.wikimedia.org with the new host SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Mentioned in SAL (#wikimedia-operations) [2023-11-22T16:14:25Z] <Emperor> repool moss-fe1001 with new envoy TLS setup T317616

Mentioned in SAL (#wikimedia-operations) [2023-11-22T16:16:05Z] <Emperor> depool ms-fe1014 to reimage with new envoy TLS setup T317616

Change 976677 merged by MVernon:

[operations/puppet@production] hiera: move final swift frontend to envoy

https://gerrit.wikimedia.org/r/976677

Cookbook cookbooks.sre.hosts.reimage was started by mvernon@cumin1001 for host ms-fe1014.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by mvernon@cumin2002 for host moss-fe2001.codfw.wmnet with OS bullseye executed with errors:

  • moss-fe2001 (FAIL)
    • Downtimed on Icinga/Alertmanager
    • Unable to disable Puppet, the host may have been unreachable
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202311221612_mvernon_2609578_moss-fe2001.out
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by mvernon@cumin2002 for host moss-fe2001.codfw.wmnet with OS bullseye

(perhaps the moss-fe2001 puppet failures are due to T350809 )

Cookbook cookbooks.sre.hosts.reimage started by mvernon@cumin2002 for host moss-fe2001.codfw.wmnet with OS bullseye completed:

  • moss-fe2001 (PASS)
    • Downtimed on Icinga/Alertmanager
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202311221634_mvernon_2632566_moss-fe2001.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Mentioned in SAL (#wikimedia-operations) [2023-11-22T16:46:53Z] <Emperor> repool moss-fe2001 with new envoy TLS setup T317616

Cookbook cookbooks.sre.hosts.reimage started by mvernon@cumin1001 for host ms-fe1014.eqiad.wmnet with OS bullseye completed:

  • ms-fe1014 (PASS)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202311221638_mvernon_1512681_ms-fe1014.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
    • Cleared switch DHCP cache and MAC table for the host IP and MAC (EVPN Switch)

Mentioned in SAL (#wikimedia-operations) [2023-11-22T16:56:36Z] <Emperor> repool ms-fe1014 with new envoy TLS setup T317616

Change 977077 had a related patch set uploaded (by MVernon; author: MVernon):

[operations/puppet@production] hiera: use envoy by default in ms clusters (nfc)

https://gerrit.wikimedia.org/r/977077

Change 977077 merged by MVernon:

[operations/puppet@production] hiera: use envoy by default in ms clusters (nfc)

https://gerrit.wikimedia.org/r/977077

MatthewVernon claimed this task.

I think this is now done - ms clusters default to using envoy (I've not done anything to beta, but it should carry on using nginx just fine).