Page MenuHomePhabricator

Run Thanos backend on Bullseye
Closed, ResolvedPublic

Description

This task tracks moving all Thanos backends (thanos-be*) hosts to Bullseye.

Most of the work has been done in Pontoon (i.e. provisioning an Bullseye instance and running thanos::backend role in project swift). The first host to be reimaged in production has been thanos-be2001 and with T301657 fixed we're good to proceed to reimage the rest.

  • thanos-be1001.eqiad.wmnet
  • thanos-be1002.eqiad.wmnet
  • thanos-be1003.eqiad.wmnet
  • thanos-be1004.eqiad.wmnet
  • thanos-be2001.codfw.wmnet
  • thanos-be2002.codfw.wmnet
  • thanos-be2003.codfw.wmnet
  • thanos-be2004.codfw.wmnet

Event Timeline

Change 713230 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] swift: ship uwsgi config for account/container server

https://gerrit.wikimedia.org/r/713230

Change 713608 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] swift: add support for loopback storage device

https://gerrit.wikimedia.org/r/713608

Change 713609 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] swift: stop carrying drive-audit patch starting with Bullseye

https://gerrit.wikimedia.org/r/713609

Change 713230 merged by Filippo Giunchedi:

[operations/puppet@production] swift: ship uwsgi config for account/container server

https://gerrit.wikimedia.org/r/713230

Change 713608 merged by Filippo Giunchedi:

[operations/puppet@production] swift: add support for loopback storage device

https://gerrit.wikimedia.org/r/713608

Change 713609 merged by Filippo Giunchedi:

[operations/puppet@production] swift: stop carrying drive-audit patch starting with Bullseye

https://gerrit.wikimedia.org/r/713609

The other problem I noticed, though not specific to thanos but rather ferm + pontoon, is that @resolve calls will fail in WMCS:

-- Boot 5c400f48d7c24abaaa78d50a591d8e8c --
Sep 13 12:37:35 thanos-fe-01 ferm[202]: Starting Firewall: ferm
Sep 13 12:37:35 thanos-fe-01 ferm[219]: Error in /etc/ferm/conf.d/10_memcached line 10:
Sep 13 12:37:35 thanos-fe-01 ferm[219]:         saddr
Sep 13 12:37:35 thanos-fe-01 ferm[219]:         (
Sep 13 12:37:35 thanos-fe-01 ferm[219]:             deferred=ARRAY(0x5597c750d800)
Sep 13 12:37:35 thanos-fe-01 ferm[219]:         )
Sep 13 12:37:35 thanos-fe-01 ferm[219]:         <--
Sep 13 12:37:35 thanos-fe-01 ferm[219]: DNS query for 'thanos-be-01.swift.eqiad1.wikimedia.cloud' failed: Network is unreachable
Sep 13 12:37:35 thanos-fe-01 ferm[338]:  failed!

Not surprising though since ferm wants/before network-pre.target, whereas the network is brought up via dhclient in network.target.

Sep 13 12:37:35 thanos-fe-01 ferm[219]: DNS query for 'thanos-be-01.swift.eqiad1.wikimedia.cloud' failed: Network is unreachable
Sep 13 12:37:35 thanos-fe-01 ferm[338]:  failed!
Sep 13 12:37:35 thanos-fe-01 systemd[1]: ferm.service: Main process exited, code=exited, status=101/n/a
Sep 13 12:37:35 thanos-fe-01 systemd[1]: ferm.service: Failed with result 'exit-code'.
Sep 13 12:37:35 thanos-fe-01 systemd[1]: Failed to start ferm firewall configuration.
Sep 13 12:37:35 thanos-fe-01 systemd[1]: Reached target Network (Pre).
Sep 13 12:37:35 thanos-fe-01 systemd[1]: Starting Raise network interfaces...
Sep 13 12:37:35 thanos-fe-01 ifup[355]: resolvconf disabled.
Sep 13 12:37:35 thanos-fe-01 ifup[365]: net.ipv4.conf.all.arp_ignore = 1
Sep 13 12:37:35 thanos-fe-01 ifup[365]: net.ipv4.conf.all.arp_announce = 2
Sep 13 12:37:35 thanos-fe-01 dhclient[380]: Internet Systems Consortium DHCP Client 4.4.1
...

And indeed overriding ferm to want/after network.target does fix the problem, I'm not sure yet about a more permanent solution though

The other problem I noticed, though not specific to thanos but rather ferm + pontoon, is that @resolve calls will fail in WMCS:

To clarify: this issue "fixes" itself at the first puppet run when ferm is (re)started

I checked thanos-be-01.swift.eqiad1.wikimedia.cloud and couldn't find any obvious errors and problems, I'll proceed with reimaging a thanos backend in production

Cookbook cookbooks.sre.hosts.reimage was started by filippo@cumin1001 for host thanos-be2001.codfw.wmnet with OS bullseye

Mentioned in SAL (#wikimedia-operations) [2022-02-08T14:07:48Z] <godog> update NIC firmware on thanos-be2001 - T288937

Mentioned in SAL (#wikimedia-operations) [2022-02-08T14:17:20Z] <godog> update PERC firmware on thanos-be2001 - T288937

Cookbook cookbooks.sre.hosts.reimage started by filippo@cumin1001 for host thanos-be2001.codfw.wmnet with OS bullseye completed:

  • thanos-be2001 (PASS)
    • Downtimed on Icinga
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202202081236_filippo_29297_thanos-be2001.out
    • Checked BIOS boot parameters are back to normal
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

thanos-be2001 is running Bullseye, the reimage itself went fine and I'll leave the host alone to see if any obvious problems pop up.

I've ran into an old/known issue with disk renumbering (i.e. megasas doesn't always enumerate/assign disks in the same order). In practice this isn't normally a problem because we label filesystems (and puppet works as long as sda and sdb are in the right place). I've updated the PERC firmware (just in case, to no avail). A reboot usually fixes things, though in this case I haven't been able to yet

(notice the scsi ID not being incremental)

thanos-be2001:~$ ls -la /dev/disk/by-path/ | grep -v part | sort -k11
drwxr-xr-x 2 root root 720 Feb  8 16:07 .
drwxr-xr-x 8 root root 160 Feb  8 16:07 ..
total 0
lrwxrwxrwx 1 root root   9 Feb  8 16:07 pci-0000:3b:00.0-scsi-0:2:0:0 -> ../../sda
lrwxrwxrwx 1 root root   9 Feb  8 16:07 pci-0000:3b:00.0-scsi-0:2:1:0 -> ../../sdb
lrwxrwxrwx 1 root root   9 Feb  8 16:07 pci-0000:3b:00.0-scsi-0:2:2:0 -> ../../sdc
lrwxrwxrwx 1 root root   9 Feb  8 16:07 pci-0000:3b:00.0-scsi-0:2:3:0 -> ../../sdd
lrwxrwxrwx 1 root root   9 Feb  8 16:07 pci-0000:3b:00.0-scsi-0:2:5:0 -> ../../sde
lrwxrwxrwx 1 root root   9 Feb  8 16:07 pci-0000:3b:00.0-scsi-0:2:4:0 -> ../../sdf
lrwxrwxrwx 1 root root   9 Feb  8 16:07 pci-0000:3b:00.0-scsi-0:2:6:0 -> ../../sdg
lrwxrwxrwx 1 root root   9 Feb  8 16:07 pci-0000:3b:00.0-scsi-0:2:8:0 -> ../../sdh
lrwxrwxrwx 1 root root   9 Feb  8 16:07 pci-0000:3b:00.0-scsi-0:2:7:0 -> ../../sdi
lrwxrwxrwx 1 root root   9 Feb  8 16:07 pci-0000:3b:00.0-scsi-0:2:9:0 -> ../../sdj
lrwxrwxrwx 1 root root   9 Feb  8 16:07 pci-0000:3b:00.0-scsi-0:2:10:0 -> ../../sdk
lrwxrwxrwx 1 root root   9 Feb  8 16:07 pci-0000:3b:00.0-scsi-0:2:11:0 -> ../../sdl
lrwxrwxrwx 1 root root   9 Feb  8 16:07 pci-0000:3b:00.0-scsi-0:2:13:0 -> ../../sdm
lrwxrwxrwx 1 root root   9 Feb  8 16:07 pci-0000:3b:00.0-scsi-0:2:12:0 -> ../../sdn

And we're good now:

thanos-be2001:~$ ls -la /dev/disk/by-path/ | grep -v part | sort -k11
drwxr-xr-x 2 root root 720 Feb  9 07:51 .
drwxr-xr-x 8 root root 160 Feb  9 07:51 ..
total 0
lrwxrwxrwx 1 root root   9 Feb  9 07:51 pci-0000:3b:00.0-scsi-0:2:0:0 -> ../../sda
lrwxrwxrwx 1 root root   9 Feb  9 07:51 pci-0000:3b:00.0-scsi-0:2:1:0 -> ../../sdb
lrwxrwxrwx 1 root root   9 Feb  9 07:51 pci-0000:3b:00.0-scsi-0:2:2:0 -> ../../sdc
lrwxrwxrwx 1 root root   9 Feb  9 07:51 pci-0000:3b:00.0-scsi-0:2:3:0 -> ../../sdd
lrwxrwxrwx 1 root root   9 Feb  9 07:51 pci-0000:3b:00.0-scsi-0:2:4:0 -> ../../sde
lrwxrwxrwx 1 root root   9 Feb  9 07:51 pci-0000:3b:00.0-scsi-0:2:5:0 -> ../../sdf
lrwxrwxrwx 1 root root   9 Feb  9 07:51 pci-0000:3b:00.0-scsi-0:2:6:0 -> ../../sdg
lrwxrwxrwx 1 root root   9 Feb  9 07:51 pci-0000:3b:00.0-scsi-0:2:7:0 -> ../../sdh
lrwxrwxrwx 1 root root   9 Feb  9 07:51 pci-0000:3b:00.0-scsi-0:2:8:0 -> ../../sdi
lrwxrwxrwx 1 root root   9 Feb  9 07:51 pci-0000:3b:00.0-scsi-0:2:9:0 -> ../../sdj
lrwxrwxrwx 1 root root   9 Feb  9 07:51 pci-0000:3b:00.0-scsi-0:2:10:0 -> ../../sdk
lrwxrwxrwx 1 root root   9 Feb  9 07:51 pci-0000:3b:00.0-scsi-0:2:11:0 -> ../../sdl
lrwxrwxrwx 1 root root   9 Feb  9 07:51 pci-0000:3b:00.0-scsi-0:2:12:0 -> ../../sdm
lrwxrwxrwx 1 root root   9 Feb  9 07:51 pci-0000:3b:00.0-scsi-0:2:13:0 -> ../../sdn

Cookbook cookbooks.sre.hosts.reimage was started by mvernon@cumin2002 for host thanos-be2002.codfw.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by mvernon@cumin2002 for host thanos-be2002.codfw.wmnet with OS bullseye completed:

  • thanos-be2002 (PASS)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202212081525_mvernon_4090101_thanos-be2002.out
    • Checked BIOS boot parameters are back to normal
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Cookbook cookbooks.sre.hosts.reimage was started by mvernon@cumin2002 for host thanos-be2003.codfw.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by mvernon@cumin2002 for host thanos-be2003.codfw.wmnet with OS bullseye completed:

  • thanos-be2003 (PASS)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202212090933_mvernon_82015_thanos-be2003.out
    • Checked BIOS boot parameters are back to normal
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Cookbook cookbooks.sre.hosts.reimage was started by mvernon@cumin2002 for host thanos-be2004.codfw.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by mvernon@cumin2002 for host thanos-be2004.codfw.wmnet with OS bullseye completed:

  • thanos-be2004 (PASS)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202212091100_mvernon_98100_thanos-be2004.out
    • Checked BIOS boot parameters are back to normal
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Cookbook cookbooks.sre.hosts.reimage was started by mvernon@cumin2002 for host thanos-be1001.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by mvernon@cumin2002 for host thanos-be1001.eqiad.wmnet with OS bullseye completed:

  • thanos-be1001 (PASS)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202212091235_mvernon_115228_thanos-be1001.out
    • Checked BIOS boot parameters are back to normal
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Cookbook cookbooks.sre.hosts.reimage was started by mvernon@cumin2002 for host thanos-be1002.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by mvernon@cumin2002 for host thanos-be1002.eqiad.wmnet with OS bullseye completed:

  • thanos-be1002 (PASS)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202212120854_mvernon_794389_thanos-be1002.out
    • Checked BIOS boot parameters are back to normal
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Cookbook cookbooks.sre.hosts.reimage was started by mvernon@cumin2002 for host thanos-be1003.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by mvernon@cumin2002 for host thanos-be1003.eqiad.wmnet with OS bullseye completed:

  • thanos-be1003 (PASS)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202212121159_mvernon_830230_thanos-be1003.out
    • Checked BIOS boot parameters are back to normal
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Cookbook cookbooks.sre.hosts.reimage was started by mvernon@cumin2002 for host thanos-be1004.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by mvernon@cumin2002 for host thanos-be1004.eqiad.wmnet with OS bullseye completed:

  • thanos-be1004 (PASS)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202212121330_mvernon_847406_thanos-be1004.out
    • Checked BIOS boot parameters are back to normal
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
MatthewVernon claimed this task.
MatthewVernon updated the task description. (Show Details)
MatthewVernon subscribed.

All thanos nodes now bullseye.