Run Thanos backend on Bullseye
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	fgiunchedi
	Aug 16 2021, 9:30 AM

Description

This task tracks moving all Thanos backends (thanos-be*) hosts to Bullseye.

Most of the work has been done in Pontoon (i.e. provisioning an Bullseye instance and running thanos::backend role in project swift). The first host to be reimaged in production has been thanos-be2001 and with T301657 fixed we're good to proceed to reimage the rest.

Details

Subject	Repo	Branch	Lines +/-
swift: stop carrying drive-audit patch starting with Bullseye	operations/puppet	production	+29 -19
swift: add support for loopback storage device	operations/puppet	production	+74 -1
swift: ship uwsgi config for account/container server	operations/puppet	production	+115 -0

Customize query in gerrit

Related Objects
Search...

Status	Assigned	Task
Open	None	T291916 Tracking task for Bullseye migrations in production
Resolved	MatthewVernon	T288937 Run Thanos backend on Bullseye
Resolved	MatthewVernon	T301657 Missed swift log rotation can lead to full root filesystem

Event Timeline

fgiunchedi created this task.Aug 16 2021, 9:30 AM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptAug 16 2021, 9:30 AM

Change 713230 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] swift: ship uwsgi config for account/container server

https://gerrit.wikimedia.org/r/713230

gerritbot added a project: Patch-For-Review.Aug 16 2021, 10:21 AM

Change 713608 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] swift: add support for loopback storage device

https://gerrit.wikimedia.org/r/713608

Change 713609 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] swift: stop carrying drive-audit patch starting with Bullseye

https://gerrit.wikimedia.org/r/713609

Change 713230 merged by Filippo Giunchedi:

[operations/puppet@production] swift: ship uwsgi config for account/container server

https://gerrit.wikimedia.org/r/713230

Change 713608 merged by Filippo Giunchedi:

[operations/puppet@production] swift: add support for loopback storage device

https://gerrit.wikimedia.org/r/713608

Change 713609 merged by Filippo Giunchedi:

[operations/puppet@production] swift: stop carrying drive-audit patch starting with Bullseye

https://gerrit.wikimedia.org/r/713609

Maintenance_bot removed a project: Patch-For-Review.Aug 18 2021, 2:10 PM

fgiunchedi moved this task from Inbox to In progress on the SRE-swift-storage board.Aug 25 2021, 4:14 PM

fgiunchedi added a project: User-fgiunchedi.Aug 30 2021, 12:52 PM

fgiunchedi moved this task from Backlog to Up next on the User-fgiunchedi board.Sep 13 2021, 12:37 PM

The other problem I noticed, though not specific to thanos but rather ferm + pontoon, is that @resolve calls will fail in WMCS:

-- Boot 5c400f48d7c24abaaa78d50a591d8e8c --
Sep 13 12:37:35 thanos-fe-01 ferm[202]: Starting Firewall: ferm
Sep 13 12:37:35 thanos-fe-01 ferm[219]: Error in /etc/ferm/conf.d/10_memcached line 10:
Sep 13 12:37:35 thanos-fe-01 ferm[219]:         saddr
Sep 13 12:37:35 thanos-fe-01 ferm[219]:         (
Sep 13 12:37:35 thanos-fe-01 ferm[219]:             deferred=ARRAY(0x5597c750d800)
Sep 13 12:37:35 thanos-fe-01 ferm[219]:         )
Sep 13 12:37:35 thanos-fe-01 ferm[219]:         <--
Sep 13 12:37:35 thanos-fe-01 ferm[219]: DNS query for 'thanos-be-01.swift.eqiad1.wikimedia.cloud' failed: Network is unreachable
Sep 13 12:37:35 thanos-fe-01 ferm[338]:  failed!

Not surprising though since ferm wants/before network-pre.target, whereas the network is brought up via dhclient in network.target.

Sep 13 12:37:35 thanos-fe-01 ferm[219]: DNS query for 'thanos-be-01.swift.eqiad1.wikimedia.cloud' failed: Network is unreachable
Sep 13 12:37:35 thanos-fe-01 ferm[338]:  failed!
Sep 13 12:37:35 thanos-fe-01 systemd[1]: ferm.service: Main process exited, code=exited, status=101/n/a
Sep 13 12:37:35 thanos-fe-01 systemd[1]: ferm.service: Failed with result 'exit-code'.
Sep 13 12:37:35 thanos-fe-01 systemd[1]: Failed to start ferm firewall configuration.
Sep 13 12:37:35 thanos-fe-01 systemd[1]: Reached target Network (Pre).
Sep 13 12:37:35 thanos-fe-01 systemd[1]: Starting Raise network interfaces...
Sep 13 12:37:35 thanos-fe-01 ifup[355]: resolvconf disabled.
Sep 13 12:37:35 thanos-fe-01 ifup[365]: net.ipv4.conf.all.arp_ignore = 1
Sep 13 12:37:35 thanos-fe-01 ifup[365]: net.ipv4.conf.all.arp_announce = 2
Sep 13 12:37:35 thanos-fe-01 dhclient[380]: Internet Systems Consortium DHCP Client 4.4.1
...

And indeed overriding ferm to want/after network.target does fix the problem, I'm not sure yet about a more permanent solution though

In T288937#7348155, @fgiunchedi wrote:

The other problem I noticed, though not specific to thanos but rather ferm + pontoon, is that @resolve calls will fail in WMCS:

To clarify: this issue "fixes" itself at the first puppet run when ferm is (re)started

fgiunchedi moved this task from Up next to Doing on the User-fgiunchedi board.Jan 19 2022, 4:34 PM

I checked thanos-be-01.swift.eqiad1.wikimedia.cloud and couldn't find any obvious errors and problems, I'll proceed with reimaging a thanos backend in production

Cookbook cookbooks.sre.hosts.reimage was started by filippo@cumin1001 for host thanos-be2001.codfw.wmnet with OS bullseye

Mentioned in SAL (#wikimedia-operations) [2022-02-08T14:07:48Z] <godog> update NIC firmware on thanos-be2001 - T288937

Mentioned in SAL (#wikimedia-operations) [2022-02-08T14:17:20Z] <godog> update PERC firmware on thanos-be2001 - T288937

Cookbook cookbooks.sre.hosts.reimage started by filippo@cumin1001 for host thanos-be2001.codfw.wmnet with OS bullseye completed:

thanos-be2001 (PASS)
- Downtimed on Icinga
- Disabled Puppet
- Removed from Puppet and PuppetDB if present
- Deleted any existing Puppet certificate
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- Host up (Debian installer)
- Host up (new fresh bullseye OS)
- Generated Puppet certificate
- Signed new Puppet certificate
- Run Puppet in NOOP mode to populate exported resources in PuppetDB
- Found Nagios_host resource for this host in PuppetDB
- Downtimed the new host on Icinga
- First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202202081236_filippo_29297_thanos-be2001.out
- Checked BIOS boot parameters are back to normal
- Rebooted
- Automatic Puppet run was successful
- Forced a re-check of all Icinga services for the host
- Icinga status is optimal
- Icinga downtime removed
- Updated Netbox data from PuppetDB

thanos-be2001 is running Bullseye, the reimage itself went fine and I'll leave the host alone to see if any obvious problems pop up.

I've ran into an old/known issue with disk renumbering (i.e. megasas doesn't always enumerate/assign disks in the same order). In practice this isn't normally a problem because we label filesystems (and puppet works as long as sda and sdb are in the right place). I've updated the PERC firmware (just in case, to no avail). A reboot usually fixes things, though in this case I haven't been able to yet

(notice the scsi ID not being incremental)

thanos-be2001:~$ ls -la /dev/disk/by-path/ | grep -v part | sort -k11
drwxr-xr-x 2 root root 720 Feb  8 16:07 .
drwxr-xr-x 8 root root 160 Feb  8 16:07 ..
total 0
lrwxrwxrwx 1 root root   9 Feb  8 16:07 pci-0000:3b:00.0-scsi-0:2:0:0 -> ../../sda
lrwxrwxrwx 1 root root   9 Feb  8 16:07 pci-0000:3b:00.0-scsi-0:2:1:0 -> ../../sdb
lrwxrwxrwx 1 root root   9 Feb  8 16:07 pci-0000:3b:00.0-scsi-0:2:2:0 -> ../../sdc
lrwxrwxrwx 1 root root   9 Feb  8 16:07 pci-0000:3b:00.0-scsi-0:2:3:0 -> ../../sdd
lrwxrwxrwx 1 root root   9 Feb  8 16:07 pci-0000:3b:00.0-scsi-0:2:5:0 -> ../../sde
lrwxrwxrwx 1 root root   9 Feb  8 16:07 pci-0000:3b:00.0-scsi-0:2:4:0 -> ../../sdf
lrwxrwxrwx 1 root root   9 Feb  8 16:07 pci-0000:3b:00.0-scsi-0:2:6:0 -> ../../sdg
lrwxrwxrwx 1 root root   9 Feb  8 16:07 pci-0000:3b:00.0-scsi-0:2:8:0 -> ../../sdh
lrwxrwxrwx 1 root root   9 Feb  8 16:07 pci-0000:3b:00.0-scsi-0:2:7:0 -> ../../sdi
lrwxrwxrwx 1 root root   9 Feb  8 16:07 pci-0000:3b:00.0-scsi-0:2:9:0 -> ../../sdj
lrwxrwxrwx 1 root root   9 Feb  8 16:07 pci-0000:3b:00.0-scsi-0:2:10:0 -> ../../sdk
lrwxrwxrwx 1 root root   9 Feb  8 16:07 pci-0000:3b:00.0-scsi-0:2:11:0 -> ../../sdl
lrwxrwxrwx 1 root root   9 Feb  8 16:07 pci-0000:3b:00.0-scsi-0:2:13:0 -> ../../sdm
lrwxrwxrwx 1 root root   9 Feb  8 16:07 pci-0000:3b:00.0-scsi-0:2:12:0 -> ../../sdn

And we're good now:

thanos-be2001:~$ ls -la /dev/disk/by-path/ | grep -v part | sort -k11
drwxr-xr-x 2 root root 720 Feb  9 07:51 .
drwxr-xr-x 8 root root 160 Feb  9 07:51 ..
total 0
lrwxrwxrwx 1 root root   9 Feb  9 07:51 pci-0000:3b:00.0-scsi-0:2:0:0 -> ../../sda
lrwxrwxrwx 1 root root   9 Feb  9 07:51 pci-0000:3b:00.0-scsi-0:2:1:0 -> ../../sdb
lrwxrwxrwx 1 root root   9 Feb  9 07:51 pci-0000:3b:00.0-scsi-0:2:2:0 -> ../../sdc
lrwxrwxrwx 1 root root   9 Feb  9 07:51 pci-0000:3b:00.0-scsi-0:2:3:0 -> ../../sdd
lrwxrwxrwx 1 root root   9 Feb  9 07:51 pci-0000:3b:00.0-scsi-0:2:4:0 -> ../../sde
lrwxrwxrwx 1 root root   9 Feb  9 07:51 pci-0000:3b:00.0-scsi-0:2:5:0 -> ../../sdf
lrwxrwxrwx 1 root root   9 Feb  9 07:51 pci-0000:3b:00.0-scsi-0:2:6:0 -> ../../sdg
lrwxrwxrwx 1 root root   9 Feb  9 07:51 pci-0000:3b:00.0-scsi-0:2:7:0 -> ../../sdh
lrwxrwxrwx 1 root root   9 Feb  9 07:51 pci-0000:3b:00.0-scsi-0:2:8:0 -> ../../sdi
lrwxrwxrwx 1 root root   9 Feb  9 07:51 pci-0000:3b:00.0-scsi-0:2:9:0 -> ../../sdj
lrwxrwxrwx 1 root root   9 Feb  9 07:51 pci-0000:3b:00.0-scsi-0:2:10:0 -> ../../sdk
lrwxrwxrwx 1 root root   9 Feb  9 07:51 pci-0000:3b:00.0-scsi-0:2:11:0 -> ../../sdl
lrwxrwxrwx 1 root root   9 Feb  9 07:51 pci-0000:3b:00.0-scsi-0:2:12:0 -> ../../sdm
lrwxrwxrwx 1 root root   9 Feb  9 07:51 pci-0000:3b:00.0-scsi-0:2:13:0 -> ../../sdn

MatthewVernon closed subtask T301657: Missed swift log rotation can lead to full root filesystem as Resolved.Feb 18 2022, 8:52 AM

fgiunchedi updated the task description. (Show Details)Feb 21 2022, 1:02 PM

Jdforrester-WMF added a parent task: T291916: Tracking task for Bullseye migrations in production.Feb 23 2022, 4:49 PM

fgiunchedi moved this task from Doing to Up next on the User-fgiunchedi board.May 24 2022, 8:16 AM

fgiunchedi removed a project: User-fgiunchedi.Sep 29 2022, 8:57 AM

Cookbook cookbooks.sre.hosts.reimage was started by mvernon@cumin2002 for host thanos-be2002.codfw.wmnet with OS bullseye

MatthewVernon updated the task description. (Show Details)Dec 8 2022, 3:50 PM

Cookbook cookbooks.sre.hosts.reimage started by mvernon@cumin2002 for host thanos-be2002.codfw.wmnet with OS bullseye completed:

thanos-be2002 (PASS)
- Downtimed on Icinga/Alertmanager
- Disabled Puppet
- Removed from Puppet and PuppetDB if present
- Deleted any existing Puppet certificate
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- Host up (Debian installer)
- Host up (new fresh bullseye OS)
- Generated Puppet certificate
- Signed new Puppet certificate
- Run Puppet in NOOP mode to populate exported resources in PuppetDB
- Found Nagios_host resource for this host in PuppetDB
- Downtimed the new host on Icinga/Alertmanager
- Removed previous downtime on Alertmanager (old OS)
- First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202212081525_mvernon_4090101_thanos-be2002.out
- Checked BIOS boot parameters are back to normal
- configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
- Rebooted
- Automatic Puppet run was successful
- Forced a re-check of all Icinga services for the host
- Icinga status is optimal
- Icinga downtime removed
- Updated Netbox data from PuppetDB

Cookbook cookbooks.sre.hosts.reimage was started by mvernon@cumin2002 for host thanos-be2003.codfw.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by mvernon@cumin2002 for host thanos-be2003.codfw.wmnet with OS bullseye completed:

thanos-be2003 (PASS)
- Downtimed on Icinga/Alertmanager
- Disabled Puppet
- Removed from Puppet and PuppetDB if present
- Deleted any existing Puppet certificate
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- Host up (Debian installer)
- Host up (new fresh bullseye OS)
- Generated Puppet certificate
- Signed new Puppet certificate
- Run Puppet in NOOP mode to populate exported resources in PuppetDB
- Found Nagios_host resource for this host in PuppetDB
- Downtimed the new host on Icinga/Alertmanager
- Removed previous downtime on Alertmanager (old OS)
- First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202212090933_mvernon_82015_thanos-be2003.out
- Checked BIOS boot parameters are back to normal
- configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
- Rebooted
- Automatic Puppet run was successful
- Forced a re-check of all Icinga services for the host
- Icinga status is optimal
- Icinga downtime removed
- Updated Netbox data from PuppetDB

MatthewVernon updated the task description. (Show Details)Dec 9 2022, 10:36 AM

Cookbook cookbooks.sre.hosts.reimage was started by mvernon@cumin2002 for host thanos-be2004.codfw.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by mvernon@cumin2002 for host thanos-be2004.codfw.wmnet with OS bullseye completed:

thanos-be2004 (PASS)
- Downtimed on Icinga/Alertmanager
- Disabled Puppet
- Removed from Puppet and PuppetDB if present
- Deleted any existing Puppet certificate
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- Host up (Debian installer)
- Host up (new fresh bullseye OS)
- Generated Puppet certificate
- Signed new Puppet certificate
- Run Puppet in NOOP mode to populate exported resources in PuppetDB
- Found Nagios_host resource for this host in PuppetDB
- Downtimed the new host on Icinga/Alertmanager
- Removed previous downtime on Alertmanager (old OS)
- First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202212091100_mvernon_98100_thanos-be2004.out
- Checked BIOS boot parameters are back to normal
- configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
- Rebooted
- Automatic Puppet run was successful
- Forced a re-check of all Icinga services for the host
- Icinga status is optimal
- Icinga downtime removed
- Updated Netbox data from PuppetDB

Cookbook cookbooks.sre.hosts.reimage was started by mvernon@cumin2002 for host thanos-be1001.eqiad.wmnet with OS bullseye

MatthewVernon updated the task description. (Show Details)Dec 9 2022, 12:36 PM

Cookbook cookbooks.sre.hosts.reimage started by mvernon@cumin2002 for host thanos-be1001.eqiad.wmnet with OS bullseye completed:

thanos-be1001 (PASS)
- Downtimed on Icinga/Alertmanager
- Disabled Puppet
- Removed from Puppet and PuppetDB if present
- Deleted any existing Puppet certificate
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- Host up (Debian installer)
- Host up (new fresh bullseye OS)
- Generated Puppet certificate
- Signed new Puppet certificate
- Run Puppet in NOOP mode to populate exported resources in PuppetDB
- Found Nagios_host resource for this host in PuppetDB
- Downtimed the new host on Icinga/Alertmanager
- Removed previous downtime on Alertmanager (old OS)
- First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202212091235_mvernon_115228_thanos-be1001.out
- Checked BIOS boot parameters are back to normal
- configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
- Rebooted
- Automatic Puppet run was successful
- Forced a re-check of all Icinga services for the host
- Icinga status is optimal
- Icinga downtime removed
- Updated Netbox data from PuppetDB

MatthewVernon updated the task description. (Show Details)Dec 9 2022, 1:34 PM

Cookbook cookbooks.sre.hosts.reimage was started by mvernon@cumin2002 for host thanos-be1002.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by mvernon@cumin2002 for host thanos-be1002.eqiad.wmnet with OS bullseye completed:

thanos-be1002 (PASS)
- Downtimed on Icinga/Alertmanager
- Disabled Puppet
- Removed from Puppet and PuppetDB if present
- Deleted any existing Puppet certificate
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- Host up (Debian installer)
- Host up (new fresh bullseye OS)
- Generated Puppet certificate
- Signed new Puppet certificate
- Run Puppet in NOOP mode to populate exported resources in PuppetDB
- Found Nagios_host resource for this host in PuppetDB
- Downtimed the new host on Icinga/Alertmanager
- Removed previous downtime on Alertmanager (old OS)
- First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202212120854_mvernon_794389_thanos-be1002.out
- Checked BIOS boot parameters are back to normal
- configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
- Rebooted
- Automatic Puppet run was successful
- Forced a re-check of all Icinga services for the host
- Icinga status is optimal
- Icinga downtime removed
- Updated Netbox data from PuppetDB

MatthewVernon updated the task description. (Show Details)Dec 12 2022, 11:54 AM

Cookbook cookbooks.sre.hosts.reimage was started by mvernon@cumin2002 for host thanos-be1003.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by mvernon@cumin2002 for host thanos-be1003.eqiad.wmnet with OS bullseye completed:

thanos-be1003 (PASS)
- Downtimed on Icinga/Alertmanager
- Disabled Puppet
- Removed from Puppet and PuppetDB if present
- Deleted any existing Puppet certificate
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- Host up (Debian installer)
- Host up (new fresh bullseye OS)
- Generated Puppet certificate
- Signed new Puppet certificate
- Run Puppet in NOOP mode to populate exported resources in PuppetDB
- Found Nagios_host resource for this host in PuppetDB
- Downtimed the new host on Icinga/Alertmanager
- Removed previous downtime on Alertmanager (old OS)
- First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202212121159_mvernon_830230_thanos-be1003.out
- Checked BIOS boot parameters are back to normal
- configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
- Rebooted
- Automatic Puppet run was successful
- Forced a re-check of all Icinga services for the host
- Icinga status is optimal
- Icinga downtime removed
- Updated Netbox data from PuppetDB

MatthewVernon updated the task description. (Show Details)Dec 12 2022, 12:33 PM

Cookbook cookbooks.sre.hosts.reimage was started by mvernon@cumin2002 for host thanos-be1004.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by mvernon@cumin2002 for host thanos-be1004.eqiad.wmnet with OS bullseye completed:

thanos-be1004 (PASS)
- Downtimed on Icinga/Alertmanager
- Disabled Puppet
- Removed from Puppet and PuppetDB if present
- Deleted any existing Puppet certificate
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- Host up (Debian installer)
- Host up (new fresh bullseye OS)
- Generated Puppet certificate
- Signed new Puppet certificate
- Run Puppet in NOOP mode to populate exported resources in PuppetDB
- Found Nagios_host resource for this host in PuppetDB
- Downtimed the new host on Icinga/Alertmanager
- Removed previous downtime on Alertmanager (old OS)
- First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202212121330_mvernon_847406_thanos-be1004.out
- Checked BIOS boot parameters are back to normal
- configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
- Rebooted
- Automatic Puppet run was successful
- Forced a re-check of all Icinga services for the host
- Icinga status is optimal
- Icinga downtime removed
- Updated Netbox data from PuppetDB

All thanos nodes now bullseye.

Run Thanos backend on BullseyeClosed, ResolvedPublicActions

Description

Details

Related ObjectsSearch...

Event Timeline

Run Thanos backend on Bullseye
Closed, ResolvedPublic
Actions

Related Objects
Search...