[Epic] Migrate all Search Platform servers to Debian Bullseye
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	Gehel
	Nov 28 2022, 3:29 PM

Description

Our OS Update Policy states that we intend to end the use of Debian Buster by the end of September 2023. For the Search Platform team, there are:

WDQS / WCQS
Ganeti VMs for search-loader

Notes:

search-loader will probably be made obsolete by the new Search Updater Pipeline.
Blazegraph (WDQS / WCQS) does not support Java 11 / 17. We maintain an up-to-date Java 8 backport for Bullseye, so this should not be an issue.
Individual migrations to Bullseye should be tracked as subtasks
sudo cumin -b 10 'P:contacts%role_contacts~"Search Platform"' 'facter -p lsbdistcodename' can be used to check which hosts are running which Debian version

AC:

All Search Platform servers are migrated away from Buster

Related Objects
Search...

Status	Assigned	Task
Open	None	T291916 Tracking task for Bullseye migrations in production
Resolved	Gehel	T323921 [Epic] Migrate all Search Platform servers to Debian Bullseye
Resolved	bking	T328325 Reimage wdqs20[13-22] servers to Bullseye
Resolved	bking	T331297 Audit/update NIC firmware on Search Platform-owned Buster hosts
Resolved	bking	T331300 Ensure WDQS stack works on Bullseye
Resolved	bking	T336540 Ensure prometheus-blazegraph-exporter-wdqs-* services can start in Bullseye or later
Resolved	bking	T336443 Investigate performance differences between wdqs2022 and older hosts
Resolved	VRiley-WMF	T358727 Reclaim recently-decommed CP host for WDQS (see T352253)
Resolved	bking	T340793 Implement depool (source only) and keep-downtime options on data-transfer cookbook
Resolved	Gehel	T342060 Investigate WDQS categories update failures on Bullseye hosts
In Progress	Sandeeps	T342162 "scap deploy"'s config-deploy should check for broken symlinks
Duplicate	None	T342701 Ensure WCQS stack works on Bullseye or later
Resolved	bking	T343124 Migrate WDQS and WCQS servers to Debian Bullseye
Resolved	Papaul	T344518 hw troubleshooting: wdqs1010 unreachable from SSH or DRAC
Resolved	bking	T346039 Migrate search-loader hosts to Bullseye or later
Resolved	Gehel	T346272 1 codfw VM requested for search-loader
Resolved	bking	T346273 eqiad: 1 VM requested for search-loader
Resolved	EBernhardson	T346373 Ensure mjolnir can work on Python 3.9 or later
Invalid	None	T350078 Decom search-loader VMs still using Buster
Resolved	bking	T351123 Decommission search-loader1001/2001 VMs
Invalid	bking	T351233 Update search-loader dashboard to reflect new search-loader hosts
Resolved	brouberol	T346053 Migrate apifeatureusage hosts to Bullseye or later

Event Timeline

Gehel created this task.Nov 28 2022, 3:29 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptNov 28 2022, 3:29 PM

MPhamWMF triaged this task as High priority.Nov 28 2022, 4:40 PM

MPhamWMF moved this task from needs triage to Ops / SRE on the Discovery-Search board.

MoritzMuehlenhoff renamed this task from [Epic] Migrate all Search Platform servers to Debian Buster to [Epic] Migrate all Search Platform servers to Debian Bullseye.Nov 29 2022, 12:19 PM

MoritzMuehlenhoff updated the task description. (Show Details)

MoritzMuehlenhoff subscribed.Nov 29 2022, 12:26 PM

Gehel edited projects, added Discovery-Search (Current work); removed Discovery-Search.Dec 16 2022, 8:19 AM

Gehel edited projects, added Discovery-Search; removed Discovery-Search (Current work).Jan 23 2023, 4:47 PM

Note that we have a bunch of wdqs hosts that are online, but not yet in production. We may wish to use these to test Bullseye.

search-loader will probably be made obsolete by the new Search Updater Pipeline.

search-loader runs two daemons, mjolnir-bulk and mjolnir-msearch. The msearch daemon will be made obsolete by the search updater pipeline, but the msearch daemon will not be replaced at this time. There are possibilities to move what the msearch daemon does in-process of the mjolnir spark jobs, but engineering would be required.

Gehel edited projects, added Discovery-Search (Current work); removed Discovery-Search.Mar 16 2023, 2:02 PM

MPhamWMF moved this task from Incoming to Epics on the Discovery-Search (Current work) board.Apr 10 2023, 3:31 PM

bking closed subtask T331297: Audit/update NIC firmware on Search Platform-owned Buster hosts as Resolved.Jun 5 2023, 10:23 PM

Gehel closed subtask T328325: Reimage wdqs20[13-22] servers to Bullseye as Resolved.Jul 28 2023, 9:51 AM

Gehel closed subtask T331300: Ensure WDQS stack works on Bullseye as Resolved.Jul 28 2023, 9:55 AM

Gehel added a subtask: T342701: Ensure WCQS stack works on Bullseye or later.Jul 31 2023, 12:49 PM

Gehel added a parent task: T291916: Tracking task for Bullseye migrations in production.

Gehel edited projects, added Data-Platform-SRE; removed Discovery-Search (Current work).Jul 31 2023, 12:55 PM

Gehel moved this task from Incoming to Epics on the Data-Platform-SRE board.

Gehel updated the task description. (Show Details)Aug 3 2023, 7:11 PM

Gehel updated the task description. (Show Details)Aug 3 2023, 7:13 PM

We currently still have 30 nodes on Buster:

APIFeature Usage: apifeatureusage2001.codfw.wmnet,apifeatureusage1001.eqiad.wmnet
Search Loader: search-loader2001.codfw.wmnet,search-loader1001.eqiad.wmnet
W[CD]QS: wcqs[2001-2003].codfw.wmnet,wcqs[1001-1003].eqiad.wmnet,wdqs[2007-2012].codfw.wmnet,wdqs[1003-1016].eqiad.wmnet

gehel@cumin1001:~$ sudo cumin -b 10 'P:contacts%role_contacts~"Search Platform"' 'facter -p lsbdistcodename'
150 hosts will be targeted:
apifeatureusage2001.codfw.wmnet,apifeatureusage1001.eqiad.wmnet,cloudelastic[1001-1006].wikimedia.org,elastic[2037-2048,2050-2086].codfw.wmnet,elastic[1053-1102].eqiad.wmnet,flink-zk[1001-1003].eqiad.wmnet,relforge[1003-1004].eqiad.wmnet,search-loader2001.codfw.wmnet,search-loader1001.eqiad.wmnet,wcqs[2001-2003].codfw.wmnet,wcqs[1001-1003].eqiad.wmnet,wdqs[2007-2022].codfw.wmnet,wdqs[1003-1016].eqiad.wmnet
OK to proceed on 150 hosts? Enter the number of affected hosts to confirm or "q" to quit: 150
===== NODE GROUP =====                                                          
(3) flink-zk[1001-1003].eqiad.wmnet                                             
----- OUTPUT of 'facter -p lsbdistcodename' -----                               
bookworm                                                                        
===== NODE GROUP =====                                                          
(117) cloudelastic[1001-1006].wikimedia.org,elastic[2037-2048,2050-2086].codfw.wmnet,elastic[1053-1102].eqiad.wmnet,relforge[1003-1004].eqiad.wmnet,wdqs[2013-2022].codfw.wmnet
----- OUTPUT of 'facter -p lsbdistcodename' -----                               
bullseye                                                                        
===== NODE GROUP =====                                                          
(30) apifeatureusage2001.codfw.wmnet,apifeatureusage1001.eqiad.wmnet,search-loader2001.codfw.wmnet,search-loader1001.eqiad.wmnet,wcqs[2001-2003].codfw.wmnet,wcqs[1001-1003].eqiad.wmnet,wdqs[2007-2012].codfw.wmnet,wdqs[1003-1016].eqiad.wmnet
----- OUTPUT of 'facter -p lsbdistcodename' -----                               
buster                                                                          
================                                                                
PASS |██████████████████████████████| 100% (150/150) [00:41<00:00,  3.63hosts/s]
FAIL |                                        |   0% (0/150) [00:41<?, ?hosts/s]
100.0% (150/150) success ratio (>= 100.0% threshold) for command: 'facter -p lsbdistcodename'.
100.0% (150/150) success ratio (>= 100.0% threshold) of nodes successfully executed all commands.

Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin1001 for host wcqs2001.codfw.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by bking@cumin1001 for host wcqs2001.codfw.wmnet with OS bullseye executed with errors:

wcqs2001 (FAIL)
- Downtimed on Icinga/Alertmanager
- Disabled Puppet
- Removed from Puppet and PuppetDB if present
- Deleted any existing Puppet certificate
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- Host up (Debian installer)
- Checked BIOS boot parameters are back to normal
- Host up (new fresh bullseye OS)
- Generated Puppet certificate
- Signed new Puppet certificate
- Run Puppet in NOOP mode to populate exported resources in PuppetDB
- Found Nagios_host resource for this host in PuppetDB
- Downtimed the new host on Icinga/Alertmanager
- Removed previous downtime on Alertmanager (old OS)
- First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202308041518_bking_62915_wcqs2001.out
- configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
- Rebooted
- The reimage failed, see the cookbook logs for the details

Mentioned in SAL (#wikimedia-operations) [2023-08-04T18:13:36Z] <bking@cumin1001> START - Cookbook sre.hosts.downtime for 4 days, 0:00:00 on wcqs2001.codfw.wmnet with reason: T323921

Mentioned in SAL (#wikimedia-operations) [2023-08-04T18:14:00Z] <bking@cumin1001> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4 days, 0:00:00 on wcqs2001.codfw.wmnet with reason: T323921

Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin1001 for host wcqs2002.codfw.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by bking@cumin1001 for host wcqs2002.codfw.wmnet with OS bullseye executed with errors:

wcqs2002 (FAIL)
- Downtimed on Icinga/Alertmanager
- Disabled Puppet
- Removed from Puppet and PuppetDB if present
- Deleted any existing Puppet certificate
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- Host up (Debian installer)
- Checked BIOS boot parameters are back to normal
- Host up (new fresh bullseye OS)
- Generated Puppet certificate
- Signed new Puppet certificate
- Run Puppet in NOOP mode to populate exported resources in PuppetDB
- Found Nagios_host resource for this host in PuppetDB
- Downtimed the new host on Icinga/Alertmanager
- Removed previous downtime on Alertmanager (old OS)
- First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202308041834_bking_99274_wcqs2002.out
- configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
- Rebooted
- The reimage failed, see the cookbook logs for the details

Mentioned in SAL (#wikimedia-operations) [2023-08-04T19:12:04Z] <bking@cumin1001> START - Cookbook sre.hosts.downtime for 4 days, 0:00:00 on wcqs2002.codfw.wmnet with reason: T323921

Mentioned in SAL (#wikimedia-operations) [2023-08-04T19:12:12Z] <bking@cumin1001> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4 days, 0:00:00 on wcqs2002.codfw.wmnet with reason: T323921

Mentioned in SAL (#wikimedia-operations) [2023-08-04T20:04:34Z] <bking@cumin1001> START - Cookbook sre.hosts.downtime for 4 days, 0:00:00 on wcqs2002.codfw.wmnet with reason: T323921

Mentioned in SAL (#wikimedia-operations) [2023-08-04T20:04:47Z] <bking@cumin1001> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4 days, 0:00:00 on wcqs2002.codfw.wmnet with reason: T323921

Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin1001 for host wcqs2003.codfw.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by bking@cumin1001 for host wcqs2003.codfw.wmnet with OS bullseye executed with errors:

wcqs2003 (FAIL)
- Downtimed on Icinga/Alertmanager
- Disabled Puppet
- Removed from Puppet and PuppetDB if present
- Deleted any existing Puppet certificate
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- Host up (Debian installer)
- Checked BIOS boot parameters are back to normal
- Host up (new fresh bullseye OS)
- Generated Puppet certificate
- Signed new Puppet certificate
- Run Puppet in NOOP mode to populate exported resources in PuppetDB
- Found Nagios_host resource for this host in PuppetDB
- Downtimed the new host on Icinga/Alertmanager
- Removed previous downtime on Alertmanager (old OS)
- First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202308072019_bking_923178_wcqs2003.out
- configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
- Rebooted
- The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin1001 for host wcqs1001.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin1001 for host wcqs1002.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by bking@cumin1001 for host wcqs1001.eqiad.wmnet with OS bullseye completed:

wcqs1001 (WARN)
- Downtimed on Icinga/Alertmanager
- Disabled Puppet
- Removed from Puppet and PuppetDB if present
- Deleted any existing Puppet certificate
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- Host up (Debian installer)
- Checked BIOS boot parameters are back to normal
- Host up (new fresh bullseye OS)
- Generated Puppet certificate
- Signed new Puppet certificate
- Run Puppet in NOOP mode to populate exported resources in PuppetDB
- Found Nagios_host resource for this host in PuppetDB
- Downtimed the new host on Icinga/Alertmanager
- Removed previous downtime on Alertmanager (old OS)
- First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202308081738_bking_1167782_wcqs1001.out
- configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
- Rebooted
- Automatic Puppet run was successful
- Forced a re-check of all Icinga services for the host
- Icinga status is not optimal, downtime not removed
- Updated Netbox data from PuppetDB

Cookbook cookbooks.sre.hosts.reimage started by bking@cumin1001 for host wcqs1002.eqiad.wmnet with OS bullseye executed with errors:

wcqs1002 (FAIL)
- Downtimed on Icinga/Alertmanager
- Disabled Puppet
- Removed from Puppet and PuppetDB if present
- Deleted any existing Puppet certificate
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- Host up (Debian installer)
- Checked BIOS boot parameters are back to normal
- Host up (new fresh bullseye OS)
- Generated Puppet certificate
- Signed new Puppet certificate
- Run Puppet in NOOP mode to populate exported resources in PuppetDB
- Found Nagios_host resource for this host in PuppetDB
- Downtimed the new host on Icinga/Alertmanager
- Removed previous downtime on Alertmanager (old OS)
- First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202308081741_bking_1168860_wcqs1002.out
- configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
- Rebooted
- The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin1001 for host wcqs1003.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by bking@cumin1001 for host wcqs1003.eqiad.wmnet with OS bullseye executed with errors:

wcqs1003 (FAIL)
- Downtimed on Icinga/Alertmanager
- Disabled Puppet
- Removed from Puppet and PuppetDB if present
- Deleted any existing Puppet certificate
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- Host up (Debian installer)
- Checked BIOS boot parameters are back to normal
- Host up (new fresh bullseye OS)
- Generated Puppet certificate
- Signed new Puppet certificate
- Run Puppet in NOOP mode to populate exported resources in PuppetDB
- Found Nagios_host resource for this host in PuppetDB
- Downtimed the new host on Icinga/Alertmanager
- Removed previous downtime on Alertmanager (old OS)
- First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202308082118_bking_1212577_wcqs1003.out
- configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
- Rebooted
- The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin1001 for host wdqs2007.codfw.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by bking@cumin1001 for host wdqs2007.codfw.wmnet with OS bullseye executed with errors:

wdqs2007 (FAIL)
- Downtimed on Icinga/Alertmanager
- Disabled Puppet
- Removed from Puppet and PuppetDB if present
- Deleted any existing Puppet certificate
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- Host up (Debian installer)
- Checked BIOS boot parameters are back to normal
- Host up (new fresh bullseye OS)
- Generated Puppet certificate
- Signed new Puppet certificate
- Run Puppet in NOOP mode to populate exported resources in PuppetDB
- Found Nagios_host resource for this host in PuppetDB
- Downtimed the new host on Icinga/Alertmanager
- Removed previous downtime on Alertmanager (old OS)
- First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202308101843_bking_1740817_wdqs2007.out
- configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
- Rebooted
- The reimage failed, see the cookbook logs for the details

This is complete; closing.

Reopening as the following hosts are still on Buster: apifeatureusage2001.codfw.wmnet,apifeatureusage1001.eqiad.wmnet,search-loader2001.codfw.wmnet,search-loader1001.eqiad.wmnet

Will check for/create new task for these migrations.

bking added a subtask: T346039: Migrate search-loader hosts to Bullseye or later.Sep 11 2023, 1:38 PM

Gehel added a subtask: T346053: Migrate apifeatureusage hosts to Bullseye or later.Sep 13 2023, 8:30 AM

RKemper closed subtask T343124: Migrate WDQS and WCQS servers to Debian Bullseye as Resolved.Sep 18 2023, 10:21 PM

Gehel closed subtask T346039: Migrate search-loader hosts to Bullseye or later as Resolved.Dec 1 2023, 9:17 AM

brouberol closed subtask T346053: Migrate apifeatureusage hosts to Bullseye or later as Resolved.Feb 13 2024, 11:51 AM

Gehel moved this task from Epics to Quarterly Goals on the Data-Platform-SRE board.Feb 29 2024, 9:31 AM

Gehel closed this task as Resolved.Mar 22 2024, 8:53 AM

Gehel claimed this task.