Page MenuHomePhabricator

[Epic] Migrate all Search Platform servers to Debian Bullseye
Closed, ResolvedPublic

Description

Our OS Update Policy states that we intend to end the use of Debian Buster by the end of September 2023. For the Search Platform team, there are:

  • WDQS / WCQS
  • Ganeti VMs for search-loader

Notes:

  • search-loader will probably be made obsolete by the new Search Updater Pipeline.
  • Blazegraph (WDQS / WCQS) does not support Java 11 / 17. We maintain an up-to-date Java 8 backport for Bullseye, so this should not be an issue.
  • Individual migrations to Bullseye should be tracked as subtasks
  • sudo cumin -b 10 'P:contacts%role_contacts~"Search Platform"' 'facter -p lsbdistcodename' can be used to check which hosts are running which Debian version

AC:

  • All Search Platform servers are migrated away from Buster

Related Objects

StatusSubtypeAssignedTask
OpenNone
ResolvedGehel
Resolvedbking
Resolvedbking
Resolvedbking
Resolvedbking
Resolvedbking
ResolvedVRiley-WMF
Resolvedbking
ResolvedGehel
In ProgressSandeeps
DuplicateNone
Resolvedbking
ResolvedPapaul
Resolvedbking
ResolvedGehel
Resolvedbking
ResolvedEBernhardson
InvalidNone
Resolvedbking
Invalidbking
Resolvedbrouberol

Event Timeline

MPhamWMF moved this task from needs triage to Ops / SRE on the Discovery-Search board.
MoritzMuehlenhoff renamed this task from [Epic] Migrate all Search Platform servers to Debian Buster to [Epic] Migrate all Search Platform servers to Debian Bullseye.Nov 29 2022, 12:19 PM
MoritzMuehlenhoff updated the task description. (Show Details)

Note that we have a bunch of wdqs hosts that are online, but not yet in production. We may wish to use these to test Bullseye.

search-loader will probably be made obsolete by the new Search Updater Pipeline.

search-loader runs two daemons, mjolnir-bulk and mjolnir-msearch. The msearch daemon will be made obsolete by the search updater pipeline, but the msearch daemon will not be replaced at this time. There are possibilities to move what the msearch daemon does in-process of the mjolnir spark jobs, but engineering would be required.

We currently still have 30 nodes on Buster:

  • APIFeature Usage: apifeatureusage2001.codfw.wmnet,apifeatureusage1001.eqiad.wmnet
  • Search Loader: search-loader2001.codfw.wmnet,search-loader1001.eqiad.wmnet
  • W[CD]QS: wcqs[2001-2003].codfw.wmnet,wcqs[1001-1003].eqiad.wmnet,wdqs[2007-2012].codfw.wmnet,wdqs[1003-1016].eqiad.wmnet
gehel@cumin1001:~$ sudo cumin -b 10 'P:contacts%role_contacts~"Search Platform"' 'facter -p lsbdistcodename'
150 hosts will be targeted:
apifeatureusage2001.codfw.wmnet,apifeatureusage1001.eqiad.wmnet,cloudelastic[1001-1006].wikimedia.org,elastic[2037-2048,2050-2086].codfw.wmnet,elastic[1053-1102].eqiad.wmnet,flink-zk[1001-1003].eqiad.wmnet,relforge[1003-1004].eqiad.wmnet,search-loader2001.codfw.wmnet,search-loader1001.eqiad.wmnet,wcqs[2001-2003].codfw.wmnet,wcqs[1001-1003].eqiad.wmnet,wdqs[2007-2022].codfw.wmnet,wdqs[1003-1016].eqiad.wmnet
OK to proceed on 150 hosts? Enter the number of affected hosts to confirm or "q" to quit: 150
===== NODE GROUP =====                                                          
(3) flink-zk[1001-1003].eqiad.wmnet                                             
----- OUTPUT of 'facter -p lsbdistcodename' -----                               
bookworm                                                                        
===== NODE GROUP =====                                                          
(117) cloudelastic[1001-1006].wikimedia.org,elastic[2037-2048,2050-2086].codfw.wmnet,elastic[1053-1102].eqiad.wmnet,relforge[1003-1004].eqiad.wmnet,wdqs[2013-2022].codfw.wmnet
----- OUTPUT of 'facter -p lsbdistcodename' -----                               
bullseye                                                                        
===== NODE GROUP =====                                                          
(30) apifeatureusage2001.codfw.wmnet,apifeatureusage1001.eqiad.wmnet,search-loader2001.codfw.wmnet,search-loader1001.eqiad.wmnet,wcqs[2001-2003].codfw.wmnet,wcqs[1001-1003].eqiad.wmnet,wdqs[2007-2012].codfw.wmnet,wdqs[1003-1016].eqiad.wmnet
----- OUTPUT of 'facter -p lsbdistcodename' -----                               
buster                                                                          
================                                                                
PASS |██████████████████████████████| 100% (150/150) [00:41<00:00,  3.63hosts/s]
FAIL |                                        |   0% (0/150) [00:41<?, ?hosts/s]
100.0% (150/150) success ratio (>= 100.0% threshold) for command: 'facter -p lsbdistcodename'.
100.0% (150/150) success ratio (>= 100.0% threshold) of nodes successfully executed all commands.

Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin1001 for host wcqs2001.codfw.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by bking@cumin1001 for host wcqs2001.codfw.wmnet with OS bullseye executed with errors:

  • wcqs2001 (FAIL)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202308041518_bking_62915_wcqs2001.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • The reimage failed, see the cookbook logs for the details

Mentioned in SAL (#wikimedia-operations) [2023-08-04T18:13:36Z] <bking@cumin1001> START - Cookbook sre.hosts.downtime for 4 days, 0:00:00 on wcqs2001.codfw.wmnet with reason: T323921

Mentioned in SAL (#wikimedia-operations) [2023-08-04T18:14:00Z] <bking@cumin1001> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4 days, 0:00:00 on wcqs2001.codfw.wmnet with reason: T323921

Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin1001 for host wcqs2002.codfw.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by bking@cumin1001 for host wcqs2002.codfw.wmnet with OS bullseye executed with errors:

  • wcqs2002 (FAIL)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202308041834_bking_99274_wcqs2002.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • The reimage failed, see the cookbook logs for the details

Mentioned in SAL (#wikimedia-operations) [2023-08-04T19:12:04Z] <bking@cumin1001> START - Cookbook sre.hosts.downtime for 4 days, 0:00:00 on wcqs2002.codfw.wmnet with reason: T323921

Mentioned in SAL (#wikimedia-operations) [2023-08-04T19:12:12Z] <bking@cumin1001> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4 days, 0:00:00 on wcqs2002.codfw.wmnet with reason: T323921

Mentioned in SAL (#wikimedia-operations) [2023-08-04T20:04:34Z] <bking@cumin1001> START - Cookbook sre.hosts.downtime for 4 days, 0:00:00 on wcqs2002.codfw.wmnet with reason: T323921

Mentioned in SAL (#wikimedia-operations) [2023-08-04T20:04:47Z] <bking@cumin1001> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4 days, 0:00:00 on wcqs2002.codfw.wmnet with reason: T323921

Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin1001 for host wcqs2003.codfw.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by bking@cumin1001 for host wcqs2003.codfw.wmnet with OS bullseye executed with errors:

  • wcqs2003 (FAIL)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202308072019_bking_923178_wcqs2003.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin1001 for host wcqs1001.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin1001 for host wcqs1002.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by bking@cumin1001 for host wcqs1001.eqiad.wmnet with OS bullseye completed:

  • wcqs1001 (WARN)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202308081738_bking_1167782_wcqs1001.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is not optimal, downtime not removed
    • Updated Netbox data from PuppetDB

Cookbook cookbooks.sre.hosts.reimage started by bking@cumin1001 for host wcqs1002.eqiad.wmnet with OS bullseye executed with errors:

  • wcqs1002 (FAIL)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202308081741_bking_1168860_wcqs1002.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin1001 for host wcqs1003.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by bking@cumin1001 for host wcqs1003.eqiad.wmnet with OS bullseye executed with errors:

  • wcqs1003 (FAIL)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202308082118_bking_1212577_wcqs1003.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin1001 for host wdqs2007.codfw.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by bking@cumin1001 for host wdqs2007.codfw.wmnet with OS bullseye executed with errors:

  • wdqs2007 (FAIL)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202308101843_bking_1740817_wdqs2007.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • The reimage failed, see the cookbook logs for the details

This is complete; closing.

Reopening as the following hosts are still on Buster: apifeatureusage2001.codfw.wmnet,apifeatureusage1001.eqiad.wmnet,search-loader2001.codfw.wmnet,search-loader1001.eqiad.wmnet

Will check for/create new task for these migrations.

Gehel claimed this task.