Page MenuHomePhabricator

Upgrade Presto servers to Bullseye
Closed, ResolvedPublic

Description

Presto machines to be upgrades

  • presto-worker-test 1 - an-test-presto1001.eqiad.wmnet
  • an-presto1001
  • an-presto1002
  • an-presto1003
  • an-presto1004
  • an-presto1005

These machines should be relatively easy to upgrade, owing to the following factors.

  • There is no need for data retention - There is a /srv/ volume mounted, but it does not contain any data of value.
  • We have already shown that presto on bullseye works with an-presto10[06-15]

Therefore, the procedure to upgrade these hosts should be as simple as running the cookbook for each host as follows:

sudo cookbook sre.hosts.reimage --os bullseye -t 329361 an-presto1001

We could also add one (or more) of the currently disabled presto hosts into the cluster to pick up the load of one worker being down.
i.e. delete this file: https://phabricator.wikimedia.org/source/operations-puppet/browse/production/hieradata/hosts/an-presto1006.yaml

Event Timeline

BTullis renamed this task from Upgrade Presto clients to Bullseye to Upgrade Presto servers to Bullseye.Feb 10 2023, 11:45 AM
nfraison changed the task status from Open to In Progress.Feb 13 2023, 12:52 PM

Mentioned in SAL (#wikimedia-analytics) [2023-02-13T14:06:34Z] <nfraison> Reimage an-test-presto1001 to upgrade to bullseye T329361

Cookbook cookbooks.sre.ganeti.reimage was started by nfraison@cumin1001 for host an-test-presto1001.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.ganeti.reimage started by nfraison@cumin1001 for host an-test-presto1001.eqiad.wmnet with OS bullseye executed with errors:

  • an-test-presto1001 (FAIL)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • The reimage failed, see the cookbook logs for the details

Reimage failed

For info, please visit https://www.isc.org/software/dhcp/
/etc/dhcp/automation/ttyS0-115200/an-test-presto1001.conf line 6: host an-test-presto1001: already exists
}
 ^
/etc/dhcp/automation/proxies/ttyS0-115200.conf line 2: /etc/dhcp/automation/ttyS0-115200/an-test-presto1001.conf: bad parse.
include "/etc/dhcp/automation/ttyS0-115200/an-test-presto1001.conf"
         ^
/etc/dhcp/dhcpd.conf line 768: /etc/dhcp/automation/proxies/ttyS0-115200.conf: bad parse.
        include "/etc/dhcp/automation/proxies/ttyS0-115200.conf"
                 ^
Configuration file errors encountered -- exiting

If you think you have received this message due to a bug rather
than a configuration issue please read the section on submitting
bugs on either our web page at www.isc.org or in the README file
before submitting a bug.  These pages explain the proper
process and the information we find helpful for debugging.

exiting.
2023-02-13 14:11:07,913 [ERROR] dhcp config test returned non-zero.
================
100.0% (1/1) of nodes failed to execute command '/usr/local/sbin/...cludes -r commit': install1004.wikimedia.org
0.0% (0/1) success ratio (< 100.0% threshold) for command: '/usr/local/sbin/...cludes -r commit'. Aborting.
0.0% (0/1) success ratio (< 100.0% threshold) of nodes successfully executed all commands. Aborting.
Exception raised while executing cookbook sre.ganeti.reimage:
Traceback (most recent call last):
  File "/usr/lib/python3/dist-packages/spicerack/dhcp.py", line 218, in refresh_dhcp
    self._hosts.run_sync("/usr/local/sbin/dhcpincludes -r commit", print_progress_bars=False)
  File "/usr/lib/python3/dist-packages/spicerack/remote.py", line 520, in run_sync
    return self._execute(
  File "/usr/lib/python3/dist-packages/spicerack/remote.py", line 720, in _execute
    raise RemoteExecutionError(ret, "Cumin execution failed")
spicerack.remote.RemoteExecutionError: Cumin execution failed (exit_code=2)

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/usr/lib/python3/dist-packages/spicerack/_menu.py", line 234, in run
    raw_ret = runner.run()
  File "/srv/deployment/spicerack/cookbooks/sre/ganeti/reimage.py", line 470, in run
    with self.dhcp.config(self.dhcp_config):
  File "/usr/lib/python3.9/contextlib.py", line 117, in __enter__
    return next(self.gen)
  File "/usr/lib/python3/dist-packages/spicerack/dhcp.py", line 299, in config
    self.push_configuration(dhcp_config)
  File "/usr/lib/python3/dist-packages/spicerack/dhcp.py", line 250, in push_configuration
    self.refresh_dhcp()
  File "/usr/lib/python3/dist-packages/spicerack/dhcp.py", line 220, in refresh_dhcp
    raise DHCPRestartError("restarting generating dhcp config or restarting dhcpd failed") from exc
spicerack.dhcp.DHCPRestartError: restarting generating dhcp config or restarting dhcpd failed
**The reimage failed, see the cookbook logs for the details**
Reimage executed with errors:
- an-test-presto1001 (**FAIL**)
  - Downtimed on Icinga/Alertmanager
  - Disabled Puppet
  - Removed from Puppet and PuppetDB if present
  - Deleted any existing Puppet certificate
  - Removed from Debmonitor if present
  - **The reimage failed, see the cookbook logs for the details**

rm manually the ttyS0-115200/an-test-presto1001.con file as indicated in doc https://wikitech.wikimedia.org/wiki/Server_Lifecycle/Reimage#Virtual_hosts

nfraison@install1004:~$ rm /etc/dhcp/automation/ttyS0-115200/an-test-presto1001.conf

Cookbook cookbooks.sre.ganeti.reimage was started by nfraison@cumin1001 for host an-test-presto1001.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.ganeti.reimage started by nfraison@cumin1001 for host an-test-presto1001.eqiad.wmnet with OS bullseye executed with errors:

  • an-test-presto1001 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • The reimage failed, see the cookbook logs for the details

Change 888711 had a related patch set uploaded (by Nicolas Fraison; author: Nicolas Fraison):

[operations/puppet@production] chore(install_server): remove an-test-presto1001 entries for reimage

https://gerrit.wikimedia.org/r/888711

Change 888711 merged by Nicolas Fraison:

[operations/puppet@production] chore(install_server): remove an-test-presto1001 entries for reimage

https://gerrit.wikimedia.org/r/888711

Cookbook cookbooks.sre.ganeti.reimage was started by nfraison@cumin1001 for host an-test-presto1001.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.ganeti.reimage started by nfraison@cumin1001 for host an-test-presto1001.eqiad.wmnet with OS bullseye executed with errors:

  • an-test-presto1001 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.ganeti.reimage was started by nfraison@cumin1001 for host an-test-presto1001.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.ganeti.reimage started by nfraison@cumin1001 for host an-test-presto1001.eqiad.wmnet with OS bullseye executed with errors:

  • an-test-presto1001 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.ganeti.reimage was started by nfraison@cumin1001 for host an-test-presto1001.eqiad.wmnet with OS bullseye

puppet failed on presto test

Feb 13 15:22:48 an-test-presto1001 puppet-agent[3743]: (/Stage[main]/Rsyslog/Service[rsyslog]) Skipping because of failed dependencies
Feb 13 15:22:48 an-test-presto1001 puppet-agent[3743]: (/Stage[main]/Nrpe/Base::Service_unit[nagios-nrpe-server]/Service[nagios-nrpe-server]) Skipping because of failed dependencies
Feb 13 15:22:48 an-test-presto1001 puppet-agent[3743]: (/Stage[main]/Profile::Presto::Server/Nrpe::Monitor_service[presto-server]/Nrpe::Check[check_presto-server]/Sudo::User[nrpe-check_presto-server]/File[/etc/sudoers.d/nrpe-check_presto-server]) Skippi>
Feb 13 15:22:43 an-test-presto1001 puppet-agent[3743]: (/Stage[main]/Profile::Presto::Server/Nrpe::Monitor_service[presto-server]/Nrpe::Check[check_presto-server]/File[/etc/nagios/nrpe.d/check_presto-server.cfg]) Skipping because of failed dependencies
Feb 13 15:22:43 an-test-presto1001 puppet-agent[3743]: (/Stage[main]/Presto::Server/Service[presto-server]) Skipping because of failed dependencies
Feb 13 15:22:43 an-test-presto1001 puppet-agent[3743]: (/Stage[main]/Presto::Server/Presto::Catalog[analytics_test_iceberg]/Presto::Properties[catalog/analytics_test_iceberg]/File[/etc/presto/catalog/analytics_test_iceberg.properties]) Skipping because >
Feb 13 15:22:43 an-test-presto1001 puppet-agent[3743]: (/Stage[main]/Presto::Server/Presto::Catalog[analytics_test_hive]/Presto::Properties[catalog/analytics_test_hive]/File[/etc/presto/catalog/analytics_test_hive.properties]) Skipping because of failed>
Feb 13 15:22:43 an-test-presto1001 puppet-agent[3743]: (/Stage[main]/Profile::Presto::Server/Sslcert::X509_to_pkcs12[presto_keystore]/File[/etc/presto/ssl/server.p12]) Skipping because of failed dependencies
Feb 13 15:22:43 an-test-presto1001 puppet-agent[3743]: (/Stage[main]/Profile::Presto::Server/Sslcert::X509_to_pkcs12[presto_keystore]/Exec[sslcert generate presto_keystore.p12]) Skipping because of failed dependencies
Feb 13 15:22:43 an-test-presto1001 puppet-agent[3743]: (/Stage[main]/Main/Cfssl::Cert[discovery__an-test-presto1001_eqiad_wmnet]/File[/etc/presto/ssl/discovery__an-test-presto1001_eqiad_wmnet.chained.pem]) Skipping because of failed dependencies
Feb 13 15:22:43 an-test-presto1001 puppet-agent[3743]: (/Stage[main]/Main/Cfssl::Cert[discovery__an-test-presto1001_eqiad_wmnet]/Exec[create chained cert /etc/presto/ssl/discovery__an-test-presto1001_eqiad_wmnet.chain.pem]) Skipping because of failed de>
Feb 13 15:22:43 an-test-presto1001 puppet-agent[3743]: (/Stage[main]/Main/Cfssl::Cert[discovery__an-test-presto1001_eqiad_wmnet]/Exec[renew certificate - discovery__an-test-presto1001_eqiad_wmnet]) Skipping because of failed dependencies
Feb 13 15:22:43 an-test-presto1001 puppet-agent[3743]: (/Stage[main]/Main/Cfssl::Cert[discovery__an-test-presto1001_eqiad_wmnet]/Exec[Generate cert discovery__an-test-presto1001_eqiad_wmnet refresh]) Skipping because of failed dependencies
Feb 13 15:22:43 an-test-presto1001 puppet-agent[3743]: (/Stage[main]/Main/Cfssl::Cert[discovery__an-test-presto1001_eqiad_wmnet]/Exec[Generate cert discovery__an-test-presto1001_eqiad_wmnet]) Skipping because of failed dependencies
Feb 13 15:22:43 an-test-presto1001 puppet-agent[3743]: (/Stage[main]/Main/Cfssl::Cert[discovery__an-test-presto1001_eqiad_wmnet]/Cfssl::Csr[/etc/cfssl/csr/discovery__an-test-presto1001_eqiad_wmnet.csr]/File[/etc/cfssl/csr/discovery__an-test-presto1001_e>
Feb 13 15:22:33 an-test-presto1001 puppet-agent[3743]: (/Stage[main]/Presto::Server/Rsyslog::Conf[presto-server]/File[/etc/rsyslog.d/60-presto-server.conf]) Skipping because of failed dependencies
Feb 13 15:22:33 an-test-presto1001 puppet-agent[3743]: (/Stage[main]/Presto::Server/Logrotate::Conf[presto-server]/File[/etc/logrotate.d/presto-server]) Skipping because of failed dependencies
Feb 13 15:22:33 an-test-presto1001 puppet-agent[3743]: (/Stage[main]/Presto::Server/Presto::Properties[log]/File[/etc/presto/log.properties]) Skipping because of failed dependencies
Feb 13 15:22:33 an-test-presto1001 puppet-agent[3743]: (/Stage[main]/Presto::Server/Presto::Properties[node]/File[/etc/presto/node.properties]) Skipping because of failed dependencies
Feb 13 15:22:33 an-test-presto1001 puppet-agent[3743]: (/Stage[main]/Presto::Server/Presto::Properties[config]/File[/etc/presto/config.properties]) Skipping because of failed dependencies
Feb 13 15:22:33 an-test-presto1001 puppet-agent[3743]: (/Stage[main]/Main/Cfssl::Cert[discovery__an-test-presto1001_eqiad_wmnet]/File[/etc/presto/ssl/discovery__an-test-presto1001_eqiad_wmnet.chain.pem]) Skipping because of failed dependencies
Feb 13 15:22:33 an-test-presto1001 puppet-agent[3743]: (/Stage[main]/Main/Cfssl::Cert[discovery__an-test-presto1001_eqiad_wmnet]/File[/etc/presto/ssl/discovery__an-test-presto1001_eqiad_wmnet-key.pem]) Skipping because of failed dependencies
Feb 13 15:22:33 an-test-presto1001 puppet-agent[3743]: (/Stage[main]/Main/Cfssl::Cert[discovery__an-test-presto1001_eqiad_wmnet]/File[/etc/presto/ssl/discovery__an-test-presto1001_eqiad_wmnet.csr]) Skipping because of failed dependencies
Feb 13 15:22:33 an-test-presto1001 puppet-agent[3743]: (/Stage[main]/Main/Cfssl::Cert[discovery__an-test-presto1001_eqiad_wmnet]/File[/etc/presto/ssl/discovery__an-test-presto1001_eqiad_wmnet.pem]) Skipping because of failed dependencies
Feb 13 15:22:33 an-test-presto1001 puppet-agent[3743]: (/Stage[main]/Main/Cfssl::Cert[discovery__an-test-presto1001_eqiad_wmnet]/File[/etc/presto/ssl]) Skipping because of failed dependencies
Feb 13 15:22:30 an-test-presto1001 puppet-agent[3743]: (/Stage[main]/Profile::Kerberos::Keytabs/File[/etc/security/keytabs/presto/presto.keytab]) Skipping because of failed dependencies
Feb 13 15:22:30 an-test-presto1001 puppet-agent[3743]: (/Stage[main]/Profile::Kerberos::Keytabs/File[/etc/security/keytabs/presto]) Skipping because of failed dependencies
Feb 13 15:22:30 an-test-presto1001 puppet-agent[3743]: (/Stage[main]/Presto::Server/File[/var/log/presto]) Skipping because of failed dependencies
Feb 13 15:22:30 an-test-presto1001 puppet-agent[3743]: (/Stage[main]/Presto::Server/File[/srv/presto/var/log]) Skipping because of failed dependencies
Feb 13 15:22:30 an-test-presto1001 puppet-agent[3743]: (/Stage[main]/Presto::Server/File[/srv/presto]) Skipping because of failed dependencies
Feb 13 15:22:30 an-test-presto1001 puppet-agent[3743]: (/Stage[main]/Presto::Server/User[presto]) Skipping because of failed dependencies
Feb 13 15:22:30 an-test-presto1001 puppet-agent[3743]: (/Stage[main]/Profile::Presto::Server/File[/usr/local/bin/presto]) Skipping because of failed dependencies
Feb 13 15:22:30 an-test-presto1001 puppet-agent[3743]: (/Stage[main]/Profile::Presto::Server/File[/usr/local/bin/presto]) Dependency Package[presto-server] has failures: true
Feb 13 15:22:30 an-test-presto1001 puppet-agent[3743]: (/Stage[main]/Presto::Server/Package[presto-server]/ensure) change from 'absent' to 'present' failed: Execution of '/usr/bin/apt-get -q -y -o DPkg::Options::=--force-confold install presto-server' r>
Feb 13 15:22:30 an-test-presto1001 puppet-agent[3743]: Execution of '/usr/bin/apt-get -q -y -o DPkg::Options::=--force-confold install presto-server' returned 100: E: dpkg was interrupted, you must manually run 'dpkg --configure -a' to correct the probl>
Feb 13 15:22:26 an-test-presto1001 puppet-agent[3743]: Applying configuration version '(2610014ec0) John Bond - postgresql::user: need to also include the escape in the final command'

Cookbook cookbooks.sre.ganeti.reimage started by nfraison@cumin1001 for host an-test-presto1001.eqiad.wmnet with OS bullseye completed:

  • an-test-presto1001 (PASS)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via gnt-instance
    • Host up (Debian installer)
    • Set boot to disk
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/ganeti/reimage/202302131446_nfraison_627860_an-test-presto1001.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed

And then this one

Error: Cannot create /srv/presto/var/log; parent directory /srv/presto/var does not exist
Error: /Stage[main]/Presto::Server/File[/srv/presto/var/log]/ensure: change from 'absent' to 'directory' failed: Cannot create /srv/presto/var/log; parent directory /srv/presto/var does not exist

Due to my recent change that doesn't create intermediate folder (need to check how to do this in puppet

Change 888760 had a related patch set uploaded (by Nicolas Fraison; author: Nicolas Fraison):

[operations/puppet@production] fix(presto): create intermediate ${data_dir}/var fodler

https://gerrit.wikimedia.org/r/888760

Change 888760 merged by Nicolas Fraison:

[operations/puppet@production] fix(presto): create intermediate ${data_dir}/var fodler

https://gerrit.wikimedia.org/r/888760

Presto service is up but didn't succeed to reach presto coordinator:

Feb 16 15:35:01 an-test-presto1001 presto-server[540653]: 2023-02-16T15:35:01.491Z        ERROR        Announcer-0        com.facebook.airlift.discovery.client.Announcer        Service announcement failed after 110.01ms. Next request will happen within 512.00ms

Looking at network traces I can see some errors due to

TLSv1.2 Record Layer: Alert (Level: Fatal, Description: Certificate Unknown)

Newly created server.p12 file is missing the intermediate certificate.
It is the same on an-test-coord1001

On prod cluster the intermediate certificate is well in the file also the issuer has changed from issuer=CN = Puppet CA: palladium.eqiad.wmnet to issuer=C = US, L = San Francisco, O = "Wikimedia Foundation, Inc", OU = SRE Foundations, CN = discovery

Seems to be a bad new pattern :(

Creating manually the file with intermediate certificate make the service work

Change 889822 had a related patch set uploaded (by Nicolas Fraison; author: Nicolas Fraison):

[operations/puppet@production] fix(presto): create pkcs12 server file with intermediate certificate

https://gerrit.wikimedia.org/r/889822

Change 889822 merged by Nicolas Fraison:

[operations/puppet@production] fix(presto): create pkcs12 server file with intermediate certificate

https://gerrit.wikimedia.org/r/889822

Mentioned in SAL (#wikimedia-analytics) [2023-02-20T13:11:01Z] <nfraison> Reimage an-presto1001 to upgrade to bullseye T329361

Cookbook cookbooks.sre.hosts.reimage was started by nfraison@cumin1001 for host an-presto1001.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by nfraison@cumin1001 for host an-presto1001.eqiad.wmnet with OS bullseye executed with errors:

  • an-presto1001 (FAIL)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by nfraison@cumin1001 for host an-presto1001.eqiad.wmnet with OS bullseye

Reboot of an-presto1001.eqiad.wmnet seems stuck nothing displayed on the IPMI console
Enforcing manual reset of the node

ipmitool -I lanplus -H 'an-presto1001.mgmt.eqiad.wmnet' -U root -E power reset

And it finally failed see previous message from cookbook

Cookbook cookbooks.sre.hosts.reimage started by nfraison@cumin1001 for host an-presto1001.eqiad.wmnet with OS bullseye executed with errors:

  • an-presto1001 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • The reimage failed, see the cookbook logs for the details

Again stuck on [138/240, retrying in 10.00s] Attempt to run 'spicerack.remote.RemoteHosts.wait_reboot_since' raised: Unable to get uptime for an-presto1001.eqiad.wmnet

This time even a reset doesn't seem to unstuck things.

Node well start up to bios load and then nothing happens

Enumerating Boot options...
Enumerating Boot options... Done
Lifecycle Controller: Collecting System Inventory...
...

And then nothing...

Tried ensuring the PXE is well required (confirm by below log)

IPMI: Boot to PXE Requested

But still same behavior..

Cookbook cookbooks.sre.hosts.reimage was started by nfraison@cumin1001 for host an-presto1001.eqiad.wmnet with OS bullseye

DHCP configuration looks fine

root@install1004:/etc/dhcp/automation/proxies# cat ttyS0-115200.conf
# Automatically generated by dhcpincludes for /etc/dhcp/automation/ttyS0-115200
root@install1004:/etc/dhcp/automation/proxies# cat ttyS1-115200.conf
# Automatically generated by dhcpincludes for /etc/dhcp/automation/ttyS1-115200
include "/etc/dhcp/automation/ttyS1-115200/an-presto1001.conf";root@install1004:/etc/dhcp/automation/proxies# cat /etc/dhcp/automation/ttyS1-115200/an-presto1001.conf

host an-presto1001 {
    host-identifier option agent.circuit-id "asw2-d-eqiad:xe-2/0/16.0:analytics1-d-eqiad";
    fixed-address 10.64.53.39;
    option pxelinux.pathprefix "http://apt.wikimedia.org/tftpboot/bullseye-installer/";
}

retrying to run the reimage

this time the console move forward but the debian installer is stuck on some missing firmware

┌───────────────────┤ [!] Detect network hardware ├────────────────────┐    
    │                                                                      │    
    │ Some of your hardware needs non-free firmware files to operate. The  │    
    │ firmware can be loaded from removable media, such as a USB stick or  │    
    │ floppy.                                                              │    
    │                                                                      │    
    │ The missing firmware files are: bnx2x/bnx2x-e2-7.13.21.0.fw          │    
    │ bnx2x/bnx2x-e2-7.13.21.0.fw                                          │    
    │                                                                      │    
    │ If you have such media available now, insert it, and continue.       │    
    │                                                                      │    
    │ Load missing firmware from removable media?                          │    
    │                                                                      │    
    │     <Yes>                                                   <No>     │

Cookbook cookbooks.sre.hosts.reimage started by nfraison@cumin1001 for host an-presto1001.eqiad.wmnet with OS bullseye executed with errors:

  • an-presto1001 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • The reimage failed, see the cookbook logs for the details

Looking at https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1006500 it seems that those firmware are missing on the bullseye setup

While looking at our bullseye-installer on install1004 we are indeed relying on https://debian.pkgs.org/11/debian-nonfree-arm64/firmware-bnx2x_20210315-3_all.deb.html which doesn't contain this version of the firmware

:nfraison@pop-os:~/Downloads/test2$ ls -al lib/firmware/bnx2x/bnx2x-e*
-rw-r--r-- 1 nfraison nfraison 161368 Jul 25  2021 lib/firmware/bnx2x/bnx2x-e1-7.0.29.0.fw
-rw-r--r-- 1 nfraison nfraison 164392 Jul 25  2021 lib/firmware/bnx2x/bnx2x-e1-7.10.51.0.fw
-rw-r--r-- 1 nfraison nfraison 170192 Jul 25  2021 lib/firmware/bnx2x/bnx2x-e1-7.12.30.0.fw
-rw-r--r-- 1 nfraison nfraison 170096 Jul 25  2021 lib/firmware/bnx2x/bnx2x-e1-7.13.1.0.fw
-rw-r--r-- 1 nfraison nfraison 170168 Jul 25  2021 lib/firmware/bnx2x/bnx2x-e1-7.13.15.0.fw
-rw-r--r-- 1 nfraison nfraison 163592 Jul 25  2021 lib/firmware/bnx2x/bnx2x-e1-7.8.19.0.fw
-rw-r--r-- 1 nfraison nfraison 168680 Jul 25  2021 lib/firmware/bnx2x/bnx2x-e1h-7.0.29.0.fw
-rw-r--r-- 1 nfraison nfraison 173016 Jul 25  2021 lib/firmware/bnx2x/bnx2x-e1h-7.10.51.0.fw
-rw-r--r-- 1 nfraison nfraison 178984 Jul 25  2021 lib/firmware/bnx2x/bnx2x-e1h-7.12.30.0.fw
-rw-r--r-- 1 nfraison nfraison 178992 Jul 25  2021 lib/firmware/bnx2x/bnx2x-e1h-7.13.1.0.fw
-rw-r--r-- 1 nfraison nfraison 178608 Jul 25  2021 lib/firmware/bnx2x/bnx2x-e1h-7.13.15.0.fw
-rw-r--r-- 1 nfraison nfraison 171920 Jul 25  2021 lib/firmware/bnx2x/bnx2x-e1h-7.8.19.0.fw
-rw-r--r-- 1 nfraison nfraison 289848 Jul 25  2021 lib/firmware/bnx2x/bnx2x-e2-7.0.29.0.fw
-rw-r--r-- 1 nfraison nfraison 321456 Jul 25  2021 lib/firmware/bnx2x/bnx2x-e2-7.10.51.0.fw
-rw-r--r-- 1 nfraison nfraison 321320 Jul 25  2021 lib/firmware/bnx2x/bnx2x-e2-7.12.30.0.fw
-rw-r--r-- 1 nfraison nfraison 320936 Jul 25  2021 lib/firmware/bnx2x/bnx2x-e2-7.13.1.0.fw
-rw-r--r-- 1 nfraison nfraison 323360 Jul 25  2021 lib/firmware/bnx2x/bnx2x-e2-7.13.15.0.fw
-rw-r--r-- 1 nfraison nfraison 310440 Jul 25  2021 lib/firmware/bnx2x/bnx2x-e2-7.8.19.0.fw

From https://phabricator.wikimedia.org/T308106 this is a known issue that require manual ack on the host

Cookbook cookbooks.sre.hosts.reimage was started by nfraison@cumin1001 for host an-presto1001.eqiad.wmnet with OS bullseye

reimage relaunched and prompt manually ack.

But again blocked due to

┌──────────────────────┤ [!] Partition disks ├───────────────────────┐─┐   
  │  │                                                                    │ │   
  │  │                        95.3 GB is too small                        │ │   
  │  │ You asked for 95.3 GB to be used for guided partitioning, but the  │ │   
  │ C│ selected partitioning recipe requires at least 300.0 GB.           │ │   
  │  │                                                                    │ │   
  └──│     <Go Back>                                       <Continue>

Seems to be linked to commit ac64acc642e76155f683975c68046a80a81c21f5 which move to partman/custom/kafka-jumbo.cfg for new and old presto nodes
Instead of using partman/custom/cloudvirtan.cfg

Before

modules/install_server/files/autoinstall/partman/custom/cloudvirtan.cfg
# Configuration to create:
# Hardware RAID1 on 2 SFF drives in flex bays mounted at /dev/sda
# 1G on /boot outside of LVM
# LVM volume of 95% remainder of sda is /
# Hardware RAID10 on 12 LFF 4TB SATA disks mounted at /dev/sdb
# 95% of sdb allocated as /srv

# remove any LVM already on the disks
d-i    partman-lvm/device_remove_lvm   boolean true

# We'll be creating LVMs and partitioning disks SDA and SDB
d-i    partman-auto/method     string  lvm
d-i    partman-auto/disk       string  /dev/sda /dev/sdb

# setup a /boot partition of 1GB outside of the LVM
d-i    partman-auto/expert_recipe      string  lvm ::  \
               1000 2000 1000 ext4     \
                               $primary{ }             \
                               $bootable{ }    \
                               method{ format }        \
                               format{ }               \
                               use_filesystem{ }       \
                               filesystem{ ext4 }      \
                               mountpoint{ /boot }     \
                               device { /dev/sda }     \

               .       \
# setup the / filesystem within the LVM filling the 95% of the remaining space
               80000 1000 -1 ext4      \
                               method{ format }        \
                               format{ }               \
                               use_filesystem{ }       \
                               filesystem{ ext4 }      \
                               lv_name{ root }         \
                               $defaultignore{ }       \
                               $lvmok{ }               \
                               mountpoint{ / } \
                               device { /dev/sda }     \
               .       \
# setup the SDB disk with a single LVM at 95% of the disk, and a mount in xfs for /srv
                       100000 1000 -1 xfs              \
                               method{ format }        \
                               format{ }               \
                               use_filesystem{ }       \
                               filesystem{ xfs }       \
                               lv_name{ srv }          \
                               $defaultignore{ }       \
                               $lvmok{ }               \
                               mountpoint{ /var/lib/nova/instances }   \
                               device { /dev/sdb }     \


               .

d-i partman-auto-lvm/guided_size          string  95%
d-i partman/confirm_write_new_label       boolean true
d-i partman/choose_partition              select  finish
d-i partman/confirm                       boolean true
d-i partman/confirm_nooverwrite           boolean true
d-i partman-md/confirm                    boolean true
d-i partman-md/confirm_nooverwrite        boolean true
d-i partman-lvm/confirm                   boolean true
d-i partman-lvm/confirm_nooverwrite       boolean true

partman-basicfilesystems partman-basicfilesystems/no_swap boolean false

Now

# configuration:
#  * hardware raid on kafka-jumbo hosts
#  * sda hw raid1 (Flex Bay): 2 * 1TB / 2 * 500GB
#  * sdb hw raid10: 12 * 4TB
#
# * GPT partitions:
#   - boot 300MB (biosgrub type, see below)
#   - LVM
#   - /:    ext4, max of /dev/sda (varies across hosts)
#   - /srv: ext4, max of /dev/sdb
#
# The GPT biosgrub partition is made 300MB to future-proof it for EFI: in that
# case the partition is large enough to be turned into the ESP without touching
# GPT partition sizes. Also 300MB is big enough to work on 4k sector disks and FAT.

d-i	partman-auto/method	string	lvm
d-i	partman-auto/disk	string	/dev/sda /dev/sdb
d-i	partman-auto-lvm/guided_size	string	80%

# the install makes sure we want to wipe the lvm
d-i	partman-lvm/device_remove_lvm	boolean	true
d-i	partman-lvm/confirm	boolean	true
d-i	partman-lvm/confirm_nooverwrite	boolean	true
d-i	partman/confirm	boolean	true
d-i	partman-auto-lvm/no_boot	boolean	true

# Force GPT
d-i	partman-basicfilesystems/choose_label	string	gpt
d-i	partman-basicfilesystems/default_label	string	gpt
d-i	partman-partitioning/choose_label	string	gpt
d-i	partman-partitioning/default_label	string	gpt
d-i	partman/choose_label			string	gpt
d-i	partman/default_label			string	gpt

d-i	partman-auto/choose_recipe	lvm

d-i     partman-auto/expert_recipe	string	\
		lvm ::							\
		300 300 300 grub		        \
			$primary{ }	             	\
			method{ biosgrub }	        \
		.				                \
		100000 300000 -1 ext4	        \
			$defaultignore{ }			\
			$primary{ }					\
			method{ lvm }				\
			device{ /dev/sda }			\
			vg_name{ vg0 }				\
		.								\
		500000 300 -1 ext4				\
			$defaultignore{ }			\
			$primary{ }					\
			method{ lvm }				\
			device{ /dev/sdb }			\
			vg_name{ vg1 }				\
		.								\
		300000 900000 -1 ext4    		\
			$lvmok{ }					\
			method{ format }			\
			format{ }					\
			use_filesystem{ }			\
			filesystem{ ext4 }			\
			mountpoint{ / }				\
			in_vg{ vg0 }				\
			lv_name{ root }				\
		.								\
		6000000 21000000 24000000 ext4	\
			$lvmok{ }					\
			method{ format }			\
			format{ }					\
			use_filesystem{ }			\
			filesystem{ ext4 }			\
			mountpoint{ /srv }			\
			in_vg{ vg1 }				\
			lv_name{ srv }				\
		.

d-i	partman/choose_partition	\
		select	finish
d-i	partman-partitioning/confirm_write_new_label	boolean	true

d-i	partman/confirm_nooverwrite	boolean	true
partman-basicfilesystems partman-basicfilesystems/no_swap boolean false

# do not prompt for 'no filesystem on partition'
d-i	partman-basicmethods/method_only	boolean false
d-i	partman-basicfilesystems/no_mount_point boolean false

Cookbook cookbooks.sre.hosts.reimage started by nfraison@cumin1001 for host an-presto1001.eqiad.wmnet with OS bullseye executed with errors:

  • an-presto1001 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • The reimage failed, see the cookbook logs for the details

Change 890488 had a related patch set uploaded (by Nicolas Fraison; author: Nicolas Fraison):

[operations/puppet@production] kafka-jumbo: reduce min size of root partition

https://gerrit.wikimedia.org/r/890488

Change 890488 merged by Nicolas Fraison:

[operations/puppet@production] netboot: create dedicated partman recipe for presto workers

https://gerrit.wikimedia.org/r/890488

Cookbook cookbooks.sre.hosts.reimage was started by nfraison@cumin1001 for host an-presto1001.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by nfraison@cumin1001 for host an-presto1001.eqiad.wmnet with OS bullseye completed:

  • an-presto1001 (PASS)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202302211050_nfraison_2093684_an-presto1001.out
    • Checked BIOS boot parameters are back to normal
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Cookbook cookbooks.sre.hosts.reimage was started by nfraison@cumin1001 for host an-presto1002.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by nfraison@cumin1001 for host an-presto1002.eqiad.wmnet with OS bullseye executed with errors:

  • an-presto1002 (FAIL)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202302211250_nfraison_2422524_an-presto1002.out
    • Checked BIOS boot parameters are back to normal
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by nfraison@cumin1001 for host an-presto1003.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by nfraison@cumin1001 for host an-presto1003.eqiad.wmnet with OS bullseye completed:

  • an-presto1003 (PASS)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202302211615_nfraison_2622550_an-presto1003.out
    • Checked BIOS boot parameters are back to normal
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Cookbook cookbooks.sre.hosts.reimage was started by nfraison@cumin1001 for host an-presto1004.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by nfraison@cumin1001 for host an-presto1004.eqiad.wmnet with OS bullseye executed with errors:

  • an-presto1004 (FAIL)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202302220817_nfraison_2879932_an-presto1004.out
    • Checked BIOS boot parameters are back to normal
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by nfraison@cumin1001 for host an-presto1005.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by nfraison@cumin1001 for host an-presto1005.eqiad.wmnet with OS bullseye completed:

  • an-presto1005 (PASS)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202302220936_nfraison_2899978_an-presto1005.out
    • Checked BIOS boot parameters are back to normal
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB