Page MenuHomePhabricator

cloudgw: upgrade servers to Debian 11 Bullseye
Closed, ResolvedPublic

Description

They are currently running Debian 10 Buster.

  • cloudgw2001-dev.codfw.wmnet
  • cloudgw2002-dev.codfw.wmnet
  • cloudgw1001.eqiad.wmnet
  • cloudgw1002.eqiad.wmnet

Event Timeline

Cookbook cookbooks.sre.hosts.reimage was started by aborrero@cumin2002 for host cloudgw2001-dev.codfw.wmnet with OS bullseye

aborrero changed the task status from Open to In Progress.Mar 24 2022, 12:54 PM
aborrero triaged this task as Low priority.
aborrero moved this task from Inbox to Doing on the cloud-services-team (Kanban) board.

Cookbook cookbooks.sre.hosts.reimage started by aborrero@cumin2002 for host cloudgw2001-dev.codfw.wmnet with OS bullseye executed with errors:

  • cloudgw2001-dev (FAIL)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202203241253_aborrero_2900873_cloudgw2001-dev.out
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by aborrero@cumin2002 for host cloudgw2002-dev.codfw.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by aborrero@cumin2002 for host cloudgw2002-dev.codfw.wmnet with OS bullseye completed:

  • cloudgw2002-dev (PASS)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202203241341_aborrero_2907817_cloudgw2002-dev.out
    • Checked BIOS boot parameters are back to normal
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Change 773585 had a related patch set uploaded (by Arturo Borrero Gonzalez; author: Arturo Borrero Gonzalez):

[operations/puppet@production] keepalived: use version from bullseye-bpo

https://gerrit.wikimedia.org/r/773585

Change 773586 had a related patch set uploaded (by Arturo Borrero Gonzalez; author: Arturo Borrero Gonzalez):

[operations/puppet@production] cloudgw: don't install kernel or nft from backports

https://gerrit.wikimedia.org/r/773586

Will wait a few more days before upgrading eqiad server, to run a few more tests, merge these patches, etc.

Change 773585 merged by Arturo Borrero Gonzalez:

[operations/puppet@production] keepalived: use version from bullseye-bpo

https://gerrit.wikimedia.org/r/773585

Change 773586 merged by Arturo Borrero Gonzalez:

[operations/puppet@production] cloudgw: don't install kernel or nft from backports

https://gerrit.wikimedia.org/r/773586

Cookbook cookbooks.sre.hosts.reimage was started by aborrero@cumin1001 for host cloudgw1001.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by aborrero@cumin1001 for host cloudgw1001.eqiad.wmnet with OS bullseye executed with errors:

  • cloudgw1001 (FAIL)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by aborrero@cumin1001 for host cloudgw1001.eqiad.wmnet with OS bullseye

Change 777313 had a related patch set uploaded (by Arturo Borrero Gonzalez; author: Arturo Borrero Gonzalez):

[operations/puppet@production] cloudgw1001: hieradata: refresh NIC names

https://gerrit.wikimedia.org/r/777313

Change 777313 merged by Arturo Borrero Gonzalez:

[operations/puppet@production] cloudgw1001: hieradata: refresh NIC names

https://gerrit.wikimedia.org/r/777313

The reimage resulted in new NIC names for cloudgw :-( the newer ones are longer and don't support the vlan tag attached to them.

Change 777333 had a related patch set uploaded (by Arturo Borrero Gonzalez; author: Arturo Borrero Gonzalez):

[operations/puppet@production] cloudgw1001: use a custom name for the dataplane NIC

https://gerrit.wikimedia.org/r/777333

Change 777333 merged by Arturo Borrero Gonzalez:

[operations/puppet@production] cloudgw1001: use a custom name for the dataplane NIC

https://gerrit.wikimedia.org/r/777333

Change 777335 had a related patch set uploaded (by Arturo Borrero Gonzalez; author: Arturo Borrero Gonzalez):

[operations/puppet@production] cloudgw: conntrackd: refresh NIC name

https://gerrit.wikimedia.org/r/777335

Change 777335 merged by Arturo Borrero Gonzalez:

[operations/puppet@production] cloudgw: conntrackd: refresh NIC name

https://gerrit.wikimedia.org/r/777335

Cookbook cookbooks.sre.hosts.reimage started by aborrero@cumin1001 for host cloudgw1001.eqiad.wmnet with OS bullseye executed with errors:

  • cloudgw1001 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202204050834_aborrero_3071366_cloudgw1001.out
    • Checked BIOS boot parameters are back to normal
    • Rebooted
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by aborrero@cumin1001 for host cloudgw1001.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by aborrero@cumin1001 for host cloudgw1001.eqiad.wmnet with OS bullseye completed:

  • cloudgw1001 (PASS)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202204051030_aborrero_3141213_cloudgw1001.out
    • Checked BIOS boot parameters are back to normal
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
aborrero@cloudgw1001:~ 4 $ sudo systemctl status systemd-sysctl
● systemd-sysctl.service - Apply Kernel Variables
     Loaded: loaded (/lib/systemd/system/systemd-sysctl.service; static)
     Active: active (exited) since Tue 2022-04-05 12:16:43 UTC; 10min ago
       Docs: man:systemd-sysctl.service(8)
             man:sysctl.d(5)
    Process: 434 ExecStart=/lib/systemd/systemd-sysctl (code=exited, status=0/SUCCESS)
   Main PID: 434 (code=exited, status=0/SUCCESS)
        CPU: 15ms

Apr 05 12:16:43 cloudgw1001 systemd[1]: Starting Apply Kernel Variables...
Apr 05 12:16:43 cloudgw1001 systemd-sysctl[434]: Couldn't write '1' to 'net/ipv4/conf/dataplane.1107/forwarding', ignoring: No such file or directory
Apr 05 12:16:43 cloudgw1001 systemd-sysctl[434]: Couldn't write '0' to 'net/ipv4/conf/dataplane.1107/rp_filter', ignoring: No such file or directory
Apr 05 12:16:43 cloudgw1001 systemd-sysctl[434]: Couldn't write '1' to 'net/ipv4/conf/dataplane.1120/forwarding', ignoring: No such file or directory
Apr 05 12:16:43 cloudgw1001 systemd-sysctl[434]: Couldn't write '0' to 'net/ipv4/conf/dataplane.1120/rp_filter', ignoring: No such file or directory
Apr 05 12:16:43 cloudgw1001 systemd-sysctl[434]: Couldn't write '0' to 'net/ipv6/conf/dataplane.1107/accept_ra', ignoring: No such file or directory
Apr 05 12:16:43 cloudgw1001 systemd-sysctl[434]: Couldn't write '1' to 'net/ipv6/conf/dataplane.1107/forwarding', ignoring: No such file or directory
Apr 05 12:16:43 cloudgw1001 systemd-sysctl[434]: Couldn't write '0' to 'net/ipv6/conf/dataplane.1120/accept_ra', ignoring: No such file or directory
Apr 05 12:16:43 cloudgw1001 systemd-sysctl[434]: Couldn't write '1' to 'net/ipv6/conf/dataplane.1120/forwarding', ignoring: No such file or directory
Apr 05 12:16:43 cloudgw1001 systemd[1]: Finished Apply Kernel Variables.

this is likely a race condition: systemd-sysctl runs before the NIC gets renamed, so the network config can't be applied

Cookbook cookbooks.sre.hosts.reimage was started by aborrero@cumin1001 for host cloudgw1001.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by aborrero@cumin1001 for host cloudgw1001.eqiad.wmnet with OS bullseye completed:

  • cloudgw1001 (PASS)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202204060806_aborrero_3616294_cloudgw1001.out
    • Checked BIOS boot parameters are back to normal
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Cookbook cookbooks.sre.hosts.reimage was started by aborrero@cumin1001 for host cloudgw1002.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by aborrero@cumin1001 for host cloudgw1002.eqiad.wmnet with OS bullseye executed with errors:

  • cloudgw1002 (FAIL)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by aborrero@cumin1001 for host cloudgw1002.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by aborrero@cumin1001 for host cloudgw1002.eqiad.wmnet with OS bullseye executed with errors:

  • cloudgw1002 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by aborrero@cumin1001 for host cloudgw1002.eqiad.wmnet with OS bullseye

Change 777763 had a related patch set uploaded (by Arturo Borrero Gonzalez; author: Arturo Borrero Gonzalez):

[operations/puppet@production] cloudgw1002: rename interface names

https://gerrit.wikimedia.org/r/777763

Cookbook cookbooks.sre.hosts.reimage started by aborrero@cumin1001 for host cloudgw1002.eqiad.wmnet with OS bullseye executed with errors:

  • cloudgw1002 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202204060951_aborrero_3667008_cloudgw1002.out
    • Checked BIOS boot parameters are back to normal
    • Rebooted
    • The reimage failed, see the cookbook logs for the details

Change 777763 merged by Arturo Borrero Gonzalez:

[operations/puppet@production] cloudgw1002: rename interface names

https://gerrit.wikimedia.org/r/777763

Cookbook cookbooks.sre.hosts.reimage was started by aborrero@cumin1001 for host cloudgw1002.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by aborrero@cumin1001 for host cloudgw1002.eqiad.wmnet with OS bullseye completed:

  • cloudgw1002 (PASS)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202204061123_aborrero_3754913_cloudgw1002.out
    • Checked BIOS boot parameters are back to normal
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB