Page MenuHomePhabricator

Upgrade Airflow instances to Bullseye
Closed, ResolvedPublic

Description

Upgrade all Apache-Airflow instances to Bullseye.

  • an-test-client1002 - analytics_test
  • an-airflow1002 - research
  • an-airflow1004 - platform_eng
  • an-airflow1005 - search
  • an-airflow1006 - analytics_product
  • an-airflow1007 - wmde

I won't include an-launcher1002 - analytics because the upgrade for that instance that is already covered by T332580: Upgrade an-launcher1002 to bullseye.

Event Timeline

Gehel triaged this task as High priority.Nov 22 2023, 9:50 AM
Gehel moved this task from Misc to Ready for Work on the Data-Platform-SRE board.
BTullis updated the task description. (Show Details)
BTullis removed a subscriber: Stevemunene.

I'm proposing to start with an-airflow1007, since it appears not to be used for anything yet. I have checked with the users in the #wmf-wmde Slack channel.

Mentioned in SAL (#wikimedia-analytics) [2024-01-29T10:46:47Z] <btullis> upgrading an-airflow1007 to bullseye for T335261

Cookbook cookbooks.sre.hosts.reimage was started by btullis@cumin1002 for host an-airflow1007.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by btullis@cumin1002 for host an-airflow1007.eqiad.wmnet with OS bullseye completed:

  • an-airflow1007 (PASS)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via gnt-instance
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Set boot media to disk
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202401291105_btullis_736712_an-airflow1007.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Cookbook cookbooks.sre.hosts.reimage was started by btullis@cumin1002 for host an-airflow1006.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by btullis@cumin1002 for host an-airflow1006.eqiad.wmnet with OS bullseye completed:

  • an-airflow1006 (PASS)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via gnt-instance
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Set boot media to disk
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202401291340_btullis_760963_an-airflow1006.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Change 993727 had a related patch set uploaded (by Muehlenhoff; author: Muehlenhoff):

[operations/puppet@production] airflow/analytics_product: Keep Python 2

https://gerrit.wikimedia.org/r/993727

Cookbook cookbooks.sre.hosts.reimage was started by btullis@cumin1002 for host an-airflow1005.eqiad.wmnet with OS bullseye

Mentioned in SAL (#wikimedia-analytics) [2024-01-30T10:17:41Z] <btullis> upgrading an-airflow1005 (search) to bullseye for T335261

Change 993727 merged by Muehlenhoff:

[operations/puppet@production] airflow: Keep Python 2 due to Hive

https://gerrit.wikimedia.org/r/993727

Cookbook cookbooks.sre.hosts.reimage started by btullis@cumin1002 for host an-airflow1005.eqiad.wmnet with OS bullseye completed:

  • an-airflow1005 (PASS)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via gnt-instance
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Set boot media to disk
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202401301032_btullis_946556_an-airflow1005.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Cookbook cookbooks.sre.hosts.reimage was started by btullis@cumin1002 for host an-airflow1004.eqiad.wmnet with OS bullseye

Mentioned in SAL (#wikimedia-analytics) [2024-02-02T09:46:32Z] <btullis> reimaging an-airflow1004 to bullseye for T335261

Cookbook cookbooks.sre.hosts.reimage started by btullis@cumin1002 for host an-airflow1004.eqiad.wmnet with OS bullseye completed:

  • an-airflow1004 (PASS)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via gnt-instance
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Set boot media to disk
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202402021001_btullis_1591657_an-airflow1004.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Cookbook cookbooks.sre.hosts.reimage was started by btullis@cumin1002 for host an-airflow1002.eqiad.wmnet with OS bullseye

Mentioned in SAL (#wikimedia-analytics) [2024-02-02T10:27:33Z] <btullis> reimaging an-airflow10042to bullseye for T335261

Mentioned in SAL (#wikimedia-analytics) [2024-02-02T10:27:48Z] <btullis> correction: reimaging an-airflow1002to bullseye for T335261

Cookbook cookbooks.sre.hosts.reimage started by btullis@cumin1002 for host an-airflow1002.eqiad.wmnet with OS bullseye completed:

  • an-airflow1002 (PASS)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via gnt-instance
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Set boot media to disk
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202402021044_btullis_1598578_an-airflow1002.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
BTullis updated the task description. (Show Details)