Page MenuHomePhabricator

Migrate an-web1001 to Debian bullseye
Closed, ResolvedPublic

Description

We currently run two websites on an-web1001.eqiad.wmnet

We need to upgrade this server to Debian bullseye.

We cannot yet use bookworm because we need this server to be an HDFS client, so it requires our bigtop packages for bullseye.

Event Timeline

Gehel triaged this task as High priority.Nov 15 2023, 9:45 AM

Change 997798 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/puppet@production] Configure reuse-parts for the analytics webserver

https://gerrit.wikimedia.org/r/997798

BTullis renamed this task from Migrate an-web1001 to Debian bullseye (or bookworm) to Migrate an-web1001 to Debian bullseye.Feb 6 2024, 10:48 AM
BTullis updated the task description. (Show Details)
BTullis updated the task description. (Show Details)

Change 997798 merged by Btullis:

[operations/puppet@production] Configure reuse-parts for the analytics webserver

https://gerrit.wikimedia.org/r/997798

Cookbook cookbooks.sre.hosts.reimage was started by btullis@cumin1002 for host an-web1001.eqiad.wmnet with OS bullseye

Mentioned in SAL (#wikimedia-analytics) [2024-02-06T11:03:27Z] <btullis> reimaging an-web1001 to bullseye for T349398

Cookbook cookbooks.sre.hosts.reimage started by btullis@cumin1002 for host an-web1001.eqiad.wmnet with OS bullseye executed with errors:

  • an-web1001 (FAIL)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202402061143_btullis_2404627_an-web1001.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • The reimage failed, see the cookbook logs for the details

Change 997810 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/puppet@production] Fix the reuse-analytics-raid1-2dev partman recipe

https://gerrit.wikimedia.org/r/997810

Change 997810 merged by Btullis:

[operations/puppet@production] Fix the reuse-analytics-raid1-2dev partman recipe

https://gerrit.wikimedia.org/r/997810

Cookbook cookbooks.sre.hosts.reimage was started by btullis@cumin1002 for host an-web1001.eqiad.wmnet with OS bullseye

Change 997812 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/puppet@production] Fix the reuse-analytics-raid1-2dev recipe

https://gerrit.wikimedia.org/r/997812

Cookbook cookbooks.sre.hosts.reimage started by btullis@cumin1002 for host an-web1001.eqiad.wmnet with OS bullseye executed with errors:

  • an-web1001 (FAIL)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • The reimage failed, see the cookbook logs for the details

Change 997812 merged by Btullis:

[operations/puppet@production] Fix the reuse-analytics-raid1-2dev recipe

https://gerrit.wikimedia.org/r/997812

Cookbook cookbooks.sre.hosts.reimage was started by btullis@cumin1002 for host an-web1001.eqiad.wmnet with OS bullseye

Doing an in-place reimage here means that analytics.wm.o and stats.wm.o are currently now. Is there a reason why this could not be done by creating a new VM and then switching traffic to it?

Cookbook cookbooks.sre.hosts.reimage started by btullis@cumin1002 for host an-web1001.eqiad.wmnet with OS bullseye completed:

  • an-web1001 (PASS)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202402061336_btullis_2444619_an-web1001.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Doing an in-place reimage here means that analytics.wm.o and stats.wm.o are currently now. Is there a reason why this could not be done by creating a new VM and then switching traffic to it?

It's a bare metal host. Apologies for the extended downtime. Everything should be back now. I had some issues with the reuse recipe for partman.