Page MenuHomePhabricator

esams text cp nvme upgrade
Closed, ResolvedPublic

Description

This task will track the planning and execution of the NVMe upgrade to all (8) cp text hosts in esams.

These hosts are as follows: cp3066 cp3067 cp3068 cp3069 cp3070 cp3071 cp3072 cp3073

The SSDs were ordered/have arrived via T344768.

Proposal of work window and type below, summarized from past IRC discussion with robh and @ssingh but subject to correction by @ssingh after this task filing.

The SSDS were delivered to ESAMS shipping via DEL0158639 and confirmed onsite via CS1520630.

Remote work task is via CS1553796, remote hands has confirmed receipt of the SSDs and work to take place on March 27th @ 11AM CET.

Maint Window Details

Event Window: March 27 starting at 9AM CET.
Scope: Full depool

Assumptions: it takes roughly 1 hour to depool a site without significant user impact. If it takes longer, the 9AM start time must shift forward, or our scheduled work by the on-site must shift from 11AM to later in the day, subject to traffic approval.

Timeline

2023-03-27 @ 0900 : @ssingh depools esams and reroutes traffic to drmrs
2023-03-27 @ 1000 : @ssingh puts the CP text hosts into downtime in icinga and powers them off in advance of Interxion remote hands.
2023-03-27 @ 1100 : Interxion remote hands begins work, unplugging the cp hosts 1 at a time, installing the PCIe NVMe SSD, and plugging back in the host to a fully accessible state before moving onto the next host. During this time, robh will be online and will attempt to remotely connect to each host as they are replaced and confirm function.
2023-03-27 @ 1300 : Estimate of roughly 2 hours for remote hands to fully accomplish installation of 4 NVMe SSDs into 4 text cp hosts.
2023-03-27 @ 1300 : Reinstallation/reimage of text cp hosts as required by either robh or @ssingh

Action list

  • Depool ESAMS DC and verify traffic is switched to DRMS (at least 4h before scheduled intervention)
  • Downtime impacted hosts to be ready for power off
    • Extra: silence/ack eventual other alerts
  • Power off impacted hosts
  • New SSD installation and hosts power on
  • Verify host health and ATS cache (without using the new NVMe disk).
    • run puppet agent
    • Check that metrics are ok in grafana
    • Check Icinga status (all green for the host)
    • Check with lsblk and nvme list that the new disk is visible and has the correct naming (nvme1n1)
    • Check ATS status (traffic_server -C check)
    • Check for "sure hit" on ATS: curl -vv -H 'Host: en.wikipedia.org' http://localhost:3128/wiki/Main_Page -o /dev/null
    • Check for both first miss and subsequent hit (eg. issuing two requests like curl -vv -H 'Host: en.wikipedia.org' http://localhost:3128/wiki/Caffe1ne -o /dev/null and checking the X-Cache-Status value)
  • Remove downtime for hosts
  • Removed downtime from alertmanager
  • Manually repooled hosts with conftool (auto depooled by PyBal?)
  • Repool ESAMS DC and verify the traffic: https://gerrit.wikimedia.org/r/c/operations/dns/+/1015018 done @12:15UTC
  • (in the next days) Merge the hiera config to add new disk, host by host, and reimage hosts one by one with appropriate interval to help warm cache.
    • cp3066
    • cp3067
    • cp3068
    • cp3069
    • cp3070
    • cp3071
    • cp3072
    • cp3073
  • Remove custom hiera overrides and make it for whole ESAMS DC

Reimaging Process

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
RobH added a parent task: Unknown Object (Task).Mar 19 2024, 1:36 PM
RobH mentioned this in Unknown Object (Task).
RobH updated the task description. (Show Details)
RobH updated the task description. (Show Details)

Chatted with @ssingh as I had neglected some items we had discussed previously:

  • Adjusted this from a single installation window to 2 windows, 1 week apart, falling on Wednesday.
    • Allows for esams to stay pooled with fallback of depool to drmrs if required.
    • Half of the CP text for esams will be offline in a given window, we'll work on 1 rack, then the other rack, to reduce complexity and chance for failure.

I'm going to leave the proposed changes in the task description above and review, and if we don't think of anything we need to adjust, I'll enter the remote hands task tomorrow for the two proposed date windows.

Hi @RobH:

Thanks for creating the task. In some further discussion with @BBlack today, we decided that we will do the following:

  • We have decided that we will depool esams prior the event, four hours before the work is supposed to start. I will send out an email about this later once we finalize the timing. (9 AM CET March 27 works for us if that's what we are keeping)
  • Remote hands can then go in, power down the hosts, install the disks, and power them back up (site will be depooled). We will not be reimaging during this time but will do so later, on our time and without requiring remote hands to be there.
  • We will verify everything is working fine and then repool the site. Then we will start reimaging the hosts, one by one to pick up the new disks.
    • We technically don't need to reimage as we can simply restart ATS to pick up the new disk (after configuring it) but we will still reimage, as we have done in the past.

Total time for DNS/site depool thus comes out to be: 4 hours before the event, ~4 hours for the work, and then 2 hours for the checking and repooling, for a total of ~10 hours but can happen sooner. That's fine and still falls within the 24h TTL window on the caches.

So the changes from above are that we are no longer doing the reimages during the event and there is really only one event now. Sorry for the confusion, I hope that's clear and let me know if there are any questions so that we can finalize this today.

Remote hands won't have any ability to power down a host other than by pressing the front power button. It would reduce potential complexity if we power down all the hosts for them to work on to prevent confusion. That way they know if it is powered off and matches the list, they can work on it.

Would that adjustment work, and traffic send a power off to those hosts in advance of the work?

Remote hands won't have any ability to power down a host other than by pressing the front power button. It would reduce potential complexity if we power down all the hosts for them to work on to prevent confusion. That way they know if it is powered off and matches the list, they can work on it.

Would that adjustment work, and traffic send a power off to those hosts in advance of the work?

Thanks, that works for us and we will make sure that the hosts are powered off.

Rob, once the time/data is confirmed, please let me know here or on IRC and I will send an email to sre@. Thanks!

Hi Rob: Checking if the date/time above has been confirmed by remote hands?

We would like remote hands to fetch shipmnet DEL0158639 which contains (8) 6.5TB NVMe PCIe SSDs from Dell NL to Wikimedia.

Proposted Work Window: 2023-03-27 @ 1100 CET

Once fetched, please unbox, photograph the contents and packing slip, and stage them for installation in our servers.

We would like to schedule the actual installation to take place om 2023-03-27 @ 1100 CET. The installation of these PCIe cards by Remote hands should take anywhere from 1-3 hours.

We would like remote hands to unbox and install (1) PCIe SSD NVMe card into (8) of our hosts which only contain (1) PCIe NVMe SSD and will upgrade them to (2) PCIe NVMe SSDs per host.

This will be repeated for a total of eight hosts in our racks as follows:
hostname/serial Rack:U-space

cp3066/6QGW8X3 BW27:U2
cp3067/3QGW8X3 BY27:U2
cp3068/5QGW8X3 BW27:U3
cp3069/2QGW8X3 BY27:U3
cp3070/1QGW8X3 BW27:U4
cp3071/7QGW8X3 BY27:U4
cp3072/4QGW8X3 BW27:U5
cp3073/JPGW8X3 BY27:U5

We would prefer the cadence of the work to be as follows:

  • Unplug cables from listed host that has been powered down.
  • Note serial of the PCIe NVMe SSD and install it into the host.
  • Push host back into rack rails and re-attach all cables.
  • Power on host and update us remotely so they can begin testing of the new hardware as remote hands works on the next one.

During the work window I'll be online remotely from the USA, so any updates can be sent via the ticket or via text message to +1.727.255.4597 or via email or google hangout to rhalsell@wikimedia.org.

Once the first host has the PCIe Card installed and cables re-attached, please update us so we can begin remote testing to ensure there are no issues while remote hands continues to install PCIe SSDs into the rest of the 8 total hosts.

Would remote hands please review the above directions for clarity or questions and also confirm the work window of 2023-03-27 @ 1100 CET.

Thank you in advance,

@RobH: Verified the hosts, serial numbers, racking and the cadence. Looks good!

CS1553796 created. Will update one they confirm the window.

Remote work task is via CS1553796, remote hands has confirmed receipt of the SSDs and work to take place on March 27th @ 11AM CET.

Change #1014514 had a related patch set uploaded (by Ssingh; author: Ssingh):

[operations/dns@master] depool esams for text cluster drive upgrade

https://gerrit.wikimedia.org/r/1014514

Change #1014571 had a related patch set uploaded (by Fabfur; author: Fabfur):

[operations/puppet@production] hiera: second nvme disk to text_esams

https://gerrit.wikimedia.org/r/1014571

Change #1014514 merged by Fabfur:

[operations/dns@master] depool esams for text cluster drive upgrade

https://gerrit.wikimedia.org/r/1014514

Mentioned in SAL (#wikimedia-operations) [2024-03-27T05:57:46Z] <fabfur> running authdns-update on dns1004 to depool ESAMS (T360430)

ESAMS DC started depooling @05:58UTC

Icinga downtime and Alertmanager silence (ID=e71791c7-a0fa-43b5-81ae-e92b275e5cc3) set by fabfur@cumin1002 for 1 day, 0:00:00 on 8 host(s) and their services with reason: preparing for new disk

cp[3066-3073].esams.wmnet

ESAMS remote hands began hands on work at 11:10 CET and it is now ongoing.

Fabfur added a subscriber: 12.

esams has been repooled at 12:15UTC

Reassigning from myself over to @Fabfur for reimaging at Traffic's leisure.

Change #1015968 had a related patch set uploaded (by Ssingh; author: Ssingh):

[operations/puppet@production] cp3066: update hieradata for dual NVMe disks configuration

https://gerrit.wikimedia.org/r/1015968

Change #1015969 had a related patch set uploaded (by Ssingh; author: Ssingh):

[operations/puppet@production] cp3067: update hieradata for dual NVMe disks configuration

https://gerrit.wikimedia.org/r/1015969

Change #1015970 had a related patch set uploaded (by Ssingh; author: Ssingh):

[operations/puppet@production] cp3068: update hieradata for dual NVMe disks configuration

https://gerrit.wikimedia.org/r/1015970

Change #1015971 had a related patch set uploaded (by Ssingh; author: Ssingh):

[operations/puppet@production] cp3069: update hieradata for dual NVMe disks configuration

https://gerrit.wikimedia.org/r/1015971

Change #1015972 had a related patch set uploaded (by Ssingh; author: Ssingh):

[operations/puppet@production] cp3070: update hieradata for dual NVMe disks configuration

https://gerrit.wikimedia.org/r/1015972

Change #1015973 had a related patch set uploaded (by Ssingh; author: Ssingh):

[operations/puppet@production] cp3071: update hieradata for dual NVMe disks configuration

https://gerrit.wikimedia.org/r/1015973

Change #1015974 had a related patch set uploaded (by Ssingh; author: Ssingh):

[operations/puppet@production] cp3072: update hieradata for dual NVMe disks configuration

https://gerrit.wikimedia.org/r/1015974

Change #1015975 had a related patch set uploaded (by Ssingh; author: Ssingh):

[operations/puppet@production] cp3073: update hieradata for dual NVMe disks configuration

https://gerrit.wikimedia.org/r/1015975

Mentioned in SAL (#wikimedia-operations) [2024-04-02T13:32:34Z] <fabfur> depool cp3066 for reimage (T360430)

Cookbook cookbooks.sre.hosts.reimage was started by fabfur@cumin1002 for host cp3066.esams.wmnet with OS bullseye

Change #1015968 merged by Fabfur:

[operations/puppet@production] cp3066: update hieradata for dual NVMe disks configuration

https://gerrit.wikimedia.org/r/1015968

For posterity, an annotated Grafana dashboard that shows incoming traffic to esams after and during the depool and power-off events: https://grafana.wikimedia.org/goto/DbIJc7bSk?orgId=1. Note that the current TTL for dyna.wikimedia.org is five minutes, reduced in T140365. This is only for incoming traffic to Varnish, not the number of connections.

  • In five minutes (the actual TTL expiring), there is a ~50% drop in traffic.
  • In ten minutes: ~83% drop.
  • In 120 minutes: ~92%.
  • Before power-off before the hands-on maintenance began, we had a ~95% drop in traffic.

Cookbook cookbooks.sre.hosts.reimage started by fabfur@cumin1002 for host cp3066.esams.wmnet with OS bullseye completed:

  • cp3066 (PASS)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202404021405_fabfur_273941_cp3066.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Mentioned in SAL (#wikimedia-operations) [2024-04-02T14:38:46Z] <fabfur> repooling cp3066 after reimage (T360430)

cp3066 has been reimaged successfully, no evidence of errors

Change #1015969 merged by Fabfur:

[operations/puppet@production] cp3067: update hieradata for dual NVMe disks configuration

https://gerrit.wikimedia.org/r/1015969

Mentioned in SAL (#wikimedia-operations) [2024-04-03T08:24:12Z] <fabfur> depool cp3067 for reimage (T360430)

Cookbook cookbooks.sre.hosts.reimage was started by fabfur@cumin1002 for host cp3067.esams.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by fabfur@cumin1002 for host cp3067.esams.wmnet with OS bullseye completed:

  • cp3067 (PASS)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202404030855_fabfur_433556_cp3067.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Mentioned in SAL (#wikimedia-operations) [2024-04-04T15:54:37Z] <fabfur> depooling cp3068 for reimage (T360430)

Change #1015970 merged by Fabfur:

[operations/puppet@production] cp3068: update hieradata for dual NVMe disks configuration

https://gerrit.wikimedia.org/r/1015970

Cookbook cookbooks.sre.hosts.reimage was started by fabfur@cumin1002 for host cp3068.esams.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by fabfur@cumin1002 for host cp3068.esams.wmnet with OS bullseye completed:

  • cp3068 (PASS)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202404041632_fabfur_684884_cp3068.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
Fabfur added a subscriber: RobH.

Mentioned in SAL (#wikimedia-operations) [2024-04-08T14:19:47Z] <sukhe> depool cp3069 to prepare for reimaging: T360430

Change #1015971 merged by Ssingh:

[operations/puppet@production] cp3069: update hieradata for dual NVMe disks configuration

https://gerrit.wikimedia.org/r/1015971

Change #1015972 merged by Fabfur:

[operations/puppet@production] cp3070: update hieradata for dual NVMe disks configuration

https://gerrit.wikimedia.org/r/1015972

Cookbook cookbooks.sre.hosts.reimage was started by fabfur@cumin1002 for host cp3070.esams.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by fabfur@cumin1002 for host cp3070.esams.wmnet with OS bullseye completed:

  • cp3070 (PASS)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202404100818_fabfur_1754090_cp3070.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Change #1015973 merged by Ssingh:

[operations/puppet@production] cp3071: update hieradata for dual NVMe disks configuration

https://gerrit.wikimedia.org/r/1015973

Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin1002 for host cp3071.esams.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin1002 for host cp3071.esams.wmnet with OS bullseye completed:

  • cp3071 (PASS)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202404101854_sukhe_1856882_cp3071.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Change #1015974 merged by Fabfur:

[operations/puppet@production] cp3072: update hieradata for dual NVMe disks configuration

https://gerrit.wikimedia.org/r/1015974

Cookbook cookbooks.sre.hosts.reimage was started by fabfur@cumin1002 for host cp3072.esams.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by fabfur@cumin1002 for host cp3072.esams.wmnet with OS bullseye executed with errors:

  • cp3072 (FAIL)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202404110813_fabfur_1965699_cp3072.out
    • The reimage failed, see the cookbook logs for the details,You can also try typing "install-console" cp3072.esams.wmnet to get a root shellbut depending on the failure this may not work.

Cookbook cookbooks.sre.hosts.reimage was started by fabfur@cumin1002 for host cp3072.esams.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by fabfur@cumin1002 for host cp3072.esams.wmnet with OS bullseye completed:

  • cp3072 (WARN)
    • Downtimed on Icinga/Alertmanager
    • Unable to disable Puppet, the host may have been unreachable
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202404110906_fabfur_1982322_cp3072.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Change #1015975 merged by Ssingh:

[operations/puppet@production] cp3073: update hieradata for dual NVMe disks configuration

https://gerrit.wikimedia.org/r/1015975

Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin1002 for host cp3073.esams.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin1002 for host cp3073.esams.wmnet with OS bullseye completed:

  • cp3073 (PASS)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202404111401_sukhe_2052551_cp3073.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Change #1014571 merged by Ssingh:

[operations/puppet@production] hiera: unify trafficserver storage elements for esams

https://gerrit.wikimedia.org/r/1014571

Fabfur updated the task description. (Show Details)