Page MenuHomePhabricator

Q4: install PCIe NVMe SSDs into eqsin text cp50(1[789]|2[01234]
Open, MediumPublic

Description

This task will track the receiving and installation of (8) 6.4TB NVMe PCIe SSDs to install into the text cp fleet in eqsin.

Target Date: 2024-06-25 @ 9AM Singapore Time, 1AM GMT, 6PM Pacific.

Order was via parent task T348064.

cp50(1[789]|2[01234] are text hosts.

This will be worked on with RobH with Jin from DreamIIC and ssingh

This task is a copy of ulsfo nvme task T364891.

Scheduling

Target Date: 2024-06-25 @ 9AM Singapore Time, 1AM GMT, 6PM Pacific.

Scope: Full depool

Summary Checklist (Detailed Action Checklist below in task description):

  • SSDs arrive, Rob updates this task with their arrival.
  • Rob generates quote with Jin@DreamIIC for Jin to install as remote hands.
  • Rob and Suhkbir (with Jin) determine best date for SSD installation and set/announce a maintenance window.
  • Suhkbir depools all traffic from eqsin on scheduled date.
  • Jin onsite work
  • Suhkbir & Rob ensure all hosts are back online and accessible.
  • Suhkbir re-pools eqsin for user traffic
  • Suhkbir/Traffic will reimage the upgraded text CP hosts individually while site is serving traffic. This will take place over the course of days, and not within the intial work maintenance window.

Communication

Rob will create a google hangout room with Jin, robh, ssingh, and bcornwall. This will allow Traffic to communicate directly with both Rob and Jin at the same time, as Jin does not have IRC.

Jin is highly responsive via google hangout during onsite work.

Action checklist

  • Depool eqsin DC and verify traffic is switched to ulsfo (at least 4h before scheduled intervention)
  • Downtime impacted hosts to be ready for power off
    • Extra: silence/ack eventual other alerts
  • Power off impacted hosts
  • New SSD installation and hosts power on
  • Verify host health and ATS cache (without using the new NVMe disk).
    • run puppet agent
    • Check that metrics are ok in grafana
    • Check Icinga status (all green for the host)
    • Check with lsblk and nvme list that the new disk is visible and has the correct naming (nvme1n1)
    • Check ATS status (traffic_server -C check)
    • Check for "sure hit" on ATS: curl -vv -H 'Host: en.wikipedia.org' http://localhost:3128/wiki/Main_Page -o /dev/null
    • Check for both first miss and subsequent hit (eg. issuing two requests like curl -vv -H 'Host: en.wikipedia.org' http://localhost:3128/wiki/Caffeine -o /dev/null and checking the X-Cache-Status value)
    • cp5017
    • cp5018
    • cp5019
    • cp5020
    • cp5021
    • cp5022
    • cp5023
    • cp5024
  • Remove downtime for hosts
  • Removed downtime from alertmanager
  • Manually repooled hosts with conftool (auto depooled by PyBal?)
  • Repool eqsin DC and verify the traffic
  • (in the next days) Merge the hiera config to add new disk, host by host, and depool/merge/reimage/repool hosts one by one with appropriate interval to help warm cache.
  • Remove custom hiera overrides and make it for whole eqsin DC

Reimaging Process

  • Depool host you are going to work on
  • Merge patch in the patchset
  • Merge on Puppet master
  • Run the reimaging cookbook
  • Check if everything is fine after reimaging: Icinga, disks
  • Pool host back

Event Timeline

RobH renamed this task from Q#:rack/setup/install X to Q4: install PCIe NVMe SSDs into eqsin text cp50(1[789]|2[01234].May 23 2024, 9:17 PM
RobH mentioned this in Unknown Object (Task).May 23 2024, 9:24 PM
RobH added a parent task: Unknown Object (Task).
RobH added a subtask: Unknown Object (Task).
RobH added a subscriber: ssingh.
RobH added a subscriber: Fabfur.
ssingh updated the task description. (Show Details)

Change #1049168 had a related patch set uploaded (by Ssingh; author: Ssingh):

[operations/puppet@production] cp5017: update hieradata for dual NVMe disks configuration

https://gerrit.wikimedia.org/r/1049168

Change #1049169 had a related patch set uploaded (by Ssingh; author: Ssingh):

[operations/puppet@production] cp5018: update hieradata for dual NVMe disks configuration

https://gerrit.wikimedia.org/r/1049169

Change #1049170 had a related patch set uploaded (by Ssingh; author: Ssingh):

[operations/puppet@production] cp5019: update hieradata for dual NVMe disks configuration

https://gerrit.wikimedia.org/r/1049170

Change #1049171 had a related patch set uploaded (by Ssingh; author: Ssingh):

[operations/puppet@production] cp5020: update hieradata for dual NVMe disks configuration

https://gerrit.wikimedia.org/r/1049171

Change #1049172 had a related patch set uploaded (by Ssingh; author: Ssingh):

[operations/puppet@production] cp5021: update hieradata for dual NVMe disks configuration

https://gerrit.wikimedia.org/r/1049172

Change #1049173 had a related patch set uploaded (by Ssingh; author: Ssingh):

[operations/puppet@production] cp5022: update hieradata for dual NVMe disks configuration

https://gerrit.wikimedia.org/r/1049173

Change #1049174 had a related patch set uploaded (by Ssingh; author: Ssingh):

[operations/puppet@production] cp5023: update hieradata for dual NVMe disks configuration

https://gerrit.wikimedia.org/r/1049174

Change #1049175 had a related patch set uploaded (by Ssingh; author: Ssingh):

[operations/puppet@production] cp5024: update hieradata for dual NVMe disks configuration

https://gerrit.wikimedia.org/r/1049175

Change #1049232 had a related patch set uploaded (by BCornwall; author: BCornwall):

[operations/dns@master] depool ulsfo for text cluster drive upgrade

https://gerrit.wikimedia.org/r/1049232

RobH updated the task description. (Show Details)

Change #1049232 merged by BCornwall:

[operations/dns@master] depool eqsin for text cluster drive upgrade

https://gerrit.wikimedia.org/r/1049232

Mentioned in SAL (#wikimedia-operations) [2024-06-24T23:02:05Z] <brett> Running authdns-update on dns1004 to depool eqsin - T365763

Mentioned in SAL (#wikimedia-operations) [2024-06-25T00:01:08Z] <brett@cumin2002> START - Cookbook sre.hosts.downtime for 4:00:00 on 8 hosts with reason: T365763

Mentioned in SAL (#wikimedia-operations) [2024-06-25T00:01:33Z] <brett@cumin2002> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on 8 hosts with reason: T365763

Mentioned in SAL (#wikimedia-operations) [2024-06-25T01:40:11Z] <brett> Removing downtime for cp[5017-5024] as nvme drives are installed and hosts back online - T365763

Mentioned in SAL (#wikimedia-operations) [2024-06-25T01:48:27Z] <brett> Running authdns-update on dns1004 to pool eqsin - T365763

BCornwall updated the task description. (Show Details)
RobH removed RobH as the assignee of this task.Tue, Jun 25, 4:24 PM
RobH closed subtask Unknown Object (Task) as Resolved.
RobH updated the task description. (Show Details)
RobH updated the task description. (Show Details)
RobH unsubscribed.

Change #1049168 merged by BCornwall:

[operations/puppet@production] cp5017: update hieradata for dual NVMe disks configuration

https://gerrit.wikimedia.org/r/1049168

Cookbook cookbooks.sre.hosts.reimage was started by brett@cumin2002 for host cp5017.eqsin.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by brett@cumin2002 for host cp5017.eqsin.wmnet with OS bullseye executed with errors:

  • cp5017 (FAIL)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • The reimage failed, see the cookbook logs for the details,You can also try typing "install-console" cp5017.eqsin.wmnet to get a root shellbut depending on the failure this may not work.

Cookbook cookbooks.sre.hosts.reimage was started by brett@cumin2002 for host cp5017.eqsin.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by brett@cumin2002 for host cp5017.eqsin.wmnet with OS bullseye completed:

  • cp5017 (PASS)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202406251928_brett_704326_cp5017.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Change #1049169 merged by BCornwall:

[operations/puppet@production] cp5018: update hieradata for dual NVMe disks configuration

https://gerrit.wikimedia.org/r/1049169

Cookbook cookbooks.sre.hosts.reimage was started by brett@cumin2002 for host cp5018.eqsin.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by brett@cumin2002 for host cp5018.eqsin.wmnet with OS bullseye completed:

  • cp5018 (PASS)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202406261706_brett_1276932_cp5018.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Change #1049170 merged by BCornwall:

[operations/puppet@production] cp5019: update hieradata for dual NVMe disks configuration

https://gerrit.wikimedia.org/r/1049170

Cookbook cookbooks.sre.hosts.reimage was started by brett@cumin2002 for host cp5019.eqsin.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by brett@cumin2002 for host cp5019.eqsin.wmnet with OS bullseye executed with errors:

  • cp5019 (FAIL)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • The reimage failed, see the cookbook logs for the details,You can also try typing "install-console" cp5019.eqsin.wmnet to get a root shellbut depending on the failure this may not work.

Cookbook cookbooks.sre.hosts.reimage was started by brett@cumin2002 for host cp5019.eqsin.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by brett@cumin2002 for host cp5019.eqsin.wmnet with OS bullseye completed:

  • cp5019 (PASS)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202406261933_brett_1340367_cp5019.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Change #1049171 merged by BCornwall:

[operations/puppet@production] cp5020: update hieradata for dual NVMe disks configuration

https://gerrit.wikimedia.org/r/1049171

Cookbook cookbooks.sre.hosts.reimage was started by brett@cumin2002 for host cp5020.eqsin.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by brett@cumin2002 for host cp5020.eqsin.wmnet with OS bullseye executed with errors:

  • cp5020 (FAIL)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • The reimage failed, see the cookbook logs for the details,You can also try typing "install-console" cp5020.eqsin.wmnet to get a root shellbut depending on the failure this may not work.

Cookbook cookbooks.sre.hosts.reimage was started by brett@cumin2002 for host cp5020.eqsin.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by brett@cumin2002 for host cp5020.eqsin.wmnet with OS bullseye completed:

  • cp5020 (PASS)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202406262150_brett_1403267_cp5020.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Cookbook cookbooks.sre.hosts.reimage was started by brett@cumin2002 for host cp5021.eqsin.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by brett@cumin2002 for host cp5021.eqsin.wmnet with OS bullseye completed:

  • cp5021 (PASS)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202406262323_brett_1443394_cp5021.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB