This task will track the receiving and installation of (8) 6.4TB NVMe PCIe SSDs to install into the text cp fleet in eqsin.
Target Date: 2024-06-25 @ 9AM Singapore Time, 1AM GMT, 6PM Pacific.
Order was via parent task T348064.
cp50(1[789]|2[01234] are text hosts.
This will be worked on with RobH with Jin from DreamIIC and ssingh
This task is a copy of ulsfo nvme task T364891.
Scheduling
Target Date: 2024-06-25 @ 9AM Singapore Time, 1AM GMT, 6PM Pacific.
Scope: Full depool
Summary Checklist (Detailed Action Checklist below in task description):
- SSDs arrive, Rob updates this task with their arrival.
- Rob generates quote with Jin@DreamIIC for Jin to install as remote hands.
- Rob and Suhkbir (with Jin) determine best date for SSD installation and set/announce a maintenance window.
- Suhkbir depools all traffic from eqsin on scheduled date.
- Jin onsite work
- Suhkbir & Rob ensure all hosts are back online and accessible.
- Suhkbir re-pools eqsin for user traffic
- Suhkbir/Traffic will reimage the upgraded text CP hosts individually while site is serving traffic. This will take place over the course of days, and not within the intial work maintenance window.
Communication
Rob will create a google hangout room with Jin, robh, ssingh, and bcornwall. This will allow Traffic to communicate directly with both Rob and Jin at the same time, as Jin does not have IRC.
Jin is highly responsive via google hangout during onsite work.
Action checklist
- Depool eqsin DC and verify traffic is switched to ulsfo (at least 4h before scheduled intervention)
- Downtime impacted hosts to be ready for power off
- Extra: silence/ack eventual other alerts
- Power off impacted hosts
- New SSD installation and hosts power on
- Verify host health and ATS cache (without using the new NVMe disk).
- run puppet agent
- Check that metrics are ok in grafana
- Check Icinga status (all green for the host)
- Check with lsblk and nvme list that the new disk is visible and has the correct naming (nvme1n1)
- Check ATS status (traffic_server -C check)
- Check for "sure hit" on ATS: curl -vv -H 'Host: en.wikipedia.org' http://localhost:3128/wiki/Main_Page -o /dev/null
- Check for both first miss and subsequent hit (eg. issuing two requests like curl -vv -H 'Host: en.wikipedia.org' http://localhost:3128/wiki/Caffeine -o /dev/null and checking the X-Cache-Status value)
- cp5017
- cp5018
- cp5019
- cp5020
- cp5021
- cp5022
- cp5023
- cp5024
- Remove downtime for hosts
- Removed downtime from alertmanager
- Manually repooled hosts with conftool (auto depooled by PyBal?)
- Repool eqsin DC and verify the traffic
- (in the next days) Merge the hiera config to add new disk, host by host, and depool/merge/reimage/repool hosts one by one with appropriate interval to help warm cache.
- per-host hiera override required, see example: https://gerrit.wikimedia.org/r/c/operations/puppet/+/1015968
- cp5017
- cp5018
- cp5019
- cp5020
- cp5021
- cp5022
- cp5023
- cp5024
- Remove custom hiera overrides and make it for whole eqsin DC
Reimaging Process
- Depool host you are going to work on
- Merge patch in the patchset
- Merge on Puppet master
- Run the reimaging cookbook
- Check if everything is fine after reimaging: Icinga, disks
- Pool host back