This task will track the receiving and installation of (8) 6.4TB NVMe PCIe SSDs to install into the text cp fleet in eqsin.
Target Date: 2024-06-25 @ 9AM Singapore Time, 1AM GMT, 6PM Pacific.
Order was via parent task T348064.
cp50(1[789]|2[01234] are text hosts.
This will be worked on with @RobH as on-site engineer and @ssingh
This task is a copy of ulsfo nvme task T364891.
== Scheduling ==
Target Date: 2024-06-25 @ 9AM Singapore Time, 1AM GMT, 6PM Pacific.
Scope: Full depool
Assumptions: it takes roughly 1 hour to depool a site without significant user impact. If it takes longer, the 9AM start time must shift forward, or our scheduled work by the on-site must shift from 11AM to later in the day, subject to traffic approval.
Cadence:
* SSDs arrive, Rob updates this task with their arrival.
* Rob generates quote with Jin@DreamIIC for Jin to install as remote hands.
* Rob and Suhkbir (with Jin) determine best date for SSD installation and set/announce a maintenance window.
* Suhkbir depools all traffic from eqsin on scheduled date.
* Rob goes on-site, graceful shut down on each cp host and installs the PCIe SSD and then powers them back up.
* Suhkbir & Rob ensure all hosts are back online and accessible.
* Suhkbir re-pools eqsin for user traffic
* Suhkbir/Traffic will reimage the upgraded text CP hosts individually while site is serving traffic. This will take place over the course of days, and not within the intial work maintenance window.
== Communication ==
Rob will create a google hangout room with Jin, @robh, @ssingh, and @fabfur. This will allow #traffic to communicate directly with both Rob and Jin at the same time, as Jin does not have IRC.
Jin is highly responsive via google hangout during onsite work.
== Action checklist ==
- [ ] Depool eqsin DC and verify traffic is switched to codfw (at least 4h before scheduled intervention)
- [ ] Downtime impacted hosts to be ready for power off
- Extra: silence/ack eventual other alerts
- [ ] Power off impacted hosts
- [ ] New SSD installation and hosts power on
- [ ] Verify host health and ATS cache (without using the new NVMe disk).
- run puppet agent
- Check that metrics are ok in grafana
- Check Icinga status (all green for the host)
- Check with `lsblk` and `nvme list` that the new disk is visible and has the correct naming (`nvme1n1`)
- Check ATS status (`traffic_server -C check`)
- Check for "sure hit" on ATS: `curl -vv -H 'Host: en.wikipedia.org' http://localhost:3128/wiki/Main_Page -o /dev/null`
- Check for both first miss and subsequent hit (eg. issuing two requests like `curl -vv -H 'Host: en.wikipedia.org' http://localhost:3128/wiki/Caffe1ne -o /dev/null` and checking the X-Cache-Status value)
- [ ] Remove downtime for hosts
- [ ] Removed downtime from alertmanager
- [ ] Manually repooled hosts with `conftool` (auto depooled by PyBal?)
- [ ] Repool eqsin DC and verify the traffic
- [ ] (in the next days) Merge the hiera config to add new disk, host by host, and reimage hosts one by one with appropriate interval to help warm cache.
- [ ] cp7037
- [ ] cp7038
- [ ] cp7039
- [ ] cp7040
- [ ] cp7041
- [ ] cp7042
- [ ] cp7043
- [ ] cp7044
- [ ] Remove custom hiera overrides and make it for whole eqsin DC
== Reimaging Process ==
- Depool host you are going to work on
- Merge patch in the patchset
- Merge on Puppet master
- Run the reimaging cookbook
- Check if everything is fine after reimaging: Icinga, disks
- Pool host back