This task will track the receiving and installation of (8) 6.4TB NVMe PCIe SSDs to install into the text cp fleet in ulsfo.
Order was via parent task T359167.
cp40(3[789]|4[01234] are text hosts.
This will be worked on with RobH as on-site engineer and @BCornwall.
SSDs are not expected to arrive unto 2024-05-15.
Proposal of work window and type below, summarized from past IRC discussion with robh and ssingh but subject to correction by ssingh after this task filing.
Scheduling
Cadence:
- SSDs arrive, Rob updates this task with their arrival.
- Rob and Brett determine best date for SSD installation and set/annnounce a maintenance window.
- Brett depools all traffic from ulsfo on scheduled date.
- Rob goes on-site, graceful shut down on each cp host and installs the PCIe SSD and then powers them back up.
- Brett & Rob ensure all hosts are back online and accessible.
- Brett re-pools ulsfo for user traffic
- Brett/Traffic will reimage the upgraded text CP hosts individually while site is serving traffic. This will take place over the course of days, and not within the intial work maintenance window.
Maintenance window
Event window: 2024-06-12 at 15:00 UTC through 19:00 UTC
Scope: Full depool
Assumptions: it takes roughly 1 hour to depool a site without significant user impact. If it takes longer, the 16:00 UTC power-off time must shift forward, or our scheduled work by the on-site must shift from 17:00 UTC to later in the day, subject to traffic approval.
Timeline
2024-06-12 @ 15:00 : @BCornwall and @CDobbins depools esams to let traffic start routing to other DCs.
2024-06-12 @ 16:00 : @BCornwall and @CDobbins puts the CP text hosts into downtime in icinga and powers them off in advance of onsite hands.
2024-06-12 @ 17:00 : Remote hands begins work, unplugging the cp hosts 1 at a time, installing the PCIe NVMe SSD, and plugging back in the host to a fully accessible state before moving onto the next host. During this time, RobH will be online and will attempt to remotely connect to each host as they are replaced and confirm function. Estimate of roughly 2 hours for onsite hands to fully accomplish installation of 8 NVMe SSDs into 8 text cp hosts.
2024-06-12 @ 19:00 : Re-pooling of ulsfo
Post-maintenance window: Reinstallation/reimage of text cp hosts as required by either RobH or @BCornwall and @CDobbins
Action checklist
- Depool ulsfo DC and verify traffic is switched to other DCs (at least 2h before scheduled intervention)
- Downtime impacted hosts to be ready for power off
- Extra: silence/ack eventual other alerts
- Power off impacted hosts
- New SSD installation and hosts power on
Steps to be carried out after the disks are installed:
- Verify host health and ATS cache (without using the new NVMe disk).
- run puppet agent
- Check that metrics are ok in grafana (Host Overview)
- Check Icinga status (all green for the host)
- Check with lsblk and nvme list that the new disk is visible and has the correct naming (nvme1n1)
- Check ATS status (traffic_server -C check)
- Check for "sure hit" on ATS: curl -vv -H 'Host: en.wikipedia.org' http://localhost:3128/wiki/Main_Page -o /dev/null
- Check for both first miss and subsequent hit (eg. issuing two requests like curl -vv -H 'Host: en.wikipedia.org' http://localhost:3128/wiki/Caffe1ne -o /dev/null and checking the X-Cache-Status value)
- cp4037
- cp4038
- cp4039
- cp4040
- cp4041
- cp4042
- cp4043
- cp4044
- Remove downtime for hosts
- Removed downtime from alertmanager
- Manually repool hosts with conftool (auto depooled by PyBal?)
- Repool ulsfo DC and verify the traffic
- (in the next days) Merge the hiera config to add new disk, host by host, and depool/merge/reimage/repool hosts one by one with appropriate interval to help warm cache.
- per-host hiera override required, see example: https://gerrit.wikimedia.org/r/c/operations/puppet/+/1015968
- cp4037
- cp4038
- cp4039
- cp4040
- cp4041
- cp4042
- cp4043
- cp4044
- Remove custom hiera overrides and make it for whole ulsfo DC
Reimaging Process
- Depool host you are going to work on
- Merge patch in the patchset
- Merge on Puppet master
- Run the reimaging cookbook
- Check if everything is fine after reimaging: Icinga, disks
- Pool host back