This task will track the planning and execution of the NVMe upgrade to all (8) cp text hosts in esams.
These hosts are as follows: cp3066 cp3067 cp3068 cp3069 cp3070 cp3071 cp3072 cp3073
The SSDs were ordered/have arrived via T344768.
Proposal of work window and type below, summarized from past IRC discussion with robh and @ssingh but subject to correction by @ssingh after this task filing.
The SSDS were delivered to ESAMS shipping via DEL0158639 and confirmed onsite via CS1520630.
Remote work task is via CS1553796, remote hands has confirmed receipt of the SSDs and work to take place on March 27th @ 11AM CET.
Maint Window Details
Event Window: March 27 starting at 9AM CET.
Scope: Full depool
Assumptions: it takes roughly 1 hour to depool a site without significant user impact. If it takes longer, the 9AM start time must shift forward, or our scheduled work by the on-site must shift from 11AM to later in the day, subject to traffic approval.
Timeline
2023-03-27 @ 0900 : @ssingh depools esams and reroutes traffic to drmrs
2023-03-27 @ 1000 : @ssingh puts the CP text hosts into downtime in icinga and powers them off in advance of Interxion remote hands.
2023-03-27 @ 1100 : Interxion remote hands begins work, unplugging the cp hosts 1 at a time, installing the PCIe NVMe SSD, and plugging back in the host to a fully accessible state before moving onto the next host. During this time, robh will be online and will attempt to remotely connect to each host as they are replaced and confirm function.
2023-03-27 @ 1300 : Estimate of roughly 2 hours for remote hands to fully accomplish installation of 4 NVMe SSDs into 4 text cp hosts.
2023-03-27 @ 1300 : Reinstallation/reimage of text cp hosts as required by either robh or @ssingh
Action list
- Depool ESAMS DC and verify traffic is switched to DRMS (at least 4h before scheduled intervention)
- Downtime impacted hosts to be ready for power off
- Extra: silence/ack eventual other alerts
- Power off impacted hosts
- New SSD installation and hosts power on
- Verify host health and ATS cache (without using the new NVMe disk).
- run puppet agent
- Check that metrics are ok in grafana
- Check Icinga status (all green for the host)
- Check with lsblk and nvme list that the new disk is visible and has the correct naming (nvme1n1)
- Check ATS status (traffic_server -C check)
- Check for "sure hit" on ATS: curl -vv -H 'Host: en.wikipedia.org' http://localhost:3128/wiki/Main_Page -o /dev/null
- Check for both first miss and subsequent hit (eg. issuing two requests like curl -vv -H 'Host: en.wikipedia.org' http://localhost:3128/wiki/Caffe1ne -o /dev/null and checking the X-Cache-Status value)
- Remove downtime for hosts
- Removed downtime from alertmanager
- Manually repooled hosts with conftool (auto depooled by PyBal?)
- Repool ESAMS DC and verify the traffic: https://gerrit.wikimedia.org/r/c/operations/dns/+/1015018 done @12:15UTC
- (in the next days) Merge the hiera config to add new disk, host by host, and reimage hosts one by one with appropriate interval to help warm cache.
- cp3066
- cp3067
- cp3068
- cp3069
- cp3070
- cp3071
- cp3072
- cp3073
- Remove custom hiera overrides and make it for whole ESAMS DC
Reimaging Process
- Depool host you are going to work on
- Merge patch in the patchset https://gerrit.wikimedia.org/r/c/operations/puppet/+/1015968
- Merge on Puppet master
- Run the reimaging cookbook
- Check if everything is fine after reimaging: Icinga, disks
- Pool host back