In early 2025 we carried out this work: T385485: Q3: an-worker data volumes HDD upgrade tracking task
Since then, we have seen an unusually high number of disk failure alerts, which automatically create phabricator tasks and require intervention from both the ops-eqiad and Data-Platform-SRE teams, in order to resolve.
Unfortunately, the current Hadoop worker configuration is not ideal, since it includes a RAID controller, but each of the data disks has to be configured with a single-disk RAID0 volume.
Naturally, there is no redundancy in this configuration, so if a logical volume in the RAID controller configuration fails, the LV configuration has to be deleted and re-created.
In addition to this, RAID controller configuration, any new disk also has to have a partition table, volume label, fs tunables etc manually created as per Data_Platform/Systems/Hadoop/Administration#Swapping_broken_disk.
These are all examples of this type of incident from the last 12 months: T387732, T389751, T396703, T398773, T399355, T399991, T401504, T406293, T408359, T409060, T409938, T413704, T413336, T411209, T413360, T414861, T416066
It is possible that these are all legitimate disk failures and these are an effect of the Infant Mortality Phase sen in the Bathtub Curve.
However, the rate if this type of incident is so stubbornly high that it is also possible that there is a more direct cause of the failures.
Perhaps they are not failures of the disks themselves, but perhaps some kind of intermittent outage of the connection between the motherboard or RAID controller, and the disk(s).
@Jclark-ctr suggested a possible line of investigation in T415002#11550039
The system is currently set to Performance Per Watt (OS). Given the intermittent disk dropouts that occasionally recover after a reboot, we may want to switch the System Profile to Performance to disable deep CPU C-states and PCIe power saving.
I suggest that we use this ticket to track the investigation into this problem, as well as possible solutions.
