Page MenuHomePhabricator

Unusually high disk errors on the an-worker nodes since upgrading the disks
Closed, ResolvedPublic

Description

In early 2025 we carried out this work: T385485: Q3: an-worker data volumes HDD upgrade tracking task

Since then, we have seen an unusually high number of disk failure alerts, which automatically create phabricator tasks and require intervention from both the ops-eqiad and Data-Platform-SRE teams, in order to resolve.

Unfortunately, the current Hadoop worker configuration is not ideal, since it includes a RAID controller, but each of the data disks has to be configured with a single-disk RAID0 volume.
Naturally, there is no redundancy in this configuration, so if a logical volume in the RAID controller configuration fails, the LV configuration has to be deleted and re-created.
In addition to this, RAID controller configuration, any new disk also has to have a partition table, volume label, fs tunables etc manually created as per Data_Platform/Systems/Hadoop/Administration#Swapping_broken_disk.

These are all examples of this type of incident from the last 12 months: T387732, T389751, T396703, T398773, T399355, T399991, T401504, T406293, T408359, T409060, T409938, T413704, T413336, T411209, T413360, T414861, T416066

It is possible that these are all legitimate disk failures and these are an effect of the Infant Mortality Phase sen in the Bathtub Curve.

However, the rate if this type of incident is so stubbornly high that it is also possible that there is a more direct cause of the failures.
Perhaps they are not failures of the disks themselves, but perhaps some kind of intermittent outage of the connection between the motherboard or RAID controller, and the disk(s).

@Jclark-ctr suggested a possible line of investigation in T415002#11550039

The system is currently set to Performance Per Watt (OS). Given the intermittent disk dropouts that occasionally recover after a reboot, we may want to switch the System Profile to Performance to disable deep CPU C-states and PCIe power saving.

I suggest that we use this ticket to track the investigation into this problem, as well as possible solutions.

Related Objects

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Host an-worker1175.eqiad.wmnet rebooted by btullis@cumin1003 with reason: Rebooting to pick up new server profile

Host an-worker1176.eqiad.wmnet rebooted by btullis@cumin1003 with reason: Rebooting to pick up new server profile

Host an-worker1177.eqiad.wmnet rebooted by btullis@cumin1003 with reason: Rebooting to pick up new server profile

Host an-worker1178.eqiad.wmnet rebooted by btullis@cumin1003 with reason: Rebooting to pick up new server profile

Host an-worker1179.eqiad.wmnet rebooted by btullis@cumin1003 with reason: Rebooting to pick up new server profile

Host an-worker1180.eqiad.wmnet rebooted by btullis@cumin1003 with reason: Rebooting to pick up new server profile

Host an-worker1181.eqiad.wmnet rebooted by btullis@cumin1003 with reason: Rebooting to pick up new server profile

Host an-worker1182.eqiad.wmnet rebooted by btullis@cumin1003 with reason: Rebooting to pick up new server profile

Host an-worker1183.eqiad.wmnet rebooted by btullis@cumin1003 with reason: Rebooting to pick up new server profile

Host an-worker1184.eqiad.wmnet rebooted by btullis@cumin1003 with reason: Rebooting to pick up new server profile

Host an-worker1185.eqiad.wmnet rebooted by btullis@cumin1003 with reason: Rebooting to pick up new server profile

Host an-worker1186.eqiad.wmnet rebooted by btullis@cumin1003 with reason: Rebooting to pick up new server profile

Host an-worker1187.eqiad.wmnet rebooted by btullis@cumin1003 with reason: Rebooting to pick up new server profile

Host an-worker1188.eqiad.wmnet rebooted by btullis@cumin1003 with reason: Rebooting to pick up new server profile

Host an-worker1189.eqiad.wmnet rebooted by btullis@cumin1003 with reason: Rebooting to pick up new server profile

Host an-worker1190.eqiad.wmnet rebooted by btullis@cumin1003 with reason: Rebooting to pick up new server profile

Host an-worker1191.eqiad.wmnet rebooted by btullis@cumin1003 with reason: Rebooting to pick up new server profile

Host an-worker1192.eqiad.wmnet rebooted by btullis@cumin1003 with reason: Rebooting to pick up new server profile

Host an-worker1193.eqiad.wmnet rebooted by btullis@cumin1003 with reason: Rebooting to pick up new server profile

Host an-worker1194.eqiad.wmnet rebooted by btullis@cumin1003 with reason: Rebooting to pick up new server profile

Host an-worker1195.eqiad.wmnet rebooted by btullis@cumin1003 with reason: Rebooting to pick up new server profile

Host an-worker1196.eqiad.wmnet rebooted by btullis@cumin1003 with reason: Rebooting to pick up new server profile

Host an-worker1197.eqiad.wmnet rebooted by btullis@cumin1003 with reason: Rebooting to pick up new server profile

Host an-worker1198.eqiad.wmnet rebooted by btullis@cumin1003 with reason: Rebooting to pick up new server profile

Host an-worker1199.eqiad.wmnet rebooted by btullis@cumin1003 with reason: Rebooting to pick up new server profile

Host an-worker1200.eqiad.wmnet rebooted by btullis@cumin1003 with reason: Rebooting to pick up new server profile

Host an-worker1201.eqiad.wmnet rebooted by btullis@cumin1003 with reason: Rebooting to pick up new server profile

Host an-worker1202.eqiad.wmnet rebooted by btullis@cumin1003 with reason: Rebooting to pick up new server profile

Host an-worker1203.eqiad.wmnet rebooted by btullis@cumin1003 with reason: Rebooting to pick up new server profile

Host an-worker1204.eqiad.wmnet rebooted by btullis@cumin1003 with reason: Rebooting to pick up new server profile

Host an-worker1205.eqiad.wmnet rebooted by btullis@cumin1003 with reason: Rebooting to pick up new server profile

Host an-worker1206.eqiad.wmnet rebooted by btullis@cumin1003 with reason: Rebooting to pick up new server profile

Host an-worker1207.eqiad.wmnet rebooted by btullis@cumin1003 with reason: Rebooting to pick up new server profile

Host an-worker1208.eqiad.wmnet rebooted by btullis@cumin1003 with reason: Rebooting to pick up new server profile

Host an-worker1209.eqiad.wmnet rebooted by btullis@cumin1003 with reason: Rebooting to pick up new server profile

Host an-worker1210.eqiad.wmnet rebooted by btullis@cumin1003 with reason: Rebooting to pick up new server profile

Host an-worker1211.eqiad.wmnet rebooted by btullis@cumin1003 with reason: Rebooting to pick up new server profile

Host an-worker1212.eqiad.wmnet rebooted by btullis@cumin1003 with reason: Rebooting to pick up new server profile

Host an-worker1213.eqiad.wmnet rebooted by btullis@cumin1003 with reason: Rebooting to pick up new server profile

Host an-worker1214.eqiad.wmnet rebooted by btullis@cumin1003 with reason: Rebooting to pick up new server profile

Host an-worker1215.eqiad.wmnet rebooted by btullis@cumin1003 with reason: Rebooting to pick up new server profile

Host an-worker1216.eqiad.wmnet rebooted by btullis@cumin1003 with reason: Rebooting to pick up new server profile

Host an-worker1217.eqiad.wmnet rebooted by btullis@cumin1003 with reason: Rebooting to pick up new server profile

Host an-worker1218.eqiad.wmnet rebooted by btullis@cumin1003 with reason: Rebooting to pick up new server profile

Host an-worker1219.eqiad.wmnet rebooted by btullis@cumin1003 with reason: Rebooting to pick up new server profile

Change #1248070 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/puppet@production] Remove the cpufrequtils class from the hadoop workers

https://gerrit.wikimedia.org/r/1248070

Change #1248070 merged by Btullis:

[operations/puppet@production] Remove the cpufrequtils class from the hadoop workers

https://gerrit.wikimedia.org/r/1248070

Host an-worker1220.eqiad.wmnet rebooted by btullis@cumin1003 with reason: Rebooting to pick up new server profile

Host an-worker1221.eqiad.wmnet rebooted by btullis@cumin1003 with reason: Rebooting to pick up new server profile

Host an-worker1222.eqiad.wmnet rebooted by btullis@cumin1003 with reason: Rebooting to pick up new server profile

Host an-worker1223.eqiad.wmnet rebooted by btullis@cumin1003 with reason: Rebooting to pick up new server profile

Host an-worker1224.eqiad.wmnet rebooted by btullis@cumin1003 with reason: Rebooting to pick up new server profile

Host an-worker1225.eqiad.wmnet rebooted by btullis@cumin1003 with reason: Rebooting to pick up new server profile

Host an-worker1226.eqiad.wmnet rebooted by btullis@cumin1003 with reason: Rebooting to pick up new server profile

Host an-worker1227.eqiad.wmnet rebooted by btullis@cumin1003 with reason: Rebooting to pick up new server profile

Host an-worker1228.eqiad.wmnet rebooted by btullis@cumin1003 with reason: Rebooting to pick up new server profile

Host an-worker1229.eqiad.wmnet rebooted by btullis@cumin1003 with reason: Rebooting to pick up new server profile

Host an-worker1230.eqiad.wmnet rebooted by btullis@cumin1003 with reason: Rebooting to pick up new server profile

Host an-worker1231.eqiad.wmnet rebooted by btullis@cumin1003 with reason: Rebooting to pick up new server profile

Host an-worker1232.eqiad.wmnet rebooted by btullis@cumin1003 with reason: Rebooting to pick up new server profile

Host an-worker1233.eqiad.wmnet rebooted by btullis@cumin1003 with reason: Rebooting to pick up new server profile

Host an-worker1234.eqiad.wmnet rebooted by btullis@cumin1003 with reason: Rebooting to pick up new server profile

Host an-worker1235.eqiad.wmnet rebooted by btullis@cumin1003 with reason: Rebooting to pick up new server profile

Host an-worker1236.eqiad.wmnet rebooted by btullis@cumin1003 with reason: Rebooting to pick up new server profile

I have nearly finished applying the new server settings.
We can see how many of the hosts still have only 10 or 11 data volumes out of 12 like this:

btullis@cumin1003:~$ sudo cumin A:hadoop-worker 'findmnt|grep /var/lib/hadoop/data|wc -l'
94 hosts will be targeted:
an-worker[1142-1167,1169-1236].eqiad.wmnet
OK to proceed on 94 hosts? Enter the number of affected hosts to confirm or "q" to quit: 94
===== NODE GROUP =====                                                                                                                                                                                             
(1) an-worker1199.eqiad.wmnet                                                                                                                                                                                      
----- OUTPUT of 'findmnt|grep /va...adoop/data|wc -l' -----                                                                                                                                                        
10                                                                                                                                                                                                                 
===== NODE GROUP =====                                                                                                                                                                                             
(87) an-worker[1142-1158,1160-1167,1169-1177,1179-1193,1195-1198,1201-1204,1206,1208-1236].eqiad.wmnet                                                                                                             
----- OUTPUT of 'findmnt|grep /va...adoop/data|wc -l' -----                                                                                                                                                        
12                                                                                                                                                                                                                 
===== NODE GROUP =====                                                                                                                                                                                             
(6) an-worker[1159,1178,1194,1200,1205,1207].eqiad.wmnet                                                                                                                                                           
----- OUTPUT of 'findmnt|grep /va...adoop/data|wc -l' -----                                                                                                                                                        
11                                                                                                                                                                                                                 
================

So there are still 7 hosts with missing data volumes that I will try to finish now.
Then we will have a baseline from which to start measuring the future failure rate.

Host an-worker1220.eqiad.wmnet rebooted by btullis@cumin1003 with reason: Rebooting to pick up new server profile

Host an-worker1221.eqiad.wmnet rebooted by btullis@cumin1003 with reason: Rebooting to pick up new server profile

Host an-worker1222.eqiad.wmnet rebooted by btullis@cumin1003 with reason: Rebooting to pick up new server profile

Host an-worker1223.eqiad.wmnet rebooted by btullis@cumin1003 with reason: Rebooting to pick up new server profile

Host an-worker1224.eqiad.wmnet rebooted by btullis@cumin1003 with reason: Rebooting to pick up new server profile

Host an-worker1225.eqiad.wmnet rebooted by btullis@cumin1003 with reason: Rebooting to pick up new server profile

Host an-worker1226.eqiad.wmnet rebooted by btullis@cumin1003 with reason: Rebooting to pick up new server profile

Host an-worker1227.eqiad.wmnet rebooted by btullis@cumin1003 with reason: Rebooting to pick up new server profile

Host an-worker1228.eqiad.wmnet rebooted by btullis@cumin1003 with reason: Rebooting to pick up new server profile

Host an-worker1229.eqiad.wmnet rebooted by btullis@cumin1003 with reason: Rebooting to pick up new server profile

Host an-worker1230.eqiad.wmnet rebooted by btullis@cumin1003 with reason: Rebooting to pick up new server profile

Host an-worker1231.eqiad.wmnet rebooted by btullis@cumin1003 with reason: Rebooting to pick up new server profile

Host an-worker1232.eqiad.wmnet rebooted by btullis@cumin1003 with reason: Rebooting to pick up new server profile

Host an-worker1233.eqiad.wmnet rebooted by btullis@cumin1003 with reason: Rebooting to pick up new server profile

Host an-worker1234.eqiad.wmnet rebooted by btullis@cumin1003 with reason: Rebooting to pick up new server profile

Host an-worker1235.eqiad.wmnet rebooted by btullis@cumin1003 with reason: Rebooting to pick up new server profile

Host an-worker1236.eqiad.wmnet rebooted by btullis@cumin1003 with reason: Rebooting to pick up new server profile

Host an-worker1199.eqiad.wmnet rebooted by btullis@cumin1003 with reason: Rebooting to pick up new server profile

Host an-worker1194.eqiad.wmnet rebooted by btullis@cumin1003 with reason: Rebooting to pick up new server profile

Host an-worker1205.eqiad.wmnet rebooted by btullis@cumin1003 with reason: Rebooting to pick up new server profile

Host an-worker1207.eqiad.wmnet rebooted by btullis@cumin1003 with reason: Rebooting to pick up new server profile

I've applied the BIOS settings to all hadoop workers and re-added any data disks that were unmounted.
Now we can see the following.

btullis@cumin1003:~$ sudo cumin A:hadoop-worker 'findmnt|grep /var/lib/hadoop/data|wc -l'
94 hosts will be targeted:
an-worker[1142-1167,1169-1236].eqiad.wmnet
OK to proceed on 94 hosts? Enter the number of affected hosts to confirm or "q" to quit: 94
===== NODE GROUP =====                                                                                                                                                                                             
(1) an-worker1231.eqiad.wmnet                                                                                                                                                                                      
----- OUTPUT of 'findmnt|grep /va...adoop/data|wc -l' -----                                                                                                                                                        
ssh: connect to host an-worker1231.eqiad.wmnet port 22: Connection timed out                                                                                                                                       
===== NODE GROUP =====                                                                                                                                                                                             
(2) an-worker[1159,1178].eqiad.wmnet                                                                                                                                                                               
----- OUTPUT of 'findmnt|grep /va...adoop/data|wc -l' -----                                                                                                                                                        
11                                                                                                                                                                                                                 
===== NODE GROUP =====                                                                                                                                                                                             
(91) an-worker[1142-1158,1160-1167,1169-1177,1179-1230,1232-1236].eqiad.wmnet                                                                                                                                      
----- OUTPUT of 'findmnt|grep /va...adoop/data|wc -l' -----                                                                                                                                                        
12

So there are three outstanding problems, each of which has a ticket:

There is one more open ticket:

I think that we can say we have a baseline now, so I will close this ticket and we can observe whether the failure rate drops, over time.

It's also worth noting that we haven't seen any rise in power consumption due to this BIOS change.

image.png (450×1 px, 124 KB)

https://grafana.wikimedia.org/goto/eff7605y0uuwwd?orgId=1