Page MenuHomePhabricator

Review recurrent Hadoop worker disk saturation events
Open, LowPublic

Description

Our Hadoop workers sometimes experience disk saturation events that last even minutes. For example, let's pick an-worker1082:

Screen Shot 2020-10-26 at 9.33.04 AM.png (905×2 px, 245 KB)

https://grafana.wikimedia.org/d/VSyI1AWMk/cluster-overview-thanos?orgId=1&from=1603572068854&to=1603574015874&var-site=eqiad&var-cluster=analytics&var-instance=an-worker1082&var-datasource=thanos

It seems happening for datanode partitions when high IOPS read load happens, that is kind of expected given the amount of data that we shuffle around. What it is not great is the fact that a disk saturates for even minutes, since there might be some bottleneck that we are hitting.

On all workers, these are the settings:

elukey@an-worker1080:~$ sudo megacli -LDPDInfo -aAll | grep "Current Cache Policy:" | uniq -c
     13 Current Cache Policy: WriteBack, ReadAdaptive, Direct, No Write Cache if Bad BBU

Due to how DELL hw raid controller works, we have all disks set as single-disk-raid-0 volume. It seems to be possible to force disks as JBOD, but the command seems a destructive action (namely the partition on top needs to be recreated).

All the options are now set via cookbook, so I added some comments in https://gerrit.wikimedia.org/r/c/operations/cookbooks/+/633941/3/cookbooks/sre/hadoop/init-hadoop-workers.py about the meaning of each option.

Checked via cumin that the settings are consistent across workers:

elukey@cumin1001:~$ sudo cumin "A:hadoop-worker" 'megacli -LDPDInfo -aAll | grep "Current Cache Policy:" | uniq -c' -b 10
69 hosts will be targeted:
an-worker[1078-1117].eqiad.wmnet,analytics[1049-1077].eqiad.wmnet
Confirm to continue [y/n]? y
===== NODE GROUP =====
(1) analytics1049.eqiad.wmnet
----- OUTPUT of 'megacli -LDPDInf...licy:" | uniq -c' -----
     13 Current Cache Policy: WriteThrough, ReadAheadNone, Direct, No Write Cache if Bad BBU
===== NODE GROUP =====
(1) analytics1057.eqiad.wmnet
----- OUTPUT of 'megacli -LDPDInf...licy:" | uniq -c' -----
     12 Current Cache Policy: WriteThrough, ReadAdaptive, Direct, No Write Cache if Bad BBU
===== NODE GROUP =====
(1) analytics1055.eqiad.wmnet
----- OUTPUT of 'megacli -LDPDInf...licy:" | uniq -c' -----
     12 Current Cache Policy: WriteBack, ReadAdaptive, Direct, No Write Cache if Bad BBU
===== NODE GROUP =====
(6) an-worker[1096-1101].eqiad.wmnet
----- OUTPUT of 'megacli -LDPDInf...licy:" | uniq -c' -----
     24 Current Cache Policy: WriteBack, ReadAdaptive, Direct, No Write Cache if Bad BBU
===== NODE GROUP =====
(60) an-worker[1078-1095,1102-1117].eqiad.wmnet,analytics[1050-1054,1056,1058-1077].eqiad.wmnet
----- OUTPUT of 'megacli -LDPDInf...licy:" | uniq -c' -----
     13 Current Cache Policy: WriteBack, ReadAdaptive, Direct, No Write Cache if Bad BBU

Disk saturated is not really a huge deal for our use case, so this task is not a high priority, but if there was some setting to test/apply to improve the performances of our disks it would be great :)

Event Timeline

I added some Datanode metrics to the Hadoop grafana dashboard, and started 3 iotop sessions (dumping to a file) on an-worker108[1-3] to get some idea about what processes are hammering the disks periodically.

There was an error on https://grafana.wikimedia.org/d/VSyI1AWMk/cluster-overview-thanos and the other one (user cluster overview), namely read/write metrics were supposed to be -/+ but in reality they were all +. I modified the dashboard, and I'll fix the description of the task as well.

As far as I can see from iotop's logs, a lot of the following are present when disks are saturated:

10:19:19 29017 be/4 hdfs     4597.34 K/s    0.00 K/s  0.00 %  1.09 % java -Dproc_datanode -Xmx1000m [..] org.apache.hadoop.hdfs.server.datanode.DataNode [DataXceiver for]

Piling up some of those likely gets up to 50/60 MB/s, that should be enough to saturate a disk. The DataXceiver thread is generic input/output data exchange between client and datanode. At the same time, when the disks saturate, I see from the new datanode metrics that a lot of local client reads happen (so tasks local to the datanode hammering the disks for blocks).

Ottomata triaged this task as Medium priority.
Ottomata moved this task from Backlog to Q3 2020/2021 on the Analytics-Clusters board.
odimitrijevic lowered the priority of this task from Medium to Low.Dec 3 2021, 10:53 PM
odimitrijevic moved this task from Incoming (new tickets) to Ops Week on the Data-Engineering board.