Review recurrent Hadoop worker disk saturation events
Open, LowPublic
Actions

Assigned To

None

Authored By

	elukey
	Oct 14 2020, 2:49 PM

Description

Our Hadoop workers sometimes experience disk saturation events that last even minutes. For example, let's pick an-worker1082:

Screen Shot 2020-10-26 at 9.33.04 AM.png (905×2 px, 245 KB)

https://grafana.wikimedia.org/d/VSyI1AWMk/cluster-overview-thanos?orgId=1&from=1603572068854&to=1603574015874&var-site=eqiad&var-cluster=analytics&var-instance=an-worker1082&var-datasource=thanos

It seems happening for datanode partitions when high IOPS read load happens, that is kind of expected given the amount of data that we shuffle around. What it is not great is the fact that a disk saturates for even minutes, since there might be some bottleneck that we are hitting.

On all workers, these are the settings:

elukey@an-worker1080:~$ sudo megacli -LDPDInfo -aAll | grep "Current Cache Policy:" | uniq -c
     13 Current Cache Policy: WriteBack, ReadAdaptive, Direct, No Write Cache if Bad BBU

Due to how DELL hw raid controller works, we have all disks set as single-disk-raid-0 volume. It seems to be possible to force disks as JBOD, but the command seems a destructive action (namely the partition on top needs to be recreated).

All the options are now set via cookbook, so I added some comments in https://gerrit.wikimedia.org/r/c/operations/cookbooks/+/633941/3/cookbooks/sre/hadoop/init-hadoop-workers.py about the meaning of each option.

Checked via cumin that the settings are consistent across workers:

elukey@cumin1001:~$ sudo cumin "A:hadoop-worker" 'megacli -LDPDInfo -aAll | grep "Current Cache Policy:" | uniq -c' -b 10
69 hosts will be targeted:
an-worker[1078-1117].eqiad.wmnet,analytics[1049-1077].eqiad.wmnet
Confirm to continue [y/n]? y
===== NODE GROUP =====
(1) analytics1049.eqiad.wmnet
----- OUTPUT of 'megacli -LDPDInf...licy:" | uniq -c' -----
     13 Current Cache Policy: WriteThrough, ReadAheadNone, Direct, No Write Cache if Bad BBU
===== NODE GROUP =====
(1) analytics1057.eqiad.wmnet
----- OUTPUT of 'megacli -LDPDInf...licy:" | uniq -c' -----
     12 Current Cache Policy: WriteThrough, ReadAdaptive, Direct, No Write Cache if Bad BBU
===== NODE GROUP =====
(1) analytics1055.eqiad.wmnet
----- OUTPUT of 'megacli -LDPDInf...licy:" | uniq -c' -----
     12 Current Cache Policy: WriteBack, ReadAdaptive, Direct, No Write Cache if Bad BBU
===== NODE GROUP =====
(6) an-worker[1096-1101].eqiad.wmnet
----- OUTPUT of 'megacli -LDPDInf...licy:" | uniq -c' -----
     24 Current Cache Policy: WriteBack, ReadAdaptive, Direct, No Write Cache if Bad BBU
===== NODE GROUP =====
(60) an-worker[1078-1095,1102-1117].eqiad.wmnet,analytics[1050-1054,1056,1058-1077].eqiad.wmnet
----- OUTPUT of 'megacli -LDPDInf...licy:" | uniq -c' -----
     13 Current Cache Policy: WriteBack, ReadAdaptive, Direct, No Write Cache if Bad BBU

Disk saturated is not really a huge deal for our use case, so this task is not a high priority, but if there was some setting to test/apply to improve the performances of our disks it would be great :)

Related Objects
Search...

		Status	Subtype	Assigned	Task
		Open		None	T265487 Review recurrent Hadoop worker disk saturation events
		Resolved		JAllemandou	T267008 Improve webrequest-refine shuffle-sort

Event Timeline

elukey created this task.Oct 14 2020, 2:49 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptOct 14 2020, 2:49 PM

I added some Datanode metrics to the Hadoop grafana dashboard, and started 3 iotop sessions (dumping to a file) on an-worker108[1-3] to get some idea about what processes are hammering the disks periodically.

elukey updated the task description. (Show Details)Oct 26 2020, 7:57 AM

There was an error on https://grafana.wikimedia.org/d/VSyI1AWMk/cluster-overview-thanos and the other one (user cluster overview), namely read/write metrics were supposed to be -/+ but in reality they were all +. I modified the dashboard, and I'll fix the description of the task as well.

elukey updated the task description. (Show Details)Oct 26 2020, 8:36 AM

JAllemandou awarded a token.Oct 26 2020, 8:45 AM

As far as I can see from iotop's logs, a lot of the following are present when disks are saturated:

10:19:19 29017 be/4 hdfs     4597.34 K/s    0.00 K/s  0.00 %  1.09 % java -Dproc_datanode -Xmx1000m [..] org.apache.hadoop.hdfs.server.datanode.DataNode [DataXceiver for]

Piling up some of those likely gets up to 50/60 MB/s, that should be enough to saturate a disk. The DataXceiver thread is generic input/output data exchange between client and datanode. At the same time, when the disks saturate, I see from the new datanode metrics that a lot of local client reads happen (so tasks local to the datanode hammering the disks for blocks).

elukey added a subtask: T267008: Improve webrequest-refine shuffle-sort.Nov 2 2020, 1:31 PM

Ottomata assigned this task to elukey.Nov 30 2020, 4:46 PM

Ottomata triaged this task as Medium priority.

Ottomata moved this task from Backlog to Q3 2020/2021 on the Analytics-Clusters board.

• fdans closed subtask T267008: Improve webrequest-refine shuffle-sort as Resolved.Jan 25 2021, 7:00 PM

Ottomata removed elukey as the assignee of this task.Mar 15 2021, 4:08 PM

Ottomata moved this task from Q3 2020/2021 to Triaged but not scheduled yet on the Analytics-Clusters board.

odimitrijevic lowered the priority of this task from Medium to Low.Dec 3 2021, 10:53 PM

odimitrijevic added a project: Data-Engineering.

odimitrijevic moved this task from Incoming (new tickets) to Ops Week on the Data-Engineering board.

JArguello-WMF removed a project: Analytics-Clusters.Feb 24 2023, 12:37 PM

JArguello-WMF moved this task from Ops Week to Event Platform Backlog on the Data-Engineering board.Jun 29 2023, 11:04 PM

BTullis edited projects, added Data-Platform-SRE; removed Data-Engineering.Jul 15 2023, 12:11 AM

Gehel moved this task from Incoming to Toil / Automation on the Data-Platform-SRE board.Dec 7 2023, 2:13 PM

	F32414407: Screen Shot 2020-10-26 at 9.33.04 AM.png
	Oct 26 2020, 8:36 AM

	F32384635: Screen Shot 2020-10-14 at 4.22.35 PM.png
	Oct 14 2020, 2:49 PM

	F32384638: Screen Shot 2020-10-14 at 4.25.58 PM.png
	Oct 14 2020, 2:49 PM

Review recurrent Hadoop worker disk saturation eventsOpen, LowPublicActions

Description

Related ObjectsSearch...

Event Timeline

Review recurrent Hadoop worker disk saturation events
Open, LowPublic
Actions

Related Objects
Search...