Set SparkmaxPartitionBytes to 256MB
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	Antoine_Quhen
	Jan 27 2022, 7:55 PM

Description

The current block size in HDFS configuration is set to 256MB.
The current value in Spark is the default one: 128MB

There is a problem because this value is used (among other things) by Spark to limit the size of its readings on HDFS.

Example: Spark needs to read a 160MB file. With the limit set at 128MB, it's going to create 2 tasks:

This is suboptimal as it triggers too much network i/o. Also there is an overhead by the excessive creation of task.

So I think spark.sql.files.maxPartitionBytes should be set to 256MB (268435456) cluster-wide:

	Subject	Repo	Branch	Lines +/-
	Set spark maxPartitionBytes to hadoop dfs block size	operations/puppet	production	+9 -1

Cool, okay! Antoine, would you like to learn how to make a Puppet patch to do this? I can help show you how. (you can say no! :) )

Actually, looking at code it might be not obvious to do it nicely. I think I can make this match the hadoop setting by default.

Change 758529 had a related patch set uploaded (by Ottomata; author: Ottomata):

[operations/puppet@production] Set spark maxPartitionBytes to hadoop dfs block size

odimitrijevic triaged this task as High priority.Feb 1 2022, 11:34 PM

Let's merge and apply this on Monday.

Change 758529 merged by Ottomata:

[operations/puppet@production] Set spark maxPartitionBytes to hadoop dfs block size

Mentioned in SAL (#wikimedia-analytics) [2022-02-07T14:38:33Z] <ottomata> merged Set spark maxPartitionBytes to hadoop dfs block size - T300299

Ottomata closed this task as Resolved.Mar 22 2022, 4:17 PM