Page MenuHomePhabricator

Set SparkmaxPartitionBytes to 256MB
Closed, ResolvedPublic

Description

The current block size in HDFS configuration is set to 256MB.
The current value in Spark is the default one: 128MB

There is a problem because this value is used (among other things) by Spark to limit the size of its readings on HDFS.

Example: Spark needs to read a 160MB file. With the limit set at 128MB, it's going to create 2 tasks:

  • task 1 reads the first 128MB
  • task 2 reads the remaining 32MB

This is suboptimal as it triggers too much network i/o. Also there is an overhead by the excessive creation of task.

So I think spark.sql.files.maxPartitionBytes should be set to 256MB (268435456) cluster-wide:

Event Timeline

Cool, okay! Antoine, would you like to learn how to make a Puppet patch to do this? I can help show you how. (you can say no! :) )

Actually, looking at code it might be not obvious to do it nicely. I think I can make this match the hadoop setting by default.

Change 758529 had a related patch set uploaded (by Ottomata; author: Ottomata):

[operations/puppet@production] Set spark maxPartitionBytes to hadoop dfs block size

https://gerrit.wikimedia.org/r/758529

Let's merge and apply this on Monday.

Change 758529 merged by Ottomata:

[operations/puppet@production] Set spark maxPartitionBytes to hadoop dfs block size

https://gerrit.wikimedia.org/r/758529

Mentioned in SAL (#wikimedia-analytics) [2022-02-07T14:38:33Z] <ottomata> merged Set spark maxPartitionBytes to hadoop dfs block size - T300299