The current block size in HDFS configuration is set to 256MB.
The current value in Spark is the default one: 128MB
There is a problem because this value is used (among other things) by Spark to limit the size of its readings on HDFS.
Example: Spark needs to read a 160MB file. With the limit set at 128MB, it's going to create 2 tasks:
- task 1 reads the first 128MB
- task 2 reads the remaining 32MB
This is suboptimal as it triggers too much network i/o. Also there is an overhead by the excessive creation of task.
So I think spark.sql.files.maxPartitionBytes should be set to 256MB (268435456) cluster-wide: