Page MenuHomePhabricator

Fix skein/spark memory unit missfit
Closed, ResolvedPublic2 Estimated Story Points

Description

In spark, you specify memory in K/M/G or Kb/Mb/Gb (case doesn't matter), and this gets interpreted as kibibytes/mebibytes/gibibytes = 2^10/2^20/2^30 bytes.
In skein, when you specify memory the same way you do in Spark, you get kilobytes/megabyte/gigabytes = 10*3/10^6/10^9 bytes ! If you wish to get bibytes you need to explicitly write the i: kib/mib/gib!
This is what makes our spark-skein jobs be configured as, for instance:
Skein master -> 3815 Mib
Spark driver -> 4G
With this settings we have ~4*10^9 = 4000000000 bytes available in container, while spark can request up to 4*2^30 = 4294967296 bytes. If the container is under memory pressure, Yarn will kill it.

Proposed solution: Parse the memory values passed from Spark to skein and set/convert their units.

Event Timeline

@Gehel @JAllemandou @dcausse What do you think the priority here should be?

Is there a related task?

Ottomata set the point value for this task to 2.Jan 14 2025, 6:56 PM

@Gehel @JAllemandou @dcausse What do you think the priority here should be?

Is there a related task?

I faced this issue while migrating a hive query to spark-sql where the selection of partition was using high level SQL functions causing the spark driver to hit its 4g default limit but being rapidly killed by yarn, I fixed it using simpler criteria so it's not blocking the search team in anyways. What is particularly annoying is that spark itself does not show the usual warnings with GC overhead exceptions or the like it's just killed right away which might make debugging harder.

We can repro this issue in mjolnir, and it's currently blocking the MLR pipeline.

Just discussed with @dcausse, and we'd rather fix upstream than have a workaround for the specific failing task. I'll be work on a patch today.