Page MenuHomePhabricator

Investigate and tune mjolnir resource allocation
Closed, ResolvedPublic

Description

Recently, we encountered occasional memory-related failures in mjolnir that require investigation. It is possible that the Spark configuration can be tuned to better utilize YARN resources under the current workloads.

AC

  • mjolnir's config and resource consumption has been reviewed, and obvious bottlenecks have been identified

Event Timeline

After switching Spark deployment from cluster to client mode, we encountered memory pressure issues where the Spark driver was under-resourced.
This occurred because the driver was running in a Skein-managed container instead of directly atop YARN, and skein was improperly handling resource management.

This issue was resolved by:

  • T383589: Fix skein/spark memory unit missfit
  • Skein ignored the spark.driver.memoryOverhead property set by the feature selection job. This was fixed by explicitly passing skein_memory=<spark.driver.memory>+<spark.driver.memoryOverhead> to the Spark submit operator.
Gehel claimed this task.