Page MenuHomePhabricator

[Refine DAG Improvement] Add Parameter to Reduce Spark Driver Logs in Skein Log Collection
Closed, ResolvedPublic

Description

During the development of the new Refine DAG, Skein log collection was activated to facilitate faster development by making meaningful information easily accessible through the Airflow UI. However, it had to be disabled due to disk constraints on an-launcher1002.

Currently, when Skein log collection is activated, it retrieves logs only for the Skein application, which primarily consists of Spark Driver logs. These logs are often cluttered with verbose Spark internal lines, making them difficult to navigate and unnecessarily consuming disk space.

Proposed Solution:
• Introduce a parameter to the SparkSubmitOperator that reduces verbosity in the Spark Driver logs.
• Configure this parameter to suppress internal Spark log lines while retaining meaningful information for debugging and monitoring.

Benefits:
• Reduced size of collected logs, mitigating disk usage issues.
• Enhanced clarity in log content, making it easier to debug and monitor the Refine DAG.
• Re-enablement of Skein log collection without compromising system resources.

Details

Related Changes in Gerrit:
Related Changes in GitLab:
TitleReferenceAuthorSource BranchDest Branch
Enable customizable logger levels for Spark appsrepos/data-engineering/airflow-dags!953aquT381074_quieter_spark_loggermain
Customize query in GitLab

Event Timeline

Change #1100394 had a related patch set uploaded (by Aqu; author: Aqu):

[analytics/refinery@master] Improve Spark Logger Quietness

https://gerrit.wikimedia.org/r/1100394

Change #1100394 merged by Aqu:

[analytics/refinery@master] Improve Spark Logger Quietness

https://gerrit.wikimedia.org/r/1100394

Are we making this the default for all Spark skein apps and not just Refine? Seems like it would be useful for everyone.

Antoine_Quhen changed the task status from Open to In Progress.Mar 26 2025, 1:52 PM

Lets rollout progressively.

The property file is done.
The helper-integration within the Airflow Spark operator is in review.

The custom logger config is currently active on Refine staging dag.