Page MenuHomePhabricator

Spark history server lags behind and some tasks are not indexed in time
Closed, ResolvedPublic

Description

@JAllemandou noticed that some jobs were only shown in the Spark History UI about a day after having run. (See Slack thread. We need to figure out why the server is lagging, to restore a almost-real-time indexing behavior.

Event Timeline

Change 1005727 had a related patch set uploaded (by Brouberol; author: Brouberol):

[operations/deployment-charts@master] spark-history: expand the an-worker subnets the SHS can egress to

https://gerrit.wikimedia.org/r/1005727

Change 1005727 merged by Brouberol:

[operations/deployment-charts@master] spark-history: expand the an-worker subnets the SHS can egress to

https://gerrit.wikimedia.org/r/1005727

Mentioned in SAL (#wikimedia-analytics) [2024-02-22T11:52:51Z] <brouberol> redeploying the spark-history server with expanded egress rules for hadoop workers - T358206

The spark history server is now catching up on its lag after a redeploy. No more tracebacks of failed connections are observed.