Spark history server lags behind and some tasks are not indexed in time
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	brouberol
	Feb 22 2024, 11:36 AM

Description

@JAllemandou noticed that some jobs were only shown in the Spark History UI about a day after having run. (See Slack thread. We need to figure out why the server is lagging, to restore a almost-real-time indexing behavior.

Details

	Subject	Repo	Branch	Lines +/-
	spark-history: expand the an-worker subnets the SHS can egress to	operations/deployment-charts	master	+14 -0

Customize query in gerrit

Event Timeline

brouberol created this task.Feb 22 2024, 11:36 AM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptFeb 22 2024, 11:36 AM

Change 1005727 had a related patch set uploaded (by Brouberol; author: Brouberol):

[operations/deployment-charts@master] spark-history: expand the an-worker subnets the SHS can egress to

https://gerrit.wikimedia.org/r/1005727

gerritbot added a project: Patch-For-Review.Feb 22 2024, 11:37 AM

Change 1005727 merged by Brouberol:

[operations/deployment-charts@master] spark-history: expand the an-worker subnets the SHS can egress to

https://gerrit.wikimedia.org/r/1005727

brouberol triaged this task as High priority.Feb 22 2024, 11:52 AM

Mentioned in SAL (#wikimedia-analytics) [2024-02-22T11:52:51Z] <brouberol> redeploying the spark-history server with expanded egress rules for hadoop workers - T358206

The spark history server is now catching up on its lag after a redeploy. No more tracebacks of failed connections are observed.

Maintenance_bot removed a project: Patch-For-Review.Feb 22 2024, 12:31 PM

BTullis moved this task from Backlog to Done on the Data-Platform-SRE (2024.02.12 - 2024.03.03) board.Feb 22 2024, 12:37 PM

Spark history server lags behind and some tasks are not indexed in timeClosed, ResolvedPublicActions

Description

Details

Event Timeline

Spark history server lags behind and some tasks are not indexed in time
Closed, ResolvedPublic
Actions