Page MenuHomePhabricator

Completed dump pods take a long time to get terminated
Closed, ResolvedPublic

Description

When the dump process of a dump task pod terminates, the pod enters a 2/3 NotReady state, as both sidecar containers are still ready.

That state is detected by the job-sidecar-controller, in charge of exec ing into each sidecar pod and killing the PID 1.

We observe that it takes quite a while for the task pods to transition from the 2/3 Unready state to the 0/3 Terminating state:

mediawiki-angwikisource-sql-xml-angwikisource-dump-batch-0-7fodo5b   0/3     ContainerCreating   0          2s
mediawiki-angwikisource-sql-xml-angwikisource-dump-batch-0-7fodo5b   0/3     ContainerCreating   0          6s
mediawiki-angwikisource-sql-xml-angwikisource-dump-batch-0-7fodo5b   2/3     Running             0          7s
mediawiki-angwikisource-sql-xml-angwikisource-dump-batch-0-7fodo5b   3/3     Running             0          8s
mediawiki-angwikisource-sql-xml-angwikisource-dump-batch-0-7fodo5b   2/3     NotReady            0          23s
mediawiki-angwikisource-sql-xml-angwikisource-dump-batch-0-7fodo5b   1/3     NotReady            0          45s
mediawiki-angwikisource-sql-xml-angwikisource-dump-batch-0-7fodo5b   1/3     NotReady            0          49s
mediawiki-angwikisource-sql-xml-angwikisource-dump-batch-0-7fodo5b   0/3     Completed           0          49s

I postulate the opinion that this is due to the fact that the job sidecar controller is CPU throttled.

We probably need to increase its CPU resources.

Event Timeline

brouberol triaged this task as Medium priority.

Change #1140136 had a related patch set uploaded (by Brouberol; author: Brouberol):

[operations/deployment-charts@master] dse-k8s-eqiad: substantially increase job sidecar controller CPU resources

https://gerrit.wikimedia.org/r/1140136

brouberol changed the task status from Open to In Progress.Apr 30 2025, 9:26 AM
brouberol claimed this task.

Change #1140136 merged by Brouberol:

[operations/deployment-charts@master] dse-k8s-eqiad: substantially increase job sidecar controller CPU resources

https://gerrit.wikimedia.org/r/1140136

We've observe a dump of ~150 wikis taking about an hour, when it used to take 1h30. I'm not sure we can attribute that to the retention controller CPU increase, as we possibly also benefit from the pre-existing dump. That being said, it seems to have had some positive effect.