Page MenuHomePhabricator

CI frozen waiting for "Waiting for next available executor on ‘integration-castor05"
Closed, ResolvedPublic

Description

https://integration.wikimedia.org/ci/computer/integration-castor05/ suggests it's waiting on https://integration.wikimedia.org/ci/job/castor-save-workspace-cache/6193771/ but if you look at that URL, that job finished a little while ago 20:13:41 Finished: SUCCESS

Screenshot 2026-01-31 at 20.29.11.png (947×349 px, 104 KB)

Event Timeline

Reedy triaged this task as High priority.Sat, Jan 31, 8:28 PM
Reedy updated the task description. (Show Details)
Reedy updated the task description. (Show Details)

Screenshot_20260131_231017_Samsung Internet.png (865×1 px, 208 KB)

The UI looks like this for the last 56 minutes.

Mentioned in SAL (#wikimedia-releng) [2026-01-31T21:44:42Z] <James_F> Fighting T416078, took integration-castor-5 offline, disconnected, sshed in to kill threads, then reconnected; no change in aspect.

Mentioned in SAL (#wikimedia-releng) [2026-01-31T21:45:20Z] <James_F> Running sudo systemctl restart jenkins on contint for T416078

Mentioned in SAL (#wikimedia-releng) [2026-01-31T21:49:37Z] <James_F> Deleted Jenkins's job entry for castor-save-workspace-cache 6193776 and this seems to have unstuck things for T416078?

It started to run, but extremely slow.

It started to run, but extremely slow.

Do you have some data for this claim? From my monitoring, castor jobs are running and exiting in 5–25 seconds as normal.

Jdforrester-WMF claimed this task.

Let's call this Resolved, and hope that the very odd Jenkins confusion/deadlock doesn't recur.

It started to run, but extremely slow.

Do you have some data for this claim? From my monitoring, castor jobs are running and exiting in 5–25 seconds as normal.

Sure. Today just before this bug:

Screenshot_20260201_000140_Samsung Internet.png (366×1 px, 122 KB)

The same change now:
Screenshot_20260201_000304_Samsung Internet.png (519×1 px, 154 KB)

It started to run, but extremely slow.

Do you have some data for this claim? From my monitoring, castor jobs are running and exiting in 5–25 seconds as normal.

Sure.

[Snip]

Those are screenshot of UX approximations of macro-scale system behaviour during recovery from an outage, indeed. The part of the system that was broken and is the subject of this task is now fixed and operating normally, as far as I can tell, and so the wider system should recover to normal performance over the next few (tens of) minutes.

Please be more specific in future, so we don't burn effort frantically looking for missed issues late on a Saturday night.

I do not know enough to see the difference or at least to understand what you just said. So I said the tests are much slower and they are, I waited at least triple time this time.
UPD: Tried once more, and it was much faster, but still slower than usual. Looks like it continues to improve.