https://integration.wikimedia.org/ci/computer/integration-castor05/ suggests it's waiting on https://integration.wikimedia.org/ci/job/castor-save-workspace-cache/6193771/ but if you look at that URL, that job finished a little while ago 20:13:41 Finished: SUCCESS
Description
Event Timeline
Mentioned in SAL (#wikimedia-releng) [2026-01-31T21:44:42Z] <James_F> Fighting T416078, took integration-castor-5 offline, disconnected, sshed in to kill threads, then reconnected; no change in aspect.
Mentioned in SAL (#wikimedia-releng) [2026-01-31T21:45:20Z] <James_F> Running sudo systemctl restart jenkins on contint for T416078
Mentioned in SAL (#wikimedia-releng) [2026-01-31T21:49:37Z] <James_F> Deleted Jenkins's job entry for castor-save-workspace-cache 6193776 and this seems to have unstuck things for T416078?
Do you have some data for this claim? From my monitoring, castor jobs are running and exiting in 5–25 seconds as normal.
Let's call this Resolved, and hope that the very odd Jenkins confusion/deadlock doesn't recur.
[Snip]
Those are screenshot of UX approximations of macro-scale system behaviour during recovery from an outage, indeed. The part of the system that was broken and is the subject of this task is now fixed and operating normally, as far as I can tell, and so the wider system should recover to normal performance over the next few (tens of) minutes.
Please be more specific in future, so we don't burn effort frantically looking for missed issues late on a Saturday night.
I do not know enough to see the difference or at least to understand what you just said. So I said the tests are much slower and they are, I waited at least triple time this time.
UPD: Tried once more, and it was much faster, but still slower than usual. Looks like it continues to improve.



