Page MenuHomePhabricator

Coverage and patch-performance pipelines appear stuck
Closed, ResolvedPublic

Description

For example, the job for 797337 has been pending for almost 4 hours

(See also https://grafana-rw.wikimedia.org/d/000000322/zuul-gearman?orgId=1&from=now-6h&to=now&viewPanel=10, which suggests a more systemic issue)

zuul_gearman_queue.png (559×659 px, 45 KB)

Event Timeline

hashar subscribed.

The changes in the patch-performance or coverage have a low precedence and are only triggered after anything else.

The spike is probably what has caused the problem and it is slowly recovering.

The changes in the patch-performance or coverage have a low precedence and are only triggered after anything else.

The spike is probably what has caused the problem and it is slowly recovering.

Ah, that makes sense, thank you @hashar — still, figuring out that spike would be useful 😄

hashar renamed this task from Coverage pipeline appears stuck to Coverage and patch-performance pipelines appear stuck.May 23 2022, 10:07 PM
hashar updated the task description. (Show Details)

After a chat with @TheresNoTime @dancy and @Legoktm on IRC.

There has been a large spike between 15:13 TC and 15:33 UTC roughly. Most probably due to a series of patches send together to one of the big repositories such as mediawiki/core or Wikibase. That mostly has recovered by 15:49 UTC. This is a known issue of our outdated Zuul: T151089 T140297 . The Zuul merger is slow as well we would need more T222645 and do some optimization to git as well T307620 (there might be more tasks around). TLDR: our Zuul is obsolete.

After that until at least 22:00 UTC, there is a long standing queue of ~ 600 gearman functions waiting.

The reason found by @Legoktm is that LibUp kicked in, although it busy wait for test and gate-and-submit pipelines, it keeps them busy. Since the Zuul Gearman server primarily runs the low precedence pipeline if there is room, they end up never running. Additional the jobs stuck have a mutex in Zuul so only one of them can run at a time.

Thus, as LibUp keeps the higher precedence pipelines busy, there is a very little window for a job in the low precedence pipelines (coverage and patch-performance) to have a change to trigger their job. When they manage to get a job running, there is only one triggered due to the mutex.

We thus face a long tail which will resolve by itself.

hashar claimed this task.

Looks all good now