Page MenuHomePhabricator

Jenkins jobs regularly being queued while resources appear to be readily available
Closed, ResolvedPublic

Description

I've seen this for several years, and each time I convinced myself it's a temporary issue due to ... something something. I also suspect that for a while in 2016-2017, this wasn't an issue. But maybe I just didn't notice it.

Let me report it here in case it's a known issue and/or is something we can fix.

Problem

Quite often I see that the Zuul status page has dozens of stalled jobs (queued) with only one or two things actually being executed. Meanwhile Jenkins shows dozens of idle executor slots are available, seemingly not doing anything.

Below isn't the worst example of this, as it still shows 5 change sets being processed (16 jobs in execution phase). But, it still demonstrates the issue of dozens of jobs being stalled while executor slots appear to be available!

Zuul status page

  • executing: 16 jobs
  • waiting: 41 jobs (queued)

zuul.png (5×2 px, 774 KB)

Jenkins executor overview (docker hosts)

  • in use: 17 executors (including 1 non-zuul job for selenium, from a cron timer)
  • idle: 55 executors

jenkins.png (6×678 px, 441 KB)

Event Timeline

I have trouble figuring out how the job end up being scheduled.

The Zuul scheduler triggers Gearman jobs (eg: build:npm-node-6-docker).

The Gearman function is registered by the Jenkins Gearman worker for each node/executor that could run it. Hence for that function, we have 60 workers available:

$ zuul-gearman.py workers|grep build:npm-node-6-docker|cut -d\  -f-3
16 127.0.0.1 integration-slave-docker-1051_exec-3
22 127.0.0.1 integration-slave-docker-1054_exec-1
24 127.0.0.1 integration-slave-docker-1051_exec-4
27 127.0.0.1 integration-slave-docker-1041_exec-3
28 127.0.0.1 integration-slave-docker-1048_exec-1
30 127.0.0.1 integration-slave-docker-1051_exec-0
31 127.0.0.1 integration-slave-docker-1041_exec-4
32 127.0.0.1 integration-slave-docker-1041_exec-1
35 127.0.0.1 integration-slave-docker-1040_exec-0
36 127.0.0.1 integration-slave-docker-1040_exec-3
37 127.0.0.1 integration-slave-docker-1054_exec-3
38 127.0.0.1 integration-slave-docker-1054_exec-2
39 127.0.0.1 integration-slave-docker-1034_exec-0
40 127.0.0.1 integration-slave-docker-1048_exec-2
42 127.0.0.1 integration-slave-docker-1043_exec-4
45 127.0.0.1 integration-slave-docker-1040_exec-2
46 127.0.0.1 integration-slave-docker-1043_exec-3
49 127.0.0.1 integration-slave-docker-1040_exec-4
50 127.0.0.1 integration-slave-docker-1021_exec-0
52 127.0.0.1 integration-slave-docker-1054_exec-4
53 127.0.0.1 integration-slave-docker-1048_exec-0
55 127.0.0.1 integration-slave-docker-1052_exec-3
56 127.0.0.1 integration-slave-docker-1034_exec-3
58 127.0.0.1 integration-slave-docker-1041_exec-2
59 127.0.0.1 integration-slave-docker-1053_exec-0
60 127.0.0.1 integration-slave-docker-1050_exec-3
62 127.0.0.1 integration-slave-docker-1034_exec-1
64 127.0.0.1 integration-slave-docker-1048_exec-4
68 127.0.0.1 integration-slave-docker-1049_exec-1
69 127.0.0.1 integration-slave-docker-1034_exec-2
70 127.0.0.1 integration-slave-docker-1052_exec-2
71 127.0.0.1 integration-slave-docker-1043_exec-2
72 127.0.0.1 integration-slave-docker-1052_exec-1
73 127.0.0.1 integration-slave-docker-1049_exec-0
74 127.0.0.1 integration-slave-docker-1021_exec-1
75 127.0.0.1 integration-slave-docker-1050_exec-0
77 127.0.0.1 integration-slave-docker-1051_exec-1
79 127.0.0.1 integration-slave-docker-1048_exec-3
80 127.0.0.1 integration-slave-docker-1052_exec-0
82 127.0.0.1 integration-slave-docker-1043_exec-1
84 127.0.0.1 integration-slave-docker-1050_exec-4
86 127.0.0.1 integration-slave-docker-1034_exec-4
87 127.0.0.1 integration-slave-docker-1050_exec-2
90 127.0.0.1 integration-slave-docker-1052_exec-4
91 127.0.0.1 integration-slave-docker-1043_exec-0
92 127.0.0.1 integration-slave-docker-1054_exec-0
93 127.0.0.1 integration-slave-docker-1051_exec-2
94 127.0.0.1 integration-slave-docker-1040_exec-1
96 127.0.0.1 integration-slave-docker-1050_exec-1
97 127.0.0.1 integration-slave-docker-1041_exec-0
48 127.0.0.1 integration-slave-docker-1056_exec-3
76 127.0.0.1 integration-slave-docker-1056_exec-0
88 127.0.0.1 integration-slave-docker-1056_exec-4
89 127.0.0.1 integration-slave-docker-1056_exec-1
95 127.0.0.1 integration-slave-docker-1056_exec-2
98 127.0.0.1 integration-slave-docker-1055_exec-1
99 127.0.0.1 integration-slave-docker-1055_exec-4
100 127.0.0.1 integration-slave-docker-1055_exec-3
101 127.0.0.1 integration-slave-docker-1055_exec-0
102 127.0.0.1 integration-slave-docker-1055_exec-2

An alternate view:

$ zuul-gearman.py status|grep build:npm-node-6-docker
build:npm-node-6-docker	0	0	60

The fields are: function, waiting, working, total workers.
The Gearman server keeps the job request in a queue.
The Jenkins executors grab jobs from the server which the oldest one from the queue (FIFO)

I have no idea how the Jenkins Gearman Plugin is implemented, I would guess it asks the Jenkins scheduler for an executor then get that executor to grab the job from the Gearman server.

Anyway, the logic in Jenkins is in:

./core/src/main/java/hudson/model/LoadBalancer.java
./core/src/main/java/hudson/model/Queue.java

There is only one LoadBalancer implemented in Jenkins core which uses consistent hashing based on the node name. It is described on the page of the commercial plugin Even Scheduler Plugin https://go.cloudbees.com/docs/plugins/even-scheduler/#_default_jenkins_behavior :

Default Jenkins Behavior

To better understand what this plugin does, let us examine the default scheduling algorithm of Jenkins. How does it choose one node out of all the qualifying nodes?

By default, Jenkins employs the algorithm known as consistent hashing to make this decision. More specifically, it hashes the name of the node, in numbers proportional to the number of available executors, then hashes the job name to create a probe point for the consistent hash. More intuitively speaking, Jenkins creates a priority list for each job that lists all the agents in their "preferred" order, then picks the most preferred available node. This priority list is different from one job to another, and it is stable, in that adding or removing nodes generally only causes limited changes to these priority lists.

As a result, from the user’s point of view, it looks as if Jenkins tries to always use the same node for the same job, unless it’s not available, in which case it’ll build elsewhere. But as soon as the preferred node is available, the build comes back to it.

This behaviour is based on the assumption that it is preferable to use the same workspace as much as possible, because SCM updates are more efficient than SCM checkouts. In a typical continuous integration situation, each build only contains a limited number of changes, so indeed updates (which only fetch updated files) run substantially faster than checkouts (which refetch all files from scratch.)

This locality is also useful for a number of other reasons. For example, on a large Jenkins instance with many jobs, this tends to keep the number of the workspaces on each node small. Some build tools (such as Maven and RVM) use local caches, and they work faster if Jenkins keeps building a job on the same node.

However, the notable downside of this strategy is that when each agent is configured with multiple executors, it doesn’t try to actively create balanced load on nodes. Say you have two agents X and Y, each with 4 executors. If at one point X is building FOO #1 and Y is completely idle, then on average, the upcoming BAR #1 still gets assigned to X with 3/7 chance (because X has 3 idle executors and Y has 4 idle executors.

https://jenkins.io/doc/developer/extensions/jenkins-core/#loadbalancer lists extensions for hudson.model.LoadBalancer (Strategy that decides which Task gets run on which Executor ). It lists the Least Load plugin https://plugins.jenkins.io/leastload

TLDR: we should look at installing the Least Load plugin https://plugins.jenkins.io/leastload

Mentioned in SAL (#wikimedia-releng) [2019-05-06T13:31:28Z] <hashar> Jenkins: installed Least Load plugin | T218458

hashar claimed this task.

I am assuming the plugin magically fixed it up.

Zuul status page

  • executing: 16 jobs
  • waiting: 41 jobs (queued)

zuul.png (5×2 px, 774 KB)

This show the gate-and-submit pipeline only runs jobs for two changes, others being pending. Dependent pipelines have a window of changes they can act on, the window grows linearly as changes succeed but exponentially shrink when a change fail. Most probably some changes recently failed and caused the window of actionable changes to be reduced down to two.

zuul/layout.yaml
-   name: gate-and-submit
    window: 5  # initial value
    window-floor: 2  # minimum

Which brings back to memory T93701: Zuul status page should show the pipelines "window" value. In status.json each queues in pipelines have a window value. So potentially in the web ui, next to queue-name: mediawiki we could also display the window value and potentially display pending changes has being hold on purpose.

The puppet catalog compiler jobs benefit from it. They are run on two instances and one of them had its disk filling up quickly. Heron confirmed that those jobs are more evenly balanced T221969#5176406

SUCCESS! Thank you Timo.