Jenkins jobs invoked by Zuul tend to be clustered on a few integration-agent-docker-* nodes rather than being spread out across all of them evenly. This is a problem because more jobs running on a node means more pressure on the /srv filesystem, increasing the likelihood of Shinken alerts and/or the maintenance-disconnect-full-disks job taking nodes offline.
Spreading the jobs out evenly across all of the available nodes would reduce the pressure on /srv and improve overall resource utilization.
T218458 attempted to deal with this problem by adding the Least Load Jenkins plugin but the solution was incomplete.
I looked at the source code for the implementation of Gearman that is used by Zuul: https://opendev.org/opendev/gear
The gearman server maintains a list of connected workers (in our case these are Jenkins executors). When a client (in this case, zuul) submits a job, the list of connections is scanned in order and each idle worker that can handle the job is sent a NOOP message. Each worker will react by sending a GRAB_JOB message (which asks to be assigned any available job that it can perform). The first worker to respond gets the job. Since the list of connected workers is likely to have all executors for a given node adjacent to each other, this pattern results in clustering of jobs onto nodes.
A simple first attempt to deal with job distribution is to randomize the list of connections before sending the NOOP message
xref: gear/__init__.py, Server.wakeConnections():
class Server(BaseClientServer): def wakeConnections(self, job=None): p = Packet(constants.RES, constants.NOOP, b'') for connection in self.active_connections: if connection.state == 'SLEEP': if ((job and job.name in connection.functions) or (job is None)): connection.changeState("AWAKE") connection.sendPacket(p)
Useful xref: http://gearman.org/protocol/