Page MenuHomePhabricator

Improve scheduling of CI jobs invoked by zuul
Open, Stalled, HighPublic

Description

Jenkins jobs invoked by Zuul tend to be clustered on a few integration-agent-docker-* nodes rather than being spread out across all of them evenly. This is a problem because more jobs running on a node means more pressure on the /srv filesystem, increasing the likelihood of Shinken alerts and/or the maintenance-disconnect-full-disks job taking nodes offline.

Spreading the jobs out evenly across all of the available nodes would reduce the pressure on /srv and improve overall resource utilization.

T218458 attempted to deal with this problem by adding the Least Load Jenkins plugin but the solution was incomplete.

I looked at the source code for the implementation of Gearman that is used by Zuul: https://opendev.org/opendev/gear

The gearman server maintains a list of connected workers (in our case these are Jenkins executors). When a client (in this case, zuul) submits a job, the list of connections is scanned in order and each idle worker that can handle the job is sent a NOOP message. Each worker will react by sending a GRAB_JOB message (which asks to be assigned any available job that it can perform). The first worker to respond gets the job. Since the list of connected workers is likely to have all executors for a given node adjacent to each other, this pattern results in clustering of jobs onto nodes.

A simple first attempt to deal with job distribution is to randomize the list of connections before sending the NOOP message
xref: gear/__init__.py, Server.wakeConnections():

class Server(BaseClientServer):

    def wakeConnections(self, job=None):
        p = Packet(constants.RES, constants.NOOP, b'')
        for connection in self.active_connections:
            if connection.state == 'SLEEP':
                if ((job and job.name in connection.functions) or
                        (job is None)):
                    connection.changeState("AWAKE")
                    connection.sendPacket(p)

Useful xref: http://gearman.org/protocol/

Event Timeline

dancy created this task.Wed, Jul 22, 6:18 PM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptWed, Jul 22, 6:18 PM
dancy triaged this task as High priority.Wed, Jul 22, 9:03 PM
dancy updated the task description. (Show Details)
dancy updated the task description. (Show Details)
dancy updated the task description. (Show Details)Wed, Jul 22, 9:05 PM
brennen moved this task from Backlog to Watching on the User-brennen board.
dancy updated the task description. (Show Details)Wed, Jul 22, 9:13 PM
dancy updated the task description. (Show Details)
hashar updated the task description. (Show Details)Mon, Jul 27, 7:12 PM
hashar added a subscriber: hashar.Mon, Jul 27, 7:29 PM

For the patch, we can send it upstream https://opendev.org/opendev/gear.git . You would need an Ubuntu / launchpad account to participate.

Meanwhile we will need to fork the repository to our own Gerrit which I have done: https://gerrit.wikimedia.org/r/integration/gear

To bump the dependency for Zuul. We have the source code in integration/zuul under the branch patch-queue/debian/jessie-wikimedia. gear is pinned to some old version:

requirements.txt
gear==0.7.0

Either we add a patch on top of that 0.7.0 tag or we upgrade gear to the latest released version + the patch. We can then have pip to install from our Gerrit repository instead of pypi by replacing the requirement with something such as:

git+https://gerrit.wikimedia.org/r/integration/zuul/gear.git@master#egg=gear

master can be replaced by a custom tag such as 0.15.1.1 maybe.

Change 616599 had a related patch set uploaded (by Ahmon Dancy; owner: Ahmon Dancy):
[integration/gear@master] wakeConnections: Randomize connections before scanning them

https://gerrit.wikimedia.org/r/616599

Change 616819 had a related patch set uploaded (by Hashar; owner: Hashar):
[integration/config@master] Add tox for integration/gear

https://gerrit.wikimedia.org/r/616819

Change 616819 merged by jenkins-bot:
[integration/config@master] Add tox for integration/gear

https://gerrit.wikimedia.org/r/616819

Change 616599 merged by jenkins-bot:
[integration/gear@master] wakeConnections: Randomize connections before scanning them

https://gerrit.wikimedia.org/r/616599

Mentioned in SAL (#wikimedia-releng) [2020-07-28T15:48:18Z] <hashar> integration/gear : git tag 0.15.1+wmf1 5fd9c37 for T258630

Change 616858 had a related patch set uploaded (by Hashar; owner: Hashar):
[integration/zuul@patch-queue/debian/jessie-wikimedia] WMF: use gear from our forked version

https://gerrit.wikimedia.org/r/616858

Change 616858 merged by Hashar:
[integration/zuul@patch-queue/debian/jessie-wikimedia] WMF: use gear from our forked version

https://gerrit.wikimedia.org/r/616858

Change 617404 had a related patch set uploaded (by Hashar; owner: Hashar):
[integration/zuul/deploy@master] Upgrade gear from 0.7.0 to 1.15.1+wmf1

https://gerrit.wikimedia.org/r/617404