Page MenuHomePhabricator

Improve scheduling of CI jobs invoked by zuul
Closed, ResolvedPublic

Description

Jenkins jobs invoked by Zuul tend to be clustered on a few integration-agent-docker-* nodes rather than being spread out across all of them evenly. This is a problem because more jobs running on a node means more pressure on the /srv filesystem, increasing the likelihood of Shinken alerts and/or the maintenance-disconnect-full-disks job taking nodes offline.

Spreading the jobs out evenly across all of the available nodes would reduce the pressure on /srv and improve overall resource utilization.

T218458 attempted to deal with this problem by adding the Least Load Jenkins plugin but the solution was incomplete.

I looked at the source code for the implementation of Gearman that is used by Zuul: https://opendev.org/opendev/gear

The gearman server maintains a list of connected workers (in our case these are Jenkins executors). When a client (in this case, zuul) submits a job, the list of connections is scanned in order and each idle worker that can handle the job is sent a NOOP message. Each worker will react by sending a GRAB_JOB message (which asks to be assigned any available job that it can perform). The first worker to respond gets the job. Since the list of connected workers is likely to have all executors for a given node adjacent to each other, this pattern results in clustering of jobs onto nodes.

A simple first attempt to deal with job distribution is to randomize the list of connections before sending the NOOP message
xref: gear/__init__.py, Server.wakeConnections():

class Server(BaseClientServer):

    def wakeConnections(self, job=None):
        p = Packet(constants.RES, constants.NOOP, b'')
        for connection in self.active_connections:
            if connection.state == 'SLEEP':
                if ((job and job.name in connection.functions) or
                        (job is None)):
                    connection.changeState("AWAKE")
                    connection.sendPacket(p)

Useful xref: http://gearman.org/protocol/

Event Timeline

dancy triaged this task as High priority.Jul 22 2020, 9:03 PM
dancy updated the task description. (Show Details)
dancy updated the task description. (Show Details)
dancy updated the task description. (Show Details)

For the patch, we can send it upstream https://opendev.org/opendev/gear.git . You would need an Ubuntu / launchpad account to participate.

Meanwhile we will need to fork the repository to our own Gerrit which I have done: https://gerrit.wikimedia.org/r/integration/gear

To bump the dependency for Zuul. We have the source code in integration/zuul under the branch patch-queue/debian/jessie-wikimedia. gear is pinned to some old version:

requirements.txt
gear==0.7.0

Either we add a patch on top of that 0.7.0 tag or we upgrade gear to the latest released version + the patch. We can then have pip to install from our Gerrit repository instead of pypi by replacing the requirement with something such as:

git+https://gerrit.wikimedia.org/r/integration/zuul/gear.git@master#egg=gear

master can be replaced by a custom tag such as 0.15.1.1 maybe.

Change 616599 had a related patch set uploaded (by Ahmon Dancy; owner: Ahmon Dancy):
[integration/gear@master] wakeConnections: Randomize connections before scanning them

https://gerrit.wikimedia.org/r/616599

Change 616819 had a related patch set uploaded (by Hashar; owner: Hashar):
[integration/config@master] Add tox for integration/gear

https://gerrit.wikimedia.org/r/616819

Change 616819 merged by jenkins-bot:
[integration/config@master] Add tox for integration/gear

https://gerrit.wikimedia.org/r/616819

Change 616599 merged by jenkins-bot:
[integration/gear@master] wakeConnections: Randomize connections before scanning them

https://gerrit.wikimedia.org/r/616599

Mentioned in SAL (#wikimedia-releng) [2020-07-28T15:48:18Z] <hashar> integration/gear : git tag 0.15.1+wmf1 5fd9c37 for T258630

Change 616858 had a related patch set uploaded (by Hashar; owner: Hashar):
[integration/zuul@patch-queue/debian/jessie-wikimedia] WMF: use gear from our forked version

https://gerrit.wikimedia.org/r/616858

Change 616858 merged by Hashar:
[integration/zuul@patch-queue/debian/jessie-wikimedia] WMF: use gear from our forked version

https://gerrit.wikimedia.org/r/616858

Change 617404 had a related patch set uploaded (by Hashar; owner: Hashar):
[integration/zuul/deploy@master] Upgrade gear from 0.7.0 to 1.15.1+wmf1

https://gerrit.wikimedia.org/r/617404

Change 617404 merged by Hashar:
[integration/zuul/deploy@master] Upgrade gear from 0.7.0 to 1.15.1+wmf1

https://gerrit.wikimedia.org/r/617404

Mentioned in SAL (#wikimedia-operations) [2020-08-20T07:28:58Z] <hashar@deploy1001> Started deploy [zuul/deploy@5989ed0]: Upgrade gear from 0.7.0 to 1.15.1+wmf1 - T258630

Mentioned in SAL (#wikimedia-operations) [2020-08-20T07:29:11Z] <hashar@deploy1001> Finished deploy [zuul/deploy@5989ed0]: Upgrade gear from 0.7.0 to 1.15.1+wmf1 - T258630 (duration: 00m 13s)

That also affect the load sharing of zuul-merger. Can be compared looking at grep -c CreateZuulRef /var/log/zuul/merger-debug.log.* and that reflects a bias toward contint1001:

Datecontint2001contint1001
2020-07-2121143821
2020-07-2221784108
2020-07-2326084581
2020-07-2428014459
2020-07-2513572112
2020-07-26527868
2020-07-2722404215
2020-07-2836855693
2020-07-2916933221
2020-07-3016693003
2020-07-3113442437
2020-08-01293505
2020-08-02126227
2020-08-0311841829
2020-08-0419653563
2020-08-0516352731
2020-08-0610932238
2020-08-0710381919
2020-08-086141140
2020-08-09280481
2020-08-109781775
2020-08-1110691797
2020-08-1213132326
2020-08-1315132611
2020-08-149091798
2020-08-15468847
2020-08-16204350
2020-08-1712622429
2020-08-1817263032
2020-08-1916502967

The last action is to upstream the patch. I have send it as https://review.opendev.org/#/c/747119/

thcipriani lowered the priority of this task from High to Low.Aug 26 2020, 5:44 PM
thcipriani added a subscriber: thcipriani.

Deployed on our infrastructure, now waiting for upstream

Datecontint2001contint1001
2020-08-1916502967
2020-08-2019323289
2020-08-219371644
2020-08-22362702
2020-08-2314251737
2020-08-2411862136
2020-08-2512622424

There are still more jobs being handled by the contint1001 zuul-merger for some reason. And I have reverified the code / setup everything seems all right.

Maybe contint2001 just takes more time possibly due to slower disk or larger repositories.

On Jenkins side I haven't dig in the logs to verify the load is better spread.

I have raised awareness of Ahmon patch ( https://review.opendev.org/#/c/747119/ ) on one of the upstream mailing list: http://lists.opendev.org/pipermail/service-discuss/2020-November/000127.html

The project lacks reviews apparently (sounds familiar? :D )

Meanwhile I have reviewed a few of the other pending changes.

I have poked upstream today about this change specially.

From the conversation:

<fungi> it's seemed at least close to "feature-complete" for some years, but if there are fixes or in-scope improvements then we'll try to review them
though also its importance to us may shrink with zuul 5.0.0 and the distributed scheduler work moving that coordination into zookeeper

So the python gear module is mostly in maintenance mode nowadays.

Bulk of the work has been done. Once upstream has released a new version, we will be able to adjust our integration/zuul requirements.txt to drop the git+https://gerrit.wikimedia.org/r/integration/gear.git@0.15.1+wmf1#egg=gear. Then rebuild the deploy repo and upgrade gear ;)

Upstream hasn't cut a new release yet.

Spoked a bit with upstream about it today (#zuul on OFCT irc):

sounds like after zuul 4.7.0 we should merge a gear<0.16 pin in zuul's requirements, then after the next zuul release after 4.7.0 we can tag gear 0.16.0

They need to ensure Zuul does not get broken as a result of upgrading and thus take some extra care when changing the dependency versions.

Bulk of the work has been done. Once upstream has released a new version, we will be able to adjust our integration/zuul requirements.txt to drop the git+https://gerrit.wikimedia.org/r/integration/gear.git@0.15.1+wmf1#egg=gear. Then rebuild the deploy repo and upgrade gear ;)

T289512