Improve scheduling of CI jobs invoked by zuul
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	• dancy
	Jul 22 2020, 6:18 PM

Description

Jenkins jobs invoked by Zuul tend to be clustered on a few integration-agent-docker-* nodes rather than being spread out across all of them evenly. This is a problem because more jobs running on a node means more pressure on the /srv filesystem, increasing the likelihood of Shinken alerts and/or the maintenance-disconnect-full-disks job taking nodes offline.

Spreading the jobs out evenly across all of the available nodes would reduce the pressure on /srv and improve overall resource utilization.

T218458 attempted to deal with this problem by adding the Least Load Jenkins plugin but the solution was incomplete.

I looked at the source code for the implementation of Gearman that is used by Zuul: https://opendev.org/opendev/gear

The gearman server maintains a list of connected workers (in our case these are Jenkins executors). When a client (in this case, zuul) submits a job, the list of connections is scanned in order and each idle worker that can handle the job is sent a NOOP message. Each worker will react by sending a GRAB_JOB message (which asks to be assigned any available job that it can perform). The first worker to respond gets the job. Since the list of connected workers is likely to have all executors for a given node adjacent to each other, this pattern results in clustering of jobs onto nodes.

A simple first attempt to deal with job distribution is to randomize the list of connections before sending the NOOP message
xref: gear/__init__.py, Server.wakeConnections():

class Server(BaseClientServer):

    def wakeConnections(self, job=None):
        p = Packet(constants.RES, constants.NOOP, b'')
        for connection in self.active_connections:
            if connection.state == 'SLEEP':
                if ((job and job.name in connection.functions) or
                        (job is None)):
                    connection.changeState("AWAKE")
                    connection.sendPacket(p)

Useful xref: http://gearman.org/protocol/

Details

Subject	Repo	Branch	Lines +/-
Use gear from upstream and bump it to 0.16.0	integration/zuul/deploy	master	+1 -1
WMF: use gear from our forked version	integration/zuul	patch-queue/debian/jessie-wikimedia	+1 -1
Upgrade gear from 0.7.0 to 1.15.1+wmf1	integration/zuul/deploy	master	+2 -2
wakeConnections: Randomize connections before scanning them	integration/gear	master	+8 -1
Add tox for integration/gear	integration/config	master	+10 -0

Customize query in gerrit

Related Objects
Search...

		Status	Subtype	Assigned	Task
		Resolved		• dancy	T258630 Improve scheduling of CI jobs invoked by zuul
		Resolved		hashar	T259611 The python-build images regenerate wheels even when matching ones are already available

Event Timeline

• dancy created this task.Jul 22 2020, 6:18 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJul 22 2020, 6:18 PM

• dancy triaged this task as High priority.Jul 22 2020, 9:03 PM

• dancy updated the task description. (Show Details)

• dancy updated the task description. (Show Details)Jul 22 2020, 9:05 PM

brennen added a project: User-brennen.Jul 22 2020, 9:09 PM

brennen moved this task from Backlog to Radar on the User-brennen board.

• dancy updated the task description. (Show Details)Jul 22 2020, 9:13 PM

• dancy updated the task description. (Show Details)

• dancy added a project: Zuul.Jul 24 2020, 7:37 PM

hashar added a project: Continuous-Integration-Infrastructure.Jul 26 2020, 12:37 PM

hashar updated the task description. (Show Details)Jul 27 2020, 7:12 PM

For the patch, we can send it upstream https://opendev.org/opendev/gear.git . You would need an Ubuntu / launchpad account to participate.

Meanwhile we will need to fork the repository to our own Gerrit which I have done: https://gerrit.wikimedia.org/r/integration/gear

To bump the dependency for Zuul. We have the source code in integration/zuul under the branch patch-queue/debian/jessie-wikimedia. gear is pinned to some old version:

requirements.txt

gear==0.7.0

Either we add a patch on top of that 0.7.0 tag or we upgrade gear to the latest released version + the patch. We can then have pip to install from our Gerrit repository instead of pypi by replacing the requirement with something such as:

git+https://gerrit.wikimedia.org/r/integration/zuul/gear.git@master#egg=gear

master can be replaced by a custom tag such as 0.15.1.1 maybe.

Change 616599 had a related patch set uploaded (by Ahmon Dancy; owner: Ahmon Dancy):
[integration/gear@master] wakeConnections: Randomize connections before scanning them

https://gerrit.wikimedia.org/r/616599

gerritbot added a project: Patch-For-Review.Jul 27 2020, 9:50 PM

• dancy moved this task from Pondering to Awaiting review/merge on the User-dancy board.Jul 27 2020, 9:56 PM

Change 616819 had a related patch set uploaded (by Hashar; owner: Hashar):
[integration/config@master] Add tox for integration/gear

https://gerrit.wikimedia.org/r/616819

Change 616819 merged by jenkins-bot:
[integration/config@master] Add tox for integration/gear

https://gerrit.wikimedia.org/r/616819

Change 616599 merged by jenkins-bot:
[integration/gear@master] wakeConnections: Randomize connections before scanning them

https://gerrit.wikimedia.org/r/616599

Mentioned in SAL (#wikimedia-releng) [2020-07-28T15:48:18Z] <hashar> integration/gear : git tag 0.15.1+wmf1 5fd9c37 for T258630

Change 616858 had a related patch set uploaded (by Hashar; owner: Hashar):
[integration/zuul@patch-queue/debian/jessie-wikimedia] WMF: use gear from our forked version

https://gerrit.wikimedia.org/r/616858

Change 616858 merged by Hashar:
[integration/zuul@patch-queue/debian/jessie-wikimedia] WMF: use gear from our forked version

https://gerrit.wikimedia.org/r/616858

Change 617404 had a related patch set uploaded (by Hashar; owner: Hashar):
[integration/zuul/deploy@master] Upgrade gear from 0.7.0 to 1.15.1+wmf1

https://gerrit.wikimedia.org/r/617404

Stalled pending on https://gerrit.wikimedia.org/r/c/operations/docker-images/production-images/+/605653

hashar added projects: Release-Engineering-Team (CI & Testing services), Release-Engineering-Team-TODO (2020-07-01 to 2020-09-30 (Q1)).Aug 4 2020, 12:37 PM

hashar added a subtask: T259611: The python-build images regenerate wheels even when matching ones are already available.

• dancy moved this task from Awaiting review/merge to Watching on the User-dancy board.Aug 4 2020, 6:59 PM

Change 617404 merged by Hashar:
[integration/zuul/deploy@master] Upgrade gear from 0.7.0 to 1.15.1+wmf1

https://gerrit.wikimedia.org/r/617404

Mentioned in SAL (#wikimedia-operations) [2020-08-20T07:28:58Z] <hashar@deploy1001> Started deploy [zuul/deploy@5989ed0]: Upgrade gear from 0.7.0 to 1.15.1+wmf1 - T258630

Mentioned in SAL (#wikimedia-operations) [2020-08-20T07:29:11Z] <hashar@deploy1001> Finished deploy [zuul/deploy@5989ed0]: Upgrade gear from 0.7.0 to 1.15.1+wmf1 - T258630 (duration: 00m 13s)

That also affect the load sharing of zuul-merger. Can be compared looking at grep -c CreateZuulRef /var/log/zuul/merger-debug.log.* and that reflects a bias toward contint1001:

Date	contint2001	contint1001
2020-07-21	2114	3821
2020-07-22	2178	4108
2020-07-23	2608	4581
2020-07-24	2801	4459
2020-07-25	1357	2112
2020-07-26	527	868
2020-07-27	2240	4215
2020-07-28	3685	5693
2020-07-29	1693	3221
2020-07-30	1669	3003
2020-07-31	1344	2437
2020-08-01	293	505
2020-08-02	126	227
2020-08-03	1184	1829
2020-08-04	1965	3563
2020-08-05	1635	2731
2020-08-06	1093	2238
2020-08-07	1038	1919
2020-08-08	614	1140
2020-08-09	280	481
2020-08-10	978	1775
2020-08-11	1069	1797
2020-08-12	1313	2326
2020-08-13	1513	2611
2020-08-14	909	1798
2020-08-15	468	847
2020-08-16	204	350
2020-08-17	1262	2429
2020-08-18	1726	3032
2020-08-19	1650	2967

The last action is to upstream the patch. I have send it as https://review.opendev.org/#/c/747119/

hashar moved this task from Backlog to Patch proposed upstream on the Upstream board.Aug 20 2020, 8:07 AM

Deployed on our infrastructure, now waiting for upstream

Date	contint2001	contint1001
2020-08-19	1650	2967
2020-08-20	1932	3289
2020-08-21	937	1644
2020-08-22	362	702
2020-08-23	1425	1737
2020-08-24	1186	2136
2020-08-25	1262	2424

There are still more jobs being handled by the contint1001 zuul-merger for some reason. And I have reverified the code / setup everything seems all right.

Maybe contint2001 just takes more time possibly due to slower disk or larger repositories.

On Jenkins side I haven't dig in the logs to verify the load is better spread.

thcipriani edited projects, added Release-Engineering-Team-TODO (2020-10-01 to 2020-12-31 (Q2)); removed Release-Engineering-Team-TODO (2020-07-01 to 2020-09-30 (Q1)).Oct 21 2020, 1:07 AM

thcipriani moved this task from INBOX to Maintenance on the Release-Engineering-Team-TODO (2020-10-01 to 2020-12-31 (Q2)) board.Oct 21 2020, 5:18 PM

I have raised awareness of Ahmon patch ( https://review.opendev.org/#/c/747119/ ) on one of the upstream mailing list: http://lists.opendev.org/pipermail/service-discuss/2020-November/000127.html

The project lacks reviews apparently (sounds familiar? :D )

hashar closed subtask T259611: The python-build images regenerate wheels even when matching ones are already available as Declined.Feb 5 2021, 1:12 PM

brennen removed a project: User-brennen.Mar 8 2021, 11:05 PM

Meanwhile I have reviewed a few of the other pending changes.

I have poked upstream today about this change specially.

From the conversation:

<fungi> it's seemed at least close to "feature-complete" for some years, but if there are fixes or in-scope improvements then we'll try to review them
though also its importance to us may shrink with zuul 5.0.0 and the distributed scheduler work moving that coordination into zookeeper

So the python gear module is mostly in maintenance mode nowadays.

Upstream patch got merged :] https://review.opendev.org/c/opendev/gear/+/747119

Yay!

Bulk of the work has been done. Once upstream has released a new version, we will be able to adjust our integration/zuul requirements.txt to drop the git+https://gerrit.wikimedia.org/r/integration/gear.git@0.15.1+wmf1#egg=gear. Then rebuild the deploy repo and upgrade gear ;)

thcipriani removed a project: Release-Engineering-Team-TODO (2020-10-01 to 2020-12-31 (Q2)).Apr 20 2021, 12:21 AM

thcipriani added a project: Release-Engineering-Team-TODO (2021-04-01 to 2021-06-30 (Q4)).Apr 20 2021, 12:34 AM

thcipriani removed a project: Release-Engineering-Team (CI & Testing services).Apr 20 2021, 1:09 AM

thcipriani edited projects, added Release-Engineering-Team (Doing); removed Release-Engineering-Team-TODO (2021-04-01 to 2021-06-30 (Q4)).Apr 20 2021, 4:25 AM

Upstream hasn't cut a new release yet.

hashar moved this task from Backlog to Enhancements on the Zuul board.May 26 2021, 3:22 PM

Spoked a bit with upstream about it today (#zuul on OFCT irc):

sounds like after zuul 4.7.0 we should merge a gear<0.16 pin in zuul's requirements, then after the next zuul release after 4.7.0 we can tag gear 0.16.0

They need to ensure Zuul does not get broken as a result of upgrading and thus take some extra care when changing the dependency versions.

In T258630#6920707, @hashar wrote:

Bulk of the work has been done. Once upstream has released a new version, we will be able to adjust our integration/zuul requirements.txt to drop the git+https://gerrit.wikimedia.org/r/integration/gear.git@0.15.1+wmf1#egg=gear. Then rebuild the deploy repo and upgrade gear ;)

T289512