Page MenuHomePhabricator

Gerrit events not received by Zuul due to TCP Proxy timeout (CI is not triggered for some patches)
Closed, ResolvedPublic

Description

For two of my recent gerrit patches, CI was not triggered after I uploaded them, and I had to comment "recheck" to manually trigger the jobs.

https://gerrit.wikimedia.org/r/c/mediawiki/core/+/1239506
https://gerrit.wikimedia.org/r/c/mediawiki/extensions/CommunityRequests/+/1239420 (both after uploading PS1 and PS2)

Another patch was +2ed on Thursday but gate-and-submit was not triggered so it wasn't merged: https://gerrit.wikimedia.org/r/c/mediawiki/extensions/FlaggedRevs/+/1238850

This also seems to affect other people's patches, e.g. https://gerrit.wikimedia.org/r/c/mediawiki/services/parsoid/+/1239508

https://gerrit.wikimedia.org/r/c/integration/config/+/1239227 - 20260212 - 20:14:57 UTC

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

I have looked in the Zuul log on contint1002.wikimedia.org and the affected patchsets do not show up in the log. The events have never been received.

In the Gerrit logs ( via private access https://logstash.wikimedia.org/app/dashboards#/view/AW1f-0k0ZKA7RpirlnKV ) I see https://gerrit.wikimedia.org/r/c/mediawiki/core/+/1239506 being uploaded at Feb 14, 2026 @ 21:56:08.778 and surely it did since it exists.

I don't see any lost connection for zuul.GerritEventConnector either. So that is an entire mystery and I don't see why events would have vanished :-\

I have also looked at integration/config/+/1239227 which is from earlier: February 12. The test pipeline reported at 20:14:53 UTC, jdforrester casted at 20:14:57 UTC which never made it to Zuul. Here is the log:

2026-02-12 20:14:53,114 INFO zuul.IndependentPipelineManager: Reporting item <QueueItem 0x7ff4b974c290 for <Change 0x7ff4e7817cd0 1239227,1> in test-prio>, actions: [<GerritReporter connection: gerrit://gerrit>]
2026-02-12 20:14:55,170 INFO zuul.Gearman: Build <gear.Job 0x7ff4d3adb6d0 handle: H:::ffff:127.0.0.1:2060291 name: build:mwcore-phpunit-coverage-patch unique: 9e2d81b06c4c4ed68cf612be1d03baea> complete, result SUCCESS
2026-02-12 20:14:55,180 INFO zuul.IndependentPipelineManager: Reporting item <QueueItem 0x7ff51d34b110 for <Change 0x7ff5400c9550 1239225,1> in coverage>, actions: [<GerritReporter connection: gerrit://gerrit>]
2026-02-12 20:15:20,356 INFO zuul.Gearman: Build <gear.Job 0x7ff4d3629dd0 handle: H:::ffff:127.0.0.1:2060205 name: build:quibble-vendor-mysql-php83-selenium unique: e694d76ff21945478e05d6404875a0e1> complete, result SUCCESS
2026-02-12 20:15:20,377 INFO zuul.IndependentPipelineManager: Reporting item <QueueItem 0x7ff4d3efcb10 for <Change 0x7ff4cd311110 1239223,1> in test>, actions: [<GerritReporter connection: gerrit://gerrit>]

In the debug log, the last entry for 1239227 is the end of the test pipeline:

2026-02-12 20:14:53,389 DEBUG zuul.IndependentPipelineManager: Removing change <Change 0x7ff4e7817cd0 1239227,1> from queue
hashar renamed this task from Gerrit CI is not triggered for some patches to Gerrit events not received by Zuul (CI is not triggered for some patches).Feb 17 2026, 6:18 AM
hashar triaged this task as High priority.

Change #1239868 had a related patch set uploaded (by Hashar; author: Hashar):

[operations/puppet@production] zuul: log Gerrit stream-events

https://gerrit.wikimedia.org/r/1239868

There are events that are never received by Zuul. I thought about logging the whole stream of events on the Zuul side ( https://gerrit.wikimedia.org/r/c/operations/puppet/+/1239868 ) then I realized it might be the connection dropping. Zuul automatically retries connecting every 5 seconds (iirc) and since that is done over SSH we can see the connection attempts on the Gerrit side.

From gerrit1003:

zgrep -c jenkins-bot.*stream /var/log/gerrit/sshd_log.2026*
/var/log/gerrit/sshd_log.2026-01-17.gz:1
/var/log/gerrit/sshd_log.2026-01-18.gz:22
/var/log/gerrit/sshd_log.2026-01-19.gz:5
/var/log/gerrit/sshd_log.2026-01-20.gz:1
/var/log/gerrit/sshd_log.2026-01-21.gz:0
/var/log/gerrit/sshd_log.2026-01-22.gz:0
/var/log/gerrit/sshd_log.2026-01-23.gz:0
/var/log/gerrit/sshd_log.2026-01-24.gz:0
/var/log/gerrit/sshd_log.2026-01-25.gz:0
/var/log/gerrit/sshd_log.2026-01-26.gz:0
/var/log/gerrit/sshd_log.2026-01-27.gz:0
/var/log/gerrit/sshd_log.2026-01-28.gz:0
/var/log/gerrit/sshd_log.2026-01-29.gz:0
/var/log/gerrit/sshd_log.2026-01-30.gz:0
/var/log/gerrit/sshd_log.2026-01-31.gz:9
/var/log/gerrit/sshd_log.2026-02-01.gz:14
/var/log/gerrit/sshd_log.2026-02-02.gz:4
/var/log/gerrit/sshd_log.2026-02-03.gz:0
/var/log/gerrit/sshd_log.2026-02-04.gz:0
/var/log/gerrit/sshd_log.2026-02-05.gz:3
/var/log/gerrit/sshd_log.2026-02-06.gz:5
/var/log/gerrit/sshd_log.2026-02-07.gz:17
/var/log/gerrit/sshd_log.2026-02-08.gz:14
/var/log/gerrit/sshd_log.2026-02-09.gz:6
/var/log/gerrit/sshd_log.2026-02-10.gz:3
/var/log/gerrit/sshd_log.2026-02-11.gz:893
/var/log/gerrit/sshd_log.2026-02-12.gz:1178
/var/log/gerrit/sshd_log.2026-02-13.gz:1368
/var/log/gerrit/sshd_log.2026-02-14.gz:2104
/var/log/gerrit/sshd_log.2026-02-15.gz:2106

From gerrit2003 (we have switched Gerrit there yesterday):

$ zgrep -c jenkins-bot.*stream /var/log/gerrit/sshd_log.2026* /var/log/gerrit/sshd_log
/var/log/gerrit/sshd_log.2026-02-16:592
/var/log/gerrit/sshd_log:502

The first disconnection happened at 2026-02-11T04:47:39.560Z.

To correlate T417497#11621525

The test pipeline reported at 20:14:53 UTC, jdforrester casted at 20:14:57 UTC which never made it to Zuul.

The sshd_log on gerrit1003 has:

Report of the test completion:

[2026-02-12T20:14:53.376Z] 8cd6fca1 [SSH gerrit review --project integration/config --message Main test build succeeded. [trimmed] --tag autogenerated:ci-test --verified 2 1239227,1 (jenkins-bot)] jenkins-bot a/75 gerrit.review.--project.integration/config.--message.Main test build succeeded.

There is a connection between 20:14:22 and 20:14:52:

[2026-02-12T20:14:22.681Z] 0f5ad9fe [sshd-SshDaemon[1bc4ec31](port=22)-nio2-thread-15] jenkins-bot a/75 LOGIN FROM 2620:0:861:107:10:64:48:45
[2026-02-12T20:14:52.691Z] 0f5ad9fe [SshCommandDestroy-0] jenkins-bot a/75 gerrit.stream-events 0ms 30003ms - 0 - -10181ms -8480ms -1277191528

Jdforrester casts a vote at 20:14:57 UTC which is not received since Zuul is not streaming the events.

Zuul connects a second later:

[2026-02-12T20:14:58.025Z] 69a595e8 [sshd-SshDaemon[1bc4ec31](port=22)-nio2-thread-19] jenkins-bot a/75 LOGIN FROM 2620:0:861:107:10:64:48:45
[2026-02-12T20:16:26.314Z] 69a595e8 [SshCommandDestroy-0] jenkins-bot a/75 gerrit.stream-events 0ms 88280ms - 0 - -9969ms -8290ms -1165918768

Change #1239878 had a related patch set uploaded (by Jelto; author: Jelto):

[operations/dns@master] wikimedia: revert gerrit behind the CDN

https://gerrit.wikimedia.org/r/1239878

Can you re-check if this is still an issue and update the severity accordingly? There was some progress regarding Gerrits 502 errors, see T417536#11622345.

Can you re-check if this is still an issue and update the severity accordingly? There was some progress regarding Gerrits 502 errors, see T417536#11622345.

https://gerrit.wikimedia.org/r/c/mediawiki/core/+/1239870 is the latest patch where this is happening; a new patchset (rebase) was uploaded 10 minutes ago, but no CI jobs seem to be running.
There's also https://gerrit.wikimedia.org/r/c/operations/puppet/+/1239899 from 1h16m ago.

This is still happening, I'm unable to merge https://gerrit.wikimedia.org/r/c/mediawiki/extensions/CampaignEvents/+/1236795 despite 3 attempts (the patch already had a +2 from ~1h ago that also failed, but my attempts are from a few minutes ago).

This is still happening, I'm unable to merge https://gerrit.wikimedia.org/r/c/mediawiki/extensions/CampaignEvents/+/1236795 despite 3 attempts (the patch already had a +2 from ~1h ago that also failed, but my attempts are from a few minutes ago).

I believe this is working as intended as the patch depends on a parent which is not merged yet.

This is still happening, I'm unable to merge https://gerrit.wikimedia.org/r/c/mediawiki/extensions/CampaignEvents/+/1236795 despite 3 attempts (the patch already had a +2 from ~1h ago that also failed, but my attempts are from a few minutes ago).

I believe this is working as intended as the patch depends on a parent which is not merged yet.

Eeeek yes sorry, you're right. I saw it fail to merge and immediately assumed it was this :/

Change #1239928 had a related patch set uploaded (by Jelto; author: Jelto):

[operations/puppet@production] tcpproxy: raise connection limit from 200 to 400

https://gerrit.wikimedia.org/r/1239928

This will keep happening, we have/had two issues in parallel which use different protocols/paths:

  • T417536 which is for https requests that are now handled by the edge cache and Apache Traffic Server (ATS). That one had a workaround applied which seems to have fixed it.
  • This task which is for the SSH connection disconnecting, cause it is now passing through a TCP proxy (T408532).

I debugged the issue early this morning. Looking again at the log, the stream-events connection is often terminated after just 30 seconds and surely it is easy to reproduce by subscribing to an event that rarely occurs such as changes being abandoned:

time gerrit stream-events --subscribe change-abandonned
+ ssh -p 29418 hashar@gerrit.wikimedia.org gerrit stream-events --subscribe change-abandonned
+ set +x

real	0m30,466s
user	0m0,019s
sys	0m0,011s

CQFD / (Q.E.D)!

I thought about using the ssh client keep alive setting eg:

ssh -vv -o ServerAliveInterval=14 -o ServerAliveCountMax=2 \
  -p 29418 gerrit.wikimedia.org gerrit stream-events --subscribe change-abandonned

But Zuul uses Paramiko (python implementation of ssh) and the EOL version of Zuul we use has no setting to set keepalives.

I think the proxy should exempt the SSH connections from any timeout and let Gerrit handle it. I imagine that affects various bots at varying degrees though I haven't dig into the log. Is it possible to disable the timeout for connections targeting Gerrit port 29418?

For me, the sentence "CI is not triggered for some patches" isn't something new, it happens for weeks from time to time. But yes, much more in the last couple of days.

Change #1239878 abandoned by Jelto:

[operations/dns@master] wikimedia: revert gerrit behind the CDN

Reason:

not needed anymore, temporary workaround was found

https://gerrit.wikimedia.org/r/1239878

Change #1239868 abandoned by Hashar:

[operations/puppet@production] zuul: log Gerrit stream-events

Reason:

I found the issue: Zuul keeps being disconnected from Gerrit and it thus loose any event that happens until it has reconnected (after 5 seconds iirc).

https://gerrit.wikimedia.org/r/1239868

Change #1240243 had a related patch set uploaded (by Jelto; author: Jelto):

[operations/puppet@production] tcpproxy: add internal gerrit backend with higher timeout

https://gerrit.wikimedia.org/r/1240243

Mentioned in SAL (#wikimedia-releng) [2026-02-18T12:15:47Z] <hashar> zuul-1001.zuul3.eqiad1.wikimedia.cloud: added keepalive=20 to the scheduler Gerrit driver and restarted scheduler container # T417497

There are other accounts listening for stream-events. They can be seen from gerrit show-connections and whether they disconnect would show up on the Gerrit server in /var/log/gerrit/sshd_log. Beside jenkins-bot used by Zuul there are two accounts suchabot and zuul-test which I have investigated this morning:

suchabot

That is the bot https://www.mediawiki.org/wiki/Wikibugs which relays Gerrit events to IRC.

Source at https://gitlab.wikimedia.org/toolforge-repos/wikibugs2/-/blob/main/src/wikibugs2/gerrit.py, it uses asyncssh with a keepalive probe set at 10 seconds interval:

keepalive_interval=10,  # Send a ping every 10 seconds
keepalive_count_max=3,  # Disconnect after 3 unanswered pings

And I have confirmed via sshd_log that it did not get disconnected.

zuul-test

That is the newer Zuul dev platform on WMCS which kept being disconnected. I have edited its scheduler configuration to set a keepalive=20 to the gerrit driver and restart the scheduler which should keep it up. To be verified.

I'm seeing this on Parsoid patches too, for what it's worth.

hashar renamed this task from Gerrit events not received by Zuul (CI is not triggered for some patches) to Gerrit events not received by Zuul due to TCP Proxy timeout (CI is not triggered for some patches).Feb 19 2026, 1:41 PM
hashar added a subscriber: Vgutierrez.

TLDR

Zuul/CI does not receive event from Gerrit. The TCP Proxy 30 seconds timeout needs to be removed/largely raised up.

Summary

I have confirmed early Tuesday morning that since Wednesday 11/02 the ZuulGerrit SSH connection keeps being disconnected. During the 5 seconds+ it takes for Zuul to reconnect, events are not emitted and the associated CI jobs and workflows are not triggered.

The reason is the connection now pass through a TCP proxy (T411895) which has a 30 seconds timeout.

I did raise that task to collaboration-services over IRC and exchanged with @Jelto yesterday. Arnaud this morning pointed me to an ongoing patch https://gerrit.wikimedia.org/r/c/operations/puppet/+/1240243 , I offered a comment there but I am not familiar with HAProxy nor do I know much about the TCP Proxy infra. Arnaud mentioned Traffic and @Vgutierrez hence the CC here.

The traffic I know of:

  • git traffic over SSH from the outside, Gerrit imposes a 30 seconds idle timeout on them and terminate the connection.
  • zuul-merger on contint1002/contint2002 doing git clone/fetches from Gerrit. They are subject to the same 30 seconds idle timeout.
  • Zuul on contint1002/contint2002 connecting to Gerrit to receive events. That one can be idle for more than 30 seconds.

Only the last one is affected. git over ssh is already subject to a 30 seconds idle timeout from Gerrit.

There are a couple alternatives I have considered yesterday which I have apparently not mentioned:

  1. use split-horizon DNS and have GeoDNS to yield to internal network requests the IP of the Gerrit hosts. But that is not our model so I guess it can be rejected
  1. change the Zuul scheduler to connect its long living connection directly to the Gerrit hosts via gerrit.discovery.wmnet.

By having Zuul to connect directly to the host, it will be exempt from the TCP Proxy time out. I think that can be achieved by changing its driver upstream server:

/etc/zuul/zuul.conf
  [connection gerrit]
  driver=gerrit
- server=gerrit.wikimedia.org
+ server=gerrit.discovery.wmnet
  user=jenkins-bot
  baseurl=https://gerrit.wikimedia.org/r
  sshkey=/var/lib/zuul/.ssh/id_rsa
  event_delay=5

If I get it right, it will ssh to gerrit.discovery.wmnet. I am not sure what it affects in Zuul code, I gotta check. That will saves us from mangling with the TCP Proxy timeouts which can have side effect (such as public client now having no timeout). I'll investigate.

I think directly connecting via the discovery name makes a lot of sense and is the right direction here.

Change #1240738 had a related patch set uploaded (by Hashar; author: Hashar):

[operations/puppet@production] gerrit::sshkey: add discovery.wmnet entry

https://gerrit.wikimedia.org/r/1240738

Change #1240739 had a related patch set uploaded (by Hashar; author: Hashar):

[operations/puppet@production] zuul: allow varying Gerrit settings between merger and server

https://gerrit.wikimedia.org/r/1240739

Change #1240740 had a related patch set uploaded (by Hashar; author: Hashar):

[operations/puppet@production] zuul: change server to SSH to gerrit.discovery.wmnet

https://gerrit.wikimedia.org/r/1240740

Change #1240738 merged by Dzahn:

[operations/puppet@production] gerrit::sshkey: add discovery.wmnet entry

https://gerrit.wikimedia.org/r/1240738

Yup I think so. I have crafted three Puppet patches for that:

  1. [[ https://gerrit.wikimedia.org/r/c/operations/puppet/+/1240738 | add gerrit.discovery.wmnet ]] to the list of Gerrit ssh known_host, the same way as the LB got down. This can be verified by doing a ssh -p 29418 gerrit.discovery.wmnet on contint.wikimedia.org before (ask key) / after (allows)
  1. allow varying Gerrit settings between merger and server that is some Puppet refactoring to allow setting different Gerrit server domain between the server and the merger. It should be a noop.
  1. change server to SSH to gerrit.discovery.wmnet. Which ends up a oneliner that is easy to deploy/roll back. But I think that requires Zuul to be restarted.

;)

Change #1240747 had a related patch set uploaded (by CDanis; author: CDanis):

[operations/puppet@production] gerrit tcp haproxy: rationalize timeouts

https://gerrit.wikimedia.org/r/1240747

Change #1240761 had a related patch set uploaded (by Hashar; author: Hashar):

[operations/puppet@production] gerrit::sshkey: add discovery.wmnet as host alias

https://gerrit.wikimedia.org/r/1240761

Change #1240761 merged by Dzahn:

[operations/puppet@production] gerrit::sshkey: add discovery.wmnet as host alias

https://gerrit.wikimedia.org/r/1240761

Change #1240747 merged by Dzahn:

[operations/puppet@production] gerrit tcp haproxy: rationalize timeouts

https://gerrit.wikimedia.org/r/1240747

Change #1240739 merged by Dzahn:

[operations/puppet@production] zuul: allow varying Gerrit settings between merger and server

https://gerrit.wikimedia.org/r/1240739

Change #1240740 merged by Dzahn:

[operations/puppet@production] zuul: change server to SSH to gerrit.discovery.wmnet

https://gerrit.wikimedia.org/r/1240740

hashar lowered the priority of this task from Unbreak Now! to High.Feb 19 2026, 6:12 PM
hashar added a subscriber: CDanis.

@Dzahn and I have rolled the change to move the Zuul server to connect to gerrit.discovery.wmnet which makes it connect directly to the Gerrit server instead of passing through the proxies.

The zuul-merger processes that are running on contint1002 and contint2002 are left connecting to gerrit.wikimedia.org and thus through the load balancer. They are doing clone/fetches which are shortish SSH connections and are unlikely to ever require to be idle for more than 30 seconds.

We can later consider moving the zuul-merger to connect directly as well.

@CDanis has another change for tweaking the proxy timeout https://gerrit.wikimedia.org/r/c/operations/puppet/+/1240747

I am lowering the priority to High, I will reverify tomorrow things are stable and if so I guess finally mark this resolved 🎉

The change to the tcp-proxy timeouts have also been deployed.

But zuul is just not going through it anymore now. zuul has been restarted and could connect via ssh from contint to gerrit.discovery.wmnet after the merges above.

So these issues should be gone now. A "recheck" also worked for me.

Turns out Zuul is still being disconnected after 3600s which I think is Gerrit terminating the idling stream-events command since that matches Gerrit config sshd.idleTimeout=3600.

$ ssh gerrit2003.wikimedia.org grep jenkins-bot.*stream /var/log/gerrit/sshd_log
[2026-02-20T04:11:40.324Z] 99ffd6fe [SshCommandDestroy-0] jenkins-bot a/75 gerrit.stream-events 0ms 36430847ms - 0 - -12754ms -10930ms -1290902864
[2026-02-20T05:11:45.938Z] 251618ef [SshCommandDestroy-0] jenkins-bot a/75 gerrit.stream-events 0ms 3600001ms - 0 - -13923ms -11750ms -1365415112
[2026-02-20T06:11:51.558Z] 2fc9654e [SshCommandDestroy-0] jenkins-bot a/75 gerrit.stream-events 0ms 3600001ms - 0 - -14649ms -12490ms -1403369376
[2026-02-20T07:11:57.174Z] c29465f4 [SshCommandDestroy-0] jenkins-bot a/75 gerrit.stream-events 0ms 3600002ms - 0 - -13900ms -11650ms -1419365560

And I guess that was already the case previously, in which case I will mark this resolved and file another one.

It took me a while to find the historical data:

  • the old gerrit1003 got reimaged so we don't have the raw log files anymore
  • the OpenSearch dashboard does not properly account for the different kind of logs that are ingested (error, sshd and replication) which is T417753.
  • my brain still does not understand how to properly do freeform query searches in OpenSearch

Eventually on Gerrit dashboard I went with the Lucene search query:

message:gerrit.stream-events AND user.name:jenkins-bot AND process.thread.name:SshCommandDestroy*

with messages since Feb 10th that gives:

gerrit_stream_events_jenkins_bot_sshCommandDestroy.png (225×579 px, 15 KB)

That shows the number of time the jenkins-bot account (used by the Zuul server) had it is gerrit stream-events SSH session terminated by Gerrit.

After we have moved Zuul to connect directly to the host (rather than the proxy) by having it to connect to gerrit.discovery.wmnet we can see there are almost no disconnection left. Though there are still some (at 3600s see my previous message).

What happened before? Well it got disconnected as well, here is a view for December 28 to January 2nd which had less activity due to the Christian new year

gerrit_stream_events_jenkins_bot_sshCommandDestroy_week_of_dec_jan.png (225×579 px, 13 KB)

Or over the last week-end of January (Friday, Jan 30 22:00 UTC

And the labels.exec_time shows that Gerrit did disconnect the stream-events after 3600seconds idle timeout

stream_events_exec_time.png (401×763 px, 66 KB)

hashar claimed this task.

Summary

The introduction of the TCP proxy caused the Zuul > Gerrit connection to be terminated after 30 seconds. It uses gerrit stream-events which would not emit any packet when there is no activity in Gerrit. The connection would terminate and Zuul missed events for the 5+ seconds it takes it to reconnect.

We have changed Zuul to connect directly to the Gerrit servers, bypassing the TCP Proxy, by using gerrit.discovery.wmnet in Zuul config.

As for this task, it is resolved. I have confirmed Zuul keeps the connection (up to 3600 s), restoring the behavior from before the switch. Thank you:

  • @CDanis for the HAProxy config tweaking and the exchanges about timeouts

-@Dzahn for the on the fly switch to gerrit.discovery.wmnet yesterday!

The session is still terminated after 3600 seconds by the Gerrit server itself. I have filed T417996 to fix it, tune down the TCP Proxy timeout, have bots to turn on keepalive and have the new Zuul to have keepalive enabled/connect to gerrit.discovery.wmnet.

Sure it's resolved? Because now it's worse, the tests are all on 0% for 15 minutes now.

Screenshot_20260221_125349_Samsung Internet.png (532×1 px, 147 KB)

UPD: started after 18 minutes.

@IKhitron yes it is solved.

stream-events work

Here is the log of Zuul (connected as jenkins-bot) being disconnected fromGerrit stream of events:

[2026-02-21T02:53:06.075Z] [SshCommandDestroy-0] jenkins-bot a/75 gerrit.stream-events 0ms 69313029ms
[2026-02-21T03:53:11.705Z] [SshCommandDestroy-0] jenkins-bot a/75 gerrit.stream-events 0ms 3600002ms
[2026-02-21T04:53:17.334Z] [SshCommandDestroy-0] jenkins-bot a/75 gerrit.stream-events 0ms 3600001ms
[2026-02-21T05:53:22.968Z] [SshCommandDestroy-0] jenkins-bot a/75 gerrit.stream-events 0ms 3600001ms
[2026-02-21T06:53:28.605Z] [SshCommandDestroy-0] jenkins-bot a/75 gerrit.stream-events 0ms 3600001ms
[2026-02-21T07:53:34.240Z] [SshCommandDestroy-0] jenkins-bot a/75 gerrit.stream-events 0ms 3600002ms
[2026-02-21T08:53:39.840Z] [SshCommandDestroy-0] jenkins-bot a/75 gerrit.stream-events 0ms 3600002ms

It still disconnected after idling for 3600 seconds which was previously the case and will be addressed via T417996.

Spike of jobs

There was a spike of jobs at 10:38 UTC which took a while to complete and processing finished at 11:30 UTC.

You can see it at the bottom of the Zuul status page:

jobs_spike.png (205×446 px, 22 KB)

It was most probably caused by the chain of several changes ending at https://gerrit.wikimedia.org/r/c/mediawiki/core/+/1235547 . It takes a while to process those chains with the legacy Zuul system, but that will be addressed once we have ugpraded.

Network issue?

We apparently had a network issue/flapping between 11:11 UTC and 11:20 UTC Jenkins kept reregistering its jobs to Zuul, that uses the Gearman protocol and the advertisement is done with CAN_DO packets. That can be seen on the Gearman protocol dashboard:

Jenkins_CAN_DO.png (498×734 px, 24 KB)

I haven't checked that one, I guess it was somehow overloaded, timedout and went to reregister jobs maybe.

Conclusion

Thus I think your patch ended up being caught in either of those two unusual events which delayed the start of the execution of jobs for your change/patchset.

I see, thanks. In this case, it should not happen again, I hope. If it will, I'll bring it here.

I see, thanks. In this case, it should not happen again, I hope. If it will, I'll bring it here.

If you witness the issue may you file it as another task in Phabricator please? It is unlikely to be the same root cause as this one. When filing it you can subscribe me and add the tag #continuous-integration-infrastructure :] Thanks!

Change #1240243 abandoned by Jelto:

[operations/puppet@production] tcpproxy: add internal gerrit backend with higher timeout

Reason:

not needed anymore, discovery name is used

https://gerrit.wikimedia.org/r/1240243

Change #1239928 abandoned by Jelto:

[operations/puppet@production] tcpproxy: raise connection limit from 200 to 400

Reason:

not needed, the incident was not related to the connection limit

https://gerrit.wikimedia.org/r/1239928