On Zuul status page:
Data from that page:
Last reconfigured: Wed Mar 04 2026 12:27:46 GMT+0100 (Central European Standard Time)
Queue lengths: 1393 events, 15 results.
| hashar | |
| Mar 4 2026, 1:27 PM |
| F72504543: stack_dump.txt | |
| Mar 4 2026, 2:24 PM |
| F72504105: gerrit_indexing.png | |
| Mar 4 2026, 2:04 PM |
| F72502651: zuul_status_not_reporting.png | |
| Mar 4 2026, 1:27 PM |
On Zuul status page:
Data from that page:
Last reconfigured: Wed Mar 04 2026 12:27:46 GMT+0100 (Central European Standard Time)
Queue lengths: 1393 events, 15 results.
It also seems that CI doesn't pick up new patches, e.g.: https://gerrit.wikimedia.org/r/c/mediawiki/core/+/1247957
2026-03-04 12:25:10,813 ERROR zuul.GerritEventConnector: Exception moving Gerrit event:
Traceback (most recent call last):
File "/srv/deployment/zuul/venv/lib/python2.7/site-packages/zuul/connection/gerrit.py", line 134, in run
self._handleEvent()
File "/srv/deployment/zuul/venv/lib/python2.7/site-packages/zuul/connection/gerrit.py", line 122, in _handleEvent
event.change_number, event.patch_number, refresh=True)
File "/srv/deployment/zuul/venv/lib/python2.7/site-packages/zuul/source/gerrit.py", line 173, in _getChange
self._updateChange(change, history)
File "/srv/deployment/zuul/venv/lib/python2.7/site-packages/zuul/source/gerrit.py", line 245, in _updateChange
raise exceptions.ChangeNotFound(change.number, change.patchset)
ChangeNotFound: Change 1247774,1 not foundThat is Zuul having received an event for that change and not being able to find it in Gerrit. I think is the Gerrit index lagging behind, even though Zuul has a 5 seconds delay, that might be enough for the index to have caught up.
the debug log does not show much, I will have to investigate
Mentioned in SAL (#wikimedia-releng) [2026-03-04T13:41:05Z] <hashar> Took a Zuul stack dump on contint1002.wikimedia.org using SIGUSR1 # T419009
it could be related to T418108: gerrit: move gerrit-replica behind CDN, the dns change has been reverted
Mentioned in SAL (#wikimedia-releng) [2026-03-04T13:49:30Z] <hashar> Stopping Zuul # T419009
Mentioned in SAL (#wikimedia-releng) [2026-03-04T13:54:24Z] <hashar> SIGKILL Zuul cause it can't gracefully stop most probably due to being locked attempting to report back to Gerrit # T419009
https://grafana.wikimedia.org/d/Zh_ncGsWk/queues-upstream?orgId=1&from=now-6h&to=now&timezone=utc&var-instance=gerrit.wikimedia.org:443&var-replica=b&refresh=1m&viewPanel=panel-18 did not show any spikes of indexing:
I have asked Zuul to gracefully stop (sending SIGUSR1) but it failed to shutdown, I think it was waiting for the 15 or so reports to be send which was stalled for some reasons.
I have disconnected Zuul from Gerrit (gerrit kill connection for the jenkins-bot connections), it eventually reconnected but did not resume the reports.
I then went with systemctl stop zuul. That did kill the embedded gearman but never completed (it was left as a zombie process). I think I have a stackdump of that part.
End result: I have SIGKILL the zuul-server. Reports got lost, and a couple thousands of events as well. So we gotta recheck a few patches.
The last review sent by Zuul to Gerrit was:
[2026-03-04T12:01:25.976Z] 0ded5238 [SSH gerrit review --project mediawiki/extensions/PropertySuggester --message Main test build failed. [trimmed] --tag autogenerated:ci-test --verified -1 1247978,1 (jenkins-bot)] jenkins-bot a/75 gerrit.review.--project.mediawiki/extensions/PropertySuggester.--message.Main test build failed.
After that it was only querying for informations.
The ChangeNotFound: Change 1247774,1 not found query is:
[2026-03-04T12:25:00.623Z] 0ded5238 [SSH gerrit query --format json --all-approvals --comments --commit-message --current-patch-set --dependencies --files --patch-sets --submit-records change:1247774 (jenkins-bot)] jenkins-bot a/75 gerrit.query.--format.json.--all-approvals.--comments.--commit-message.--current-patch-set.--dependencies.--files.--patch-sets.--submit-records.change:1247774 3ms 8ms - 0 - 5ms 0ms 1023368
I don't think it is related but who knows really.
I am claiming this to have been fixed and to be caused by some issue inside Zuul. Attached is the sole stack dump that was captured:
There is one thread busy reporting back to Gerrit.
File "/usr/lib/python2.7/threading.py", line 774, in __bootstrap self.__bootstrap_inner() File "/usr/lib/python2.7/threading.py", line 801, in __bootstrap_inner self.run() File "/srv/deployment/zuul/venv/lib/python2.7/site-packages/zuul/scheduler.py", line 986, in run self.process_event_queue() File "/srv/deployment/zuul/venv/lib/python2.7/site-packages/zuul/scheduler.py", line 1044, in process_event_queue pipeline.manager.addChange(change) File "/srv/deployment/zuul/venv/lib/python2.7/site-packages/zuul/scheduler.py", line 1436, in addChange self.reportStart(item) File "/srv/deployment/zuul/venv/lib/python2.7/site-packages/zuul/scheduler.py", line 1276, in reportStart self.pipeline.source, item) File "/srv/deployment/zuul/venv/lib/python2.7/site-packages/zuul/scheduler.py", line 1293, in sendReport ret = reporter.report(source, self.pipeline, item) File "/srv/deployment/zuul/venv/lib/python2.7/site-packages/zuul/reporter/gerrit.py", line 36, in report item.change.project.name, 'refs/heads/' + item.change.branch) File "/srv/deployment/zuul/venv/lib/python2.7/site-packages/zuul/source/gerrit.py", line 49, in getRefSha refs = self.connection.getInfoRefs(project) File "/srv/deployment/zuul/venv/lib/python2.7/site-packages/zuul/connection/gerrit.py", line 413, in getInfoRefs data = urllib.request.urlopen(url).read() File "/usr/lib/python2.7/socket.py", line 355, in read data = self._sock.recv(rbufsize) File "/usr/lib/python2.7/httplib.py", line 583, in read return self._read_chunked(amt) File "/usr/lib/python2.7/httplib.py", line 625, in _read_chunked line = self.fp.readline(_MAXLINE + 1) File "/usr/lib/python2.7/socket.py", line 480, in readline data = self._sock.recv(self._rbufsize) File "/usr/lib/python2.7/ssl.py", line 786, in recv return self.read(buflen) File "/usr/lib/python2.7/ssl.py", line 673, in read v = self._sslobj.read(len)
That is processing an incoming event. The change is added to the queue and the scheduler invokes reportStart which I think would be to send a message to Gerrit "starting gate-and-submit".
The message uses a reporter which needs informations about the change being reported. That is done by fetching infos over https using:
def getInfoRefs(self, project):
url = "%s/%s/info/refs?service=git-upload-pack" % (
self.baseurl, project)
try:
data = urllib.request.urlopen(url).read()And my guess is that Zuul was stuck reading incoming packets / Gerrit not terminating it or something similar -:-\
That is where I stop debugging :-] Hard restarting Zuul solved it.