Page MenuHomePhabricator

Zuul has lost connection to gerrit, freezing all CI in queue-only mode
Closed, ResolvedPublic

Description

On Zuul status page:

zuul_status_not_reporting.png (468×431 px, 62 KB)

Data from that page:

Last reconfigured: Wed Mar 04 2026 12:27:46 GMT+0100 (Central European Standard Time)

Queue lengths: 1393 events, 15 results.

Event Timeline

hashar triaged this task as Unbreak Now! priority.Mar 4 2026, 1:28 PM
hashar added a subscriber: Zabe.
2026-03-04 12:25:10,813 ERROR zuul.GerritEventConnector: Exception moving Gerrit event:
Traceback (most recent call last):
  File "/srv/deployment/zuul/venv/lib/python2.7/site-packages/zuul/connection/gerrit.py", line 134, in run
    self._handleEvent()
  File "/srv/deployment/zuul/venv/lib/python2.7/site-packages/zuul/connection/gerrit.py", line 122, in _handleEvent
    event.change_number, event.patch_number, refresh=True)
  File "/srv/deployment/zuul/venv/lib/python2.7/site-packages/zuul/source/gerrit.py", line 173, in _getChange
    self._updateChange(change, history)
  File "/srv/deployment/zuul/venv/lib/python2.7/site-packages/zuul/source/gerrit.py", line 245, in _updateChange
    raise exceptions.ChangeNotFound(change.number, change.patchset)
ChangeNotFound: Change 1247774,1 not found

That is Zuul having received an event for that change and not being able to find it in Gerrit. I think is the Gerrit index lagging behind, even though Zuul has a 5 seconds delay, that might be enough for the index to have caught up.

the debug log does not show much, I will have to investigate

Mentioned in SAL (#wikimedia-releng) [2026-03-04T13:41:05Z] <hashar> Took a Zuul stack dump on contint1002.wikimedia.org using SIGUSR1 # T419009

Jdforrester-WMF renamed this task from Zuul status page is frozen / CI frozen? to Zuul has lost connection to gerrit, freezing all CI in queue-only mode.Mar 4 2026, 1:51 PM

Mentioned in SAL (#wikimedia-releng) [2026-03-04T13:54:24Z] <hashar> SIGKILL Zuul cause it can't gracefully stop most probably due to being locked attempting to report back to Gerrit # T419009

hashar lowered the priority of this task from Unbreak Now! to High.Mar 4 2026, 2:04 PM

https://grafana.wikimedia.org/d/Zh_ncGsWk/queues-upstream?orgId=1&from=now-6h&to=now&timezone=utc&var-instance=gerrit.wikimedia.org:443&var-replica=b&refresh=1m&viewPanel=panel-18 did not show any spikes of indexing:

gerrit_indexing.png (401×897 px, 25 KB)

I have asked Zuul to gracefully stop (sending SIGUSR1) but it failed to shutdown, I think it was waiting for the 15 or so reports to be send which was stalled for some reasons.

I have disconnected Zuul from Gerrit (gerrit kill connection for the jenkins-bot connections), it eventually reconnected but did not resume the reports.

I then went with systemctl stop zuul. That did kill the embedded gearman but never completed (it was left as a zombie process). I think I have a stackdump of that part.

End result: I have SIGKILL the zuul-server. Reports got lost, and a couple thousands of events as well. So we gotta recheck a few patches.

The last review sent by Zuul to Gerrit was:

[2026-03-04T12:01:25.976Z] 0ded5238 [SSH gerrit review --project mediawiki/extensions/PropertySuggester --message Main test build failed. [trimmed] --tag autogenerated:ci-test --verified -1 
1247978,1 (jenkins-bot)] jenkins-bot a/75 gerrit.review.--project.mediawiki/extensions/PropertySuggester.--message.Main test build failed.

After that it was only querying for informations.

The ChangeNotFound: Change 1247774,1 not found query is:

[2026-03-04T12:25:00.623Z] 0ded5238 [SSH gerrit query --format json --all-approvals --comments --commit-message --current-patch-set --dependencies --files --patch-sets --submit-records change:1247774 (jenkins-bot)] jenkins-bot a/75 gerrit.query.--format.json.--all-approvals.--comments.--commit-message.--current-patch-set.--dependencies.--files.--patch-sets.--submit-records.change:1247774 3ms 8ms - 0 - 5ms 0ms 1023368

I don't think it is related but who knows really.

I am claiming this to have been fixed and to be caused by some issue inside Zuul. Attached is the sole stack dump that was captured:

hashar claimed this task.

There is one thread busy reporting back to Gerrit.

Thread: 140279755544320
File "/usr/lib/python2.7/threading.py", line 774, in __bootstrap
  self.__bootstrap_inner()
File "/usr/lib/python2.7/threading.py", line 801, in __bootstrap_inner
  self.run()
File "/srv/deployment/zuul/venv/lib/python2.7/site-packages/zuul/scheduler.py", line 986, in run
  self.process_event_queue()
File "/srv/deployment/zuul/venv/lib/python2.7/site-packages/zuul/scheduler.py", line 1044, in process_event_queue
  pipeline.manager.addChange(change)
File "/srv/deployment/zuul/venv/lib/python2.7/site-packages/zuul/scheduler.py", line 1436, in addChange
  self.reportStart(item)
File "/srv/deployment/zuul/venv/lib/python2.7/site-packages/zuul/scheduler.py", line 1276, in reportStart
  self.pipeline.source, item)
File "/srv/deployment/zuul/venv/lib/python2.7/site-packages/zuul/scheduler.py", line 1293, in sendReport
  ret = reporter.report(source, self.pipeline, item)
File "/srv/deployment/zuul/venv/lib/python2.7/site-packages/zuul/reporter/gerrit.py", line 36, in report
  item.change.project.name, 'refs/heads/' + item.change.branch)
File "/srv/deployment/zuul/venv/lib/python2.7/site-packages/zuul/source/gerrit.py", line 49, in getRefSha
  refs = self.connection.getInfoRefs(project)
File "/srv/deployment/zuul/venv/lib/python2.7/site-packages/zuul/connection/gerrit.py", line 413, in getInfoRefs
  data = urllib.request.urlopen(url).read()
File "/usr/lib/python2.7/socket.py", line 355, in read
  data = self._sock.recv(rbufsize)
File "/usr/lib/python2.7/httplib.py", line 583, in read
  return self._read_chunked(amt)
File "/usr/lib/python2.7/httplib.py", line 625, in _read_chunked
  line = self.fp.readline(_MAXLINE + 1)
File "/usr/lib/python2.7/socket.py", line 480, in readline
  data = self._sock.recv(self._rbufsize)
File "/usr/lib/python2.7/ssl.py", line 786, in recv
  return self.read(buflen)
File "/usr/lib/python2.7/ssl.py", line 673, in read
  v = self._sslobj.read(len)

That is processing an incoming event. The change is added to the queue and the scheduler invokes reportStart which I think would be to send a message to Gerrit "starting gate-and-submit".
The message uses a reporter which needs informations about the change being reported. That is done by fetching infos over https using:

def getInfoRefs(self, project):
    url = "%s/%s/info/refs?service=git-upload-pack" % (
        self.baseurl, project)
    try:
        data = urllib.request.urlopen(url).read()

And my guess is that Zuul was stuck reading incoming packets / Gerrit not terminating it or something similar -:-\

That is where I stop debugging :-] Hard restarting Zuul solved it.