Page MenuHomePhabricator

Change force merged cause a deadlock in Zuul gate-and-submit pipeline
Closed, DeclinedPublic

Description

We have noticed yesterday a deadlock in Zuul. When a change is in the gate-and-submit but get force merged by someone in Gerrit, Zuul sometime ends up sleeping up to 5 minutes. Meanwhile all changes behind in the queues are obviously blocked.

What is ever more annoying is that Zuul consider the change hasn't entered the repository and thus consider it to be a failure. It thus retrigger the whole queue. So given six patches +2ed and force merged, it takes 6 * 5 = 30 minutes for Zuul to resume to normal operations.

A stacktrace:

Thread: 140156753389312
  File "/usr/lib/python2.7/threading.py", line 524, in __bootstrap
    self.__bootstrap_inner()
  File "/usr/lib/python2.7/threading.py", line 551, in __bootstrap_inner
    self.run()
  File "zuul/scheduler.py", line 784, in run
    while pipeline.manager.processQueue():
  File "zuul/scheduler.py", line 1358, in processQueue
    item, nnfi, ready_ahead)
  File "zuul/scheduler.py", line 1330, in _processOneItem
    self.reportItem(item)
  File "zuul/scheduler.py", line 1421, in reportItem
    item.change.branch)
  File "zuul/trigger/gerrit.py", line 224, in isMerged
    if self.waitForRefSha(change.project, ref, change._ref_sha):
  File "zuul/trigger/gerrit.py", line 203, in waitForRefSha
    time.sleep(self.replication_retry_interval)

Example changes: 198885,1 and 198950,1

The questions are: why cant it find the sha1 commit since the change has actually been merged? Maybe it is missing a git remote update.

That also prevent the other queues from being run :(

Event Timeline

hashar raised the priority of this task from to Needs Triage.
hashar updated the task description. (Show Details)
hashar added subscribers: hashar, Legoktm.
hashar set Security to None.

I have looked at the debug log for change 198885,1

2015-03-24 21:30:19,442 DEBUG zuul.Gerrit: Checking if change <Change 0x7f78adba01d0 198885,1> is merged
2015-03-24 21:30:19,923 DEBUG zuul.Gerrit: Waiting for <Change 0x7f78adba01d0 198885,1> to appear in git repo
2015-03-24 21:35:23,663 DEBUG zuul.Gerrit: Change <Change 0x7f78adba01d0 198885,1> did not appear in the git repo

The code is zuul/trigger/gerrit.py class Gerrit, but I have no idea why it can not find the reference in the Gerrit repo since it has been merged. It does retry every 5 seconds apparently.

The method:

def waitForRefSha(self, project, ref, old_sha=''):
    # Wait for the ref to show up in the repo
    start = time.time()
    while time.time() - start < self.replication_timeout:
        sha = self.getRefSha(project.name, ref)
        if old_sha != sha:
            return True
        time.sleep(self.replication_retry_interval)
    return False

And it is invoked by:

def isMerged(self, change, head=None):
...
        ref = 'refs/heads/' + change.branch
        self.log.debug("Waiting for %s to appear in git repo" % (change))
        if self.waitForRefSha(change.project, ref, change._ref_sha):
            self.log.debug("Change %s is in the git repo" %
                           (change))
            return True
        self.log.debug("Change %s did not appear in the git repo" %
                       (change))
        return False

Thus waitForRefSha() is invoked with the tip of the branch (refs/heads/master) and the change reference. It will succeed when they are different which is not the case since the change just got merged.

Seems some parameters are inverted as well. Will have to look at it again when rested.

Krinkle claimed this task.
Krinkle subscribed.

Upstream wontfix.

If people are allowed to self-merge on a project, the project can't use Zuul. The two are mutually exclusive. Zero tolerance.

Is there a link for an upstream ticket?

That was from a discussion on #openstack-infra http://eavesdrop.openstack.org/irclogs/%23openstack-infra/%23openstack-infra.2015-04-14.log


<jeblair> below is the original author of Zuul and one of the principal developers.

<Krinkle> Hm.. I'm seeing some strange behaviour in our Zuul instance. The gate had about a dozen things queued up in a dependent pipeline. Everything was passing. The top one was the slowest so everythign behind it was already 'success' but waiting. Then after the top one succeeded as well and completed, eveything behind it suddenly went from 'sucess' back to 'queued' and started over.
<Krinkle> jeblair: The change in question was merged directly by another user.
<jeblair> Krinkle: zuul expects to be the only thing with permissions to merge (gerrit "submit") changes

<Krinkle> jeblair: Some people bypass Jenkins for backports. It's a bad practice that evolved because Zuul automatically combines alll changes regardless of repo or branch in the same queue due to the simplification of jobs.


Besides the deadlock, Krinkle also mentioned on IRC that for future jobs using that repo from zuul it won't contain the merged change unless something made zuul fetch updates.

Change 301768 had a related patch set uploaded (by Hashar):
WMF: reduce Gerrit replication delay from 300 to 15 secs

https://gerrit.wikimedia.org/r/301768

Change 301768 merged by Hashar:
WMF: reduce Gerrit replication delay from 300 to 15 secs

https://gerrit.wikimedia.org/r/301768

Change 301769 had a related patch set uploaded (by Hashar):
2.1.0-391-gbc58ea3-wmf2precise1

https://gerrit.wikimedia.org/r/301769

Change 301774 had a related patch set uploaded (by Hashar):
2.1.0-391-gbc58ea3-wmf2precise1

https://gerrit.wikimedia.org/r/301774

Mentioned in SAL [2016-07-29T09:38:54Z] <hashar> Upgrading Zuul to get rid of a forced sleep(300) whenever a patch is merged T93812. zuul_2.1.0-391-gbc58ea3-wmf2precise1

Mentioned in SAL [2016-07-29T09:38:57Z] <hashar> Upgrading Zuul to get rid of a forced sleep(300) whenever a patch is merged T93812. zuul_2.1.0-391-gbc58ea3-wmf2precise1

Change 301774 abandoned by Hashar:
2.1.0-391-gbc58ea3-wmf2precise1

Reason:
wrong branch

https://gerrit.wikimedia.org/r/301774

Change 301769 merged by Hashar:
2.1.0-391-gbc58ea3-wmf2precise1

https://gerrit.wikimedia.org/r/301769

Change 301774 restored by Hashar:
2.1.0-391-gbc58ea3-wmf2precise1

https://gerrit.wikimedia.org/r/301774

Change 301774 merged by Hashar:
Merge branch 'debian/precise-wikimedia' into debian/jessie-wikimedia

https://gerrit.wikimedia.org/r/301774