Page MenuHomePhabricator

Jenkins 'test' pipeline should cancel old jobs after new patch sets
Open, MediumPublic

Description

Noticed today that when on https://gerrit.wikimedia.org/r/489443 Patch Set 24 was uploaded a few minutes after Patch Set 23, that the old jobs weren't cancelled.

Afaik Zuul used to properly cancel old jobs. If it no longer that, that would certainly increase load and traffic significantly given that it's fairly common to do a rebase followed by an amendment, and then maybe a commit message inline edit after that (that's three patch set versions, and thus three times the 25-minute whole pipeline being queued up with 5 job executors each).

Screenshot 2019-08-03 at 00.16.40.png (1×1 px, 228 KB)
Screenshot 2019-08-03 at 00.17.34.png (1×990 px, 214 KB)
Running (after 2 minutes)..Still running (after 5 minutes)..

Event Timeline

We would need to reproduce the issue to gather some debug logs then dig into Zuul debug and see what is going on. It should indeed cancel old changes.

Declining since we could not reproduce the issue.

Marking this as open again because I have an example of this.

@Dreamy_Jazz awesome! Everything ends up easier when there is a very recent case since that means we still have the debug logs from Zuul. So here we go.

Patches for 1018767 got sent in a quick succession:

17:54:42 Patchset 18 (which is a rebase)
17:56:08 Patchset 19 (which is a rebase)
17:56:17 Patchset 20 (which is a rebase)
17:57:14 Patchset 21

On the Zuul side (/var/log/zuul/debug.log):

2024-04-15 17:56:14,597 DEBUG zuul.Scheduler: Processing trigger event <TriggerEvent patchset-created mediawiki/extensions/CheckUser master 1018767,19>
2024-04-15 17:56:14,598 DEBUG zuul.IndependentPipelineManager: Change <Change 0x7f46b4375450 1018767,19> is a new version of <Change 0x7f468613d110 1018767,18>, removing <QueueItem 0x7f46658ea690 for <Change 0x7f468613d110 1018767,18> in test>
2024-04-15 17:56:25,365 DEBUG zuul.Scheduler: Processing trigger event <TriggerEvent patchset-created mediawiki/extensions/CheckUser master 1018767,20>
2024-04-15 17:57:19,984 DEBUG zuul.Scheduler: Processing trigger event <TriggerEvent patchset-created mediawiki/extensions/CheckUser master 1018767,21>
2024-04-15 17:57:19,984 DEBUG zuul.IndependentPipelineManager: Change <Change 0x7f46710ec210 1018767,21> is a new version of <Change 0x7f46b4375450 1018767,19>, removing <QueueItem 0x7f46608f5650 for <Change 0x7f46b4375450 1018767,19> in test>

When patchset 20 got sent, that did not cause patchset 19 to be removed for whatever reason (time based race condition?)?

When patchset 21 is sent, there are 19 and 20 in the queue however the Zuul code is only removing ONE item for that change (since there is supposedly never more than one copy).

def process_event_queue(self):
        self.log.debug("Processing trigger event %s" % event)
...
                if event.type == 'patchset-created':
                    pipeline.manager.removeOldVersionsOfChange(change)
def removeOldVersionsOfChange(self, change):
    if not self.pipeline.dequeue_on_new_patchset:
        return
    old_item = self.findOldVersionOfChangeAlreadyInQueue(change)
    if old_item:
        self.log.debug("Change %s is a new version of %s, removing %s" %
                       (change, old_item.change, old_item))
        self.removeItem(old_item)

That removes one item and move on.

def findOldVersionOfChangeAlreadyInQueue(self, change):
    for item in self.pipeline.getAllItems():
        if not item.live:
            continue
        if change.isUpdateOf(item.change):
            return item
    return None

Iterates through all items in the pipeline and return the first one matching. The isUpdateOf() is:

def isUpdateOf(self, other):
    if ((hasattr(other, 'number') and self.number == other.number) and
        (hasattr(other, 'patchset') and
         self.patchset is not None and
         other.patchset is not None and
         int(self.patchset) > int(other.patchset))):
        return True
    return False

I don't have an explanation as to why patchset 19 was not removed.

Another example where the failure for the first patchset occurred after the third had been uploaded for at least 7 or so minutes: https://gerrit.wikimedia.org/r/c/mediawiki/extensions/CheckUser/+/1023890/3