Page MenuHomePhabricator

Investigate Zuul crashing due to BlueSpice REL1_31 backlog
Closed, DeclinedPublic

Description

On 2020-03-04 (at ~15:12) I restarted zuul because it was no longer responding to gerrit events. Why exactly that happened: I'm not sure, but I see a number of errors in the zuul error.log that I don't normally see, namely a few related to mutexes -- a thread deadlock maybe...:

2020-03-02 08:19:45,253 ERROR zuul.MutexHandler: Held mutex mwext-phpunit-coverage-docker-publish being released because the build that holds it is com
plete                                                                                                                                                  
2020-03-02 08:19:45,324 ERROR zuul.MutexHandler: Mutex can not be released for <QueueItem 0x7fae2e26b9d0 for <Change 0x7fae358b5790 575851,1> in postmerge> which does not hold it
2020-03-03 07:18:54,232 ERROR zuul.MutexHandler: Held mutex mwext-phpunit-coverage-docker-publish being released because the build that holds it is complete
2020-03-03 07:18:54,247 ERROR zuul.MutexHandler: Mutex can not be released for <QueueItem 0x7fae35a02e50 for <Change 0x7fae35a02a50 576201,1> in postmerge> which does not hold it

Event Timeline

We have a tool that helps us updating I18N and bumping version numbers in all BlueSpice-repos. This might have caused a huge number of CR+2. Sorry for that. Maybe we should change that tool to commit/push directly without creating a review? How does translatewiki-bot deal with this?

Hrm, on closer inspection it seems that the mutex problem is not that uncommon:

[thcipriani@contint1001 ~]$ grep 'ERROR' /var/log/zuul/error.log.2020-* | cut -d' ' -f 3- | awk '/QueueItem/ {print $1,$2,$3,$4,$5,$6,$7} !/QueueItem/ {print $0}' | sort | uniq -c | sort -rn
    163 ERROR zuul.DependentPipelineManager: Exception while reporting:
     36 ERROR zuul.GerritEventConnector: Exception moving Gerrit event:
     35 ERROR zuul.MutexHandler: Mutex can not be released
     30 ERROR zuul.IndependentPipelineManager: Exception while reporting:
     26 ERROR zuul.Scheduler: Exception reporting runtime stats
     20 ERROR zuul.MutexHandler: Held mutex mwext-phpunit-coverage-docker-publish being released because the build that holds it is complete
     14 ERROR zuul.Scheduler: Exception in run handler:
      8 ERROR zuul.MutexHandler: Held mutex mwext-codehealth-master-non-voting being released because the build that holds it is complete
      4 ERROR zuul.source.Gerrit: Exception looking for ref refs/heads/master
      4 ERROR zuul.MutexHandler: Held mutex mediawiki-fresnel-patch-docker being released because the build that holds it is complete
      2 ERROR zuul.MutexHandler: Held mutex mwcore-codehealth-master-non-voting being released because the build that holds it is complete
      1 ERROR zuul.MutexHandler: Held mutex mediawiki-core-jsduck-docker-publish being released because the build that holds it is complete
      1 ERROR zuul.DependentPipelineManager: Exception while reporting start:

In fact, no errors are unique to today:

[thcipriani@contint1001 ~]$ grep 'ERROR' /var/log/zuul/error.log.2020-03* | cut -d' ' -f 3- | awk '/QueueItem/ {print $1,$2,$3,$4,$5,$6,$7} !/QueueItem/ {print $0}' | sort | uniq -c | sort -rn
      4 ERROR zuul.DependentPipelineManager: Exception while reporting:
      2 ERROR zuul.Scheduler: Exception reporting runtime stats
      2 ERROR zuul.MutexHandler: Mutex can not be released
      2 ERROR zuul.MutexHandler: Held mutex mwext-phpunit-coverage-docker-publish being released because the build that holds it is complete
      2 ERROR zuul.GerritEventConnector: Exception moving Gerrit event:
      1 ERROR zuul.source.Gerrit: Exception looking for ref refs/heads/master
      1 ERROR zuul.IndependentPipelineManager: Exception while reporting:
[thcipriani@contint1001 ~]$ grep 'ERROR' /var/log/zuul/error.log | cut -d' ' -f 3- | awk '/QueueItem/ {print $1,$2,$3,$4,$5,$6,$7} !/QueueItem/ {print $0}' | sort | uniq -c | sort -rn
     25 ERROR zuul.DependentPipelineManager: Exception while reporting:
      1 ERROR zuul.MutexHandler: Mutex can not be released
      1 ERROR zuul.MutexHandler: Held mutex mwext-codehealth-master-non-voting being released because the build that holds it is complete

So it looks like maybe it was just load that caused zuul to fall over.

We have a tool that helps us updating I18N and bumping version numbers in all BlueSpice-repos. This might have caused a huge number of CR+2. Sorry for that. Maybe we should change that tool to commit/push directly without creating a review? How does translatewiki-bot deal with this?

Ah! That might make sense. l10n-bot changes go through their own pipeline that is pretty lightweight (since we don't need to run unit tests for them).

Pushing directly won't help at all; CI will still be DOS'ed, it'll just be pointlessly DOS'ed.

We can channge CI to ignore patches from your tool, but it'd disrupt every existinng patch in test/gate pipelines.

It'd be easiest if you rate limited your tool to maybe one every 30s or so; is that possible?

Couple of troubleshooting notes from this event:

  • I didn't see it processing events from gerrit (using tail -f /var/log/zuul/debug.log on contint1001 and commenting "recheck" on a random patchset of mine)
  • I could see connections from zuul to gerrit (on the gerrit side -- ssh -p 29418 gerrit.wikimedia.org -- gerrit show-connections -w)
  • I saw a large number of idle workers in gearman (using zuul-gearman.py workers|cut -b-72 on contint1001)

Pushing directly won't help at all; CI will still be DOS'ed, it'll just be pointlessly DOS'ed.

We can channge CI to ignore patches from your tool, but it'd disrupt every existinng patch in test/gate pipelines.

It'd be easiest if you rate limited your tool to maybe one every 30s or so; is that possible?

+1 to rate limiting the tool if possible

That has been caused by a lot of changes being send and immediately voted Code-Review +2. So that each change end up triggering the jobs in the test pipeline AND each of them are queued in the gate. There is simply not enough executor on CI to handle that kind of spike load, the jobs are thus waiting for a free slot.

Eventually some of the changes got force merged, which eventually cause Zuul to cancel all jobs in the gate, reorder the change, and retrigger jobs. Only to have them cancelled again as another change is force merged. Example: https://gerrit.wikimedia.org/r/#/c/mediawiki/extensions/BlueSpiceSMWConnector/+/576793/

My guess is that we should remove the submit permission in Gerrit for MediaWiki repositories and only reserve it to a handful of people.

I have better option:

  1. Make another account for I18N and bumping version numbers in all BlueSpice-repos
  2. Make configuration like for L10n-bot to run only mediawiki-i18n-check-docker on gate-and-submit jobs

In each case, tool should be rate-limited on at least 10-15 seconds.

We will definitely implement a rate-limit in our tool. Can you point me to that configuration that L10n-bot uses? So I can create a similar one for our tool?

Eventually some of the changes got force merged, which eventually cause Zuul to cancel all jobs in the gate, reorder the change, and retrigger jobs. Only to have them cancelled again as another change is force merged. Example: >https://gerrit.wikimedia.org/r/#/c/mediawiki/extensions/BlueSpiceSMWConnector/+/576793/

My guess is that we should remove the submit permission in Gerrit for MediaWiki repositories and only reserve it to a handful of people.

I wasn't aware that force merging has this kind of effect on Zuul. If you remove the submit permission, this would mean we are not able to force-merge anything on ourselves, wouldn't it? I'd ask you not to do that.

We will definitely implement a rate-limit in our tool. Can you point me to that configuration that L10n-bot uses? So I can create a similar one for our tool?

I am not sure for l10n-bot, but LibraryUpgrader ( https://www.mediawiki.org/wiki/Libraryupgrader ) definitely has throttling implemented. It looks at how many changes are being processed in the test and gate-and-submit pipelines (since the tool only targets the master branches), and would throttle. Code:

libup/gerrit.py
def zuul_queue_length(q='gate-and-submit'):
    # ?time is for cache busting, just like jQuery does
    r = session.get('https://integration.wikimedia.org/zuul/status.json?' + str(time.time()))
    r.raise_for_status()
    
    data = r.json()
    for pipeline in data['pipelines']: 
        if pipeline['name'] != q:
            continue
        count = 0
        for change_q in pipeline['change_queues']:
            if change_q['heads']:
                for head in change_q['heads']:
                    for patch in head:
                        if patch['jobs']:
                            count += 1
    
        return count
    
    # We never found the gate-and-submit queue?
    return 0


def wait_for_zuul_test_gate(count: int):
    zuul = zuul_queue_length('gate-and-submit') + zuul_queue_length('test')
    while zuul > count:
        print('test+gate-and-submit has %s jobs, waiting...' % zuul)
        time.sleep(10)
        zuul = zuul_queue_length('gate-and-submit') + zuul_queue_length('test')

Eventually some of the changes got force merged, which eventually cause Zuul to cancel all jobs in the gate, reorder the change, and retrigger jobs. Only to have them cancelled again as another change is force merged. Example: >https://gerrit.wikimedia.org/r/#/c/mediawiki/extensions/BlueSpiceSMWConnector/+/576793/

My guess is that we should remove the submit permission in Gerrit for MediaWiki repositories and only reserve it to a handful of people.

I wasn't aware that force merging has this kind of effect on Zuul. If you remove the submit permission, this would mean we are not able to force-merge anything on ourselves, wouldn't it? I'd ask you not to do that.

Force merge should not be needed, there are a few edge cases for which we never found a proper solution though (such as updating mediawiki/vendor.git which might conflict with mediawiki/core composer.json), some rare back compatibility issue that are tricky. But that should really really be the exception rather than the rule.

If you have to routinely force merges changes, it is an indication that something is broken. Either in the code, the test suites or in the CI config. We should address them!

If you have to routinely force merges changes, it is an indication that something is broken. Either in the code, the test suites or in the CI config. We should address them

Unfortunately we used this quite often, as all CI tests on our custom branch REL1_31_dev break (see also T235807#5935902). We have the policy to force-merge them when the respective CI tests in master/REL1_31 succeed. Yes, we should address this. Any help from your side is much appreciated. If you want to, we could have a hangout regarding this.

That hasn't reoccurred and at the time I had no good lead. Just assuming that Zuul got confused due to heavy load at the time.