Page MenuHomePhabricator

CI is doing nada (Gearman)
Closed, ResolvedPublic

Description

No jobs seem to be being executed...

Gearman issue?

Screenshot 2021-12-22 at 01.57.05.png (926×1 px, 276 KB)

Event Timeline

Reedy triaged this task as High priority.Dec 22 2021, 1:55 AM

I think Lucas upset it with his stack of changes of nearly 20 patches...

2021-12-22 02:10:07,810 DEBUG zuul.Repo: Resetting repository /srv/zuul/git/mediawiki/core
2021-12-22 02:10:07,810 DEBUG zuul.Repo: Updating repository /srv/zuul/git/mediawiki/core
2021-12-22 02:10:43,364 DEBUG zuul.Repo: Checking out 16bd6eaeb497ba577460775038a530e202ef3b0f
2021-12-22 02:10:46,265 DEBUG zuul.Repo: Merging refs/changes/25/709125/9 with args ['-s', 'resolve', 'FETCH_HEAD']
2021-12-22 02:10:46,749 DEBUG zuul.Merger: Unable to merge {u'oldrev': None, u'newrev': None, u'refspec': u'refs/changes/25/709125/9', u'merge_mode': 2, u'connection_name': u'gerrit', u'number': 709125, u'project': u'mediawiki/core', u'url': u'ssh://jenkins-bot@gerrit.wikimedia.org:29418/mediawiki/core', u'branch': u'master', u'patchset': 9, u'ref': u'Z3ab5d6900dfc41abbc0f61ae9e4138d9'}
Traceback (most recent call last):
  File "/srv/deployment/zuul/venv/local/lib/python2.7/site-packages/zuul/merger/merger.py", line 277, in _mergeChange
    commit = repo.merge(item['refspec'], 'resolve')
  File "/srv/deployment/zuul/venv/local/lib/python2.7/site-packages/zuul/merger/merger.py", line 165, in merge
    repo.git.merge(*args)
  File "/srv/deployment/zuul/venv/local/lib/python2.7/site-packages/git/cmd.py", line 548, in <lambda>
    return lambda *args, **kwargs: self._call_process(name, *args, **kwargs)
  File "/srv/deployment/zuul/venv/local/lib/python2.7/site-packages/git/cmd.py", line 1014, in _call_process
    return self.execute(call, **exec_kwargs)
  File "/srv/deployment/zuul/venv/local/lib/python2.7/site-packages/git/cmd.py", line 825, in execute
    raise GitCommandError(command, status, stderr_value, stdout_value)
GitCommandError: Cmd('git') failed due to: exit code(1)
  cmdline: git merge -s resolve FETCH_HEAD
  stdout: 'Trying really trivial in-index merge...
Nope.
Trying simple merge.
Simple merge failed, trying Automatic merge.
Auto-merging RELEASE-NOTES-1.38
Auto-merging resources/Resources.php
Auto-merging resources/lib/foreign-resources.yaml
Auto-merging resources/src/vue/index.js
Automatic merge failed; fix conflicts and then commit the result.'
  stderr: 'error: Merge requires file-level merging
ERROR: content conflict in RELEASE-NOTES-1.38
ERROR: content conflict in resources/src/vue/index.js
ERROR: resources/src/vue/vuex.js: Not handling case 5d009854c54a87df6bd3bbc59700c08140ecbcbc -> f7722c7d8bfe89c5afd617e179c8b5c243bded73 -> 
fatal: merge program failed'
2021-12-22 02:10:46,852 DEBUG zuul.MergeServer: Got merge job: 980aa0b7a87e40a4b5a999308b2a1c53
2021-12-22 02:10:46,853 DEBUG zuul.Merger: Merging for change 709125,9.
2021-12-22 02:10:46,853 DEBUG zuul.Merger: Processing refspec refs/changes/25/709125/9 for project mediawiki/core / master ref Z3ab5d6900dfc41abbc0f61ae9e4138d9
2021-12-22 02:10:48,939 DEBUG zuul.Merger: Unable to find commit for ref master/Z3ab5d6900dfc41abbc0f61ae9e4138d9
2021-12-22 02:10:48,939 DEBUG zuul.Merger: No base commit found for (u'mediawiki/core', u'master')
2021-12-22 02:10:48,939 DEBUG zuul.Repo: Resetting repository /srv/zuul/git/mediawiki/core
2021-12-22 02:10:48,940 DEBUG zuul.Repo: Updating repository /srv/zuul/git/mediawiki/core
2
[01:43:24] --> Lucas_WMDE (~Lucas_WMD@user/lucas-wmde/x-3192532) has joined #wikimedia-releng
[01:43:55] <Lucas_WMDE> I’m not sure if anyone’s still up, but I pushed a bunch of Termbox changes (late at night, hoping to avoid disturbing others), and apparently Zuul hasn’t even started running them yet
[01:44:08] <Lucas_WMDE> no builds since Dec 20 at https://integration.wikimedia.org/ci/job/trigger-termbox-pipeline-test/
[01:44:42] <Lucas_WMDE> if they don’t recover by themselves until tomorrow, feel free to just cancel the builds, at this stage I’m not interested in running CI for these changes
[01:44:49] <Lucas_WMDE> I just wanted to have them on Gerrit
[01:45:22] <Lucas_WMDE> (I still have nine more commits locally but I won’t push them for now)
[01:46:14] <-- Lucas_WMDE (~Lucas_WMD@user/lucas-wmde/x-3192532) has quit (Client Quit)

I've hit rebase on https://gerrit.wikimedia.org/r/c/mediawiki/core/+/709125/9 because it was showing in conflict...

And (temporarily) removed the Depends-On from https://gerrit.wikimedia.org/r/c/wikibase/termbox/+/744831

We also have this fun

reedy@contint1001:~$ sudo /usr/sbin/service zuul status
● zuul.service
   Loaded: masked (Reason: Unit zuul.service is masked.)
   Active: inactive (dead)

contint2001.wikimedia.org is apparently the active zuul host. That should maybe be added to https://www.mediawiki.org/wiki/Continuous_integration/Zuul?

Mentioned in SAL (#wikimedia-operations) [2021-12-22T02:38:15Z] <legoktm> restarted zuul on contint2001, was totally stuck. (T298177)

I looked in the error and debug logs and didn't really see anything noteworthy:

error.log
2021-12-21 23:38:01,776 ERROR zuul.MutexHandler: Held mutex mwcore-codehealth-master-non-voting being released because the build that holds it is complete
2021-12-21 23:38:01,780 ERROR zuul.MutexHandler: Mutex can not be released for <QueueItem 0x7fa4d3ec8050 for <Change 0x7fa4f1e52a50 747877,19> in postmerge> which does not hold it
zuul.log
2021-12-22 02:36:44,171 WARNING zuul.Scheduler: Build set <BuildSet item: <QueueItem 0x7fa4e72eae50 for <Change 0x7fa51df5d290 749287,1> in test> #builds: 0 merge state: PENDING> is not current

contint2001.wikimedia.org is apparently the active zuul host.

If you ssh directly to contint.wikimedia.org, you'll end up on the correct host since it's a CNAME to contint2001.

I've stripped most contint1001 references on mw.org and updated Wikitech to mention this alias:
https://wikitech.wikimedia.org/wiki/Contint

I just did a recheck and it seems to catch it on contint2001, and I see zuul processing the queue in /var/log/zuul/debug.log and I see things merging in /var/log/zuul/merger-debug.log. Though there's not much running through it at the moment.

thcipriani claimed this task.

Ran a recheck on a core patch. This looks resolved.

If zuul stalled out, restarting it is the nuclear option—that is likely to unstick whatever is clogging up the works.

Optimistically calling this one closed. Thanks for jumping in and sorry I wasn't looking at my email in time to save you from the depths of the zuul debug log <3

I've rechecked everything that looked like it got dropped during the restart.

hashar subscribed.

The reason is the so many patches cause a lot of merge requests (roughly 850 based on the Gearman job queue graph). That takes a bit of time to process since we only have two zuul-merger daemon processing them and they are each running on HDD rather than SSD (iirc).

The system would certainly have recovered after a few hours. The quick recovery is indeed to hard restart Zuul scheduler which empty up the event queues and pending Gearman functions.

Thank you for the quick fix!