Page MenuHomePhabricator

Zuul deadlocks if unknown repo has activity in Gerrit
Closed, ResolvedPublic

Description

In the night of Friday to Saturday, Zuul got caught in a death loop that prevented it from processing changes. @JanZerebecki had to restart Zuul entirely.

It occurred again. The exception reported looks like:

2016-03-02 01:20:04,597 ERROR zuul.Scheduler: Exception in run handler:
Traceback (most recent call last):
  File "/usr/share/python/zuul/local/lib/python2.7/site-packages/zuul/scheduler.py", line 836, in run
    while pipeline.manager.processQueue():
  File "/usr/share/python/zuul/local/lib/python2.7/site-packages/zuul/scheduler.py", line 1456, in processQueue
    item, nnfi)
  File "/usr/share/python/zuul/local/lib/python2.7/site-packages/zuul/scheduler.py", line 1414, in _processOneItem
    ready = self.prepareRef(item)
  File "/usr/share/python/zuul/local/lib/python2.7/site-packages/zuul/scheduler.py", line 1321, in prepareRef
    url = self.pipeline.source.getGitUrl(item.change.project)
  File "/usr/share/python/zuul/local/lib/python2.7/site-packages/zuul/trigger/gerrit.py", line 500, in getGitUrl
    url = 'ssh://%s@%s:%s/%s' % (user, server, port, project.name)
AttributeError: 'NoneType' object has no attribute 'name'

It happens when the Zuul scheduler prepare a reference for a project. For some reason the project is missing a name which should really not happen :-}

It magically solved itself with last error being reported at 2016-03-02 08:32:19.

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Seems the issue has been triggered by https://gerrit.wikimedia.org/r/#/c/274318/ which is a change against mediawiki/vagrant and with a few Depends-On headers for changes that got earlier:

https://gerrit.wikimedia.org/r/#/c/274312/I1f194cc318c18e7c292c5450ea011811de8df204mediawiki/event-schemas
https://gerrit.wikimedia.org/r/#/c/274313/Ibbf84cdf0c923a0a69a3d68226ffcf1c5c5a3361mediawiki/core

The change got created on 2016-03-02 00:08:59 UTC and has ChangeId I3bbc1e809a40a408c820a43f651c0624cf9675c5

The chain of events from the debug.log. At first event is received and dependencies received (by searching for change ID in other commit as well as looking for the Gerrit built-in dependency tree):

00:09:04,997 INFO zuul.Gerrit: Updating information for 274318,1
00:09:06,382 DEBUG zuul.Gerrit: Running query change:Ibbf84cdf0c923a0a69a3d68226ffcf1c5c5a3361 to find needed changes
00:09:06,554 DEBUG zuul.Gerrit: Running query change:I1f194cc318c18e7c292c5450ea011811de8df204 to find needed changes
00:09:07,414 DEBUG zuul.Gerrit: Getting commit-dependent change 274313,1
00:09:07,415 DEBUG zuul.Gerrit: Getting commit-dependent change 274312,1
00:09:07,415 DEBUG zuul.Gerrit: Running query message:I3bbc1e809a40a408c820a43f651c0624cf9675c5 to find changes needed-by

The trigger event is added and then processed by the scheduler pipelines. The item is properly added to the test pipeline:

00:09:15,753 INFO zuul.Scheduler: Adding mediawiki/vagrant, <Change 0x7f4980f458d0 274318,1> to <Pipeline test>
00:09:15,754 DEBUG zuul.IndependentPipelineManager: Considering adding change <Change 0x7f4980f458d0 274318,1>
00:09:15,757 DEBUG zuul.IndependentPipelineManager: Checking for changes needed by <Change 0x7f4980f458d0 274318,1>: 
00:09:15,757 DEBUG zuul.IndependentPipelineManager:   Change <Change 0x7f4980f458d0 274318,1> needs change <Change 0x7f4983f5b490 274313,1>: 
00:09:15,757 DEBUG zuul.IndependentPipelineManager:   Change <Change 0x7f4983f5b490 274313,1> is needed
00:09:15,758 DEBUG zuul.IndependentPipelineManager:   Change <Change 0x7f4980f458d0 274318,1> needs change <Change 0x2a08390 274312,1>: 
00:09:15,758 DEBUG zuul.IndependentPipelineManager:   Change <Change 0x2a08390 274312,1> is needed
00:09:15,758 DEBUG zuul.IndependentPipelineManager:   Changes [<Change 0x7f4983f5b490 274313,1>, <Change 0x2a08390 274312,1>] must be merged ahead of <Change 0x7f4980f458d0 274318,1>
00:09:15,758 DEBUG zuul.IndependentPipelineManager: Considering adding change <Change 0x7f4983f5b490 274313,1>
00:09:15,758 DEBUG zuul.IndependentPipelineManager: Checking for changes needed by <Change 0x7f4983f5b490 274313,1>: 
00:09:15,758 DEBUG zuul.IndependentPipelineManager:   No changes needed
00:09:15,758 DEBUG zuul.IndependentPipelineManager: Adding change <Change 0x7f4983f5b490 274313,1> to queue <ChangeQueue test: mediawiki/vagrant>
00:09:15,758 DEBUG zuul.IndependentPipelineManager: Considering adding change <Change 0x2a08390 274312,1>
00:09:15,758 DEBUG zuul.IndependentPipelineManager: Checking for changes needed by <Change 0x2a08390 274312,1>: 
00:09:15,758 DEBUG zuul.IndependentPipelineManager:   No changes needed
00:09:15,758 DEBUG zuul.IndependentPipelineManager: Adding change <Change 0x2a08390 274312,1> to queue <ChangeQueue test: mediawiki/vagrant>

Then the scheduler report its pipeline status to statsd. To forge the metric names, it replaces / in Gerrit project names with dots .. Somehow one change (no clue which one) is missing the project name:

00:09:15,759 ERROR zuul.IndependentPipelineManager: Exception reporting pipeline stats
Traceback (most recent call last):
  File "/usr/share/python/zuul/local/lib/python2.7/site-packages/zuul/scheduler.py", line 1767, in reportStats
    project_name = item.change.project.name.replace('/', '.')
AttributeError: 'NoneType' object has no attribute 'name'

The vagrant change is being added to the test pipeline:

00:09:15,849 DEBUG zuul.IndependentPipelineManager: Adding change <Change 0x7f4980f458d0 274318,1> to queue <ChangeQueue test: mediawiki/vagrant>

When preparing the Zuul refs (a merge commit of patchset on tip of target branch) for the two dependent changes:

00:09:15,850 DEBUG zuul.IndependentPipelineManager: Preparing ref for: <Change 0x7f4983f5b490 274313,1>
00:09:15,850 DEBUG zuul.MergeClient: Submitting job <gear.Job 0x7f4981164d10 handle: None name: merger:merge unique: 596c2a9451e84083a4307c4fe004188a> with data
{'items': [
  {'oldrev': None, 'newrev': None, 'refspec': u'refs/changes/13/274313/1',
   'merge_mode': 2, 'number': u'274313',
   'project': 'mediawiki/core',
   'url': 'ssh://jenkins-bot@ytterbium.wikimedia.org:29418/mediawiki/core',
   'branch': u'master',
   'patchset': u'1', 'ref': 'Z628016885c3843928c549ae5b84e330b'}]}

The dependent change 274313 definitely has a project. It is then time for the other dependent change:

00:09:15,852 DEBUG zuul.IndependentPipelineManager: Preparing ref for: <Change 0x2a08390 274312,1>
00:09:15,852 INFO zuul.IndependentPipelineManager: Change <Change 0x2a08390 274312,1> depends on changes [<Change 0x7f4983f5b490 274313,1>]
00:09:15,852 ERROR zuul.Scheduler: Exception in run handler:
Traceback (most recent call last):
  File "/usr/share/python/zuul/local/lib/python2.7/site-packages/zuul/scheduler.py", line 836, in run
    while pipeline.manager.processQueue():
  File "/usr/share/python/zuul/local/lib/python2.7/site-packages/zuul/scheduler.py", line 1456, in processQueue
    item, nnfi)
  File "/usr/share/python/zuul/local/lib/python2.7/site-packages/zuul/scheduler.py", line 1414, in _processOneItem
    ready = self.prepareRef(item)
  File "/usr/share/python/zuul/local/lib/python2.7/site-packages/zuul/scheduler.py", line 1315, in prepareRef
    merger_items = map(self._makeMergerItem, all_items)
  File "/usr/share/python/zuul/local/lib/python2.7/site-packages/zuul/scheduler.py", line 1288, in _makeMergerItem
    return dict(project=item.change.project.name,
AttributeError: 'NoneType' object has no attribute 'name'

So somehow when Zuul got the informations for 274312,1 it failed to acquire the project.

From there whenever the scheduler run handler process the changes it knows about, it keeps erroring out because of that change.


The mediawiki/core change 274313 got processed properly and had tests run for it:

23:29:06,298 INFO zuul.Gerrit: Updating information for 274313,1
23:29:16,096 DEBUG zuul.IndependentPipelineManager: Adding change <Change 0x7f4983f5b490 274313,1> to queue <ChangeQueue test: mediawiki>
23:29:16,097 DEBUG zuul.IndependentPipelineManager: Preparing ref for: <Change 0x7f4983f5b490 274313,1>
23:29:16,097 DEBUG zuul.MergeClient: Submitting job <gear.Job 0x7f4981147890 handle: None name: merger:merge unique: 28a45abfe1bd4316ad9a8d49f206e7a0> with data
{'items': [{
  'oldrev': None, 'newrev': None, 'refspec': u'refs/changes/13/274313/1', 'merge_mode': 2,
  'number': u'274313',
  'project': 'mediawiki/core',
  'url': 'ssh://jenkins-bot@ytterbium.wikimedia.org:29418/mediawiki/core',
  'branch': u'master', 'patchset': u'1', 'ref': 'Z3fd08e960c424c24888cd4e58dfb6708'}]}

For the mediawiki/event-schemas change 274312 that is another story:

23:27:29,552 INFO zuul.Gerrit: Updating information for 274312,1
23:27:41,157 DEBUG zuul.Scheduler: Done adding trigger event: <TriggerEvent patchset-created mediawiki/event-schemas master 274312,1>
23:27:41,158 DEBUG zuul.Scheduler: Processing trigger event <TriggerEvent patchset-created mediawiki/event-schemas master 274312,1>

23:27:41,158 DEBUG zuul.Scheduler: Project mediawiki/event-schemas not found

23:27:41,158 DEBUG zuul.IndependentPipelineManager: Starting queue processor: test
23:27:41,158 DEBUG zuul.IndependentPipelineManager: Finished queue processor: test (changed: False)

The project is not found! The message is emitted when processing the event queue. It looks in the layout whether the project is defined and skip the event if not:

project = self.layout.projects.get(event.project_name)
if not project:
    self.log.debug("Project %s not found" % event.project_name)
    return

As a result the change is not enqueued and has not been added to the test pipeline (changed: False above).

So i have no idea why:

  • mediawiki/event-schemas is not found in the layout file
  • the change item ends up missing the project name

The whole issue got fixed by @thcipriani which restarted Zuul:

01:19	<thcipriani>	force restarting zuul because the queue is very stuck https://www.mediawiki.org/wiki/Continuous_integration/Zuul#Restart
01:20:22,265 INFO zuul.Server: Starting scheduler

I did a recheck of https://gerrit.wikimedia.org/r/#/c/274312/1

2016-03-02 10:15:48,293 DEBUG zuul.Scheduler: Processing trigger event <TriggerEvent comment-added mediawiki/event-schemas master 274312,1>
2016-03-02 10:15:48,293 DEBUG zuul.Scheduler: Project mediawiki/event-schemas not found

And indeed that repository is not defined in the Zuul layout. That would cause a corner case issue that cause the change item to not be properly defined.

hashar triaged this task as Unbreak Now! priority.

Change 274356 had a related patch set uploaded (by Hashar):
[mediawiki/event-schemas] Dummy definition

https://gerrit.wikimedia.org/r/274356

Change 274356 merged by jenkins-bot:
[mediawiki/event-schemas] Dummy definition

https://gerrit.wikimedia.org/r/274356

Mentioned in SAL [2016-03-02T10:38:27Z] <hashar> Zuul should no more be caught in death loop due to Depends-On on an event-schemas change. Hole filled with https://gerrit.wikimedia.org/r/#/c/274356/ T128569

Did a recheck on the change having the two dependent changes ( https://gerrit.wikimedia.org/r/#/c/274318/ ). Zuul properly handled the event-schema dependent change 274312

10:27:46,142 DEBUG zuul.IndependentPipelineManager: Preparing ref for: <Change 0x7fbe1e684810 274312,1>
10:27:46,145 DEBUG zuul.MergeClient: Submitting job <gear.Job 0x7fbe1eab4e10 handle: None name: merger:merge unique: e001931672d2405895025b93484a8065> with data
{'items': [
{'oldrev': None, 'newrev': None, 'refspec': u'refs/changes/13/274313/1', 'merge_mode': 2, 'number': u'274313', 'project': 'mediawiki/core', 'url': 'ssh://jenkins-bot@ytterbium.wikimedia.org:29418/mediawiki/core', 'branch': u'master', 'patchset': u'1', 'ref': 'Za0f51fb06fa34ab9a41d61b68b93313b'},
{'oldrev': None, 'newrev': None, 'refspec': u'refs/changes/12/274312/1', 'merge_mode': 2, 
'number': u'274312',                  # <----------
'project': 'mediawiki/event-schemas', # <----------
'url': 'ssh://jenkins-bot@ytterbium.wikimedia.org:29418/mediawiki/event-schemas', 'branch': u'master', 'patchset': u'1', 'ref': 'Z233c94126be94bf88109d3f4448e4310'},
{'oldrev': None, 'newrev': None, 'refspec': u'refs/changes/18/274318/1', 'merge_mode': 2, 'number': u'274318', 'project': 'mediawiki/vagrant', 'url': 'ssh://jenkins-bot@ytterbium.wikimedia.org:29418/mediawiki/vagrant', 'branch': u'master', 'patchset': u'1', 'ref': 'Zc225cfdedff4465f9573594cd0ef2a9b'}]}

Tests got run properly as expected.

hashar edited projects, added Essential-Work; removed Patch-For-Review.

The root cause was a change depending on a change made on a repository not known to Zuul. It causes the dependent change to not be fully known by Zuul and namely missing the project => death loop.

The proper fix is to have all repositories defined in Zuul layout.

Zuul assumes the project is available on a change and thus does not deal with error condition which should really not happen ™.

A similar issue occurred on Saturday, I have not investigated it but the stracktraces were identical. It was mentioned by @JanZerebecki during the CI meeting. I quickly looked at it and dismissed investigation (was late, assumed it is a weird corner case that would not happen again).

Mentioned in SAL [2016-03-02T10:42:39Z] <hashar> Zuul should no more be caught in death loop due to Depends-On on an event-schemas change. Hole filled with https://gerrit.wikimedia.org/r/#/c/274356/ T128569

Krinkle renamed this task from Zuul get caught in an error loop preventing it from processing changes to Zuul deadlocks if unknown repo has activity in Gerrit.Mar 14 2016, 5:01 PM
Krinkle reopened this task as Open.
Krinkle subscribed.

This was not resolved. It was hacked around.

Happening again:

2016-03-14 17:00:28,100 ERROR zuul.Scheduler: Exception in run handler:
Traceback (most recent call last):
  File "/usr/share/python/zuul/local/lib/python2.7/site-packages/zuul/scheduler.py", line 836, in run
    while pipeline.manager.processQueue():
  File "/usr/share/python/zuul/local/lib/python2.7/site-packages/zuul/scheduler.py", line 1456, in processQueue
    item, nnfi)
  File "/usr/share/python/zuul/local/lib/python2.7/site-packages/zuul/scheduler.py", line 1414, in _processOneItem
    ready = self.prepareRef(item)
  File "/usr/share/python/zuul/local/lib/python2.7/site-packages/zuul/scheduler.py", line 1321, in prepareRef
    url = self.pipeline.source.getGitUrl(item.change.project)
  File "/usr/share/python/zuul/local/lib/python2.7/site-packages/zuul/trigger/gerrit.py", line 500, in getGitUrl
    url = 'ssh://%s@%s:%s/%s' % (user, server, port, project.name)
AttributeError: 'NoneType' object has no attribute 'name'

Presumably triggered by https://gerrit.wikimedia.org/r/#/c/277264/.

Mentioned in SAL [2016-03-14T17:02:59Z] <Krinkle> Doing full Zuul restart due to deadlock (T128569)

Another crash happended because of an unkown project which I have added here https://gerrit.wikimedia.org/r/277268

A bug needs to be filed upstream with high priority since this will affect everyone.

@hashar Zuul didn't used to crash when Gerrit receives a commit to a repo not known to Zuul. This used to work fine. Can you confirm this is an upstream regression? Maybe it is already fixed in latest upstream?

Looks like we either have to wait for them to release there next version which may be a while or we can update straight of the master branch. OpenStack uses it of the master branch so looks stable plus fixes the crash issue I hope.

@hashar Zuul didn't used to crash when Gerrit receives a commit to a repo not known to Zuul. This used to work fine. Can you confirm this is an upstream regression? Maybe it is already fixed in latest upstream?

Zuul works fine when it receives from Gerrit an event for a repository not known to Zuul. But when there is a Depends-On set that points to a repo not know to Zuul it dies. So it is not a regression but a bug in the cross repository dependency management.

@Paladox wrote:

I think its fixed upstream here https://github.com/openstack-infra/zuul/commit/0deaaadac7143692961b9d28abee8cea570ff3ce
Looks like we either have to wait for them to release there next version which may be a while or we can update straight of the master branch. OpenStack uses it of the master branch so looks stable plus fixes the crash issue I hope.

Yup that upstream change looks correct at first glance. I will review it and it is not too far from our current version. Once happy, I will bump our Zuul version or cherry pick that patch then upgrade our Zuul. Good finding!

We may want to include all changes in the master branch because I think it includes a fix for something performance wise with gerrit. Maybe it will fix the issue we have whereby it takes more then 5sec for a change to be recognised and the builds start.

@hashar and @Krinkle ive found an another fix that actually does fix it. https://github.com/openstack-infra/zuul/commit/07cc33c85c2979c9584d8473cb0a3d7934f146dd

It was fix a week after 2.1.0 update was published so not long. But we should still go up to the commit I linked to above to make sure it does not happened.

Mentioned in SAL [2016-04-26T20:07:31Z] <hashar> Killing Zuul entirely due to deadlock T128569

Change 287634 had a related patch set uploaded (by Hashar):
[analytics/wmde/scripts] define noop

https://gerrit.wikimedia.org/r/287634

Change 287634 merged by jenkins-bot:
[analytics/wmde/scripts] define noop

https://gerrit.wikimedia.org/r/287634

Change 291286 had a related patch set uploaded (by Luke081515):
[maps/tilerator/deploy] define noop

https://gerrit.wikimedia.org/r/291286

Change 291286 merged by Thcipriani:
[maps/tilerator/deploy] define noop

https://gerrit.wikimedia.org/r/291286

Change 291289 had a related patch set uploaded (by Paladox):
Fix gerrit stalling because of an unkown project

https://gerrit.wikimedia.org/r/291289

Change 291290 had a related patch set uploaded (by Paladox):
Fix gerrit stalling because of an unkown project

https://gerrit.wikimedia.org/r/291290

Change 291291 had a related patch set uploaded (by Paladox):
Fix gerrit stalling because of an unkown project

https://gerrit.wikimedia.org/r/291291

Paladox lowered the priority of this task from Unbreak Now! to High.May 27 2016, 6:06 PM

Changing status to high since ive noticed this issue twice today requiring zuul to be restarted.

Paladox raised the priority of this task from High to Needs Triage.May 27 2016, 6:07 PM

Didn't realise it was on unbreak.

Paladox triaged this task as Unbreak Now! priority.May 27 2016, 6:19 PM

I have rebased our Zuul to include the patch 0deaaadac7143692961b9d28abee8cea570ff3ce and will attempt to upgrade.

Our new version is going to be zuul_2.1.0-95-g66c8e52-wmf1precise1

Change 291730 had a related patch set uploaded (by Hashar):
Test dependency to unknown repository

https://gerrit.wikimedia.org/r/291730

Change 291730 abandoned by Hashar:
Test dependency to unknown repository

https://gerrit.wikimedia.org/r/291730

Fixed by zuul_2.1.0-95-g66c8e52-wmf1precise1

I have archived the Debian packages on rutherfordium.eqiad.wmnet:/home/hashar/public_html/debs/zuul_2.1.0-95-g66c8e52

Or: https://people.wikimedia.org/~hashar/debs/zuul_2.1.0-95-g66c8e52/

Change 291290 abandoned by Hashar:
Fix gerrit stalling because of an unkown project

Reason:
Upstream patches comes either from a bump of the 'upstream' branch or via a quilt package managed in the patch-queue/debian/precise-wikimedia branch.

https://gerrit.wikimedia.org/r/291290

Change 291289 abandoned by Hashar:
Fix gerrit stalling because of an unkown project

https://gerrit.wikimedia.org/r/291289

Change 291291 abandoned by Paladox:
Fix gerrit stalling because of an unkown project

https://gerrit.wikimedia.org/r/291291

hashar added a subscriber: Reedy.

Happened again with Zuul version: 2.1.0-151-g30a433b-wmf4precise1

It was https://gerrit.wikimedia.org/r/#/c/300804

[23:43:56] <legoktm> yes that was the problem
[23:44:03] <legoktm> the vendor repo is not controlled by zuul

/var/log/zuul/error.log has a spam of:

2016-07-25 22:19:30,252 ERROR zuul.Scheduler: Exception in run handler:
Traceback (most recent call last):
  File "/usr/share/python/zuul/local/lib/python2.7/site-packages/zuul/scheduler.py", line 974, in run
    while pipeline.manager.processQueue():
  File "/usr/share/python/zuul/local/lib/python2.7/site-packages/zuul/scheduler.py", line 1610, in processQueue
    item, nnfi)
  File "/usr/share/python/zuul/local/lib/python2.7/site-packages/zuul/scheduler.py", line 1568, in _processOneItem
    ready = self.prepareRef(item)
  File "/usr/share/python/zuul/local/lib/python2.7/site-packages/zuul/scheduler.py", line 1474, in prepareRef
    url = self.pipeline.source.getGitUrl(item.change.project)
  File "/usr/share/python/zuul/local/lib/python2.7/site-packages/zuul/source/gerrit.py", line 354, in getGitUrl
    return self.connection.getGitUrl(project)
  File "/usr/share/python/zuul/local/lib/python2.7/site-packages/zuul/connection/gerrit.py", line 433, in getGitUrl
    project.name)
AttributeError: 'NoneType' object has no attribute 'name'

It is supposedly fixed by rCIZU0deaaadac7143692961b9d28abee8cea570ff3ce which I have deployed end of May :(

hashar lowered the priority of this task from Unbreak Now! to High.Jul 28 2016, 7:06 PM

I got Zuul upgraded on July 27th at 14:00 UTC: Zuul version: 2.1.0-391-gbc58ea3-wmf1precise1

Which is only a couple commits behind tip of master. rCIZU0deaaadac7143692961b9d28abee8cea570ff3ce is definitely included in it.

The trouble is digging in the log and figure out which corner case managed to escape the fix above.

This comment was removed by Paladox.

Change 307060 had a related patch set uploaded (by Hashar):
2.5.0-8-gcbc7f62-wmf1precise1

https://gerrit.wikimedia.org/r/307060

Change 307060 merged by Hashar:
2.5.0-8-gcbc7f62-wmf1precise1

https://gerrit.wikimedia.org/r/307060

I have upgraded Zuul server on gallium with zuul_2.5.0-8-gcbc7f62-wmf1precise1. It includes upstream patch @Paladox has mentioned:

a8b90b3 - Always create foreign projects if needed (4 days ago) <James E. Blair>

The main cause has been fixed by upstream commit: 0deaaadac7143692961b9d28abee8cea570ff3ce

With a follow up upstream to deal with a corner case issue in the changes cache which might still cause some projects to be None in a rare race condition: a8b90b38094587bbd82ffa5e4028aef1dfd02987

We have both deployed on our Zuul server as of zuul_2.5.0-8-gcbc7f62-wmf1precise1.

It must be fixed now. I am resolving this task, the sub task left open is "just" pending for the .deb package to be pushed to apt.wikimedia.org which is not a blocker.