Page MenuHomePhabricator

Re-evaluate use of "Dependent Pipeline" in Zuul for gate-and-submit
Closed, ResolvedPublic

Description

We should disable the dependant gate because it is too disruptive:

  • The Zuul queue concept not working well due to practically all projects being in the same queue (so any change in any branch of any project is blocked on all other changes anywhere else).
  • It exposing code paths in Zuul that cause dead locks (T93812).
  • Limitations in its implementation (it's not all that smart, and doesn't prevent incompatible changes, only if they happen to be merged within the same ~ 10 minute window).

Here's a brief example of what a dependent pipeline does:

Imagine two changes, A and B, sitting in code review. One in an upstream-like project, and one downstream. The changes are incompatible. For example, commit A in MediaWiki core removes the deprecated function wfExample(), and commit B in a MediaWiki extension adds new code introducing a call to that same function.

Scenario 1 (Perfect):
Commit A is merged first. Later, a user tries to merge commit B. In the final test ("gate"), commit B will fail (as it should). This is because we locally merge the commit into latest master, and use latest master of upstream. The test fails, commit B merge gets rejected and will need amending.

Scenario 2 (OK):
Commit B is merged first. Later, commit A is merged. The extension is now incompatible with MediaWiki master. Users will not discover that until the next commit. This is normal in software development. In next commit the maintainers find out and fix it. This like like the amend in Scenario 1, but with separate commits.

Scenario 3 (Perfect):
Users try to merge commit A and B around the same time. Without a Dependant Pipeline, this would follow Scenario 1 or 2 (in paralel). But a Dependant Pipeline can help catch the incompatibility earlier (in this commit, instead of the next). Zuul will speculate that Commit A passes tests and gets merged. When Zuul sends commit B to Jenkins, commit A is still being tested. Yet, it will test commit B with MediaWiki master including commit A. Thus, commit B will fail (as it should).

So it can only help in one of three scenarios. Without a dependant gate pipeline, scenario 3 will be like scenario 2. Which is quite normal and probably what most people expect would happen. A dependant gate pipeline is a really cool concept but not well known.

Before we consider re-enabling it, we should get the following sorted out:

  • Deadlock caused when changes are merged manually fixed (T93812).
  • Projects should not all share the same queue (e.g. "mediawiki"). We have to find a way to drastically reduce our queue sizes. I'm thinking maybe we disable automatic queueing completely and do it manually. Remember that this is an optimisation (Scenario 3), it is not critical. So whenever in doubt, it is better to not depend (opt-in, not opt-out).
  • ...anything else?

Follows-up T50419.

Event Timeline

Krinkle raised the priority of this task from to High.
Krinkle updated the task description. (Show Details)
Krinkle added subscribers: Krinkle, hashar.
Krinkle renamed this task from Disable "Dependent Pipeline" in Zuul for for gate-and-submit to Disable "Dependent Pipeline" in Zuul for gate-and-submit.Mar 29 2015, 12:17 AM
Krinkle set Security to None.

+1. Since we've been moving things to use generic jobs, nearly all projects are in the mediawiki queue now.

The Zuul dependent pipeline works just fine but it is based on the convention that projects sharing a common job must depend on each other and thus ends up in the same queue.

When the jobs have been refactored to no more include the project name (ie: 'someproject-npm' became 'npm'), that had the side effect of gathering almost all projects in the same queue.

I was busy when the refactoring happened and when I realized the impact it was really too late to rollback.

I want us to keep the dependent pipeline but elaborate another approach to form the list of projects sharing the same queue. Potentially we could hack Zuul to disable the auto gathering and have the queue defined explicitly, maybe by specifying a queue name in the project definitions of layout.yaml.

Will comment about timo scenarii in next comment.

Sorry that is a long wall of text. Hoping to get more details as to what has been worked on since last year until February and I hope to resume once CI isolation is done. Might be worth copy pasting on the wiki.


Lets start our journey in the utopic continuous integration world which ensures code remains compatible.

Imagine two changes, A and B, sitting in code review. One in an upstream-like project, and one downstream. The changes are incompatible. For example, commit A in MediaWiki core removes the deprecated function wfExample(), and commit B in a MediaWiki extension adds new code introducing a call to that same function.

Commit A that removes the deprecated function wfExample() is now breaking all extensions that still rely on it. I have worked last year to eventually reach a point where we have core + extensions tp run a shared job which ensure that they remain compatible no matter what happen. The parent task is T60772 and there is an RFC (T1350).

That made reuse OpenStack workflow and porting it to python as zuul-cloner and I adjusted upstream documentation to explain cross projects dependencies.

The first pass was to create the mediawiki-extensions-{hhvm,zend} jobs which have mobile related extensions and a few more. Adding more extensions is postponed since I am working on isolated CI (among others).

Thus what should really happen is either:

A is +2 ed then B is +2ed

Change A that removes wfExample() is enqueued, it runs core + all extensions tests and fail because some extensions are still relying on the deprecated call.

Change B which has been enqueued behind is tested with Change A applied (wfExample() removed) and Change B which introduce the deprecation. Change B will fail since it is tested with A. Since A failed, Zuul speculate that it might well have been the cause of the failure. B is reenqueued and tested without A.

Change B is tested and is merged because the wfExample() is still in master (ie since Change A has been rejected).

The new states of the repositories are:

mediawiki/core @ master without Change A
extension @ change B

When Change A is proposed again, it has an additional test failure due to Change B introducing yet another call of wfExample().

How do you get A merged in? Get rid of all the deprecated call from extensions. The changes made to extensions will merge since the call is still around. Once the code has been updated, the function can be removed from mediawiki/core since it is no more needed. Hence, Change A being reproposed will pass. People would be able to use recheck on Change A to figure out the migration progress.

Doing so, we retain compatibility between the platform (mediawiki core) and its consumers (extensions). Eventually it means that after each change is merged we have a list of repos @ sha1 that are known to pass tests together and we will eventually be able to continuously deploy such sets. That is the unicorn we are aiming for.

Scenario 1 (Perfect):
Commit A is merged first. Later, a user tries to merge commit B. In the final test ("gate"), commit B will fail (as it should). This is because we locally merge the commit into latest master, and use latest master of upstream. The test fails, commit B merge gets rejected and will need amending.

That is not Perfect. Now mediawiki/core is breaking back compatibility with extensions which needs to be adjusted / tweaked. The b/c issues should be caught and fixed, once they have, we can remove the orphaned code.

Scenario 2 (OK):
Commit B is merged first. Later, commit A is merged. The extension is now incompatible with MediaWiki master. Users will not discover that until the next commit. This is normal in software development. In next commit the maintainers find out and fix it. This like like the amend in Scenario 1, but with separate commits.

Commit B being merged is fine, it adds another call to a depreacted function but since it is still existing in mediawiki/core it is not causing any trouble. The change is just adding to the tech debt of deprecated calls. Commit A should not merge since it breaks the extension.

We haven't much bothered breaking back compatibility until release 1.18 or 1.19. Someone from VistaPrint posted a very detailed technical reports of all the tweaks he had to do to its extensions when upgrading. That prompted us to be much more firm about how we deprecate our functions and I am sure it is still enforced.

Moreover, since we do not periodically run test for all extensions, we are unlikely to catch back compatibility issues until we release a new version of mediawiki/core. They are sometime spot when someone send a patch which fails for a reason unrelated to the change being proposed, people then blame Jenkins. Anyway, not enforcing B/C cause us to solves them all just when we do the release which adds unnecessary pressure or ends up with fatals and stacktrace when we deploy on production (since some extensions haven't had test run with the core being deployed).

By catching b/c issues when they are proposed, we give us more time to fix and ship a valid set of products.

Breaking code should be the exception. Not normal.

Scenario 3 (Perfect):
Users try to merge commit A and B around the same time. Without a Dependant Pipeline, this would follow Scenario 1 or 2 (in paralel). But a Dependant Pipeline can help catch the incompatibility earlier (in this commit, instead of the next). Zuul will speculate that Commit A passes tests and gets merged. When Zuul sends commit B to Jenkins, commit A is still being tested. Yet, it will test commit B with MediaWiki master including commit A. Thus, commit B will fail (as it should).

As I described above, A should not merge. But lets assume there is no shared job and the extensions tests are not run. With an independent pipeline change B will merge as well because it would have been tested with whatever state of the repo which does not include A yet.

The end result is that the extensions code is broken and we have no confidence that the tip of our branches works together. When another change is sent on the extension, it mysteriously break for a reason unrelated to the change being proposed. That is very confusing and is more or less what is happening to extensions that are not yet in the shared job ( mediawiki-extensions-{hhvm,zend} ). People blame Jenkins or rage wondering why they have to work on keeping it up to date while they would prefer working on their code.

Krinkle renamed this task from Disable "Dependent Pipeline" in Zuul for gate-and-submit to Re-evaluate use of "Dependent Pipeline" in Zuul for gate-and-submit in the short term.Apr 3 2015, 12:23 AM

Upstream has foreseen the same problem. They're working on a way to configure queues manually instead of the current automated queuing inferred based on common jobs (which doesn't scale).

http://lists.openstack.org/pipermail/openstack-infra/2015-February/002471.html

@hashar: What you described is an ideal situation, but the reality is that any project that uses a general job like phplint or tox-flake8 is dragged into the "mediawiki" queue, slowing that project down for no reason. Given that upstream is aware of this, I think it makes sense to go back to the independent queue and re-evaluate using a dependent queue once upstream has fix the problems in the current implementation.

Change 202958 had a related patch set uploaded (by Legoktm):
Make gate-and-submit an independent pipeline

https://gerrit.wikimedia.org/r/202958

Commit A that removes the deprecated function wfExample() is now breaking all extensions that still rely on it.

I don't think we will ever enforce that MediaWiki core cannot break compatibility with an extension. MediaWiki core maintainers are not responsible for third-party extensions hosted in Gerrit. Similarly, If jQuery upstream changes something, we will discover if/when we update jQuery in our repo the next release. If CDB project changes something, we will discover when we update the package in composer.json. If OOjs project changes something, etc.

Also remember that we generally never remove a function and update extensions later. That is not how we work. We already have an established practice and convention at Wikimedia that changes are backwards-compatible 1 version. We usually deprecate first, then give extensions time to update code, then later remove the function. In our example "Commit A" removes wfExample *after* it has been deprecated and given extensions time to update their code. This is a social convention, not the responsibility of CI.

Also, some third parties update their extension only once in a MediaWiki release cycle. (E.g. when we have release-candidate or beta). This makes a lot of sense as the average maintainer does not have time to update their extension every week. Continuous integration is not going to change other people's resources to maintain their extensions.

For Wikimedia wikis, we already have:

  1. mwext-testextension job to test with latest MediaWiki master. This ensures extension is compatible with MediaWiki.
  2. The mediawiki-extensions-* job. This ensures MediaWiki is compatible with certain extensions. This is a bonus job to catch breakage earlier than mwext-testextension next commit. Nice!
  3. Beta labs. This catches breakage with extensions not part of mediawiki-extensions-* group.
  4. wikimedia-production phase0 deployment.

This is good stuff and catches lots of compatibility issues. I assume we will continue to add more extensions to mediawiki-extensions-*. But we should be careful about adding too much. I don't think we will add all wmf extensions because that would make it too slow right now.

We should maybe also make submodules work in wmf-branches. That way we run the tests against all extensions only for wmf branch, instead of every commit in master. Or maybe only post-merge on master. Or nightly from Jenkins timer.

The problem:

The problem is that our queues are slow and I am unable to explain to others (or myself) why we still have Dependent Pipeline enabled. The only reason is to catch compatibility regression if two changes are in the queue at the same time. That is a minor feature! A feature we did not have before last year and everything was fine. It is a feature that is good, but we should not have it enabled if it does not work well with our current reality. It will work in the future, but it doesn't work right now.

If Dependent Pipeline is an important quarterly goal, maybe we should allocate time to:

However it is not my quarterly goal and I don't have time allocated to work on these two/three things. I also don't have time allocated to manually fix deadlocks every day. This has been going on for almost 6 months now. It got worse because we consolidated jobs (that was necessary to survive in other areas), but it was a problem before then as well.

The queue slowness is causing developers problems every day (especially during SWAT). Resolving this issue will cost one person a few minutes to do in Zuul config, that will benefit many other people's time available to work on important projects.

Change 202958 abandoned by Hashar:
Make gate-and-submit an independent pipeline

Reason:
Per my arguments at T94322

Kunal has a patch for Zuul at https://gerrit.wikimedia.org/r/#/c/203290/

https://gerrit.wikimedia.org/r/202958

@hashar: What you described is an ideal situation, but the reality is that any project that uses a general job like phplint or tox-flake8 is dragged into the "mediawiki" queue, slowing that project down for no reason. Given that upstream is aware of this, I think it makes sense to go back to the independent queue and re-evaluate using a dependent queue once upstream has fix the problems in the current implementation.

That is the Zuul implicit behavior. When the jobs got unified that had the side effect of having most projects ending up in the same queue. I wish I had noticed the jobs renaming, I would probably have caught that.

So either we revert to have jobs prefixed with the project name or some queue name to split the projects again, or we teach Zuul some magic to let us define the queues explicitly. @Legoktm has a patch in that sense at https://gerrit.wikimedia.org/r/#/c/203290/

hashar claimed this task.

Closing this, we keep the dependent pipeline. To workaround projects sharing the same queue, their jobs can be made dedicated as they used to be.

Is it time to re-evaluate this given that we can now use the Depends-On header to mark cross-repo dependencies?

I'm re-opening this given that the dependent pipeline is still a major source of problems. I regularly observe jobs failing in gate-and-submit (sometimes flakiness, often out of our control like composer failures), causing all test results to be thrown away and run again. mediawiki/core jobs take about ~20 minutes to run, so you can easily be looking at an hour or two to merge a core patch (which in itself is not an issue, the problem is that there's no good reason it takes so long).

The main good feature that this provided was the ability to do cross-repo dependencies, but we now have that with Depends-On. I think we've invested more time in workarounds for this (e.g. gate-and-submit-swat) than the benefits this brings.

Legoktm renamed this task from Re-evaluate use of "Dependent Pipeline" in Zuul for gate-and-submit in the short term to Re-evaluate use of "Dependent Pipeline" in Zuul for gate-and-submit.Aug 15 2018, 9:25 PM

I'm with Timo and Kunal here, we should've disabled the dependent pipeline ages ago. It doesn't scale and the cases it catches are minimal in practice versus the disruption it causes.

In theory it's a good idea but timely results are more important IMHO than ensuring every single change is atomically compatible with every other. We have other measures in place--code review, deprecation policies, deployments to beta and phase0--to protect us as well.

We talked about this a bit at the RelEng Monday meeting, and there were a few points that I wanted to talk about here:

  1. It would be good to have some numbers about the frequency with which zuul re-queues patch sets. More ambitiously, numbers about the frequency with which that indicates a real problem vs flakiness (although those number may be harder to obtain/less reliable). There may be something about zuul re-queuing in the zuul debug log, surfacing that in a raw form would probably help to inform this task. There was also some talk about stats for the investigation of resource utilization of the integration machines that @dduvall is doing so maybe there is some overlap there.
  2. One alternative to running a dependent pipeline for gate-and-submit that was mentioned might be a broad set of tests run on post-merge; however, there is a worry that beyond merge there are currently few ways to alert developers about breakage until train (for MediaWiki and extensions, anyway).
hashar changed the task status from Open to Stalled.May 27 2021, 10:08 AM
Krinkle assigned this task to Antoine.

Mostly solved. Generally speaking, as long as you have a shared gate in which code is properly integrated and test together with other repositories, then the speculative execution feature makes this significantly faster.

Alternative implementations provide green CI guruantees by simply waiting serially, which is naturally much slower.

Given that CI runs on patch submission, and not just on +2, we can generally assume that patches are most likely passing by the time they are +2'ed.

The issues I described are largely to do with deadlocks (which upstream since fixed), and the fully utilizing of available capacity (which remains unsolved, but still naturally beats any form of serial execution!) The other misunderstanding I held was that I conflated speculative execution and the shared gate as separate concepts, when they really they are in practice one and the same. One doesn't make sense without other other.

Krinkle added a subscriber: Antoine.
Krinkle removed a subscriber: Antoine.