Page MenuHomePhabricator

Fix operations/puppet.git "rebase hell"
Closed, ResolvedPublic

Description

The operations/puppet in Gerrit is configured to require changes to be fast forward in order to be merged by Gerrit. Hence people have to manually rebase a change before Gerrit let it be submitted and ultimately merged. When two or three people are willing to merge their changes, they will race for the rebase and submit. As soon as one change merged, the other changes all have to be rebased: rebase hell.

Most changes in the Gerrit interface ends up being flagged as being in merge conflict.

The reason lies in the repository configuration at https://gerrit.wikimedia.org/r/#/admin/projects/operations/puppet which is:

Submit TypeFast Forward Only
Allow content mergesFalsenote: does not apply to ff type

A while ago we had a discussion about it at T166888#3311916 and I have proposed to change the strategy to Rebase if necessary and allowing content merges. The git history will stay linear since Gerrit will automatically rebase non fast forward changes.

A gotcha is that the test results at the time the patch got uploaded might no more be valid since the target branch might have evolved between.

Consider:

  • change A change a linting rule. Test passes at the time. Change has verified +2
  • the production branch has more code adhering to the old linting rule
  • someone submit change A. Gerrit rebase it and since there is no conflict merge the change. No tests have been run despite the branch receiving changes in between
  • tests no more pass

We can adjust that by adjusting the workflow. When previously one would:

  • rebase
  • wait for test to complete and CI to vote Verified+2
  • vote CR+2 and submit
  • Gerrit merge

We would switch the Gerrit submit type to Rebase if necessary and let CI handles the merge. The workflow then becomes:

  • Code-Review +2
  • CI test the change against the tip of production catching up with all changes that happened between
  • CI vote Verified +2 and submit
  • Gerrit merge

One advantage is that people no more race on each other. Changes can just all be voted Code-Review +2 and CI will serialize them in the Queue and get them merged as tests are passing.

Culpirt: that brings submit right to the CI jenkins-bot and hence to CI operators (which has more members than the Gerrit Administrators group).

Event Timeline

hashar created this task.May 21 2019, 3:43 PM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptMay 21 2019, 3:43 PM

+ @faidon and @Catrope who participated on T166888#3311916 and following.

I'm assuming that in cases in which the rebase fails because of conflicts or the CI fails after the rebase Jenkins would vote -1 and the patch would be out of the merging queue. Correct me if I'm wrong.

@hashar how this would work with the additional step of puppet-merge on the puppetmasters?
We still need to go to the puppetmasters and run puppet-merge and that step is designed so that each +2er goes and merge his own stuff, because knows the consequences of it. If Jenkins is the one queuing them it wouldn't be harder to know what is the real order? People would need to discover it running puppet-merge repeatedly? It seems to me that if we want to go this way we'll need other steps too to make it work.
We have for example and additional check that if multiple commits are found would ask the user for a different confirmation to make sure we don't merge multiple patches by accident.

One of the main resistance in previous discussions was also the potential destructive consequences of unattended rebases.
One patch can change file A and another patch file B, but those are connected via Puppet logic and one change breaks the other. The counter argument is that even today during the manual "merge-conflicts" those might not be spotted or checked.

Volans triaged this task as Normal priority.May 21 2019, 4:54 PM
CDanis added a subscriber: CDanis.May 21 2019, 5:38 PM

@Volans wrote:
I'm assuming that in cases in which the rebase fails because of conflicts or the CI fails after the rebase Jenkins would vote -1 and the patch would be out of the merging queue. Correct me if I'm wrong.

Although they are two entirely different failures, in both case the change will be rejected.

  • rebase fails because of conflicts is caught when CI fetches and checkouts the tip of the branch and then attempt to git merge each of the changes in the queue. If any merge fail (due to a file conflict), the change get rejected and a message is reported back to Gerrit such as "Merge Failed. This change or one of its cross-repo dependencies was unable to be automatically merged with the current state of the repository. Please rebase the change and upload a new patchset" and voted Verified - 1
  • CI fails after the rebase that is done after CI successfully git merge the change. The resulting merge commit is then fetch by CI jobs and tests run. If any job fails, the change is dropped from the queue, jobs are waited to be completed, then the change is reported back in Gerrit and voted Verified -1

@hashar how this would work with the additional step of puppet-merge on the puppetmasters?
We still need to go to the puppetmasters and run puppet-merge and that step is designed so that each +2er goes and merge his own stuff, because knows the consequences of it. If Jenkins is the one queuing them it wouldn't be harder to know what is the real order? People would need to discover it running puppet-merge repeatedly? It seems to me that if we want to go this way we'll need other steps too to make it work.
We have for example and additional check that if multiple commits are found would ask the user for a different confirmation to make sure we don't merge multiple patches by accident.

As the system currently stand, if two humans get their changes manually merged, they will surely rely sync with each other when the additional check notices multiple commits. Or maybe they sync with each other to avoid the rebase hell.

With this proposal, the ordering is managed by CI and depends on the order changes get voted Code-Review +2. In the end when humans currently have to manually rebase, conflict with each other, rebase again, wait for test, and ultimately manually submit, this proposal would have CI handle that automatically behind the hood. So that if ones +2 changes A and B, the CI queue looks like:

(production) < A < B

And eventually if A and B pass tests, the git repository would have: A -> B (production). The additional would then notice the multiple commits and users will still have to sync as to who is going to puppet merge / deploy what. But at least they saved themselves the hassles of the rebase hell.

A gotcha is if A fails in the example above. CI will drop A from the queue, cancel jobs of change B, attempt to merge it against (production) and trigger new jobs. So eventually even if A and B got a Code-Review +2, the git repository might just have B being merged. Probably the author of A would not want to deploy B, but I assume the puppet-merge script displays the commit which is about to be merged.

One of the main resistance in previous discussions was also the potential destructive consequences of unattended rebases.
One patch can change file A and another patch file B, but those are connected via Puppet logic and one change breaks the other. The counter argument is that even today during the manual "merge-conflicts" those might not be spotted or checked.

I think in most case people just "blindly" hit the Gerrit rebase button, relying on tests to catch up a potential code issue. In the example above with two changes A and B, B is behind in the queue but CI test it as if A already got merged. So that if there is a test covering the logic coupling, the logic fault between A and B will be noticed.

In practice, we use the puppet compiler to build confidence, if the puppet compiler passed and the change can still be merged as a fast forward, we know the end result is valid. By allowing Gerrit to automatically rebase, the puppet compiler test is no more valid since it does not take in account that the branch has moved forward. And indeed the logic fault is not covered. Then I am not sure people actually rerun the puppet compiler when rebasing / submitting the change.

The only way for CI to notice would be to run the puppet compiler when voting Code-Review +2. Which is arguably a bit slower :]

@hashar another question for you. If I have 2 CRs, chained one on top of another and I +2 both of them because I want to deploy them together, and the first one fails but the second one passes, would the second one be rebased on top of production branch and merged despite the fact that its parent was not merged?
If this is the case this would be a blocker IMHO.

A few thoughts:

  • None of our CI on ops/puppet provides very strong guarantees of correctness regardless of what we do here. It's possible for the automatic CI to succeed and for puppet-compiler to succeed on every possible host, and yet after merging the actual agent runs on the hosts fail due to an issue not caught by either one.
  • Submission timing being commonly driven by CI V+2 could be annoying in the same cases that rebase hell is? Timing puppet-merges, timing other manual work related to the changes, not understanding clearly what order several changes are applying in and who's running the puppet-merges for them, etc. Perhaps it's no worse that rebase hell in the common timing-conflict cases, I don't know. But when 3 people click C+2 in the same short time window, what happens with waiting on the V+2 + Submits and puppet-merge execution by whom?
  • If submission is actually gated by CI (only happens when jenkins gives V+2 and no ability to manually override), that can be really problematic, as sometimes we do in tricky situations rely on fast manual C/V+2->Submit (for expediency in emergencies and/or fixing things that are breaking CI itself).
BBlack added a comment.EditedMay 23 2019, 3:00 PM

One more:

  • Automatic merges are not fool-proof in general. This is clear in the separate-file case (e.g. changing two separate manifests in two seemingly-unrelated modules, which is perfectly clean for rebase, but might break due to deep puppet-level interactions between the two modules). But it's also possible in near-miss cases. Sometimes two commits which do touch the same regions of the same file are incompatible in a way that's *very* obvious to a human observer of the situation and requires some manual editing, but the merge algorithm decides it's all fine and moves forward mostly-silently. I think we already have this problem with blind clicks of the rebase button, but it's food for thought when considering arguments here about better ways forward.

If I have 2 CRs, chained one on top of another and I +2 both of them because I want to deploy them together, and the first one fails but the second one passes, would the second one be rebased on top of production branch and merged despite the fact that its parent was not merged?
If this is the case this would be a blocker IMHO.

Given the first change failed, if the second change has for parent the first change, then:

  • Gerrit would not lot allow the change to be merged since the patchset depends on a patchset that has not been merged.
  • On CI front, when the first change is removed from the queue, the second change get all jobs canceled and enters tries to enter the queue again. The change metadata are fectched from Gerrit, and since the parent change has not been merged and is NOT ahead in the queue, the change can not be merged since its missing dependencies. The change thus does not enter the queue.

Or in short, CI does honor the dependency chain from Gerrit. Its even enhanced to support indicating dependencies across repositories by adding in the commit message: Depends-On: SomeChangeIdHere. So you can have a change to puppet.git to depend on a change to someother repository. CI will refuse to process the change in puppet.git until the dependent change get merged. That is handful when a job runs tests across multiple repositories which we do for MediaWiki There are a few culprits, but that is the general idea.

hashar added a comment.EditedMay 24 2019, 9:21 AM

A few thoughts:

  • None of our CI on ops/puppet provides very strong guarantees of correctness regardless of what we do here. It's possible for the automatic CI to succeed and for puppet-compiler to succeed on every possible host, and yet after merging the actual agent runs on the hosts fail due to an issue not caught by either one.

Indeed, since we really just check whether the catalog looks valid but we do not actually applies it. Execution of commands, apt conflicts are not checked for example. Eventually they can be first deployed to a staging cluster (eg: beta cluster), then progressively deployed to production (stop puppet on affected host, unpool one of the host, enable and run puppet, verify, pool, verify, enable puppet on more hosts etc).

A potential way would be to provision machines using the patch then run acceptance tests against the resulting cluster. Given infinite resources, one could imagine that each patchset would bring up a whole cluster and run every single end to end test we might have (eg a change to manifests/site.pp would also get the mobile apps smoke tests). A middle ground might be to find a way to apply the catalog in Docker containers to ensure they provision properly, or first apply it to a small copy of our cluster (which I think is what the staging project is going to be for).

  • Submission timing being commonly driven by CI V+2 could be annoying in the same cases that rebase hell is? Timing puppet-merges, timing other manual work related to the changes, not understanding clearly what order several changes are applying in and who's running the puppet-merges for them, etc. Perhaps it's no worse that rebase hell in the common timing-conflict cases, I don't know. But when 3 people click C+2 in the same short time window, what happens with waiting on the V+2 + Submits and puppet-merge execution by whom?

Indeed that pushes the annoyance from Gerrit to the puppet-merge host. As you described, the 3 CR+2 would most probably result in all 3 changes to be merged in a very short window of time (since CI processes changes in parallel, if it has enough capacity all changes will be merged almost at the same time in the order they entered the queue / got voted CR+2). Thus there will need some sync up at puppet-merge, but I believe that one is easier to achieve. One can move the branch forward by one change, apply it, then hand off to the second person that move by one change. Or eventually all three peoples agree that all those three changes can be fast forwarded to in a single operation, each person then checking the result of their change.

I think it is worth trying and see how the workflow works for you. I believe it would be an improvement compared to rebase hell, but it might well cause havoc and potentially lead to unexpected deployments if one miss the puppet-merge warning.

  • If submission is actually gated by CI (only happens when jenkins gives V+2 and no ability to manually override), that can be really problematic, as sometimes we do in tricky situations rely on fast manual C/V+2->Submit (for expediency in emergencies and/or fixing things that are breaking CI itself).

In Gerrit, the puppet.git project is owned by SRE members and we have configured Gerrit to allow project owner to vote Verified +2 and submit changes. That is defined in the special repo https://gerrit.wikimedia.org/r/#/admin/projects/All-Projects,access All-Projects.git:

[access "refs/*"]
    owner = group ldap/ops  # also grants submit right
    label-Verified = -1..+2 group Project Owners
    label-Code-Review = -2..+2 group Project Owners

Thus can still remove the Verified -1, vote C/V+2 and press submit. Effectively bypassing CI entirely. We use that on MediaWiki when we have breaking/incompatible changes, although that is rare and is against the general (mediawiki) policy.

For MediaWiki / mediawiki-config, the fastest way to revert is to just git revert on the deployment host and roll the deploy:

ssh deploy1001.eqiad.wmnet
cd /srv/mediawiki-staging
git revert
scap sync "Rollback $(git log -n1 --oneline HEAD^)"

Then catch up in Gerrit by sending the revert commit (git push origin HEAD:refs/for/master and CR+2 it).

Which I guess can also be used when one is already on the host that does puppet-merge. That could even be automated in the script (puppet-merge revert).

  • Automatic merges are not fool-proof in general. This is clear in the separate-file case (e.g. changing two separate manifests in two seemingly-unrelated modules, which is perfectly clean for rebase, but might break due to deep puppet-level interactions between the two modules). But it's also possible in near-miss cases. Sometimes two commits which do touch the same regions of the same file are incompatible in a way that's *very* obvious to a human observer of the situation and requires some manual editing, but the merge algorithm decides it's all fine and moves forward mostly-silently. I think we already have this problem with blind clicks of the rebase button, but it's food for thought when considering arguments here about better ways forward.

The puppet compiler would most probably solves it. At least with the current situation, if one runs the puppet compiler and merge the change, they have a guarantee of the result. If CI gates the changes we are in the same situation as your second example with rebase button: the resulting change has no guarantee to see be correct / pass the puppet compiler.

The only way to solve that would be to have CI to actually run the puppet-compiler. It has some support for that already but that is still triggered manually (check experimental).

Maybe those potential conflicts are concerning enough that we should keep doing manual rebase / verification. At the price of the rebase annoyance. But it might safer in the end.

(I edited my previous comment since I submitted it before I was done with it)

Just a note, I have filled this task since several people talked about it recently. My intuition is that having CI to handle the merges and resolve conflicts would saves a lot of annoyances caused by having to rebase constantly. As shown by @Volans and @BBlack above, that also potentially open paths for mistakes or a resulting code which might cause issues and thus potentially an outage :-(

I honestly do not have a strong opinion either way.

Joe added a comment.May 27 2019, 6:32 AM

I am 100% against having ci handle merges of ops/puppet. Think of the case ci is down and we need puppet for anything.

I am also all for moving to rebase-if-necessary and manual inspection in puppet-merge, I don't think the risks we incur are in any way comparable to the inane amount of time I wasted rebasing and waiting for CI multiple times over the years.

When the repo is busy, it can take up to 10 minutes or more to merge a change, unless you race in front of others by clicking V+2 in gerrit, which kinda defies the whole logic here.

I can see multiple ways to fix this, but I have zero time to dedicate to it and I think we could go with rebase-if-necessary and just be careful.

hashar added a comment.EditedMay 28 2019, 11:31 AM

I am 100% against having ci handle merges of ops/puppet. Think of the case ci is down and we need puppet for anything.

See my third answer to @BBlack above at T224033#5209860. TLDR: drop the verified-1, vote verified+2, submit the change. Done.

I am also all for moving to rebase-if-necessary and manual inspection in puppet-merge, I don't think the risks we incur are in any way comparable to the inane amount of time I wasted rebasing and waiting for CI multiple times over the years.
When the repo is busy, it can take up to 10 minutes or more to merge a change, unless you race in front of others by clicking V+2 in gerrit, which kinda defies the whole logic here.

That is what this task proposal will help on. Since CR+2 ed changes get their CI job run in parallel and would thus end up being merged faster when there are lot of contention (up to CI capacity limits).

hashar added a comment.Jun 6 2019, 6:19 PM

I am assuming you will be able to talk about this during the SRE offsite next week?

Have you had a chance to speak about this "rebase hell" during the SRE offsite?

I think I answered to all the concerns that have been raised so far. Should I then just proceed with the implementation and craft a note about the workflow change?

CDanis added a comment.Aug 8 2019, 2:31 PM

We unfortunately did not discuss this during the SRE summit.

Here is my two lepta:

  • The current situation of ff-only is both not as safe as it seems, and often creates tedium/toil as @Joe describes.
    • Additionally, I think the tedium created here encourages people to blindly mash the rebase button in gerrit, which reduces actual-safety further.
  • So we might as well save ourselves some tedium/toil, and stop pretending that safety exists in the status quo, when really it doesn't.
  • For patches that someone deemed extra-risky or extra-invasive, they could still manually do the procedure of rebasing by hand, running tests and puppet compiler, and only C+2'ing when they were confident in the change and when its parent matched tip-of-production.
    • This is even pretty easily scriptable: cmp <(git ls-remote origin refs/heads/production | awk '{print $1}') <(git rev-parse HEAD^) && ssh -p 29418 gerrit.wikimedia.org gerrit review --code-review +2 --submit $(git log -1 | egrep '^\s+Change-Id:' | awk '{print $2}')
      • Yes, there is a possible race condition, but in practice the window should only be a second or two.
  • As long as SRE can manually CR+2 V+2 and submit in an emergency -- which it sounds like would be true in any possible future -- I'm happy.
  • The perfect is the enemy of the good. IMO, let's move forward with rebase-if-necessary.
Joe added a comment.Aug 20 2019, 8:32 AM

Here is my two lepta:

I did miss this comment last week. Thanks for reviving this.

  • The current situation of ff-only is both not as safe as it seems, and often creates tedium/toil as @Joe describes.

If the safety is that CI will run again, I think it's a pretty small guarantee right now. I would really only trust the puppet compiler results as a confirmation everything's fine with the current changes. So I agree.

    • Additionally, I think the tedium created here encourages people to blindly mash the rebase button in gerrit, which reduces actual-safety further.
  • So we might as well save ourselves some tedium/toil, and stop pretending that safety exists in the status quo, when really it doesn't.
  • For patches that someone deemed extra-risky or extra-invasive, they could still manually do the procedure of rebasing by hand, running tests and puppet compiler, and only C+2'ing when they were confident in the change and when its parent matched tip-of-production.
    • This is even pretty easily scriptable: cmp <(git ls-remote origin refs/heads/production | awk '{print $1}') <(git rev-parse HEAD^) && ssh -p 29418 gerrit.wikimedia.org gerrit review --code-review +2 --submit $(git log -1 | egrep '^\s+Change-Id:' | awk '{print $2}')
      • Yes, there is a possible race condition, but in practice the window should only be a second or two.

I think large / risky changes can be scheduled ahead of time or just be taken care of with special care as you describe.

  • As long as SRE can manually CR+2 V+2 and submit in an emergency -- which it sounds like would be true in any possible future -- I'm happy.
  • The perfect is the enemy of the good. IMO, let's move forward with rebase-if-necessary.

Indeed, I agree. I think having to deal with some API breakage (and thus, puppet failure) when two large changes to correlated modules happen and someone was not careful with their rebases is outweighed by the constant pain of having to go through 4 rebases/ci cycles to get a value changed from 4 to 5 in a hiera file.

jbond added a subscriber: jbond.Oct 3 2019, 3:39 PM
herron awarded a token.Oct 3 2019, 3:50 PM
herron added a subscriber: herron.
fgiunchedi added a subscriber: fgiunchedi.

I'm +1 on turning on rebase if necessary and see how things play out, if they don't for some reason it is an easy revert

CDanis added a comment.Oct 9 2019, 5:55 PM

Was this discussed during the Monday meeting? What was the outcome?

IRC says the meeting was mostly consumed by OKR discussion, it may have been talked about a little, nobody remembers any new big blocker being raised.

I'm going to be bold and just switch it today. If carnage ensues, you can all blame me and revert it.

BBlack closed this task as Resolved.Oct 17 2019, 2:45 PM
BBlack claimed this task.

It's switched to Rebase-if-necessary now