Applying security patches should be robust and also give some useful output
Closed, ResolvedPublic
Actions

Description

When applying the security patches with scap apply-patches --train, it immediately completed leaving me to wonder whether the patches actually got applied. I eventually looked at the available patches and checked the affected repositories to confirm.

It would be nice to print the list that got applied / or already applied.

Beside that, it is a great functionality. It definitely saves a noticeable amount of time. Thanks for that!

Also, apply-patches can fail in various cases, if patches don't all apply cleanly. It should never fail for that, and should be atomic: either all patches have been applied, or none.

Details

	Subject	Repo	Branch	Lines +/-
	feat! rewrite apply-patches, drop list-patches and test-patches	mediawiki/tools/scap	master	+566 -496

Customize query in gerrit

Related Objects
Search...

Status	Assigned	Task
Resolved	dancy	T196515 Automate the Train
Resolved	• LarsWirzenius	T245185 Automate scap patch
Resolved	• LarsWirzenius	T272145 Tag Scap release 1.37.1 and ask serviceops to deploy it to production
Resolved	dancy	T277793 Increase reliability of scap release process
Resolved	• LarsWirzenius	T269153 Applying security patches should be robust and also give some useful output

Event Timeline

hashar created this task.Dec 1 2020, 6:07 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptDec 1 2020, 6:07 PM

Would output like the following be acceptable:

Applying 2 security patches.
Applying patch 01-xyzzy: OK
Applying patch-02-blugh: OK
All security patches applied OK.

In other words: list every patch, and be explicit about whether each patch applied OK, and also be explicit at the end that all applied OK.

Is more needed?

• LarsWirzenius added a project: Release-Engineering-Team-TODO (2020-10-01 to 2020-12-31 (Q2)).Dec 2 2020, 1:18 PM

• LarsWirzenius claimed this task.Dec 3 2020, 2:59 PM

The lack of output surprised me and I was left wondering whether the patches actually applied.

The proposal looks good, cConsider using the full file path which carries the MediaWiki version but most importantly the repository name.

hashar added a project: Release-Engineering-Team (Development services).Dec 3 2020, 4:42 PM

• LarsWirzenius moved this task from INBOX to New Work on the Release-Engineering-Team-TODO (2020-10-01 to 2020-12-31 (Q2)) board.Dec 4 2020, 1:17 PM

• LarsWirzenius edited projects, added Release-Engineering-Team-TODO (2021-01-01 to 2021-03-31 (Q3)); removed Release-Engineering-Team-TODO (2020-10-01 to 2020-12-31 (Q2)).Jan 11 2021, 3:49 PM

• LarsWirzenius moved this task from INBOX to New Work on the Release-Engineering-Team-TODO (2021-01-01 to 2021-03-31 (Q3)) board.

• LarsWirzenius renamed this task from Applying security patches should give some output to Applying security patches should be robust and also give some useful output.Jan 15 2021, 12:57 PM

• LarsWirzenius updated the task description. (Show Details)

Noting here that I ran into a situation with a patch that didn't apply cleanly under --no-3way, but did with the traditional instructions:

USERNAME@deploy1001:/srv/mediawiki-staging/php-[VERSION]$ git apply --check --3way /srv/patches/[VERSION]/core/[NUMBER]-T[NUMBER].patch
USERNAME@deploy1001:/srv/mediawiki-staging/php-[VERSION]$ git am --3way /srv/patches/[VERSION]/core/[NUMBER]-T[NUMBER].patch

From memory, I think this is a fairly common scenario.

"apply-patches" should be atomic: either all patches get applied, or none.

Also, it should probably automatically notice if a patch has already been applied and tell the user, who can then decide to delete it.

This might warrant a couple of separate tickets, but I wanted to get this captured while I still had it in scrollback.

Summarizing some IRC discussion from 2021-01-19:

We've traditionally used 3-way merges when dealing with security patches.
scap apply-patches was originally written to use --no-3way but no one quite remembers why.
<thcipriani> I do think that if 3way fails it's going to fail silently (semantic error and not exception error)
<thcipriani> running a nightly test with security patches with --no-3way and alerting security if that fails would probably ...at least answer the question of how often that happens
RelEng shouldn't generally be evaluating semantic content of security patches
scap should at least do a --check --3-way and inform the deployer that patches _would_ apply that way.
<dancy> So have we landed on this?: If git am --no-3way fails for security patches, that is a train blocker.
- General agreement, I think, but noting that this is a policy change.
- We should check automatically and inform folks well in advance of train.

I may have missed some nuance here, and I believe @thcipriani did some further experimentation with git.

brennen moved this task from Backlog to Radar on the User-brennen board.Feb 1 2021, 7:56 PM

• LarsWirzenius mentioned this in T245185: Automate scap patch.Feb 9 2021, 5:38 PM

• LarsWirzenius added a parent task: T245185: Automate scap patch.

I've been thinking about this. Here's where I've got so far. I welcome feedback.

Context: the security team maintains patches for embargoed security problems in MediaWiki and its extensions. Once the security problem is made public, the patch is merged into the relevant repository's master branch, but until then, the patches can't be revealed.

During a weekly train deployment, a "rrain branch" is created, by creating, in Gerrit, a branch named "1.3x.0-wmf.yz" in the git repository for MediaWiki core and each extension in WMF production. All those repositories are cloned to (or refreshed on) the deployment host, and the branch is checked out. This is done by "scap prep".

After that, the security patches are applied, locally, and the change committed, locally. These are copied to application servers. Thus, MediaWiki is deployed.

Scap has a sub-command "apply-patches" that does the application of security patches. However, it's simplistic, and doesn't do a good job. In many circumstances it fails, needlessly. This task is about sorting out a way to do the process robustly, but automated.

Some open questions:

There are many git repositories and patches involved: should we try to be atomic so that either all patches have been applied, or none of them? In other words, if some patches apply successfully, but one or more do not, should we revert the successful changes? Doing that would be reasonably easy (a "git reset" will work), but maybe that's unnecessary?
- I think I'd personally not care about being atomic here, except for applying individual patches: either a patch is applied successfully, and that git repository is changed, or applying it fails, and afterwards the repository and working tree haven't changed at all.
Do we care if /srv/mediawiki-staging on the deployment server is being used by multiple people deploying at the same time? I don't think we care for other Scap operations, so I don't think we need to do that for patch application.

Assuming my biases above are acceptable, I'm proposing that "scap apply-patches" should work like this:

Enumerate all patches in /srv/patches.
Verify that the relevant git repository exists under /srv/mediawiki-staging. Do this before applying any patches.
For each patch, apply it with "git am --3way". If that fails, remember that for reporting later.
At the end, report which patches were applied successfully, and which ones failed, and how they failed. Show the full pathname of each patch file, and the pathname of the directory it was applied, and the output of "git apply" for failures.

I would skip the test step ("git apply --check --3way") that we currently do. It's somewhat inappropriate as it doesn't handle the case of patches building on top of each other.

For "scap test-patches" do the same, except for each repository, make a local clone to a temporary directory, and apply the patch there. This is a more complete test than running "git apply --check", but avoids changing /srv/mediawiki-staging, possibly in a catastrophic manner.

"scap apply-patches" does not currently use --3way, but that has proven to be a problem, so I suggest we start using it again. We might have Scap run "git am" both with and without it and report an error if the results differ, at least for "scap test-patches". Or some other approach: using or not using that option is easy to change; I'm more worried about other aspects in the short term.

• LarsWirzenius added a subscriber: dancy.Mar 9 2021, 2:48 PM

• LarsWirzenius added a subscriber: • mmodell.

I note that I would like to run "scap test-patches" in a cron job so we are alerted about problems with security patches before train Tuesday, but that's not quite the topic of this task.

About being atomic. All security patches must be applied successfully for the train deployment to continue.

For each repository we have a series of patch and they might depend on each other. So probably as soon as one fail, we should skip the rest of the stack.

I think git apply --check was meant as a fast step to avoid writing to disk and just inform there is a conflict. But we can probably leave the working space/index dirty since the security patches are applied during the train deployment and there is no other operations going on for that wmf branch. So yeah I think we can drop that step entirely and directly apply. They are applied during the train preparation step and nothing get deployed until all patches have been applied.

I don't see a need to use a temporary workspace area. The working space can easily be reverted back to a good state by resetting to the original branch (git reset --hard origin/wmf/1.36.0-wmf.33). /srv/mediawiki-staging is already the working copy to "play" with.

What I would expect is an output like:

[APPLIED] applied patch core/fixapy.patch
[FAIL] conflict in core/morefixup.patch
[APPLIED] applied patch extensions/MobileFrontend/0001-T1234-fixup.patch
[APPLIED] applied patch extensions/Wikibase/0001-T1235-fixstuff.patch
[APPLIED] applied patch extensions/Wikibase/0002-T1236-fixmore.patch
[FAIL] conflict in extensions/Flow/0001-T1237-fixup.patch
[SKIP] skipping extensions/Flow/0002-T1238-followupfix.patch

Error: some repositories failed to apply all patches. See repositories working copy for details, fix conflict and export the fixed patch then reapply.

/srv/mediawiki-staging/php-1.36.0-wmf.32/
/srv/mediawiki-staging/php-1.36.0-wmf.32/Flow

Once patch is fixed and exported to /srv/patches one can rerun the command and continue from there:

[OK] already applied patch core/fixapy.patch
[APPLIED] applied core/morefixup.patch
[OK] already applied patch extensions/MobileFrontend/0001-T1234-fixup.patch
[OK] already applied patch extensions/Wikibase/0001-T1235-fixstuff.patch
[OK] already applied patch extensions/Wikibase/0002-T1236-fixmore.patch
[APPLIED] applied extensions/Flow/0001-T1237-fixup.patch
[FAIL] conflict in extensions/Flow/0002-T1238-followupfix.patch

Error: some repositories failed to apply all patches. See repositories working copy for details, fix conflict and export the fixed patch then reapply.

/srv/mediawiki-staging/php-1.36.0-wmf.32/Flow

Fix that last Flow patch, export it run again:

[OK] already applied patch core/fixapy.patch
[OK] already applied core/morefixup.patch
[OK] already applied patch extensions/MobileFrontend/0001-T1234-fixup.patch
[OK] already applied patch extensions/Wikibase/0001-T1235-fixstuff.patch
[OK] already applied patch extensions/Wikibase/0002-T1236-fixmore.patch
[OK] applied extensions/Flow/0001-T1237-fixup.patch
[APPLIED] conflict in extensions/Flow/0002-T1238-followupfix.patch

Success!

Something like that.

For --3way, I guess it is to solve trivial rebase conflict. I am guessing it should be added to scap apply-patches unless the script history mentions a reason for explicitly not using it.

To detect potential conflict early on, we would need a cron script which clone the master branches of repositories having security patches, try to apply the patch and alert. But that sounds out of scope for this task.

brennen mentioned this in T71367: page_recent_contributors leaks revdeleted user names (CVE-2021-31545).Mar 9 2021, 5:49 PM

In T269153#6897343, @hashar wrote:

What I would expect is an output like:

[APPLIED] applied patch core/fixapy.patch
[FAIL] conflict in core/morefixup.patch
[APPLIED] applied patch extensions/MobileFrontend/0001-T1234-fixup.patch
[APPLIED] applied patch extensions/Wikibase/0001-T1235-fixstuff.patch
[APPLIED] applied patch extensions/Wikibase/0002-T1236-fixmore.patch
[FAIL] conflict in extensions/Flow/0001-T1237-fixup.patch
[SKIP] skipping extensions/Flow/0002-T1238-followupfix.patch

Error: some repositories failed to apply all patches. See repositories working copy for details, fix conflict and export the fixed patch then reapply.

/srv/mediawiki-staging/php-1.36.0-wmf.32/
/srv/mediawiki-staging/php-1.36.0-wmf.32/Flow

Once patch is fixed and exported to /srv/patches one can rerun the command and continue from there:

This is pretty much the ideal situation, IMO - it should just keep track of which patches applied and which ones didn't so you can easily see what needs to be fixed. This is what I was going for when I originally built scap patch but I never got that far on it.

I'll add that, ideally, whoever merges a patch to master should remove it from /srv/patches so that we wouldn't often see any problems with patches conflicting with themself. Then we'd only rarely have conflicts from code changes in master that actually conflict with an active security patch.

brennen mentioned this in T274938: 1.36.0-wmf.34 deployment blockers.Mar 9 2021, 6:45 PM

Thank you for your feedback, Antoine and Mukunda. I think we're getting to a consensus, which I understand to be as follows:

We don't need "scap apply-patches" to be atomic. The train conductor will need to run it repeatedly, anyway, if there's any problem, so there is no need for it to be atomic.

If a patch fails, we should skip any later patches for the same repository. When the train conductor re-runs "scap apply-patches" after fixing what ever problem there was, the later patches will be applied, if they can be.

"scap apply-patches" should be very clear about which patch has been applied, was already applied, failed, or was skipped due to a previous patch failing. It should also be clear in which directories patches have failed, so that it's easier for the train conductor to figure out what to do. This will make the life of the train conductor that much easier.

Apply patches with "git am --3way" directly in the correct directory under /srv/patches. If it fails, leave that directory in a dirty state: don't run "git am --abort" in that directory. The --3way option is useful for making "git am" idempotent: i.e., not care if the patch had already been applied; this saves us from keeping fragile state across run.

Change "scap sync-world" and "scap sync-file" to check that no directory being synced is in a dirty state.

Antoine:

About being atomic. All security patches must be applied successfully for the train deployment to continue.

Yep, but that doesn't mean one run of "scap apply-patches" needs to be atomic. However, from the stuff you said later, I think we agree that the atomic operations is not needed.

Antoine:

For each repository we have a series of patch and they might depend on each other. So probably as soon as one fail, we should skip the rest of the stack.

Hmm. If there are patches 1, 2, and 3 for a repository, such that patch 2 depends on patch 1, but patch 3 applies with or without 1 and 2, in principle we could apply 3 even if 1 fails. The fact that 3 comes later may just be a co-incidence of timing. However, I'm happy to make the assumption that if one patch fails, nothing later for that repository should be applied. The person using Scap is supposed to fix the problem with patch 1 and re-run "scap apply-patches", which means patch 3 would be applied then. I agree, let's skip any later patches for a repository if one fails.

Antoine:

I think git apply --check was meant as a fast step to avoid writing to disk and just inform there is a conflict.

I'm skeptical this involves so much writing it matters for performance.

Antoine:

I don't see a need to use a temporary workspace area.

I do see a need for that. Bitter experience has taught me to distrust complicated "undo". Complicated things are fragile and fail in unfortunate and unforeseeable ways. Specifically, I don't want to apply anything in /srv/mediawiki-staging when running "scap test-patches". I'm less worried about "scap apply-patches", even if that's illogical of me.

It's true that /srv/mediawiki-staging is a "staging" area, but mistakes happen, and I'd strongly prefer to avoid taking risks here. People can and do deploy from there to production.

This reminds me that "scap sync-file" and "scap sync-world" should probably check that each repository is in a clean state: no uncommitted changed, not in the middle of a git am, etc.

Antoine:

What I would expect is an output like:

That looks good to me. I'll aim at that.

Mukunda:

This is pretty much the ideal situation, IMO - it should just keep track of which patches applied and which ones didn't so you can easily see what needs to be fixed. This is what I was going for when I originally built scap patch but I never got that far on it.

I find keeping track of state across runs of a process is error prone: how do you know something else didn't change something? However, having "scap apply-patches" apply patches in a way that's idempotent is OK and what I want. This way, if Scap applies a patch that's already applied is safe and harmless.

It turns out "git apply --3way" applies patches in an idempotent manner. That would be an additional reason to keep it.

Mukunda:

I'll add that, ideally, whoever merges a patch to master should remove it from /srv/patches so that we wouldn't often see any problems with patches conflicting with themself. Then we'd only rarely have conflicts from code changes in master that actually conflict with an active security patch.

Agreed. Unfortunately, it seems it's rarely RelEng doing the merging, and that the responsibility is diluted enough that removing the patch from /srv/patches does not happen sufficiently systematically. However, we can automate a check to make sure there aren't any merged patches in /srv/patches anymore. (But that's outside the scope for this task.)

In T269153#6896577, @LarsWirzenius wrote:

I would skip the test step ("git apply --check --3way") that we currently do. It's somewhat inappropriate as it doesn't handle the case of patches building on top of each other.

For "scap test-patches" do the same, except for each repository, make a local clone to a temporary directory, and apply the patch there. This is a more complete test than running "git apply --check", but avoids changing /srv/mediawiki-staging, possibly in a catastrophic manner.

Just FYI - when manually applying security patches, I always run a git apply --check prior to the git am, just as a sanity check. I don't think it's led to any error notification that was more than a whitespace difference or maybe an out-of-order patch application though.

I'm sketching a scenario for the Scap integration test as follows:

This scenario verifies that "scap apply-patches" can be run several
times, and problems can be fixed between runs. It assumes two
repositories, core and an extension, which both contain a file
`file.php`, which contains several lines of text. There are also three
kinds of patches: two good ones that change different parts of the
file, and one that expects a different content in the file. See below
for actual file contents.

First we set up the repositories and patches.

~~~scenario
given git repository core
given git repository core/extensions/Foo
given patch patches/core/01 from good-patch-1
given patch patches/core/02 from good-patch-2
given patch patches/extensions/Foo/01 from good-patch-1
given patch patches/extensions/Foo/02 from bad-patch-1
given patch patches/extensions/Foo/03 from good-patch-2
given an installed scap
~~~

Then we run "scap apply-patches". This fails, because one of the
patches is bad, but we verify that the output is as expected and that
the repositories are left in the expected state.

~~~scenario
when I run scap apply-patches
then it fails with exit code 1
then stdout contains "[APPLIED] patches/core/01"
then stdout contains "[APPLIED] patches/core/02"
then stdout contains "[APPLIED] patches/core/03"
then stdout contains "[APPLIED] patches/extensions/Foo/01"
then stdout contains "[FAILED] patches/extensions/Foo/02"
then stdout contains "[SKIPPED] patches/extensions/Foo/03"
then git repository core is clean
then git repository core/extensions/Foo is the middle of git am
~~~

If we run "scap apply-patches" at this point, it won't do anything,
and will tell us there is a problem.

~~~scenario
when I run scap apply-patches
then it fails with exit code 1
then stdout contains "[OK] patches/core/01"
then stdout contains "[OK] patches/core/02"
then stdout contains "[OK] patches/core/03"
then stdout contains "[ERROR] patches/extension/Foo in the middle of git am"
then git repository core is clean
then git repository core/extensions/Foo is the middle of git am
~~~

We then fix the problem by aborting the "git am" operation that was in
progress, and fixing the problematic patch.

~~~scenario
when I run, in core/extensions/Foo, git am --abort
when I replace patches/extensions/Foo/02 with good-patch-3
~~~

We can then run "scap apply-patches" again and it finishes successfully.

~~~scenario
when I run scap apply-patches
then it succeeds
then stdout contains "[OK] patches/core/01"
then stdout contains "[OK] patches/core/02"
then stdout contains "[OK] patches/core/03"
then stdout contains "[OK] patches/extension/Foo/01"
then stdout contains "[APPLIED] patches/extension/Foo/02"
then stdout contains "[APPLIED] patches/extension/Foo/03"
then git repository core is clean
then git repository core/extensions/Foo is clean
~~~

If we run "scap apply-patches" yet again, after all patches are
already applied, it's a benign operation.

~~~scenario
when I run scap apply-patches
then it succeeds
then stdout contains "[OK] patches/core/01"
then stdout contains "[OK] patches/core/02"
then stdout contains "[OK] patches/core/03"
then stdout contains "[OK] patches/extension/Foo/01"
then stdout contains "[OK] patches/extension/Foo/02"
then stdout contains "[OK] patches/extension/Foo/03"
then git repository core is clean
then git repository core/extensions/Foo is clean
~~~

~~~{#file.php .file}
This is line 1
This is line 2
This is line 3
This is line 4
This is line 5
This is line 6
This is line 7
This is line 8
This is line 9
~~~

good-patch-1 edits line 1
good-patch-2 edits line 9
bad-patch-1 edits line 5, but expects it to say "This is line 42"

There are a few typo here and there but it is not important. The scenario accurately capture the idea I had in mind for applying security patches,continuing on failure or checking the state. That is an excellent spec! I specially like the summary before each scenario entry, that is like reading a book. Well done!

RhinosF1 subscribed.Mar 12 2021, 4:38 PM

thcipriani added a parent task: T277793: Increase reliability of scap release process.Mar 23 2021, 4:09 PM

thcipriani mentioned this in T277793: Increase reliability of scap release process.

Change 674871 had a related patch set uploaded (by Lars Wirzenius; author: Lars Wirzenius):
[mediawiki/tools/scap@master] feat! rewrite apply-patches, drop list-patches and test-patches

https://gerrit.wikimedia.org/r/674871

gerritbot added a project: Patch-For-Review.Mar 25 2021, 1:14 PM

• LarsWirzenius added a parent task: T272145: Tag Scap release 1.37.1 and ask serviceops to deploy it to production.Mar 30 2021, 4:11 PM

• LarsWirzenius edited projects, added Release-Engineering-Team-TODO (2021-04-01 to 2021-06-30 (Q4)); removed Release-Engineering-Team-TODO (2021-01-01 to 2021-03-31 (Q3)).Apr 6 2021, 4:26 PM

Change 674871 merged by jenkins-bot:

[mediawiki/tools/scap@master] feat! rewrite apply-patches, drop list-patches and test-patches

https://gerrit.wikimedia.org/r/674871

• LarsWirzenius closed this task as Resolved.Apr 7 2021, 3:02 PM

Maintenance_bot removed a project: Patch-For-Review.Apr 7 2021, 3:11 PM

I ran the train today and ran the scap apply-patches --train command twice. I am happy with the result, thank you!

Applying security patches should be robust and also give some useful outputClosed, ResolvedPublicActions

Description

Details

Related ObjectsSearch...

Event Timeline

Applying security patches should be robust and also give some useful output
Closed, ResolvedPublic
Actions

Related Objects
Search...