Page MenuHomePhabricator

`scap clean` failure
Open, MediumPublic

Description

1zfilipin@deploy1001:/srv/mediawiki-staging$ find . -mindepth 2 -maxdepth 2 -type f -path './php-*/README' -ctime +30 -exec dirname {} \;
2./php-1.33.0-wmf.17
3
4zfilipin@deploy1001:/srv/mediawiki-staging$ scap clean --delete 1.33.0-wmf.17
5 ___ ____
6 ⎛ ⎛ ,----
7 \ //==--'
8 _//|,.·//==--' ____________________________
9 _OO≣=- ︶ ᴹw ⎞_§ ______ ___\ ___\ ,\__ \/ __ \
10 (∞)_, ) ( | ______/__ \/ /__ / /_/ / /_/ /
11 ¨--¨|| |- ( / ______\____/ \___/ \__^_/ .__/
12 ««_/ «_/ jgs/bd808 /_/
13
1413:36:52 Checking for new runtime errors locally
1513:36:53 Started clean-l10nupdate-cache
1613:36:53 Finished clean-l10nupdate-cache (duration: 00m 00s)
1713:36:53 Started clean-l10nupdate-owned-files
1813:36:53 Finished clean-l10nupdate-owned-files (duration: 00m 00s)
1913:36:53 Started clean-ExtensionMessages
2013:36:53 Unable to delete /srv/mediawiki-staging/wmf-config/ExtensionMessages-1.33.0-wmf.17.php, already missing
2113:36:53 Finished clean-ExtensionMessages (duration: 00m 00s)
2213:36:53 Started prune-git-branches
23Received disconnect from 2620:0:861:3:208:80:154:85 port 29418:2: Too many authentication failures: 7
24Authentication failed.
25fatal: Could not read from remote repository.
26
27Please make sure you have the correct access rights
28and the repository exists.
29Received disconnect from 2620:0:861:3:208:80:154:85 port 29418:2: Too many authentication failures: 7
30Authentication failed.
31fatal: Could not read from remote repository.
32
33...

Details

Related Gerrit Patches:
operations/mediawiki-config : masterTrain: scap clean, feature flag prune branches
operations/mediawiki-config : masterscap: add logging to clean > prune-git-branches

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptMar 20 2019, 1:50 PM
zeljkofilipin triaged this task as Unbreak Now! priority.Mar 20 2019, 1:50 PM
Restricted Application added subscribers: Liuxinyu970226, TerraCodes. · View Herald TranscriptMar 20 2019, 1:51 PM
hashar added a subscriber: hashar.Mar 20 2019, 1:54 PM

13:44:00 Started prune-git-branches
Received disconnect from 2620:0:861:3:208:80:154:85 port 29418:2: Too many authentication failures: 7
Authentication failed.

prune-git-branches comes from our scap plugin:

/operations/mediawiki-config(masteru=)$ git grep -A7 prune.git

scap/plugins/clean.py
with log.Timer('prune-git-branches', self.get_stats()):
    # Prune all the submodules' remote branches
    with utils.cd(self.branch_stage_dir):
        submodule_cmd = 'git submodule foreach "{} ||:"'.format(
            ' '.join(git_prune_cmd))          
        subprocess.check_output(submodule_cmd, shell=True)
        if subprocess.call(git_prune_cmd) != 0:
            logger.info('Failed to prune core branch')

Or to summarizes, our scap plugin does:

git submodule foreach  git push origin --delete wmf/1.33.0-wmf.17

Which deletes the legacy branch.

When scap clean is issued, on Gerrit sshd_log we eventually see:

AUTH FAILURE FROM 2620:0:861:103:10:64:32:16 no-matching-key

But ssh -p 29418 zfilipin@gerrit.wikimedia.org works.

So the insteadOf in .gitconfig works. Seems the proper ssh key is not passed when using scap? :-(

One thing that can be attempted is to manually replay the command that scap/plugins/clean.py is doing. Namely:

cd /srv/mediawiki-staging/php-1.33.0-wmf.17
git submodule foreach "git push origin --quiet --delete wmf/1.33.0-wmf.17 || :" ; echo $?

And for core:

cd /srv/mediawiki-staging/php-1.33.0-wmf.17
git push origin --quiet --delete wmf/1.33.0-wmf.17 ; echo $?
greg lowered the priority of this task from Unbreak Now! to High.Mar 20 2019, 4:13 PM
greg added a subscriber: greg.

Not a blocker to the train. But needs to be dealt with this week or next.

Change 497781 had a related patch set uploaded (by Hashar; owner: Hashar):
[operations/mediawiki-config@master] scap: add logging to clean > prune-git-branches

https://gerrit.wikimedia.org/r/497781

Mentioned in SAL (#wikimedia-operations) [2019-03-26T19:18:52Z] <marxarelli> scap clean failure due to T218783. train is rolling without cleanup

I think what's happening is that scap overwrites the environment $SSH_AUTH_SOCK with whatever is in /etc/scap.cfg:ssh_auth_sock (which is /run/keyholder/proxy.sock)

Change 502316 had a related patch set uploaded (by Thcipriani; owner: Thcipriani):
[operations/mediawiki-config@master] Train: scap clean, feature flag prune branches

https://gerrit.wikimedia.org/r/502316

Change 497781 abandoned by Hashar:
scap: add logging to clean > prune-git-branches

https://gerrit.wikimedia.org/r/497781

Change 502316 merged by jenkins-bot:
[operations/mediawiki-config@master] Train: scap clean, feature flag prune branches

https://gerrit.wikimedia.org/r/502316

Marostegui raised the priority of this task from High to Unbreak Now!.Apr 10 2019, 5:01 AM
Marostegui added a subscriber: Marostegui.

mwdebug2001 and mwdebug2002 are now full:

root@mwdebug2002:~# df -hT /
Filesystem     Type  Size  Used Avail Use% Mounted on
/dev/vda1      ext4   39G   37G     0 100% /


root@mwdebug2001:~# df -hT /
Filesystem     Type  Size  Used Avail Use% Mounted on
/dev/vda1      ext4   39G   37G     0 100% /
greg added a comment.Apr 10 2019, 5:19 AM

Since these are vms, can their disk be expanded easily? I ask because they're the oddballs in the mw fleet :) I know Tyler wants to fix this issue this week, though.

Joe added a subscriber: Joe.Apr 10 2019, 5:21 AM

To be precise, the last version of the train did not deploy correctly to any of the debug servers.

Deployments should be considered blocked (train and SWAT) until we have at least pruned some directories manually.

Joe added a comment.Apr 10 2019, 5:53 AM

Also please note that with the next train, the eqiad servers would fill their disk up as well.

Joe added a comment.Apr 10 2019, 5:54 AM

Since these are vms, can their disk be expanded easily? I ask because they're the oddballs in the mw fleet :) I know Tyler wants to fix this issue this week, though.

It's some work (basically, reimaging 2 servers) and more importantly takes time to complete. We will be doing it but it's not a painless quick fix.

Since these are vms, can their disk be expanded easily? I ask because they're the oddballs in the mw fleet :) I know Tyler wants to fix this issue this week, though.

Ugh, I did "fix" scap clean (for some value of "fix" -- it should no longer fail, but it also doesn't do everything it used to -- namely deleting old branches on gerrit) , but we still need to run scap clean for versions 17-20.

I ran it for version 17 yesterday and that seems to have worked. I'll clean up 18-20 today.

mmodell lowered the priority of this task from Unbreak Now! to Medium.Apr 11 2019, 5:59 PM

Ok I ran scap clean for wmf.18-wmf.20 and it seems like it got things all cleaned up.

The error now more shows up since the cleaning has been disabled behind a feature flag by https://gerrit.wikimedia.org/r/#/c/operations/mediawiki-config/+/502316/

So one now has to run scap clean --delete-gerrit-branch which we do not do. The repository still has the old branches:

$ git ls-remote https://gerrit.wikimedia.org/r/mediawiki/core 'refs/heads/wmf/*'
5e425f78328119c494f85bde2b62bf882af957b3	refs/heads/wmf/1.34.0-wmf.13
7df499c9ac0da2ec9c5dfae79091a7dcfab0721b	refs/heads/wmf/1.34.0-wmf.14
2cf6cfe13f23dca5430e47d337936f503bfe6115	refs/heads/wmf/1.34.0-wmf.15
4c69337a8a74c57294bf3d5b3ef8a73aec333b13	refs/heads/wmf/1.34.0-wmf.16
f84a4abb418de8e2c53c87f5a3dc1379acfd2f63	refs/heads/wmf/1.34.0-wmf.17
5d54761bb099434397d9e2accb3a2d396c07989c	refs/heads/wmf/1.34.0-wmf.19
ddfeb42049c2ce35157ab9e941d24b294cbb2924	refs/heads/wmf/1.34.0-wmf.20
2286620289584e937ed9f595942988e28ba6b4f7	refs/heads/wmf/1.34.0-wmf.21
5a907677b69dd008498f170c3683e5d5e9e821b3	refs/heads/wmf/1.34.0-wmf.22
56f788d5bb941b109119c5ed374e1e49004776bf	refs/heads/wmf/1.34.0-wmf.23
68ccab1e007112d8f45954088ac3c2fa4b0692d0	refs/heads/wmf/1.34.0-wmf.24
8df64470fdead29ddd42c88ee1ecba6e77d88311	refs/heads/wmf/1.34.0-wmf.24-bak
54dbe5f05df6636dc2c3f7f20616a7a2acd07f47	refs/heads/wmf/1.34.0-wmf.25
f01da543f11746acc1aa42cd46e02688b2b7b9de	refs/heads/wmf/1.34.0-wmf.25-bak
0ba3e32a84a5960d266ed9cfba5e36ac4f1a9256	refs/heads/wmf/1.35.0-wmf.1
21e7707b532027baf3bd8f1563b920f237327ca8	refs/heads/wmf/1.35.0-wmf.2
50fc302ed798530164baf5839b375e099591d608	refs/heads/wmf/1.35.0-wmf.3
7af252a5be5b00646446532f61f59bd4111033d0	refs/heads/wmf/1.35.0-wmf.4
6ecbe69f4d1e5e9d2b15113036d3552bb674e4ec	refs/heads/wmf/1.35.0-wmf.5

I had a patch to add at least some level of logging in the scap clean command: https://gerrit.wikimedia.org/r/#/c/operations/mediawiki-config/+/497781/

Then @thcipriani wrote:

Gerrit's use of HTTP auth tokens has been temporarily removed. Until it
is re-enabled we rely on Gerrit's SSH auth. In the case of scap clean,
since scap overrides the SSH_AUTH_SOCK env var with a pointer to
keyholder, we are not authorized to prune any branches on Gerrit since
there are no appropriate keys in keyholder
Until either Gerrit's HTTP auth tokens are re-enabled, or we have a
shared key to prune branches in keyholder, we need to feature flag
branch prune so that we ensure that we're doing the rest of scap clean
(removing old branches on the application servers themselves).

I am in favor of creating a maintenance / branch cutter user in Gerrit with a ssh key hold in the deployment hosts keyholder :] Then we would need the branch deletion part to use that username (is that mwdeploy?) and drop the feature flag.