Page MenuHomePhabricator

`scap clean` failure
Open, NormalPublic

Description

1zfilipin@deploy1001:/srv/mediawiki-staging$ find . -mindepth 2 -maxdepth 2 -type f -path './php-*/README' -ctime +30 -exec dirname {} \;
2./php-1.33.0-wmf.17
3
4zfilipin@deploy1001:/srv/mediawiki-staging$ scap clean --delete 1.33.0-wmf.17
5 ___ ____
6 ⎛ ⎛ ,----
7 \ //==--'
8 _//|,.·//==--' ____________________________
9 _OO≣=- ︶ ᴹw ⎞_§ ______ ___\ ___\ ,\__ \/ __ \
10 (∞)_, ) ( | ______/__ \/ /__ / /_/ / /_/ /
11 ¨--¨|| |- ( / ______\____/ \___/ \__^_/ .__/
12 ««_/ «_/ jgs/bd808 /_/
13
1413:36:52 Checking for new runtime errors locally
1513:36:53 Started clean-l10nupdate-cache
1613:36:53 Finished clean-l10nupdate-cache (duration: 00m 00s)
1713:36:53 Started clean-l10nupdate-owned-files
1813:36:53 Finished clean-l10nupdate-owned-files (duration: 00m 00s)
1913:36:53 Started clean-ExtensionMessages
2013:36:53 Unable to delete /srv/mediawiki-staging/wmf-config/ExtensionMessages-1.33.0-wmf.17.php, already missing
2113:36:53 Finished clean-ExtensionMessages (duration: 00m 00s)
2213:36:53 Started prune-git-branches
23Received disconnect from 2620:0:861:3:208:80:154:85 port 29418:2: Too many authentication failures: 7
24Authentication failed.
25fatal: Could not read from remote repository.
26
27Please make sure you have the correct access rights
28and the repository exists.
29Received disconnect from 2620:0:861:3:208:80:154:85 port 29418:2: Too many authentication failures: 7
30Authentication failed.
31fatal: Could not read from remote repository.
32
33...

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptMar 20 2019, 1:50 PM
zeljkofilipin triaged this task as Unbreak Now! priority.
Restricted Application added subscribers: Liuxinyu970226, TerraCodes. · View Herald TranscriptMar 20 2019, 1:51 PM
hashar added a subscriber: hashar.Mar 20 2019, 1:54 PM

13:44:00 Started prune-git-branches
Received disconnect from 2620:0:861:3:208:80:154:85 port 29418:2: Too many authentication failures: 7
Authentication failed.

prune-git-branches comes from our scap plugin:

/operations/mediawiki-config(masteru=)$ git grep -A7 prune.git

scap/plugins/clean.py
with log.Timer('prune-git-branches', self.get_stats()):
    # Prune all the submodules' remote branches
    with utils.cd(self.branch_stage_dir):
        submodule_cmd = 'git submodule foreach "{} ||:"'.format(
            ' '.join(git_prune_cmd))          
        subprocess.check_output(submodule_cmd, shell=True)
        if subprocess.call(git_prune_cmd) != 0:
            logger.info('Failed to prune core branch')

Or to summarizes, our scap plugin does:

git submodule foreach  git push origin --delete wmf/1.33.0-wmf.17

Which deletes the legacy branch.

When scap clean is issued, on Gerrit sshd_log we eventually see:

AUTH FAILURE FROM 2620:0:861:103:10:64:32:16 no-matching-key

But ssh -p 29418 zfilipin@gerrit.wikimedia.org works.

So the insteadOf in .gitconfig works. Seems the proper ssh key is not passed when using scap? :-(

One thing that can be attempted is to manually replay the command that scap/plugins/clean.py is doing. Namely:

cd /srv/mediawiki-staging/php-1.33.0-wmf.17
git submodule foreach "git push origin --quiet --delete wmf/1.33.0-wmf.17 || :" ; echo $?

And for core:

cd /srv/mediawiki-staging/php-1.33.0-wmf.17
git push origin --quiet --delete wmf/1.33.0-wmf.17 ; echo $?
greg lowered the priority of this task from Unbreak Now! to High.Mar 20 2019, 4:13 PM
greg added a subscriber: greg.

Not a blocker to the train. But needs to be dealt with this week or next.

Change 497781 had a related patch set uploaded (by Hashar; owner: Hashar):
[operations/mediawiki-config@master] scap: add logging to clean > prune-git-branches

https://gerrit.wikimedia.org/r/497781

Mentioned in SAL (#wikimedia-operations) [2019-03-26T19:18:52Z] <marxarelli> scap clean failure due to T218783. train is rolling without cleanup

I think what's happening is that scap overwrites the environment $SSH_AUTH_SOCK with whatever is in /etc/scap.cfg:ssh_auth_sock (which is /run/keyholder/proxy.sock)

Change 502316 had a related patch set uploaded (by Thcipriani; owner: Thcipriani):
[operations/mediawiki-config@master] Train: scap clean, feature flag prune branches

https://gerrit.wikimedia.org/r/502316

Change 497781 abandoned by Hashar:
scap: add logging to clean > prune-git-branches

https://gerrit.wikimedia.org/r/497781

Change 502316 merged by jenkins-bot:
[operations/mediawiki-config@master] Train: scap clean, feature flag prune branches

https://gerrit.wikimedia.org/r/502316

Marostegui raised the priority of this task from High to Unbreak Now!.Apr 10 2019, 5:01 AM
Marostegui added a subscriber: Marostegui.

mwdebug2001 and mwdebug2002 are now full:

root@mwdebug2002:~# df -hT /
Filesystem     Type  Size  Used Avail Use% Mounted on
/dev/vda1      ext4   39G   37G     0 100% /


root@mwdebug2001:~# df -hT /
Filesystem     Type  Size  Used Avail Use% Mounted on
/dev/vda1      ext4   39G   37G     0 100% /
greg added a comment.Apr 10 2019, 5:19 AM

Since these are vms, can their disk be expanded easily? I ask because they're the oddballs in the mw fleet :) I know Tyler wants to fix this issue this week, though.

Joe added a subscriber: Joe.Apr 10 2019, 5:21 AM

To be precise, the last version of the train did not deploy correctly to any of the debug servers.

Deployments should be considered blocked (train and SWAT) until we have at least pruned some directories manually.

Joe added a comment.Apr 10 2019, 5:53 AM

Also please note that with the next train, the eqiad servers would fill their disk up as well.

Joe added a comment.Apr 10 2019, 5:54 AM

Since these are vms, can their disk be expanded easily? I ask because they're the oddballs in the mw fleet :) I know Tyler wants to fix this issue this week, though.

It's some work (basically, reimaging 2 servers) and more importantly takes time to complete. We will be doing it but it's not a painless quick fix.

Since these are vms, can their disk be expanded easily? I ask because they're the oddballs in the mw fleet :) I know Tyler wants to fix this issue this week, though.

Ugh, I did "fix" scap clean (for some value of "fix" -- it should no longer fail, but it also doesn't do everything it used to -- namely deleting old branches on gerrit) , but we still need to run scap clean for versions 17-20.

I ran it for version 17 yesterday and that seems to have worked. I'll clean up 18-20 today.

mmodell lowered the priority of this task from Unbreak Now! to Normal.Apr 11 2019, 5:59 PM

Ok I ran scap clean for wmf.18-wmf.20 and it seems like it got things all cleaned up.