Page MenuHomePhabricator

Scap train-presync failed to prepare 1.41.0-wmf.12
Closed, ResolvedPublic

Description

Systemd timer ran the following command:

/usr/bin/scap stage-train -Dfull_image_build:True --yes auto

Its return value was 70 and emitted the following output:

03:00:14 Initializing stage-train auto mode
03:00:14 Retrieving train information...
03:00:16 Using version 1.41.0-wmf.12
03:00:16 ----------------------------
1. Starting: prep
03:00:16 Started scap prep 1.41.0-wmf.12
03:00:16 Copying patches from /srv/patches/1.41.0-wmf.11 to /srv/patches/1.41.0-wmf.12
03:00:16 Finished scap prep 1.41.0-wmf.12 (duration: 00m 00s)
03:00:16 Unhandled error:
Traceback (most recent call last):
  File "/var/lib/scap/scap/lib/python3.7/site-packages/scap/plugins/prep.py", line 166, in main
    self._prep_mw_branch(self.arguments.branch, logger)
  File "/var/lib/scap/scap/lib/python3.7/site-packages/scap/plugins/prep.py", line 191, in _prep_mw_branch
    self._setup_patches(branch)
  File "/var/lib/scap/scap/lib/python3.7/site-packages/scap/plugins/prep.py", line 291, in _setup_patches
    git.add_all(patch_base_dir, message='Scap prep for "{}"'.format(version))
  File "/var/lib/scap/scap/lib/python3.7/site-packages/scap/git.py", line 238, in add_all
    gitcmd("add", "--all", cwd=location)
  File "/var/lib/scap/scap/lib/python3.7/site-packages/scap/runcmd.py", line 91, in gitcmd
    return _runcmd(["git", subcommand] + list(args), **kwargs)
  File "/var/lib/scap/scap/lib/python3.7/site-packages/scap/runcmd.py", line 78, in _runcmd
    raise FailedCommand(argv, p.returncode, stdout, stderr)
scap.runcmd.FailedCommand: Command 'git add --all' failed with exit code 128;

stdout:

stderr:

fatal: Unable to create 'srv/patches.git/index.lock': Permission denied

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/var/lib/scap/scap/lib/python3.7/site-packages/scap/cli.py", line 530, in run
    exit_status = app.main(app.extra_arguments)
  File "/var/lib/scap/scap/lib/python3.7/site-packages/scap/plugins/prep.py", line 170, in main
    history.log(self.new_history, self.config["history_log"])
  File "/var/lib/scap/scap/lib/python3.7/site-packages/scap/history.py", line 54, in log
    with utils.open_with_lock(path, 'a+') as f:
  File "/usr/lib/python3.7/contextlib.py", line 112, in __enter__
    return next(self.gen)
  File "/var/lib/scap/scap/lib/python3.7/site-packages/scap/utils.py", line 995, in open_with_lock
    with open(path, mode, *args, **kwargs) as f:
PermissionError: [Errno 13] Permission denied: '/srv/mediawiki-staging/scap/log/history.log'
03:00:16 prep failed: <PermissionError> [Errno 13] Permission denied: '/srv/mediawiki-staging/scap/log/history.log'
03:00:16 stage-train failed: <CalledProcessError> Command '['/usr/bin/scap', 'prep', '1.41.0-wmf.12', '-D', 'full_image_build:True']' returned non-zero exit status 70.

Followup actions:

  • If /srv/patches has the set-group-id flag set (g+s), then newly created directory/files would all be owned by the deployment group which should reduce the need to use the fix-staging-perms script.
  • Files in /srv/patches should be owned by some other users than by uid of former employees
    • This last step hasn't been accomplished. Maybe later we can change the nameless uid to nobody.

Event Timeline

hashar triaged this task as Unbreak Now! priority.Jun 6 2023, 7:45 AM

That is obviously blocking the train and as such is an Unbreak Now!

The few things I found from a quick look on deployment.eqiad.wmnet (deploy1002.eqiad.wmnet):

Ghost uid

/srv/patches/ is owned by non existing user with UID 2246 (same on the spare deploy2002).

Looking at Puppet modules/admin/data/data.yaml that uid was assigned to former WMF employee Chris Steipp. He used to work on the MediaWiki security front but left back in 2017. I am guessing that kept being carried over as machine got rebuild and rsync used and indeed some files under the .git directory date from February 2016.

Surely we should figure out who should own that directory: mwpresync or mwdeploy?

/srv/patches group mismatches

The /srv/patches directories are owned by different groups:

  • On the spare deploy2002, the group is mwbuilder
  • On the primary deploy1002:
    • wmf.10 and wmf.11 have group wikidev
    • wmf.12 has group deployment

Permissions denied

fatal: Unable to create '/srv/patches.git/index.lock': Permission denied

/srv/patches/.git is owned by the wikidev group and group writable. Last changes but I doubt it is relevant:

Access: 2023-06-05 20:51:42.900202464 +0000
Modify: 2023-06-05 20:29:14.157772734 +0000
Change: 2023-06-05 20:51:39.852188518 +0000

Last change to git index was on June 5th 20:14 by urbanecm.

There is a left over Scap clean for "1.41.0-wmf.9":

-rw-rw-r--   1                  1011 wikidev      30 May 30 03:52 COMMIT_EDITMSG

The uid is 1011 which is assigned to Yuri Astrakhan a former WMF employee who left in 2017.

For the inner follow up exception:

PermissionError: [Errno 13] Permission denied: '/srv/mediawiki-staging/scap/log/history.log'

The file is owned by brennen:wikidev and user/group writable.

The parent directory /srv/mediawiki-staging/scap/log is owned by gjg:wikidev, group writable with the set-group-ID flag to enforce wikidev group.

Possible culprit is https://gerrit.wikimedia.org/r/c/operations/puppet/+/927269

CommitDate: Mon Jun 5 19:41:13 2023 +0000

    fix-stagging-perms: Fix group owner change for /srv/patches
    
    Previously, the script was erroring out with
    "paths must precede expression `group`", because
    the group predicate was missing a dash.

Which is more or less inline with the last change to .git/index by @Urbanecm at June 5th 20:14.

The page changes a fix-staging-perms.sh script:

-find /srv/patches -not group wikidev -print0 | xargs -0 -r chgrp wikidev
+find /srv/patches -not -group wikidev -print0 | xargs -0 -r chgrp wikidev

That is part of T338180 and previously the command would not do anything?

Then why is /srv/patches/1.41.0.wmf.12 owned by group deployment instead of wikidev?

Change 927584 had a related patch set uploaded (by Urbanecm; author: Urbanecm):

[operations/puppet@production] fix-staging-perms: Change group owner to deployment

https://gerrit.wikimedia.org/r/927584

I thought there has to be a reason for the deployment group ownership :). Uploaded a fixing patch; deploying it and running the fixing script again should unbreak train.

T338180 had:

In addition to this, I noticed that the group owner of /srv/patches changed to deployment (this may or may not be the case for /srv/mediawiki-staging before I ran the fixing script and changed it to wikidev).
Can someone confirm the group owner for those two paths should still be wikidev?

That was not listed on the Puppet patch https://gerrit.wikimedia.org/r/c/operations/puppet/+/927269 which got deployed overnight (from an European point of view). The script fixed thus changed the group to wikidev when we expect deployment. Indeed from Puppet:

modules/scap/manifests/master.pp
file { $patches_path:
    ensure => 'directory',
    owner  => 'mwdeploy',
    group  => $deployment_group,
    mode   => '2775',
}

Which comes from:

modules/profile/manifests/mediawiki/deployment/server.pp
String $deployment_group             = lookup('deployment_group', {default_value => 'wikidev'}),

Which points to hiera:

hieradata/role/common/deployment_server/kubernetes.yaml
deployment_group: "deployment"

But that one got set back in May 2022 :/

Short story:

/srv/patches/.git is now owned by wikidev group due to the script being fixed but it should be in the deployment group for scap to be able to act on it.

Surely the permissions should be fixed with a group set-Id set on /srv/patches.

Change 927584 merged by Jcrespo:

[operations/puppet@production] fix-staging-perms: Change group owner to deployment

https://gerrit.wikimedia.org/r/927584

Mentioned in SAL (#wikimedia-operations) [2023-06-06T08:59:49Z] <urbanecm> deploy1002: run /usr/local/sbin/fix-staging-perms (T338205)

The ownership should be fixed now. Leaving re-running the presync command to @hashar / releng.

hashar lowered the priority of this task from Unbreak Now! to Medium.Jun 6 2023, 9:25 AM
hashar added a subscriber: jcrespo.

@Urbanecm fixed it up and @jcrespo reran the train-presync systemd service. The train is progressing hence this task is no more a blocker.

There are a few follows up action needed:

  • fix-staging-perms hardcodes the deployment unix group, that should be feed by Puppet. That would prevent forgetting to update one of the occurrences of the group name
  • The deployment_group variable in the Puppet classes default to wikidev which sounds potentially misleading. I'd got to remove the default and only rely on the deployment_group: deployment hiera variable. Beta-Cluster-Infrastructure will need something similar
  • Files in /srv/patches should be owned by some other users than by uid of former employees
  • If /srv/patches has the set-group-id flag set (g+s), then newly created directory/files would all be owned by the deployment group which should reduce the need to use the fix-staging-perms script.

I am adding those to the task description.

Change 927674 had a related patch set uploaded (by Hashar; author: Hashar):

[operations/puppet@production] fix-staging-perms: set group name from Puppet

https://gerrit.wikimedia.org/r/927674

Change 927675 had a related patch set uploaded (by Hashar; author: Hashar):

[operations/puppet@production] scap3: stop defaulting deployment_group to 'wikidev'

https://gerrit.wikimedia.org/r/927675

Change 927676 had a related patch set uploaded (by Hashar; author: Hashar):

[operations/puppet@production] fix-staging-perms: set set-group-id on /srv/patches subdirs

https://gerrit.wikimedia.org/r/927676

I have send Puppet patches from 3 of the 4 actionable. Not sure who can review them though.

The last one is about files being owned by former employees:

Files in /srv/patches should be owned by some other users than by uid of former employees

That is probably not that important. Then there are plenty of old legacy objects so surely it should get some automatic maintenance from time to time but I digress.

Mentioned in SAL (#wikimedia-releng) [2023-06-20T10:43:05Z] <hashar> deployment-prep: running /usr/local/sbin/fix-staging-perms on deployment-deploy03 using the cherry pick of https://gerrit.wikimedia.org/r/c/operations/puppet/+/927674 . That normalizes all of /srv/mediawiki-staging to be group owned by wikidev # T338205

Change 927674 merged by Clément Goubert:

[operations/puppet@production] fix-staging-perms: set group name from Puppet

https://gerrit.wikimedia.org/r/927674

Change 978541 had a related patch set uploaded (by Hashar; author: Hashar):

[operations/puppet@production] fix-staging-perms: chgrp symbolic link, not its target!

https://gerrit.wikimedia.org/r/978541

Change 978541 merged by Clément Goubert:

[operations/puppet@production] fix-staging-perms: chgrp symbolic link, not its target!

https://gerrit.wikimedia.org/r/978541

Change 927675 merged by Clément Goubert:

[operations/puppet@production] scap3: stop defaulting deployment_group to 'wikidev'

https://gerrit.wikimedia.org/r/927675

Change 927676 merged by Clément Goubert:

[operations/puppet@production] fix-staging-perms: set set-group-id on /srv/patches subdirs

https://gerrit.wikimedia.org/r/927676

hashar added a subscriber: Clement_Goubert.

@claime gave the final round of review and rolled all the patches. He ran the fix-staging-perms script so we should be all set.

I have left a note on the next train task ( T350085 ) pointing back here.

THANK YOU @claime !