Page MenuHomePhabricator

beta-scap-eqiad mira / deployment-bastion permissions problem
Closed, ResolvedPublic

Description

beta-scap-eqiad has been failing because of permissions issues on mira.

There were a couple of weird problems, one of which I fixed, one of which I'm punting on until tomorrow.

Problem the first: the ownership of the rsync-created /srv/mediawiki-staging/.~tmp~ directory was all whack:

thcipriani@mira:/srv/mediawiki-staging$ stat .~tmp~
  File: ‘.~tmp~’
  Size: 4096            Blocks: 8          IO Block: 4096   directory
Device: fc00h/64512d    Inode: 2097528     Links: 2
Access: (2700/drwx--S---)  Uid: (  993/mwdeploy)   Gid: (  500/ wikidev)
Access: 2015-10-28 23:19:17.296223237 +0000
Modify: 2015-10-28 23:46:06.872485202 +0000
Change: 2015-10-28 23:46:06.872485202 +0000
 Birth: -
thcipriani@mira:/srv/mediawiki-staging$ getent passwd mwdeploy
mwdeploy:x:603:603:mwdeploy:/home/mwdeploy:/bin/bash

I'm not sure what caused that problem (mwdeploy on deployment-bastion is also uid 603), but a chown -R 603 solved it

Problem the second: rsync wants to set times on /srv/mediawiki-staging which it can't do because it's owned by root. A chown mwdeploy:wikidev /srv/mediawiki-staging fixed this for 1 run, but this needs to be fixed somewhere in puppet.

Revisions and Commits

Event Timeline

thcipriani raised the priority of this task from to Needs Triage.
thcipriani updated the task description. (Show Details)
thcipriani added subscribers: thcipriani, bd808, demon, hashar.

Change 249684 had a related patch set uploaded (by Alex Monk):
Make mediawiki-config clone be owned by mwdeploy

https://gerrit.wikimedia.org/r/249684

hashar triaged this task as Unbreak Now! priority.Oct 29 2015, 3:08 PM
hashar added a project: Essential-Work.

That breaks beta-scap-eqiad on deployment-bastion:

15:07:15 15:07:08 Started rsync master
15:07:15 rsync: failed to set times on "/srv/mediawiki-staging/.": Operation not permitted (1)
15:07:15 rsync: rename "/srv/mediawiki-staging/.wikiversions-labs.php.tsl1fr" -> ".~tmp~/wikiversions-labs.php": Permission denied (13)
15:07:15 rsync error: some files/attrs were not transferred (see previous errors) (code 23) at main.c(1655) [generator=3.1.0]

Cherry-picked patch on deployment-puppetmaster, ran puppet on deployment-bastion:
Notice: /Stage[main]/Scap::Master/Git::Clone[operations/mediawiki-config]/File[/srv/mediawiki-staging]/owner: owner changed 'root' to 'mwdeploy'

Hmm, mira:/srv/mediawiki-staging/.~tmp~ is owned by 993 again:

thcipriani@mira:/srv/mediawiki-staging$ stat .~tmp~
  File: ‘.~tmp~’
  Size: 4096            Blocks: 8          IO Block: 4096   directory
Device: fc00h/64512d    Inode: 2097528     Links: 2
Access: (2700/drwx--S---)  Uid: (  993/mwdeploy)   Gid: (  500/ wikidev)
Access: 2015-10-29 12:33:25.842298135 +0000
Modify: 2015-10-29 12:37:25.838273282 +0000
Change: 2015-10-29 12:37:25.838273282 +0000
Birth: -

I wonder what the root cause of this is?

$ stat /srv/mediawiki-staging
  File: `/srv/mediawiki-staging'
  Size: 4096      	Blocks: 8          IO Block: 4096   directory
Device: fc00h/64512d	Inode: 657883      Links: 14
Access: (2775/drwxrwsr-x)  Uid: (    0/    root)   Gid: (  500/ wikidev)
Access: 2015-10-29 15:07:33.602030795 +0000
Modify: 2015-10-29 15:07:30.430031178 +0000
Change: 2015-10-29 15:07:30.430031178 +0000
 Birth: -

So belong to root:wikidev (with group setuid).

The mwdeploy user does not belong to the wikidev group:

$ groups mwdeploy
mwdeploy : mwdeploy

From the puppet logs:

puppet.log.2.gz:Notice: /Stage[main]/Mediawiki::Users/Sudo::User[mwdeploy]/File[/etc/sudoers.d/mwdeploy]/content: 
puppet.log.2.gz:--- /etc/sudoers.d/mwdeploy	2015-09-08 19:44:30.965728133 +0000
puppet.log.2.gz: mwdeploy ALL = (root) NOPASSWD: /usr/sbin/service apache2 start
puppet.log.2.gz: mwdeploy ALL = (root) NOPASSWD: /sbin/start hhvm
puppet.log.2.gz: mwdeploy ALL = (root) NOPASSWD: /usr/sbin/apache2ctl graceful-stop
puppet.log.2.gz:-mwdeploy ALL = (mwdeploy:wikidev) NOPASSWD: /usr/bin/rsync *\:\:common /srv/mediawiki-staging
puppet.log.2.gz:Info: /Stage[main]/Mediawiki::Users/Sudo::User[mwdeploy]/File[/etc/sudoers.d/mwdeploy]: Filebucketed /etc/sudoers.d/mwdeploy to puppet with sum 3f738b182013e41d81a1bb09cc5b3bb0
puppet.log.2.gz:Notice: /Stage[main]/Mediawiki::Users/Sudo::User[mwdeploy]/File[/etc/sudoers.d/mwdeploy]/content: content changed '{md5}3f738b182013e41d81a1bb09cc5b3bb0' to '{md5}b264916d4404ffa410a0233fef11d18d'
puppet.log.2.gz:Info: /Stage[main]/Mediawiki::Users/Sudo::User[mwdeploy]/File[/etc/sudoers.d/mwdeploy]: Scheduling refresh of Exec[sudo_user_mwdeploy_linting]
puppet.log.2.gz:Notice: /Stage[main]/Mediawiki::Users/Sudo::User[mwdeploy]/Exec[sudo_user_mwdeploy_linting]: Triggered 'refresh' from 1 events

I.e the following sudo rule got removed which mentions both wikidev and rsync.

mwdeploy ALL = (mwdeploy:wikidev) NOPASSWD: /usr/bin/rsync *\:\:common /srv/mediawiki-staging

That occurred on 2015-10-27 19:15:15 UTC.

I don't see anything related in either of operations/puppet.git or Hiera:deployment-prep.

Maybe that was a local hack, a manual change or a cherry picked patch that got removed from the puppet master? :/

Hmm, mira:/srv/mediawiki-staging/.~tmp~ is owned by 993 again:

thcipriani@mira:/srv/mediawiki-staging$ stat .~tmp~
  File: ‘.~tmp~’
  Size: 4096            Blocks: 8          IO Block: 4096   directory
Device: fc00h/64512d    Inode: 2097528     Links: 2
Access: (2700/drwx--S---)  Uid: (  993/mwdeploy)   Gid: (  500/ wikidev)
Access: 2015-10-29 12:33:25.842298135 +0000
Modify: 2015-10-29 12:37:25.838273282 +0000
Change: 2015-10-29 12:37:25.838273282 +0000
Birth: -

I wonder what the root cause of this is?

A shadow account for mwdeploy in the /etc/passwd file on mira. We have had problems with this happening before when there is a hiccup talking to LDAP during a Puppet run.

mira.deployment-prep:~
bd808$ grep 993 /etc/passwd
mwdeploy:x:993:993::/home/mwdeploy:/bin/bash

https://gerrit.wikimedia.org/r/#/c/249684/ has been cherry picked and /srv/mediawiki-staging now belong to mwdeploy user.

hashar renamed this task from beta-scap-eqiad mira permissions problem to beta-scap-eqiad mira / deployment-bastion permissions problem.Oct 29 2015, 3:33 PM
hashar moved this task from To Triage to Done on the Beta-Cluster-Infrastructure board.
hashar set Security to None.

This is all related to T104826: [scap] Add support for syncing /srv/mediawiki-staging including fully working git data to warm spare deploy server. @demon tested things out in the last couple of days, but he manually chmoded things to get it working initially and we missed the need to change the ownership of /srv/mediawiki-staging while that happened. Thanks to @Krenair for coming to the rescue with the right puppet patch for that. Now we just need to get @Joe or another opsen with motivation to get mira working in production to review and merge the two puppet patches.

Change 249684 merged by Muehlenhoff:
Make mediawiki-config clone be owned by mwdeploy

https://gerrit.wikimedia.org/r/249684

krenair@tin:~$ ls -al /srv | grep mediawiki-staging
drwxrwsr-x 28 mwdeploy  wikidev  4096 Oct 30 22:05 mediawiki-staging

What is the other patch needed?

hashar lowered the priority of this task from Unbreak Now! to High.Nov 9 2015, 1:07 PM

This problem is back after deploying the 1.27.0-wmf.6 branch. @mmodell and I talked about the general problem a bit on irc and think that we may have a better solution than the current permissions dance.

  1. Create shell script in operations/puppet that does /usr/bin/rsync --archive --delete-delay --delay-updates --compress --delete --exclude=**/cache/l10n/*.cdb -exclude=*.swp "$1::common" "$2"
  2. Change the 'scap-master-sync' sudoers grant in ::scap::master to allow mwdeploy to run that script as root
  3. Change scap.tasks.sync_master to call the new script as root passing the current master and stage_dir as parameters

Change 253040 had a related patch set uploaded (by BryanDavis):
scap: Create wrapper script for master-master rsync

https://gerrit.wikimedia.org/r/253040

mmodell added a revision: Restricted Differential Revision.Nov 13 2015, 10:26 PM

Change 253040 merged by Filippo Giunchedi:
scap: Create wrapper script for master-master rsync

https://gerrit.wikimedia.org/r/253040

https://gerrit.wikimedia.org/r/253040 got merged at Wed Nov 18 09:22:43 2015 UTC.

That caused the beta-scap Jenkins job to fail https://integration.wikimedia.org/ci/job/beta-scap-eqiad/79039/

00:01:46.845 09:47:05 Started sync-masters
00:01:46.847 sync-masters:   0% (ok: 0; fail: 0; left: 1)                                    
00:01:48.533 09:47:06 ['/srv/deployment/scap/scap/bin/sync-master', 'deployment-bastion.deployment-prep.eqiad.wmflabs'] on mira.deployment-prep.eqiad.wmflabs returned [70]: Warning: Permanently added 'mira.deployment-prep.eqiad.wmflabs,10.68.17.215' (ECDSA) to the list of known hosts.
00:01:48.533 09:47:06 Copying to mira.deployment-prep.eqiad.wmflabs from deployment-bastion.deployment-prep.eqiad.wmflabs
00:01:48.533 09:47:06 Started rsync master
00:01:48.533 sudo: a password is required
00:01:48.533 09:47:06 Finished rsync master (duration: 00m 00s)
00:01:48.533 09:47:06 Unhandled error:
00:01:48.533 Traceback (most recent call last):
00:01:48.533   File "/srv/deployment/scap/scap/scap/cli.py", line 276, in run
00:01:48.533     exit_status = app.main(extra_args)
00:01:48.533   File "/srv/deployment/scap/scap/scap/main.py", line 349, in main
00:01:48.533     verbose=self.verbose
00:01:48.533   File "/srv/deployment/scap/scap/scap/utils.py", line 348, in context_wrapper
00:01:48.533     return func(*args, **kwargs)
00:01:48.533   File "/srv/deployment/scap/scap/scap/tasks.py", line 278, in sync_master
00:01:48.533     subprocess.check_call(rsync)
00:01:48.533   File "/usr/lib/python2.7/subprocess.py", line 540, in check_call
00:01:48.533     raise CalledProcessError(retcode, cmd)
00:01:48.533 CalledProcessError: Command '['sudo', '-u', 'mwdeploy', '-g', 'wikidev', '-n', '--', '/usr/bin/rsync', '--archive', '--delete-delay', '--delay-updates', '--compress', '--delete', '--exclude=**/cache/l10n/*.cdb', '--no-perms', 'deployment-bastion.deployment-prep.eqiad.wmflabs::common', '/srv/mediawiki-staging']' returned non-zero exit status 1
00:01:48.533 09:47:06 sync-master failed: <CalledProcessError> Command '['sudo', '-u', 'mwdeploy', '-g', 'wikidev', '-n', '--', '/usr/bin/rsync', '--archive', '--delete-delay', '--delay-updates', '--compress', '--delete', '--exclude=**/cache/l10n/*.cdb', '--no-perms', 'deployment-bastion.deployment-prep.eqiad.wmflabs::common', '/srv/mediawiki-staging']' returned non-zero exit status 1
00:01:48.533 
00:01:48.535 sync-masters: 100% (ok: 0; fail: 1; left: 0)                                    
00:01:48.536 sync-masters: 100% (ok: 0; fail: 1; left: 0)                                    
00:01:48.536 
00:01:48.536 09:47:06 1 masters had sync errors

https://gerrit.wikimedia.org/r/253040 got merged at Wed Nov 18 09:22:43 2015 UTC.

That caused the beta-scap Jenkins job to fail https://integration.wikimedia.org/ci/job/beta-scap-eqiad/79039/

We've gotten past this particular problem now that the associated scap change has been merged but we have run into a new one:

00:03:46.625 18:04:10 Failure processing (u'/srv/mediawiki-staging/php-master/cache/l10n', u'l10n_cache-nds.cdb', True)
00:03:46.625 Traceback (most recent call last):
00:03:46.625   File "/srv/deployment/scap/scap/scap/tasks.py", line 431, in update_l10n_cdb_wrapper
00:03:46.625     return update_l10n_cdb(*args)
00:03:46.625   File "/srv/deployment/scap/scap/scap/utils.py", line 348, in context_wrapper
00:03:46.625     return func(*args, **kwargs)
00:03:46.625   File "/srv/deployment/scap/scap/scap/tasks.py", line 404, in update_l10n_cdb
00:03:46.625     with open(tmp_cdb_path, 'wb') as fp:
00:03:46.625 IOError: [Errno 13] Permission denied: u'/srv/mediawiki-staging/php-master/cache/l10n/l10n_cache-nds.cdb.tmp'
00:03:46.625 18:04:10 Unhandled error:
00:03:46.625 Traceback (most recent call last):
00:03:46.625   File "/srv/deployment/scap/scap/scap/cli.py", line 276, in run
00:03:46.625     exit_status = app.main(extra_args)
00:03:46.625   File "/srv/deployment/scap/scap/scap/main.py", line 353, in main
00:03:46.625     verbose=self.verbose
00:03:46.625   File "/srv/deployment/scap/scap/scap/utils.py", line 348, in context_wrapper
00:03:46.625     return func(*args, **kwargs)
00:03:46.625   File "/srv/deployment/scap/scap/scap/tasks.py", line 282, in sync_master
00:03:46.625     merge_cdb_updates(cache_dir, use_cores, True, True)
00:03:46.625   File "/srv/deployment/scap/scap/scap/utils.py", line 348, in context_wrapper
00:03:46.625     return func(*args, **kwargs)
00:03:46.625   File "/srv/deployment/scap/scap/scap/tasks.py", line 207, in merge_cdb_updates
00:03:46.625     itertools.repeat(trust_mtime))), 1):
00:03:46.625   File "/usr/lib/python2.7/multiprocessing/pool.py", line 659, in next
00:03:46.625     raise value
00:03:46.625 IOError: [Errno 13] Permission denied: u'/srv/mediawiki-staging/php-master/cache/l10n/l10n_cache-nds.cdb.tmp'
00:03:46.625 18:04:10 sync-master failed: <IOError> [Errno 13] Permission denied: u'/srv/mediawiki-staging/php-master/cache/l10n/l10n_cache-nds.cdb.tmp'

This is a permissions problem caused by the difference between /srv/mediawiki-staging and /srv/mediawiki. The l10n cache directories in the staging tree are owned by the l10nupdate user so that the nightly l10nupdate cron job can modify the CDB files. In the deploy tree everything is owned by mwdeploy. Scap is currently trying to directly reuse the existing logic for creating CDB files from their json dumps in the staging tree. This fails because the target directory is owned by the l10nupdate user rather than then mwdeploy user that is running the update_l10n_cdb task.

The fix for this is to introduce a new entry point (eg scap-update-l10n-cdb) that can execute scap.tasks.update_l10n_cdb as the l10nupdate user and change scap.tasks.sync_master to invoke the new wrapper script as the l10nupdate user via sudo. The mwdeploy user already has full sudoer rights as the l10nupdate user so this will only need scap patches.