Page MenuHomePhabricator

localisationupdate broken on wmf wikis by scap master-master sync changes
Closed, ResolvedPublic

Description

Starting l10nupdate at Fri Nov 27 02:00:01 UTC 2015.
Updating git clone ...
fatal: You don't exist. Go away!
Unexpected end of command stream
Updating core FAILED.

Not sure if this is after the userid change for the l10nupdate user....

sudo -u l10nupdate git pull ran fine... And it's running in screen as me on tin..

Revisions and Commits

Event Timeline

Reedy raised the priority of this task from to Needs Triage.
Reedy updated the task description. (Show Details)
Reedy added projects: WMF-General-or-Unknown, SRE.
Reedy added subscribers: Reedy, Dzahn.
fatal: You don't exist. Go away!

That's apparently a ssh related error

l10nupdate@tin:~$ git var GIT_AUTHOR_IDENT

*** Please tell me who you are.

Run

  git config --global user.email "you@example.com"
  git config --global user.name "Your Name"

to set your account's default identity.
Omit --global to set the identity only in this repository.

fatal: empty ident name (for <l10nupdate@tin.eqiad.wmnet>) not allowed
l10nupdate@tin:~$ git var GIT_COMMITTER_IDENT

*** Please tell me who you are.

Run

  git config --global user.email "you@example.com"
  git config --global user.name "Your Name"

to set your account's default identity.
Omit --global to set the identity only in this repository.

fatal: empty ident name (for <l10nupdate@tin.eqiad.wmnet>) not allowed
l10nupdate@tin:~$

So this will fix it, but I don't know why it got broken. Was it by mutante changing the uid? How did it ever work before? Can we puppetise this somehow?

l10nupdate@tin:~$ git config --global user.name l10nupdate
l10nupdate@tin:~$ git config --global user.email "l10nupdate@wikimedia.org"
l10nupdate@tin:~$ git var GIT_COMMITTER_IDENT
l10nupdate <l10nupdate@wikimedia.org> 1448744334 +0000
l10nupdate@tin:~$ git var GIT_AUTHOR_IDENT
l10nupdate <l10nupdate@wikimedia.org> 1448744338 +0000
l10nupdate@tin:~$

Can we puppetise this somehow?

We would just need to provision a ~l10nupdate/.gitconfig file I think. No idea why this is suddenly broken/needed.

Can we puppetise this somehow?

We would just need to provision a ~l10nupdate/.gitconfig file I think. No idea why this is suddenly broken/needed.

I guess we should do, as with datacentre swapping, l10nupdate should be able to be run on tin or mira...

tin:/srv/mediawiki-staging  (git master $)
l10nupdate$ /usr/local/bin/sudo-withagent l10nupdate /srv/deployment/scap/scap/bin/sync-dir --no-shared-authsock -D ssh_user:l10nupdate --verbose php-1.27.0-wmf.5/cache/l10n "bd808 testing l10nupdate sync-dir using stale branch"
Starting ssh-agent
Agent pid 32486
Identity added: /home/l10nupdate/.ssh/id_rsa (/home/l10nupdate/.ssh/id_rsa)
           ___ ____
         ⎛   ⎛ ,----
          \  //==--'
     _//|,.·//==--'    ____________________________
    _OO≣=-  ︶ ᴹw ⎞_§ ______  ___\ ___\ ,\__ \/ __ \
   (∞)_, )  (     |  ______/__  \/ /__ / /_/ / /_/ /
     ¨--¨|| |- (  / ______\____/ \___/ \__^_/  .__/
         ««_/  «_/ jgs/bd808                /_/

22:46:43 Running command: `find '/srv/mediawiki-staging/php-1.27.0-wmf.5/cache/l10n' -name '*.php' -or -name '*.inc' -or -name '*.phtml'  -or -name '*.php5' | xargs -n1 -P6 -exec php -l >/dev/null`
22:47:55 Started sync-masters
22:47:56 ['/srv/deployment/scap/scap/bin/sync-master', '--verbose', 'tin.eqiad.wmnet'] on mira.codfw.wmnet returned [70]: 22:47:56 Copying to mira.codfw.wmnet from tin.eqiad.wmnet
22:47:56 Running rsync command: `sudo -n -- /usr/local/bin/scap-master-sync tin.eqiad.wmnet`
22:47:56 Started rsync master
sudo: a password is required
22:47:56 Finished rsync master (duration: 00m 00s)
22:47:56 Unhandled error:
Traceback (most recent call last):
  File "/srv/deployment/scap/scap/scap/cli.py", line 276, in run
    exit_status = app.main(extra_args)
  File "/srv/deployment/scap/scap/scap/main.py", line 364, in main
    verbose=self.verbose
  File "/srv/deployment/scap/scap/scap/utils.py", line 348, in context_wrapper
    return func(*args, **kwargs)
  File "/srv/deployment/scap/scap/scap/tasks.py", line 273, in sync_master
    subprocess.check_call(rsync)
  File "/usr/lib/python2.7/subprocess.py", line 540, in check_call
    raise CalledProcessError(retcode, cmd)
CalledProcessError: Command '['sudo', '-n', '--', '/usr/local/bin/scap-master-sync', 'tin.eqiad.wmnet']' returned non-zero exit status 1
22:47:56 sync-master failed: <CalledProcessError> Command '['sudo', '-n', '--', '/usr/local/bin/scap-master-sync', 'tin.eqiad.wmnet']' returned non-zero exit status 1

sync-masters: 100% (ok: 0; fail: 1; left: 0)
22:47:56 1 masters had sync errors
22:47:56 Finished sync-masters (duration: 00m 00s)
22:47:56 Started sync-proxies
sync-proxies: 100% (ok: 12; fail: 0; left: 0)
22:47:59 Finished sync-proxies (duration: 00m 02s)
22:47:59 Started sync-apaches
sync-common: 100% (ok: 467; fail: 0; left: 0)
22:48:13 Finished sync-apaches (duration: 00m 14s)
22:48:13 Synchronized php-1.27.0-wmf.5/cache/l10n: bd808 testing l10nupdate sync-dir using stale branch (duration: 01m 29s)

The curse of l10nupdate strikes again. When we granted the sudoer rights to run the master-master sync as root we only granted to wikidev. We need to also grant to l10nupdate.

The curse of l10nupdate strikes again. When we granted the sudoer rights to run the master-master sync as root we only granted to wikidev. We need to also grant to l10nupdate.

It feels like the better fix here may be to take away some of the special properties of the l10nupdate process and user. We could add new scap command (sync-l10n) that takes a branch as an argument (eg 1.27.0-wmf.7) and does a sync-dir followed by cdb rebuild across the fleet. We could then grant l10nupdate the right to run this script as the mwdeploy user and kill off the separate l10nupdate ssh key.

Change 255916 had a related patch set uploaded (by BryanDavis):
l10nupdate: replace ssh key with new scap script

https://gerrit.wikimedia.org/r/255916

The combination of D65 and https://gerrit.wikimedia.org/r/255916 should make l10nupdate-1 less unique when it comes to syncing things. As a bonus it also kills a usage of dsh and a ssh keypair.

bd808 renamed this task from localisationupdate broken on wmf wikis to localisationupdate broken on wmf wikis by scap master-master sync changes.Nov 29 2015, 4:57 AM
bd808 triaged this task as High priority.
bd808 added a project: Deployments.
bd808 set Security to None.

D65 is merged and deployed to beta cluster. It is not deployed to production yet.

In beta cluster testing I found a blocker to https://gerrit.wikimedia.org/r/255916: the keyholder system that protects the shared ssh-agent is only usable by members of the wikidev group. Making the mwdeploy user a member of wikidev is not the right thing to do (See https://wikitech.wikimedia.org/wiki/UID for reasons) but maybe it would be ok to allow it to have access to its own ssh key? I'd trust @faidon and/or @tstarling for opinions on the possible ramifications of allowing the mwdeploy user on a deployment server to access its own ssh key.

I think there is some configurability in the keyholder process these days via yaml config files. Maybe @thcipriani or @ori can tell us how to setup keyholder to do this (assuming it's not a horrible idea for other reasons)?

I think there is some configurability in the keyholder process these days via yaml config files. Maybe @thcipriani or @ori can tell us how to setup keyholder to do this (assuming it's not a horrible idea for other reasons)?

Keyholder reads the files in /etc/keyholder-auth.d/*.yml during startup to determine key-access. Each yaml file is in the format:

group_name: ['public-key-fingerprint1', 'public-key-fingerprint2']

for example:

deploy-service: ['96:d3:76:32:0d:d1:c7:85:ef:2d:d1:34:c7:68:bf:87']

Would give anyone in the deploy-service group access to the private key that matched the public-key fingerprint 96:d3:76:32:0d:d1:c7:85:ef:2d:d1:34:c7:68:bf:87.

This is somewhat puppetized in the keyholder::agent resource, but the use case for that resource is not entirely what you're trying to do:

https://phabricator.wikimedia.org/diffusion/OPUP/browse/production/modules/keyholder/manifests/agent.pp;cb6f705652167035a1eba7f24a8b007bdadd8866$37

This is somewhat puppetized in the keyholder::agent resource, but the use case for that resource is not entirely what you're trying to do:

https://phabricator.wikimedia.org/diffusion/OPUP/browse/production/modules/keyholder/manifests/agent.pp;cb6f705652167035a1eba7f24a8b007bdadd8866$37

I think what I would like here is for the keyholder::agent instance in role::deployment::mediawiki to grant access to two groups: wikidev and mwdeploy. Another option would be to add support for granting access to a specific user (or list of users) in the yaml file in addition to the current group grant ability. Baring any of that new complexity, is it possible to have multiple keyholder::agent defines that point to the same key?

I think what I would like here is for the keyholder::agent instance in role::deployment::mediawiki to grant access to two groups: wikidev and mwdeploy. Another option would be to add support for granting access to a specific user (or list of users) in the yaml file in addition to the current group grant ability. Baring any of that new complexity, is it possible to have multiple keyholder::agent defines that point to the same key?

I think the simplest refactor here would be to allow the $trusted_groups in keyholder::agent to accept an array of groups. In this way, keyholder::agent would be scoped to acting on a single key (since it includes a reference to keyholder::private_key it needs to act on a single-key or get rid of that referrence). It should just require modifying the template on line 10 of the keyholder::agent resource. ssh-agent-proxy should be able to handle having multiple groups point to the same key.

I updated keyholder::agent in https://gerrit.wikimedia.org/r/255916 to support associating multiple groups with a key and granted the mwdeploy group. Following a cherry-pick to deployment-puppetmaster I tested the concept on deployment-bastion:

# keyholder-proxy needs a manual restart
$ sudo service keyholder-proxy restart
$ $ sudo -u l10nupdate -n -- sudo -u mwdeploy -n -- /srv/deployment/scap/scap/bin/sync-l10n --verbose master
           ___ ____
         ⎛   ⎛ ,----
          \  //==--'
     _//|,.·//==--'    ____________________________
    _OO≣=-  ︶ ᴹw ⎞_§ ______  ___\ ___\ ,\__ \/ __ \
   (∞)_, )  (     |  ______/__  \/ /__ / /_/ / /_/ /
     ¨--¨|| |- (  / ______\____/ \___/ \__^_/  .__/
         ««_/  «_/ jgs/bd808                /_/

23:36:43 Started sync-masters
sync-masters: 100% (ok: 1; fail: 0; left: 0)
23:36:49 Finished sync-masters (duration: 00m 05s)
23:36:49 Started sync-proxies
23:36:49 Job ['/srv/deployment/scap/scap/bin/sync-common', '--no-update-l10n', '--include', 'php-master', '--include', 'php-master/cache', '--include', 'php-master/cache/l10n', '--include', 'php-master/cache/l10n/***', '--verbose', 'deployment-bastion.deployment-prep.eqiad.wmflabs'] called with an empty host list.
23:36:49 Finished sync-proxies (duration: 00m 00s)
23:36:49 Started sync-apaches
sync-common: 100% (ok: 7; fail: 0; left: 0)
23:36:53 Finished sync-apaches (duration: 00m 04s)
23:36:53 Started scap-rebuild-cdbs
scap-rebuild-cdbs: 100% (ok: 7; fail: 0; left: 0)
23:36:57 Finished scap-rebuild-cdbs (duration: 00m 03s)
23:36:57 sync-l10nupdate completed (master) (duration: 00m 13s)

Success! (Note that I omitted the failure before I figured out that keyholder-proxy needed to be restarted and the subsequent failure on 3 of the 7 scap-rebuild-cdbs hosts because the trebuchet deploy had not updated their local script.)

Change 255916 merged by Ori.livneh:
l10nupdate: replace ssh key with new scap script

https://gerrit.wikimedia.org/r/255916

tin:/srv/mediawiki-staging  (git master $)
bd808$ sudo -u l10nupdate -n -- sudo -u mwdeploy -n -- /srv/deployment/scap/scap/bin/sync-l10n --verbose 1.27.0-wmf.7
           ___ ____
         ⎛   ⎛ ,----
          \  //==--'
     _//|,.·//==--'    ____________________________
    _OO≣=-  ︶ ᴹw ⎞_§ ______  ___\ ___\ ,\__ \/ __ \
   (∞)_, )  (     |  ______/__  \/ /__ / /_/ / /_/ /
     ¨--¨|| |- (  / ______\____/ \___/ \__^_/  .__/
         ««_/  «_/ jgs/bd808                /_/

01:04:16 Started sync-masters
sync-masters: 100% (ok: 1; fail: 0; left: 0)
01:04:25 Finished sync-masters (duration: 00m 08s)
01:04:25 Started sync-proxies
sync-proxies: 100% (ok: 12; fail: 0; left: 0)
01:04:27 Finished sync-proxies (duration: 00m 02s)
01:04:27 Started sync-apaches
sync-common: 100% (ok: 467; fail: 0; left: 0)
01:04:44 Finished sync-apaches (duration: 00m 16s)
01:04:44 Started scap-rebuild-cdbs
scap-rebuild-cdbs: 100% (ok: 479; fail: 0; left: 0)
01:05:36 Finished scap-rebuild-cdbs (duration: 00m 51s)
01:05:36 sync-l10n completed (1.27.0-wmf.7) (duration: 01m 19s)

Change 255952 had a related patch set uploaded (by BryanDavis):
Provision ~/.gitconfig file for l10nupdate user

https://gerrit.wikimedia.org/r/255952

Change 255952 merged by Ori.livneh:
Provision ~/.gitconfig file for l10nupdate user

https://gerrit.wikimedia.org/r/255952

  1. tin has those two:
root@tin:~# cat /etc/sudoers.d/l10nupdate
# This file is managed by Puppet!

l10nupdate ALL = (www-data,mwdeploy) NOPASSWD: ALL
root@tin:~# cat /etc/sudoers.d/l10nupdate-sync 
# This file is managed by Puppet!

l10nupdate ALL = (mwdeploy) NOPASSWD: /srv/deployment/scap/scap/bin/sync-l10n

Either l10nupdate-sync is superfluous or l10nupdate must go (preferrably the latter, as it's way too broad)

  1. Is this needed?
root@mw1001:~# cat /etc/sudoers.d/l10nupdate 
# This file is managed by Puppet!

l10nupdate ALL = (www-data,mwdeploy) NOPASSWD: ALL

Why is it so broad?

  1. sudo permissions on MW boxes -and especially tin- are just insane in general:
root@tin:~# grep l10n /etc/sudoers.d/*
/etc/sudoers.d/deployment:%deployment ALL = (www-data,apache,mwdeploy,l10nupdate) NOPASSWD: ALL
/etc/sudoers.d/l10nupdate:l10nupdate ALL = (www-data,mwdeploy) NOPASSWD: ALL
/etc/sudoers.d/l10nupdate-sync:l10nupdate ALL = (mwdeploy) NOPASSWD: /srv/deployment/scap/scap/bin/sync-l10n
/etc/sudoers.d/mwdeploy:mwdeploy ALL = (www-data,mwdeploy,l10nupdate) NOPASSWD: ALL
/etc/sudoers.d/wikidev_deploy:	%wikidev ALL = (apache,mwdeploy,l10nupdate) NOPASSWD: ALL

Between wikidev, mwdeploy, deployment, l10nupdate, plus the trebuchet/wikidev's filesystem access to scap there are quite a few permutations of how one can sudo their privileges from one group to another.

  1. Why is l10nupdate a normal user that also has a normal home directory under /home? This is wrong — I realize this isn't new and could have been easily missed, but adding /home/l10nupdate/.gitconfig should had at least made people suspicious.

For 1 and 2 - l10nupdate-sync looks like the new one as it's for a scap script specifically. Ori and Bryan should be able to confirm so we can hopefully clean these up

3 - Fixing 1 and 2 will remove one line at least. Still others to deal with

4 - That was added to fix an issue that cropped up seemingly after mutante changed the uid. Not sure how it worked before, but it did... Happy to revert if there's another way, or similarly, if we can find out why it's broken

As far as I know, the l10nupdate user is only used to [[https://phabricator.wikimedia.org/diffusion/OPUP/browse/production/modules/scap/manifests/l10nupdate.pp;9807f39a15dece5256e7d7c67c9ab30dd84a0da8$21|execute /usr/local/bin/l10nupdate-1 via cron]].

The l10nupdate-1 script has 2 explicit sudo calls:

  • sudo -u www-data mkdir "$tempCacheDir" # a directory in /var/lib/l10nupdate/caches
  • sudo -u mwdeploy -n -- $SCAPDIR/sync-l10n --verbose $mwVerNum # Added by https://gerrit.wikimedia.org/r/255916 in response to this bug

The script also uses /usr/local/bin/mwscript which basically calls sudo -u www-data php /srv/mediawiki/multiversion/MWScript.php $@.

Change 256025 had a related patch set uploaded (by BryanDavis):
Clean up l10nupdate settings (1/2)

https://gerrit.wikimedia.org/r/256025

Change 256026 had a related patch set uploaded (by BryanDavis):
Clean up l10nupdate settings (2/2)

https://gerrit.wikimedia.org/r/256026

The changes in https://gerrit.wikimedia.org/r/256025 and https://gerrit.wikimedia.org/r/256026 should clean things up quite a bit for reasoning about and further restricting the l10nudpate user. Once they are applied the user will not be provisioned on new MediaWiki servers (I didn't clean up the user on existing servers) and will only be needed on the deployment servers (tin and mira in production).

So this will fix it, but I don't know why it got broken. Was it by mutante changing the uid? How did it ever work before? Can we puppetise this somehow?

l10nupdate@tin:~$ git config --global user.name l10nupdate
l10nupdate@tin:~$ git config --global user.email "l10nupdate@wikimedia.org"
l10nupdate@tin:~$ git var GIT_COMMITTER_IDENT
l10nupdate <l10nupdate@wikimedia.org> 1448744334 +0000
l10nupdate@tin:~$ git var GIT_AUTHOR_IDENT
l10nupdate <l10nupdate@wikimedia.org> 1448744338 +0000
l10nupdate@tin:~$

This may have been a red herring. The run before this failed with:

tin:/var/lib/l10nupdate/mediawiki
bd808$ less /var/log/l10nupdatelog/l10nupdate.log-20151128.gz
Starting l10nupdate at Sat Nov 28 02:00:01 UTC 2015.
Updating git clone ...
error: Cannot update the ref 'refs/remotes/origin/master': unable to append to .git/logs/refs/remotes/origin/master: Permission denied
From https://gerrit.wikimedia.org/r/p/mediawiki/core
 ! de4571e..2bfde35  master     -> origin/master  (unable to update local ref)
Updating core FAILED.

There are still some files under /var/lib/l10nupdate/mediawiki owned by the old l10nupdate user id. They could be fixed by a root using find /var/lib/l10nupdate/mediawiki -uid 997 -exec chown l10nupdate.

Change 256025 merged by Ori.livneh:
Clean up l10nupdate settings (1/2)

https://gerrit.wikimedia.org/r/256025

There are still some files under /var/lib/l10nupdate/mediawiki owned by the old l10nupdate user id. They could be fixed by a root using find /var/lib/l10nupdate/mediawiki -uid 997 -exec chown l10nupdate.

Confirmed these files were there, owned by the old uid. I fixed these again, but i was really sure these were not there when i checked last time after changing the UID. Reedy added that he also had Yuvipanda re-fix some of these on the weekend, so something kept changing them.

We found that there is a cronjob doing a git clone and the cron service had not been restarted, so it kept using the old UID.

I opened the crontab of user l10nupdate and re-wrote it, stopped and started the cron service, and to confirm this should have fixed it i added a temp. cron job that writes to a temp file and it was owned by the new UID.

Change 256026 merged by Ori.livneh:
Clean up l10nupdate settings (2/2)

https://gerrit.wikimedia.org/r/256026

Change 255952 merged by Ori.livneh:
Provision ~/.gitconfig file for l10nupdate user

https://gerrit.wikimedia.org/r/255952

I'm now wondering if this was just another red herring @Reedy stumbled on while other things were broken such as cron running using the 997 uid instead of the newer 10002 uid. It sort of makes sense that git would freak out when being executed as a uid that did not exist in /etc/passwd.

The nightly l10nupdate cron job seems to be working fine now. I created T120585: Make l10nupdate user a system user to track the additional cleanup that @faidon has asked for.