Page MenuHomePhabricator

l10nupdate user can't access scap shared ssh key causing nightly l10nupdate sync process to fail
Closed, ResolvedPublic

Description

For example, https://www.mediawiki.org/w/index.php?title=MediaWiki:Donate_interface-informationsharing/es is still showing a version of the message which was updated on Nov 22, in patch https://gerrit.wikimedia.org/r/#/c/175252/

Maybe I don't understand how LocalisationUpdate is supposed to work, or DonationInterface messages are not being pulled in by that mechanism. This extension is a bit special, cos there are multiple i18n subdirectories...

Details

Related Gerrit Patches:

Event Timeline

awight created this task.Nov 26 2014, 7:10 PM
awight raised the priority of this task from to High.
awight updated the task description. (Show Details)
awight changed Security from none to None.
awight added subscribers: Reedy, atgo, Nikerabbit.
awight added a subscriber: awight.
Reedy added a comment.Dec 1 2014, 8:42 PM

the problem is a tonne of these:

02:10:28 Started sync-proxies
02:10:28 ['/srv/deployment/scap/scap/bin/sync-common', '--no-update-l10n', '--include', 'php-1.25wmf9', '--include', 'php-1.25wmf9/cache', '--include', 'php-1.25wmf9/cache/l10n', '--include', 'php-1.25wmf9/cache/l10n/***'] on mw1161.eqiad.wmnet returned [255]: Permission denied (publickey).

02:10:28 ['/srv/deployment/scap/scap/bin/sync-common', '--no-update-l10n', '--include', 'php-1.25wmf9', '--include', 'php-1.25wmf9/cache', '--include', 'php-1.25wmf9/cache/l10n', '--include', 'php-1.25wmf9/cache/l10n/***'] on mw1070.eqiad.wmnet returned [255]: Permission denied (publickey).

02:10:28 ['/srv/deployment/scap/scap/bin/sync-common', '--no-update-l10n', '--include', 'php-1.25wmf9', '--include', 'php-1.25wmf9/cache', '--include', 'php-1.25wmf9/cache/l10n', '--include', 'php-1.25wmf9/cache/l10n/***'] on mw1010.eqiad.wmnet returned [255]: Permission denied (publickey).

02:10:28 ['/srv/deployment/scap/scap/bin/sync-common', '--no-update-l10n', '--include', 'php-1.25wmf9', '--include', 'php-1.25wmf9/cache', '--include', 'php-1.25wmf9/cache/l10n', '--include', 'php-1.25wmf9/cache/l10n/***'] on mw1201.eqiad.wmnet returned [255]: Permission denied (publickey).

sync-proxies: 100% (ok: 0; fail: 4; left: 0)
Reedy added a subscriber: bd808.EditedDec 1 2014, 8:54 PM

[20:52:08] <bd808> Reedy: Bah. It's scap
[20:52:18] <Reedy> :(
[20:53:06] <bd808> Reedy: This is clobbering SSH_AUTH_SOCK -- https://github.com/wikimedia/mediawiki-tools-scap/blob/master/scap/cli.py#L171-L175
[20:53:49] <bd808> Reedy: So that needs a patch to only replace the auth sock if the shared one is present and readable I think
[20:56:56] <Reedy> bd808: os.path.isfile?
[20:57:12] <bd808> Reedy: The other way to fix it would be to change the permissions on the shared auth socket so that l10nupdate can read from it. In the long term that would be even better.
[20:58:53] <bd808> Reedy: os.path.exists. It could be a symlink in theory
[21:00:28] <bd808> Reedy: Or more pythonically, open the file for reading and only change the env is that succeeds
[21:00:33] <Reedy> if auth_sock is not None and os.path.exists(auth_sock):
[21:00:40] <Reedy> heh

Change 176750 had a related patch set uploaded (by Reedy):
Only use config ssh_auth_sock if set/readable/useable

https://gerrit.wikimedia.org/r/176750

Patch-For-Review

Urgency just became High+1 for us, it turns out there is no workaround, LocalisationUpdate is overriding newer, manually deployed extension messages.

greg raised the priority of this task from High to Unbreak Now!.Dec 1 2014, 11:51 PM
greg added a subscriber: greg.

@Reedy: this is probably affecting/will affect banners. We should get to this ASAP.

greg moved this task from To Triage to In-progress on the Deployments board.Dec 1 2014, 11:51 PM

Change 176750 merged by jenkins-bot:
Only use config ssh_auth_sock if set/readable/useable

https://gerrit.wikimedia.org/r/176750

greg added a subscriber: gerritbot.Dec 2 2014, 6:32 PM

Change 176750 merged by jenkins-bot:
Only use config ssh_auth_sock if set/readable/useable
https://gerrit.wikimedia.org/r/176750

Annnnd, reverted.

Reedy added a comment.Dec 2 2014, 7:00 PM

Urgency just became High+1 for us, it turns out there is no workaround, LocalisationUpdate is overriding newer, manually deployed extension messages.

You did scap afterwards?

l10nupdate isn't actually syncing anything, so it should mean that nothing is actually changing on the mw servers

bd808 added a subscriber: ori.Dec 2 2014, 7:06 PM

When @Reedy tried the patch he and @ori made in https://gerrit.wikimedia.org/r/#/c/176750/ it failed in prod:

17:58:47 sync-common failed: <error> [Errno 2] No such file or directory
17:58:47 ['/srv/deployment/scap/scap/bin/sync-common', '--no-update-l10n', '--include', 'wmf-config', '--include', 'wmf-config/***', 'mw1010.eqiad.wmnet', 'mw1070.eqiad.wmnet', 'mw1161.eqiad.wmnet', 'mw1201.eqiad.wmnet'] on mw1122 returned [70]: 17:58:47 Unhandled error:
Traceback (most recent call last):
  File "/srv/deployment/scap/scap/scap/cli.py", line 283, in run
    app._setup_environ()
  File "/srv/deployment/scap/scap/scap/cli.py", line 178, in _setup_environ
    sock.connect(auth_sock)
  File "/usr/lib/python2.7/socket.py", line 224, in meth
    return getattr(self._sock,name)(*args)
error: [Errno 2] No such file or directory

I think this mostly means we need to catch additional errors/exceptions in the try/except block but it could use additional debugging to figure that out.

awight added a comment.Dec 2 2014, 7:08 PM

I only pushed extension code, not a full scap. Are production app servers not reading from the messages files, is that why you ask? I gotta learn more :)

awight added a comment.Dec 2 2014, 7:12 PM

Okay, scap does in fact update the messages. Thanks! This means we have a workaround, I'm lowering the urgency again.

awight added a comment.Dec 2 2014, 7:12 PM

Okay, scap does in fact update the messages. Thanks! This means we have a workaround, I'm lowering the urgency again.

awight lowered the priority of this task from Unbreak Now! to High.Dec 2 2014, 7:12 PM
Reedy added a comment.Dec 2 2014, 8:18 PM

I only pushed extension code, not a full scap. Are production app servers not reading from the messages files, is that why you ask? I gotta learn more :)

Yeah, they don't. Loading the hundreds of json (was PHP) files, then any fallback chains is just slow.

l10nupdate/scap build cdb files with the full message set for a specific language, so it's effectively all in one place

Where are we here? Is this still a problem, or has the underlying issue been sorted out?

bd808 renamed this task from LocalisationUpdate is not updating messages to l10nupdate user can't access scap shared ssh key causing nightly l10nupdate sync process to fail.Feb 17 2015, 5:38 PM

Still happening as of 2015-02-16:

02:14:23 ['/srv/deployment/scap/scap/bin/sync-common', '--no-update-l10n', '--include', 'php-1.25wmf16', '--include', 'php-1.25wmf16/cache', '--include', 'php-1.25wmf16/cache/l10n', '--include', 'php-1.25wmf16/cache/l10n/***'] on mw1070.eqiad.wmnet returned [255]: Permission denied (publickey).

sync-proxies:  16% (ok: 0; fail: 1; left: 5)

l10nupdate is logging in SAL every morning that it worked, but looking at the /var/log/l10nupdatelog logs on tin show that the l10n cache is built but then fails to sync.

greg added a subscriber: mmodell.Feb 17 2015, 6:29 PM

@mmodell: help here please.

There are two possible ways to fix this problem for the l10nupdate user invoking scap:

  1. Make a way (probably via a new command line option) for scap to bypass using the share ssh-agent socket so that the special ssh key for the l10nupdate user can be used again as it was before scap was updated to use a shared agent.
  2. Fix the file system permission on the shared ssh-agent to allow the l10nupdate user to read the socket.

Option 2 would be tricky to do without either giving the l10nupdate user more privileges than it currently has or making the socket permission look strange (eg 1770 l10nupdate:wikidev). I'm actually not sure if the socket can be changed to be owned by anyone other than the current keyholder user either. Making l10nupdate a member of wikidev would be a fairly large privilege escalation and should not be done in my opinion.

A possible third option that would be a bit of a combination of the other two but more involved would be to extend the keyholder service to be capable of running multiple instances simultaneously, running a proxy for the l10nupdate user's key and passing an alternate -D ssh_auth_sock:/run/keyholder/some-other-proxy.sock setting to the scap invocation.

A possible third option that would be a bit of a combination of the other two but more involved would be to extend the keyholder service to be capable of running multiple instances simultaneously, running a proxy for the l10nupdate user's key and passing an alternate -D ssh_auth_sock:/run/keyholder/some-other-proxy.sock setting to the scap invocation.

Wait. Is this as easy as calling scap as scap -D ssh_auth_sock:$SSH_AUTH_SOCK?

A possible third option that would be a bit of a combination of the other two but more involved would be to extend the keyholder service to be capable of running multiple instances simultaneously, running a proxy for the l10nupdate user's key and passing an alternate -D ssh_auth_sock:/run/keyholder/some-other-proxy.sock setting to the scap invocation.

Wait. Is this as easy as calling scap as scap -D ssh_auth_sock:$SSH_AUTH_SOCK?

Not quite apparently. The l10nupdate-1 script invokes sync-dir via sudo-withagent. sudo-withagent invokes itself via sudo and then starts ssh-agent before running the passed command. $SSH_AUTH_SOCK is not set in the outer calling environment of l10nudpdate-1 and thus can't be set directly there via shell variable expansion. The shell variable would need to be expanded inside the recursive sudo-withagent invocation after $SSH_AUTH_SOCK has been set. This would require modifications to sudo-withagent to somehow mark a command line argument as something that needed to be eval'd at the correct time.

Change 191248 had a related patch set uploaded (by BryanDavis):
Add universal argument to ignore ssh_auth_sock

https://gerrit.wikimedia.org/r/191248

Patch-For-Review

Change 191251 had a related patch set uploaded (by BryanDavis):
l10nupdate: use --no-shared-authsock with sync-dir

https://gerrit.wikimedia.org/r/191251

Patch-For-Review

Change 191248 merged by jenkins-bot:
Add universal argument to ignore ssh_auth_sock

https://gerrit.wikimedia.org/r/191248

scap change +2'd but I don't have +2 on ops/puppet so that one isn't merged.

Change 191251 merged by Ottomata:
l10nupdate: use --no-shared-authsock with sync-dir

https://gerrit.wikimedia.org/r/191251

Both the scap and l10nupdate changes are deployed in production now (and we didn't break scap for normal deployers this time!). Lets see what happens on the next run in a few hours.

Still same error in the log for last run.

The sync-dir call is still failing with ssh key permissions issues:

Syncing to Apaches at 2015-02-23 02:02:22+00:00
Starting ssh-agent
Agent pid 6530
Identity added: /home/l10nupdate/.ssh/id_rsa (/home/l10nupdate/.ssh/id_rsa)
02:02:22 Started sync-proxies
sync-proxies:   0% (ok: 0; fail: 0; left: 6)                                    
02:02:22 ['/srv/deployment/scap/scap/bin/sync-common', '--no-update-l10n',
  '--include', 'php-1.25wmf17', '--include', 'php-1.25wmf17/cache', 
  '--include', 'php-1.25wmf17/cache/l10n', '--include', 'php-1.25wmf17/cache/l10n/***'
  ] on mw1033.eqiad.wmnet returned [255]: Permission denied (publickey).

sync-proxies:  16% (ok: 0; fail: 1; left: 5)
[...snip...]
sync-proxies: 100% (ok: 0; fail: 6; left: 0)

I can however use the l10nupdate user's private key to authenticate from tin:

$ /usr/local/bin/sudo-withagent l10nupdate ssh mw1033.eqiad.wmnet
Starting ssh-agent
Agent pid 24302
Identity added: /home/l10nupdate/.ssh/id_rsa (/home/l10nupdate/.ssh/id_rsa)
Linux mw1033 3.13.0-24-generic #47-Ubuntu SMP Fri May 2 23:30:00 UTC 2014 x86_64
Ubuntu 14.04.1 LTS
mw1033 is role::mediawiki::appserver
The last Puppet run was at Mon Feb 23 20:52:13 UTC 2015 (5 minutes ago).
Ubuntu 14.04.1 LTS auto-installed on Mon Dec 1 11:20:22 UTC 2014.
Last login: Mon Feb 23 20:55:04 2015 from tin.eqiad.wmnet
l10nupdate@mw1033:~$ logout
Connection to mw1033.eqiad.wmnet closed.

I can also do this via sudo -u l10nupdate -- /usr/local/bin/sudo-withagent l10nupdate ssh -v mw1033.eqiad.wmnet which would seem to rule out the l10nupdate user not having access to it's own keys.

@mmodell could you find a time to interactively run the sync-dir as /usr/local/bin/l10nupdate-1 does it with a --verbose flag appended and see if we can get any more information on what is keeping this from working?

greg assigned this task to mmodell.Feb 23 2015, 9:24 PM

@mmodell could you find a time to interactively run the sync-dir as /usr/local/bin/l10nupdate-1 does it with a --verbose flag appended and see if we can get any more information on what is keeping this from working?

Setting assignee for tracking.

@bd808 I'll see what I can figure out

aude added a subscriber: aude.Feb 23 2015, 10:49 PM

It would be super nice to have this finally fixed. :)

We have new i18n message that we would like updated this week when we add a "Wikibooks" site links section on Wikidata. I am not sure how to otherwise to properly pull in the updated messages.

bd808 added a comment.EditedFeb 23 2015, 11:01 PM

It would be super nice to have this finally fixed. :)
We have new i18n message that we would like updated this week when we add a "Wikibooks" site links section on Wikidata. I am not sure how to otherwise to properly pull in the updated messages.

The l10n cache files on tin are being updated every night, so at the worst right now a manual scap (or the normal train scap) pushes all the updates out to the wikis.

aude added a comment.Feb 23 2015, 11:09 PM

@bd808 thanks. we can do scap if/when needed... but hopefully we'll figure out why we have this issue with l10update

mmodell added a comment.EditedFeb 24 2015, 6:00 AM

@bd808: So, yes, you can authenticate with the remote server via the key on tin, however, the remote server then has to connect back to tin to rsync the files, right? And on the remote machine there aren't any ssh keys in ~/.ssh/ ... how is the remote server supposed to authenticate, does it use agent forwarding for that?

Am I confused somehow about the way this works? I can't tell 100% whether it's pushing from tin or pulling from the remote machines, but it seems like it's connecting to the remote machines which in turn use rsync to connect back and grab the files.

@bd808: So, yes, you can authenticate with the remote server via the key on tin, however, the remote server then has to connect back to tin to rsync the files, right? And on the remote machine there aren't any ssh keys in ~/.ssh/ ... how is the remote server supposed to authenticate, does it use agent forwarding for that?
Am I confused somehow about the way this works? I can't tell 100% whether it's pushing from tin or pulling from the remote machines, but it seems like it's connecting to the remote machines which in turn use rsync to connect back and grab the files.

The connection from tin to the target host uses ssh. On the target host side, rsync is run as the mwdeploy user to connect back to tin (or an rsync proxy). The rsync is done over the built-in rsync protocol via a connection to the running rsync daemon on tin (or an rsync proxy). We don't use the --rsh or -e options of rsync. See the tasks.sync_common method for more gory details.

Ok $SSH_AUTH_SOCK doesn't seem to be getting passed on to the sync targets. Don't we need to enable agent forwarding in ssh config or via command line?

hmmm I guess it must be something in scap then. I don't know how to get more debugging info from scap, adding --verbose didn't change anything at all.

hmmm I guess it must be something in scap then. I don't know how to get more debugging info from scap, adding --verbose didn't change anything at all.

One thing that might help debugging the apparent ssh connection issues would be to add one or two -v flags to DEFAULT_RSYNC_ARGS in tasks.py. At some point I was planning on making the --verbose flag to scap and sync-* do that automatically but I never got a method for passing the flag on to tasks that I liked. The quick and dirty test would be just to live edit the file on tin (doesn't need to be synced to other scap hosts) and run sync-dir as the l10nupdate user.

apparently it's not getting to RSYNC because that didn't make any difference...

apparently it's not getting to RSYNC because that didn't make any difference...

I'm not even sure why I thought that would help debug. I was thinking of some other random problem from the past apparently. What I really meant to suggest was adding verbose flags to the ssh command which is setup in the scap.ssh module by the SSH tuple at the top of the file.

scap and sync-* make the ssh connection as the mwdeply user. The l10nupdate ssh key is only good for itself and not this shared deploy user. To make this work we need to update scap to allow the caller to specify the user to run the ssh commands as. Something like sync-dir --ssh-user l10nupdate ....

Change 196143 had a related patch set uploaded (by BryanDavis):
l10nupdate: connect to remote hosts as l10nupdate user

https://gerrit.wikimedia.org/r/196143

scap and sync-* make the ssh connection as the mwdeply user. The l10nupdate ssh key is only good for itself and not this shared deploy user. To make this work we need to update scap to allow the caller to specify the user to run the ssh commands as. Something like sync-dir --ssh-user l10nupdate ....

We already have the means to override config values using the -D key:value option, so https://gerrit.wikimedia.org/r/196143 uses that.

Change 196143 merged by Tim Starling:
l10nupdate: connect to remote hosts as l10nupdate user

https://gerrit.wikimedia.org/r/196143

mmodell reassigned this task from mmodell to bd808.Mar 12 2015, 5:08 PM

looks like you have this one handled.

greg added a comment.Mar 13 2015, 6:23 PM

How'd it go last night?

bd808 closed this task as Resolved.Mar 13 2015, 6:48 PM

02:21:16 Started sync-proxies
sync-proxies: 100% (ok: 6; fail: 0; left: 0)
02:22:04 Finished sync-proxies (duration: 00m 47s)
02:22:04 Started sync-apaches
sync-common: 100% (ok: 267; fail: 0; left: 0)
02:28:25 Finished sync-apaches (duration: 06m 20s)
02:28:25 Synchronized php-1.25wmf20/cache/l10n: (no message) (duration: 07m 08s)
l10n merge: 100% (ok: 379; fail: 0; left: 0)
02:28:56 Updated 379 CDB files(s) in /srv/mediawiki/php-1.25wmf20/cache/l10n

w00t!

Snaevar removed a subscriber: Snaevar.Mar 14 2015, 4:49 PM
greg moved this task from In-progress to Done on the Deployments board.Mar 16 2015, 3:46 PM