Page MenuHomePhabricator

LocalisationUpdate broken since 2014-12-16
Closed, ResolvedPublic

Description

Last run of LocalisationUpdate on WMF cluster was von 2014-12-16 per https://wikitech.wikimedia.org/wiki/Server_Admin_Log.

Event Timeline

Raymond raised the priority of this task from to Needs Triage.
Raymond updated the task description. (Show Details)
Raymond added a subscriber: Raymond.

Are you sure this is not intentional? Last MediaWiki software deployment was 2014-12-17, and that's on purpose: https://www.mediawiki.org/wiki/MediaWiki_1.25/Roadmap

Are you sure this is not intentional? Last MediaWiki software deployment was 2014-12-17, and that's on purpose: https://www.mediawiki.org/wiki/MediaWiki_1.25/Roadmap

LocalisationUpdate runs every night since some years, independent from any deployment cycle.

Aklapper triaged this task as High priority.Jan 5 2015, 4:36 PM

@Reedy / @mmodell help please.

I don't see a !log in the SAL re disabling it... so....

reedy@tin:/var/log/l10nupdatelog$ ls -al l10nupdate.log-2015010*
-rw-rw-r-- 1 l10nupdate l10nupdate 10437 Jan  1 02:09 l10nupdate.log-20150101.gz
-rw-rw-r-- 1 l10nupdate l10nupdate  9405 Jan  2 02:09 l10nupdate.log-20150102.gz
-rw-rw-r-- 1 l10nupdate l10nupdate 13952 Jan  3 02:11 l10nupdate.log-20150103.gz
-rw-rw-r-- 1 l10nupdate l10nupdate 12782 Jan  4 02:10 l10nupdate.log-20150104.gz
-rw-rw-r-- 1 l10nupdate l10nupdate 12692 Jan  5 02:10 l10nupdate.log-20150105.gz

If we zcat this... At the end we see

02:10:15 Unhandled error:
Traceback (most recent call last):
  File "/srv/deployment/scap/scap/scap/cli.py", line 273, in run
    exit_status = app.main(extra_args)
  File "/srv/deployment/scap/scap/scap/main.py", line 318, in main
    super(SyncDir, self).main(*extra_args)
  File "/srv/deployment/scap/scap/scap/main.py", line 37, in main
    with utils.lock(self.config['lock_file']):
  File "/usr/lib/python2.7/contextlib.py", line 17, in __enter__
    return self.gen.next()
  File "/srv/deployment/scap/scap/scap/utils.py", line 249, in lock
    raise LockFailedError('Failed to lock %s: %s' % (filename, e))
LockFailedError: Failed to lock /var/lock/scap: [Errno 13] Permission denied: '/var/lock/scap'
02:10:15 sync-dir failed: <LockFailedError> Failed to lock /var/lock/scap: [Errno 13] Permission denied: '/var/lock/scap'
Failed to sync-dir 'php-1.25wmf12/cache/l10n'

I see

reedy@tin:~$ ls -al /var/lock/scap
-rw-rw-r-- 1 ori wikidev 0 Jan  5 17:43 /var/lock/scap

But not sure what it was before

Well, the l10nupdate user can't write to the scap lock file as it's not in the wikidev user group....

l10nupdate@tin:/var/log/l10nupdatelog$ groups
l10nupdate
l10nupdate@tin:/var/log/l10nupdatelog$

Well, the l10nupdate user can't write to the scap lock file as it's not in the wikidev user group....

l10nupdate@tin:/var/log/l10nupdatelog$ groups
l10nupdate
l10nupdate@tin:/var/log/l10nupdatelog$

SRE help, please?

I think @bd808 suggested "we" just make scap delete the lock file at the end of an operation...

In T85790#963669, @greg wrote:

Well, the l10nupdate user can't write to the scap lock file as it's not in the wikidev user group....

l10nupdate@tin:/var/log/l10nupdatelog$ groups
l10nupdate
l10nupdate@tin:/var/log/l10nupdatelog$

SRE help, please?

Wikidev is the default group for all users so adding a system account seems weird to me. Why isn't this file written now with perms appropriate to the intention of l10update management?

I think @bd808 suggested "we" just make scap delete the lock file at the end of an operation...

This seems like the easier solution.

I've just had ops chown the lock file to l10nupdate so I can at least do a manual run. Whether the permissions will actually stick...

Change 183560 had a related patch set uploaded (by BryanDavis):
Delete scap lock file on unlock

https://gerrit.wikimedia.org/r/183560

Patch-For-Review

Change 183560 merged by jenkins-bot:
Delete scap lock file on unlock

https://gerrit.wikimedia.org/r/183560

So we're back to

02:16:13 ['/srv/deployment/scap/scap/bin/sync-common', '--no-update-l10n', '--include', 'php-1.25wmf13', '--include', 'php-1.25wmf13/cache', '--include', 'php-1.25wmf13/cache/l10n', '--include', 'php-1.25wmf13/cache/l10n/***', 'mw1070.eqiad.wmnet', 'mw1161.eqiad.wmnet', 'mw1201.eqiad.wmnet'] on mw1018 returned [255]: Permission denied (publickey).

@bd808 did we have a task for tracking this issue? I can't see it at first glance

So we're back to

02:16:13 ['/srv/deployment/scap/scap/bin/sync-common', '--no-update-l10n', '--include', 'php-1.25wmf13', '--include', 'php-1.25wmf13/cache', '--include', 'php-1.25wmf13/cache/l10n', '--include', 'php-1.25wmf13/cache/l10n/***', 'mw1070.eqiad.wmnet', 'mw1161.eqiad.wmnet', 'mw1201.eqiad.wmnet'] on mw1018 returned [255]: Permission denied (publickey).

@bd808 did we have a task for tracking this issue? I can't see it at first glance

T76061: l10nupdate user can't access scap shared ssh key causing nightly l10nupdate sync process to fail which could use a renaming to something like "l10nupdate can't access shared ssh-agent"

Where are we here? Is this still a problem, or has the underlying issue been sorted out?

Per https://wikitech.wikimedia.org/wiki/Server_Admin_Log it works well since since January 9th again.

The SAL entries are a lie unfortunately.

02:14:23 ['/srv/deployment/scap/scap/bin/sync-common', '--no-update-l10n', '--include', 'php-1.25wmf16', '--include', 'php-1.25wmf16/cache', '--include', 'php-1.25wmf16/cache/l10n', '--include', 'php-1.25wmf16/cache/l10n/***'] on mw1070.eqiad.wmnet returned [255]: Permission denied (publickey).

sync-proxies:  16% (ok: 0; fail: 1; left: 5)

The l10nupdate user that runs the nightly script is still not able to use scap to push the changes it makes to the wikis.

any update here? I suspect it is still not working.

I heve seen patches in Gerrit trying to work out the permissions issues. Someone more close to those is needed to give more detailed status update.

We had a successful manual run in the early morning (UTC) of 2015-03-12. If the normal cron job is successful on 2015-03-13 we may be able to declare victory. Attempts to fix the backing problem are being tracked by T76061.