Last run of LocalisationUpdate on WMF cluster was von 2014-12-16 per https://wikitech.wikimedia.org/wiki/Server_Admin_Log.
Description
Details
Subject | Repo | Branch | Lines +/- | |
---|---|---|---|---|
Delete scap lock file on unlock | mediawiki/tools/scap | master | +1 -0 |
Status | Subtype | Assigned | Task | ||
---|---|---|---|---|---|
Resolved | bd808 | T85790 LocalisationUpdate broken since 2014-12-16 | |||
Resolved | bd808 | T76061 l10nupdate user can't access scap shared ssh key causing nightly l10nupdate sync process to fail |
Event Timeline
Are you sure this is not intentional? Last MediaWiki software deployment was 2014-12-17, and that's on purpose: https://www.mediawiki.org/wiki/MediaWiki_1.25/Roadmap
LocalisationUpdate runs every night since some years, independent from any deployment cycle.
reedy@tin:/var/log/l10nupdatelog$ ls -al l10nupdate.log-2015010* -rw-rw-r-- 1 l10nupdate l10nupdate 10437 Jan 1 02:09 l10nupdate.log-20150101.gz -rw-rw-r-- 1 l10nupdate l10nupdate 9405 Jan 2 02:09 l10nupdate.log-20150102.gz -rw-rw-r-- 1 l10nupdate l10nupdate 13952 Jan 3 02:11 l10nupdate.log-20150103.gz -rw-rw-r-- 1 l10nupdate l10nupdate 12782 Jan 4 02:10 l10nupdate.log-20150104.gz -rw-rw-r-- 1 l10nupdate l10nupdate 12692 Jan 5 02:10 l10nupdate.log-20150105.gz
If we zcat this... At the end we see
02:10:15 Unhandled error: Traceback (most recent call last): File "/srv/deployment/scap/scap/scap/cli.py", line 273, in run exit_status = app.main(extra_args) File "/srv/deployment/scap/scap/scap/main.py", line 318, in main super(SyncDir, self).main(*extra_args) File "/srv/deployment/scap/scap/scap/main.py", line 37, in main with utils.lock(self.config['lock_file']): File "/usr/lib/python2.7/contextlib.py", line 17, in __enter__ return self.gen.next() File "/srv/deployment/scap/scap/scap/utils.py", line 249, in lock raise LockFailedError('Failed to lock %s: %s' % (filename, e)) LockFailedError: Failed to lock /var/lock/scap: [Errno 13] Permission denied: '/var/lock/scap' 02:10:15 sync-dir failed: <LockFailedError> Failed to lock /var/lock/scap: [Errno 13] Permission denied: '/var/lock/scap' Failed to sync-dir 'php-1.25wmf12/cache/l10n'
I see
reedy@tin:~$ ls -al /var/lock/scap -rw-rw-r-- 1 ori wikidev 0 Jan 5 17:43 /var/lock/scap
But not sure what it was before
Well, the l10nupdate user can't write to the scap lock file as it's not in the wikidev user group....
l10nupdate@tin:/var/log/l10nupdatelog$ groups l10nupdate l10nupdate@tin:/var/log/l10nupdatelog$
I think @bd808 suggested "we" just make scap delete the lock file at the end of an operation...
Wikidev is the default group for all users so adding a system account seems weird to me. Why isn't this file written now with perms appropriate to the intention of l10update management?
I've just had ops chown the lock file to l10nupdate so I can at least do a manual run. Whether the permissions will actually stick...
Change 183560 had a related patch set uploaded (by BryanDavis):
Delete scap lock file on unlock
So we're back to
02:16:13 ['/srv/deployment/scap/scap/bin/sync-common', '--no-update-l10n', '--include', 'php-1.25wmf13', '--include', 'php-1.25wmf13/cache', '--include', 'php-1.25wmf13/cache/l10n', '--include', 'php-1.25wmf13/cache/l10n/***', 'mw1070.eqiad.wmnet', 'mw1161.eqiad.wmnet', 'mw1201.eqiad.wmnet'] on mw1018 returned [255]: Permission denied (publickey).
@bd808 did we have a task for tracking this issue? I can't see it at first glance
T76061: l10nupdate user can't access scap shared ssh key causing nightly l10nupdate sync process to fail which could use a renaming to something like "l10nupdate can't access shared ssh-agent"
Where are we here? Is this still a problem, or has the underlying issue been sorted out?
Per https://wikitech.wikimedia.org/wiki/Server_Admin_Log it works well since since January 9th again.
The SAL entries are a lie unfortunately.
02:14:23 ['/srv/deployment/scap/scap/bin/sync-common', '--no-update-l10n', '--include', 'php-1.25wmf16', '--include', 'php-1.25wmf16/cache', '--include', 'php-1.25wmf16/cache/l10n', '--include', 'php-1.25wmf16/cache/l10n/***'] on mw1070.eqiad.wmnet returned [255]: Permission denied (publickey). sync-proxies: 16% (ok: 0; fail: 1; left: 5)
The l10nupdate user that runs the nightly script is still not able to use scap to push the changes it makes to the wikis.
I heve seen patches in Gerrit trying to work out the permissions issues. Someone more close to those is needed to give more detailed status update.
We had a successful manual run in the early morning (UTC) of 2015-03-12. If the normal cron job is successful on 2015-03-13 we may be able to declare victory. Attempts to fix the backing problem are being tracked by T76061.